Local model landscape
Living document. Rewritten as new models ship. Last updated: 2026-04-12.
RG’s hardware
| Machine | Role | Key spec | Memory bandwidth | Budget for models |
|---|---|---|---|---|
| M3 Max MBP 14” | Main — tiny models, high tok/s, multiple in parallel | 36GB unified | ~400 GB/s | ~21-24 GB |
| M2 Max MBP 14” | Dispatch — big jobs, 7B-14B | 32GB unified | ~400 GB/s | ~19-22 GB |
| WSL + 3060 12GB | Heavy compute — biggest models, GPU offload | 12GB VRAM + 64GB RAM | PCIe bottleneck on offload | 12GB GPU / 64GB total |
Preferences
- Abliterated/uncensored variants preferred — no alignment tax
- Key producers: huihui-ai (Ollama + HF), mlabonne (HF), bartowski (GGUF quants), DavidAU (HERETIC method)
- Inference: Ollama 0.19+ (MLX backend — 57% faster prefill, 93% faster decode vs 0.18)
What just shipped
Gemma 4 (April 2, 2026) — Apache 2.0 license (major change from Gemma 3’s custom license)
| Model | Total params | Active params | Size (Ollama) | Context | Modalities |
|---|---|---|---|---|---|
| Gemma 4 E2B | 5.1B | 2.3B | 7.2 GB | 128K | Text, image, audio |
| Gemma 4 E4B | 8B | 4.5B | 9.6 GB | 128K | Text, image, audio |
| Gemma 4 26B (MoE) | 25.2B | 3.8B | 18 GB | 256K | Text, image |
| Gemma 4 31B (Dense) | 30.7B | 30.7B | 20 GB | 256K | Text, image |
Gemma 4 E2B beats Gemma 3 27B on most benchmarks with only 2.3B active params. Most efficient model per byte I’ve tracked. ~75-85 tok/s on M3 Max.
Abliterated variants expanding — see abliteration section below.
Nemotron 3 Nano — Mamba-Transformer hybrid, benchmarks now available
| Model | Total params | Active params | Size (Q4) | Architecture | Key benchmarks |
|---|---|---|---|---|---|
| Nemotron 3 Nano 4B | 3.6B | 3.6B | ~2.5 GB | Mamba-Transformer hybrid | TBD for this size |
| Nemotron 3 Nano 30B-A3B | 31.6B | 3.2B (MoE) | ~18 GB | Mamba-Transformer hybrid | AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7% |
Independent benchmarks (via NeMo Evaluator):
- AIME 2025: 89.1% (beats Qwen3-30B-A3B at 85.0%; 99.2% with Python tools)
- LiveCodeBench v6: 68.3% (beats Qwen3 66.0% and gpt-oss 61.0%)
- Arena-Hard-v2: 67.7% (vs Qwen3-30B 57.8%, gpt-oss-20b 48.5%)
- RULER: 87.5% at 64K, 82.9% at 128K, 70.6% at 512K (supports 1M context)
- 3.3x throughput vs Qwen3-30B-A3B on single H200
Verdict: At 3.2B active params, this runs on RTX 3060 12GB comfortably. Strong coding/reasoning at tiny active parameter count. Priority recommendation for RG’s 3060. GGUF quants available from Unsloth.
Hardware x Model fit matrix
M3 Max 36GB — Tiny fleet for background tasks
| Model | Quant | Size | tok/s | Role |
|---|---|---|---|---|
| Gemma 4 E2B | Q8_0 | ~4 GB | 75-85 | Best tiny general-purpose; multimodal+audio |
| Nemotron 3 Nano 4B | Q8_0 | ~3.5 GB | TBD | Mamba hybrid for agentic tasks — evaluate |
| Qwen3.5-0.8B | Q8_0 | ~1 GB | 120-150 | Ultra-fast drafting/classification |
| Qwen3.5-2B | Q8_0 | ~2.7 GB | 80-100 | Fast chat/code assist |
| SmolLM3-3B | Q8_0 | ~3.5 GB | 60-80 | Best-in-class 3B; 128K context |
| Qwen3.5-4B | Q6_K | ~3.4 GB | 50-65 | Strong coding at 4B |
Multi-model strategy: Set OLLAMA_MAX_LOADED_MODELS=4. Example fleet: Qwen3.5-0.8B (1GB) + Gemma 4 E2B (4GB) + SmolLM3-3B (3.5GB) + Qwen3.5-2B (2.7GB) = ~11GB total, plenty of headroom.
M2 Max 32GB — Dispatch workhorse
| Model | Quant | Size | tok/s | Role |
|---|---|---|---|---|
| Qwen 2.5 Coder 14B | Q4_K_M | ~9 GB | 25-35 | Primary coding workhorse (HumanEval ~89%) |
| Nemotron 3 Nano 30B-A3B | Q4_K_M | ~18 GB | ~77 tok/s (MLX) | AIME 89.1%, LCBv6 68.3% — top priority evaluation |
| DeepSeek-R1-Distill 14B | Q4_K_M | ~9 GB | 22-30 | Chain-of-thought reasoning + code |
| Qwen3.5-9B | Q5_K_M | ~6.5 GB | 28-38 | General + coding, 256K context |
| Phi-4 (14B) | Q4_K_M | ~9 GB | 30-38 | STEM reasoning |
| Qwen3.5-27B | Q4_K_M | ~17 GB | 12-18 | Peak quality (LiveCodeBench 80.7) — slow but usable for batch |
Avoid: Gemma 4 26B MoE — community reports 11 tok/s vs 60+ for similarly-sized dense models. MoE has higher bandwidth demands per active param.
WSL + 3060 12GB — Heavy compute
| Model | Quant | VRAM fit | tok/s | Notes |
|---|---|---|---|---|
| Nemotron 3 Nano 30B-A3B | Q4 | ~5 GB VRAM | 40-60 | Best MoE for this card — only 3.2B active |
| Qwen 2.5 Coder 14B | Q4_K_M | Full GPU (9GB) | 12-18 | Interactive workhorse |
| DeepSeek-R1-Distill 14B | Q4_K_M | Full GPU (9GB) | 12-18 | Reasoning + code |
| Qwen3.5-27B | Q4_K_M | Partial (16GB) | 4-8 | ~75% GPU offload |
| Qwen 2.5 Coder 32B | Q4_K_M | Partial (20GB) | 3-5 | HumanEval 92.7% — overnight batch jobs |
| Qwen3.5-35B-A3B (MoE) | Q4_K_M | Partial (24GB) | 5-10 | Only 3B active, benefits from partial offload |
Coding-specific models
| Model | Params | HumanEval | SWE-bench | Best for |
|---|---|---|---|---|
| Qwen 2.5 Coder 7B | 7B | 88.4% | — | Autocomplete/FIM |
| Qwen 2.5 Coder 14B | 14B | ~89% | — | Best balance capability/speed |
| Qwen 2.5 Coder 32B | 32B | 92.7% | — | Highest code quality |
| Qwen3-Coder-Next (80B MoE) | 80B/3B active | — | 64.6% | Beats Claude Opus 4.6 on SWE-bench |
| Qwen3.5-9B | 9B | — | 65.6 LCBv6 | Chat-based coding with vision |
| Qwen3.5-27B | 27B | — | 80.7 LCBv6 | Multi-file reasoning |
Abliterated variant sources
| Producer | Method | Models | Where |
|---|---|---|---|
| huihui-ai | Abliteration | Qwen3.5 (all sizes), Qwen3, Gemma 3, gpt-oss-20b | Ollama + HuggingFace |
| mlabonne | Abliteration | Gemma 3 (1B-27B) + GGUF | HuggingFace |
| bartowski | GGUF quants | QwQ-32B, Llama 3.1 8B, many others | HuggingFace |
| DavidAU | HERETIC | Gemma 4 31B, gpt-oss-20b (multiple variants) | HuggingFace |
| HauhauCS | Abliteration | Gemma 4 E2B, E4B (“aggressive”) | HuggingFace |
| trohrbaugh | Heretic ARA | Gemma 4 31B (KL 0.012, refusals 98→5/100) | HuggingFace |
| p-e-w (Heretic tool) | Automated HERETIC | 1000+ models including Gemma 4 | GitHub + HuggingFace |
| TrevorJS | Biprojection + EGA | Gemma 4 (E2B, E4B, 26B MoE, 31B) | GitHub |
| amarck | Abliteration | Gemma 4 31B (GGUF quants, Q4_K_M ~19GB) | HuggingFace |
| pmarreck | HERETIC | Gemma 4 31B (one-command Ollama/MLX setup) | GitHub |
| aoxo | Fine-tune | gpt-oss-20b | HuggingFace |
Quick Ollama access:
ollama pull huihui_ai/qwen3.5-abliterated # Qwen 3.5 uncensored
ollama pull huihui_ai/gemma3-abliterated # Gemma 3 uncensored
gpt-oss-20b abliterated landscape (complete)
| Variant | Producer | Method | Format |
|---|---|---|---|
| Huihui-gpt-oss-20b-BF16-abliterated | huihui-ai | Abliteration | BF16/Ollama (v1+v2) |
| GPT-oss-20b-abliterated-uncensored-NEO | DavidAU | Abliteration+NEO | GGUF (IQ4_NL, Q5_1, Q8_0) |
| GPT-oss-20b-HERETIC-uncensored-NEO | DavidAU | HERETIC | GGUF (IQ4_NL, Q5_1, Q8_0) |
| GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4 | DavidAU | HERETIC | Native MXFP4 |
| gpt-oss-20b-uncensored | aoxo | Fine-tune | BF16 |
All fit comfortably on all three machines. MXFP4 at ~14GB or IQ4_NL at ~11.5GB. HERETIC variant claims complete refusal removal.
Independent benchmarks (via BenchLM, DataRobot, Artificial Analysis):
- Arena-Hard-v2: 48.5% (behind Nemotron 3 Nano at 67.7%)
- LiveCodeBench v6: 61.0% (behind Nemotron 3 Nano at 68.3%)
- Matches or exceeds o3-mini on most benchmarks
- Outperforms gpt-oss-120B on HumanEval and MMLU despite being much smaller
- “Low thinking effort” mode outperforms more expensive competitors
- Fits 16GB devices — runs on RTX 3060 and M3 Max easily
Verdict: Solid general-purpose model but Nemotron 3 Nano beats it on coding benchmarks at similar active params. Best use: general reasoning/chat where abliterated variant is preferred.
Quantization reference
| Quant | Bits | Quality | 7B size | 14B size | 27B size |
|---|---|---|---|---|---|
| Q4_K_M | ~4.5 | Good | 4.5 GB | 9 GB | 16 GB |
| Q5_K_M | ~5.5 | Better (<2% perplexity loss) | 5.2 GB | 10 GB | 19 GB |
| Q6_K | ~6.5 | High | 6.0 GB | 12 GB | 22 GB |
| Q8_0 | ~8.0 | Near-lossless | 7.5 GB | 15 GB | 27 GB |
Rule of thumb for Apple Silicon: model should be <=60-70% of total unified memory.
Key insight: TurboQuant — 6x KV cache compression (NEW — April 12)
Google Research’s TurboQuant (March 25, ICLR 2026) compresses KV cache to 3 bits with zero accuracy loss. No retraining required. 6x reduction in KV memory.
Impact on RG’s hardware:
- M3 Max 36GB: Gemma 4 31B at full 262K context becomes possible. KV cache drops from ~22GB to ~3.7GB. 31B Q4 (~20GB) + 3.7GB KV = 23.7GB total — fits.
- M2 Max 32GB: Nemotron 30B-A3B and Qwen3.5-27B can serve dramatically longer contexts within existing memory.
- RTX 3060 12GB: Context length multiplied within same VRAM budget. 14B models can run at very long context.
Implementation status:
- Google official: Q2 2026
- llama.cpp:
turboquant_plusproject, experimental, Metal support on Apple Silicon - Validated from 1.5B to 104B parameter models
The synthesis: TurboQuant + Ollama 0.19 MLX backend = two multiplicative improvements. MLX accelerates compute, TurboQuant expands context. Together they make Apple Silicon the most improved local inference platform.
Key insight: Ollama 0.19 MLX backend
Released March 2026. On Apple Silicon: 57% faster prefill, 93% faster decode vs v0.18 (llama.cpp). The M3 Max has higher memory bandwidth than M4 Pro, so it outperforms newer chips for memory-bound inference. Make sure Ollama is updated.
Models NOT practical for RG’s hardware
| Model | Why |
|---|---|
| GLM-5.1 (744B MoE, 40B active) | MIT license, #1 SWE-Bench Pro (58.4). Smallest GGUF ~206GB. Cloud/API only. Watch for distills. |
| Kimi K2.5 (1T params) | Even smallest quant (1.8-bit) is ~240GB |
| Llama 4 Scout (109B) | Q4 is ~60GB+ |
| Llama 4 Maverick (400B) | Data center only |
| gpt-oss-120b (117B MoE) | Needs 66GB+ unified for usable speed |
| Nemotron 3 Super 120B-A12B | Too large at full quality |
Other models to assess
- Nemotron 3 Nano 4B: Mamba-Transformer hybrid, claims 5x throughput. Tiny enough for fleet member on M3 Max. Priority evaluation.
- Nemotron 3 Nano 30B-A3B: MoE with only 3B active, Mamba hybrid. Local benchmarks now available: ~77 tok/s M2 Max (MLX), 40-60 tok/s RTX 3060 (Q4, ~5GB VRAM). Fits 3060 comfortably — top priority.
- MiniMax M2.7: “Self-evolving” training. 56.22% SWE-Pro — approaching Claude Opus 4.6. Too large for local but watch for quants/distills.
- Cogito v1 (3B/8B/14B/32B/70B): Dense, hybrid reasoning toggle. Llama/Qwen-base variants. On Ollama.
- Phi-4-mini-reasoning (3.8B): 128K context, reasoning-capable. Worth testing as alternative to SmolLM3.
- Gemma 4 31B HERETIC+Thinking (DavidAU): Chain-of-thought reasoning + uncensored 31B.
- Qwen3-Coder abliterated (huihui-ai): Abliterated variant for the coding model line.
- gpt-oss-20b HERETIC (DavidAU): Claims complete refusal removal. Priority evaluation.
- Gemma 4 31B abliterated GGUF (amarck): Q4_K_M ~19GB, fits M3 Max at short context.
Known issues
- Qwen 3.5 GGUF + Ollama incompatibility: GGUF versions do not work in Ollama due to separate mmproj vision files. Use llama.cpp directly for now.
- Gemma 4 GGUF chat template bug: Community GGUF uploads ship with incorrect chat templates (wrong delimiters), causing ”---” output loops. pmarreck/gemma4-heretical fixes this via Ollama RENDERER/PARSER support.
- Gemma 4 31B flash-attention bug in Ollama: Hangs on prompts over ~500 tokens. Workaround:
OLLAMA_FLASH_ATTENTION=0but tanks speed to ~15 tok/s on Apple Silicon. The 26B MoE is the better pick at ~20-30 tok/s. - Gemma 4 31B context limits on 36GB Macs: 31B Q4 needs ~20GB weights + ~22GB KV at full 262K context. Only works at short context (<16K) on M3 Max 36GB.
Open threads
- Meta Muse Spark — open-weight contraction: Meta went proprietary. Llama future unclear. The open-weight producers (Google Gemma, Alibaba Qwen, Zhipu GLM, community) become more important. Google’s Apache 2.0 shift for Gemma 4 looks prescient.
- Heretic ARA quality: trohrbaugh’s gemma-4-31b-it-heretic-ara achieves KL divergence 0.012 (virtually no quality loss) while reducing refusals 98→5/100. Current best-quality abliteration for Gemma 4 31B. Needs evaluation.
- Nemotron 3 Nano evaluation: Benchmarks available. AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%. At 3.2B active params, top priority for RG’s 3060. Beats Qwen3-30B-A3B and gpt-oss-20b on coding.
- gpt-oss-20b evaluation: Benchmarks available. Arena-Hard 48.5%, LCBv6 61.0%. Solid but Nemotron 3 Nano beats it. Best for general reasoning with abliterated variant.
- Ollama v0.20.5 (April 9): New release. Gemma 4 all sizes available. Check for stability/perf fixes.
- TrevorJS abliteration technique: Biprojection + EGA, cross-validated against 686 prompts. New method worth tracking.
- Qwen 3.6-Plus: API-only (Alibaba Bailian, OpenRouter). 1M context, agentic coding. Watch for local release.
- DeepSeek V4: Imminent (~1T MoE, ~37B active, 1M context, multimodal, Apache 2.0). Too large for local, but distilled variants will follow.
- vllm-mlx: Server framework claiming 400+ tok/s on tiny models, continuous batching, Claude Code compatible.
- OpenClaw community adopting Kimi K2.5: Signal that model preference is shifting away from Claude for agentic work in open-source community.
- GLM-5.1: 744B MoE, #1 SWE-Bench Pro (58.4), MIT license. Too large for local but distilled variants may follow. Watch Zhipu AI.
- Copilot CLI BYOK: Now supports Ollama, vLLM, any OpenAI-compatible endpoint. Local models become usable inside a major agent’s workflow for the first time.