Local model landscape

Living document. Rewritten as new models ship. Last updated: 2026-04-12.

RG’s hardware

MachineRoleKey specMemory bandwidthBudget for models
M3 Max MBP 14”Main — tiny models, high tok/s, multiple in parallel36GB unified~400 GB/s~21-24 GB
M2 Max MBP 14”Dispatch — big jobs, 7B-14B32GB unified~400 GB/s~19-22 GB
WSL + 3060 12GBHeavy compute — biggest models, GPU offload12GB VRAM + 64GB RAMPCIe bottleneck on offload12GB GPU / 64GB total

Preferences

What just shipped

Gemma 4 (April 2, 2026) — Apache 2.0 license (major change from Gemma 3’s custom license)

ModelTotal paramsActive paramsSize (Ollama)ContextModalities
Gemma 4 E2B5.1B2.3B7.2 GB128KText, image, audio
Gemma 4 E4B8B4.5B9.6 GB128KText, image, audio
Gemma 4 26B (MoE)25.2B3.8B18 GB256KText, image
Gemma 4 31B (Dense)30.7B30.7B20 GB256KText, image

Gemma 4 E2B beats Gemma 3 27B on most benchmarks with only 2.3B active params. Most efficient model per byte I’ve tracked. ~75-85 tok/s on M3 Max.

Abliterated variants expanding — see abliteration section below.

Nemotron 3 Nano — Mamba-Transformer hybrid, benchmarks now available

ModelTotal paramsActive paramsSize (Q4)ArchitectureKey benchmarks
Nemotron 3 Nano 4B3.6B3.6B~2.5 GBMamba-Transformer hybridTBD for this size
Nemotron 3 Nano 30B-A3B31.6B3.2B (MoE)~18 GBMamba-Transformer hybridAIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%

Independent benchmarks (via NeMo Evaluator):

Verdict: At 3.2B active params, this runs on RTX 3060 12GB comfortably. Strong coding/reasoning at tiny active parameter count. Priority recommendation for RG’s 3060. GGUF quants available from Unsloth.

Hardware x Model fit matrix

M3 Max 36GB — Tiny fleet for background tasks

ModelQuantSizetok/sRole
Gemma 4 E2BQ8_0~4 GB75-85Best tiny general-purpose; multimodal+audio
Nemotron 3 Nano 4BQ8_0~3.5 GBTBDMamba hybrid for agentic tasks — evaluate
Qwen3.5-0.8BQ8_0~1 GB120-150Ultra-fast drafting/classification
Qwen3.5-2BQ8_0~2.7 GB80-100Fast chat/code assist
SmolLM3-3BQ8_0~3.5 GB60-80Best-in-class 3B; 128K context
Qwen3.5-4BQ6_K~3.4 GB50-65Strong coding at 4B

Multi-model strategy: Set OLLAMA_MAX_LOADED_MODELS=4. Example fleet: Qwen3.5-0.8B (1GB) + Gemma 4 E2B (4GB) + SmolLM3-3B (3.5GB) + Qwen3.5-2B (2.7GB) = ~11GB total, plenty of headroom.

M2 Max 32GB — Dispatch workhorse

ModelQuantSizetok/sRole
Qwen 2.5 Coder 14BQ4_K_M~9 GB25-35Primary coding workhorse (HumanEval ~89%)
Nemotron 3 Nano 30B-A3BQ4_K_M~18 GB~77 tok/s (MLX)AIME 89.1%, LCBv6 68.3% — top priority evaluation
DeepSeek-R1-Distill 14BQ4_K_M~9 GB22-30Chain-of-thought reasoning + code
Qwen3.5-9BQ5_K_M~6.5 GB28-38General + coding, 256K context
Phi-4 (14B)Q4_K_M~9 GB30-38STEM reasoning
Qwen3.5-27BQ4_K_M~17 GB12-18Peak quality (LiveCodeBench 80.7) — slow but usable for batch

Avoid: Gemma 4 26B MoE — community reports 11 tok/s vs 60+ for similarly-sized dense models. MoE has higher bandwidth demands per active param.

WSL + 3060 12GB — Heavy compute

ModelQuantVRAM fittok/sNotes
Nemotron 3 Nano 30B-A3BQ4~5 GB VRAM40-60Best MoE for this card — only 3.2B active
Qwen 2.5 Coder 14BQ4_K_MFull GPU (9GB)12-18Interactive workhorse
DeepSeek-R1-Distill 14BQ4_K_MFull GPU (9GB)12-18Reasoning + code
Qwen3.5-27BQ4_K_MPartial (16GB)4-8~75% GPU offload
Qwen 2.5 Coder 32BQ4_K_MPartial (20GB)3-5HumanEval 92.7% — overnight batch jobs
Qwen3.5-35B-A3B (MoE)Q4_K_MPartial (24GB)5-10Only 3B active, benefits from partial offload

Coding-specific models

ModelParamsHumanEvalSWE-benchBest for
Qwen 2.5 Coder 7B7B88.4%Autocomplete/FIM
Qwen 2.5 Coder 14B14B~89%Best balance capability/speed
Qwen 2.5 Coder 32B32B92.7%Highest code quality
Qwen3-Coder-Next (80B MoE)80B/3B active64.6%Beats Claude Opus 4.6 on SWE-bench
Qwen3.5-9B9B65.6 LCBv6Chat-based coding with vision
Qwen3.5-27B27B80.7 LCBv6Multi-file reasoning

Abliterated variant sources

ProducerMethodModelsWhere
huihui-aiAbliterationQwen3.5 (all sizes), Qwen3, Gemma 3, gpt-oss-20bOllama + HuggingFace
mlabonneAbliterationGemma 3 (1B-27B) + GGUFHuggingFace
bartowskiGGUF quantsQwQ-32B, Llama 3.1 8B, many othersHuggingFace
DavidAUHERETICGemma 4 31B, gpt-oss-20b (multiple variants)HuggingFace
HauhauCSAbliterationGemma 4 E2B, E4B (“aggressive”)HuggingFace
trohrbaughHeretic ARAGemma 4 31B (KL 0.012, refusals 98→5/100)HuggingFace
p-e-w (Heretic tool)Automated HERETIC1000+ models including Gemma 4GitHub + HuggingFace
TrevorJSBiprojection + EGAGemma 4 (E2B, E4B, 26B MoE, 31B)GitHub
amarckAbliterationGemma 4 31B (GGUF quants, Q4_K_M ~19GB)HuggingFace
pmarreckHERETICGemma 4 31B (one-command Ollama/MLX setup)GitHub
aoxoFine-tunegpt-oss-20bHuggingFace

Quick Ollama access:

ollama pull huihui_ai/qwen3.5-abliterated       # Qwen 3.5 uncensored
ollama pull huihui_ai/gemma3-abliterated         # Gemma 3 uncensored

gpt-oss-20b abliterated landscape (complete)

VariantProducerMethodFormat
Huihui-gpt-oss-20b-BF16-abliteratedhuihui-aiAbliterationBF16/Ollama (v1+v2)
GPT-oss-20b-abliterated-uncensored-NEODavidAUAbliteration+NEOGGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-HERETIC-uncensored-NEODavidAUHERETICGGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4DavidAUHERETICNative MXFP4
gpt-oss-20b-uncensoredaoxoFine-tuneBF16

All fit comfortably on all three machines. MXFP4 at ~14GB or IQ4_NL at ~11.5GB. HERETIC variant claims complete refusal removal.

Independent benchmarks (via BenchLM, DataRobot, Artificial Analysis):

Verdict: Solid general-purpose model but Nemotron 3 Nano beats it on coding benchmarks at similar active params. Best use: general reasoning/chat where abliterated variant is preferred.

Quantization reference

QuantBitsQuality7B size14B size27B size
Q4_K_M~4.5Good4.5 GB9 GB16 GB
Q5_K_M~5.5Better (<2% perplexity loss)5.2 GB10 GB19 GB
Q6_K~6.5High6.0 GB12 GB22 GB
Q8_0~8.0Near-lossless7.5 GB15 GB27 GB

Rule of thumb for Apple Silicon: model should be <=60-70% of total unified memory.

Key insight: TurboQuant — 6x KV cache compression (NEW — April 12)

Google Research’s TurboQuant (March 25, ICLR 2026) compresses KV cache to 3 bits with zero accuracy loss. No retraining required. 6x reduction in KV memory.

Impact on RG’s hardware:

Implementation status:

The synthesis: TurboQuant + Ollama 0.19 MLX backend = two multiplicative improvements. MLX accelerates compute, TurboQuant expands context. Together they make Apple Silicon the most improved local inference platform.

Key insight: Ollama 0.19 MLX backend

Released March 2026. On Apple Silicon: 57% faster prefill, 93% faster decode vs v0.18 (llama.cpp). The M3 Max has higher memory bandwidth than M4 Pro, so it outperforms newer chips for memory-bound inference. Make sure Ollama is updated.

Models NOT practical for RG’s hardware

ModelWhy
GLM-5.1 (744B MoE, 40B active)MIT license, #1 SWE-Bench Pro (58.4). Smallest GGUF ~206GB. Cloud/API only. Watch for distills.
Kimi K2.5 (1T params)Even smallest quant (1.8-bit) is ~240GB
Llama 4 Scout (109B)Q4 is ~60GB+
Llama 4 Maverick (400B)Data center only
gpt-oss-120b (117B MoE)Needs 66GB+ unified for usable speed
Nemotron 3 Super 120B-A12BToo large at full quality

Other models to assess

Known issues

Open threads

← all landscape docs