Local model landscape

Living document. Rewritten as new models ship. Last updated: 2026-04-12.

RG’s hardware

Machine	Role	Key spec	Memory bandwidth	Budget for models
M3 Max MBP 14”	Main — tiny models, high tok/s, multiple in parallel	36GB unified	~400 GB/s	~21-24 GB
M2 Max MBP 14”	Dispatch — big jobs, 7B-14B	32GB unified	~400 GB/s	~19-22 GB
WSL + 3060 12GB	Heavy compute — biggest models, GPU offload	12GB VRAM + 64GB RAM	PCIe bottleneck on offload	12GB GPU / 64GB total

Preferences

Abliterated/uncensored variants preferred — no alignment tax
Key producers: huihui-ai (Ollama + HF), mlabonne (HF), bartowski (GGUF quants), DavidAU (HERETIC method)
Inference: Ollama 0.19+ (MLX backend — 57% faster prefill, 93% faster decode vs 0.18)

What just shipped

Gemma 4 (April 2, 2026) — Apache 2.0 license (major change from Gemma 3’s custom license)

Model	Total params	Active params	Size (Ollama)	Context	Modalities
Gemma 4 E2B	5.1B	2.3B	7.2 GB	128K	Text, image, audio
Gemma 4 E4B	8B	4.5B	9.6 GB	128K	Text, image, audio
Gemma 4 26B (MoE)	25.2B	3.8B	18 GB	256K	Text, image
Gemma 4 31B (Dense)	30.7B	30.7B	20 GB	256K	Text, image

Gemma 4 E2B beats Gemma 3 27B on most benchmarks with only 2.3B active params. Most efficient model per byte I’ve tracked. ~75-85 tok/s on M3 Max.

Abliterated variants expanding — see abliteration section below.

Nemotron 3 Nano — Mamba-Transformer hybrid, benchmarks now available

Model	Total params	Active params	Size (Q4)	Architecture	Key benchmarks
Nemotron 3 Nano 4B	3.6B	3.6B	~2.5 GB	Mamba-Transformer hybrid	TBD for this size
Nemotron 3 Nano 30B-A3B	31.6B	3.2B (MoE)	~18 GB	Mamba-Transformer hybrid	AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%

Independent benchmarks (via NeMo Evaluator):

AIME 2025: 89.1% (beats Qwen3-30B-A3B at 85.0%; 99.2% with Python tools)
LiveCodeBench v6: 68.3% (beats Qwen3 66.0% and gpt-oss 61.0%)
Arena-Hard-v2: 67.7% (vs Qwen3-30B 57.8%, gpt-oss-20b 48.5%)
RULER: 87.5% at 64K, 82.9% at 128K, 70.6% at 512K (supports 1M context)
3.3x throughput vs Qwen3-30B-A3B on single H200

Verdict: At 3.2B active params, this runs on RTX 3060 12GB comfortably. Strong coding/reasoning at tiny active parameter count. Priority recommendation for RG’s 3060. GGUF quants available from Unsloth.

Hardware x Model fit matrix

M3 Max 36GB — Tiny fleet for background tasks

Model	Quant	Size	tok/s	Role
Gemma 4 E2B	Q8_0	~4 GB	75-85	Best tiny general-purpose; multimodal+audio
Nemotron 3 Nano 4B	Q8_0	~3.5 GB	TBD	Mamba hybrid for agentic tasks — evaluate
Qwen3.5-0.8B	Q8_0	~1 GB	120-150	Ultra-fast drafting/classification
Qwen3.5-2B	Q8_0	~2.7 GB	80-100	Fast chat/code assist
SmolLM3-3B	Q8_0	~3.5 GB	60-80	Best-in-class 3B; 128K context
Qwen3.5-4B	Q6_K	~3.4 GB	50-65	Strong coding at 4B

Multi-model strategy: Set OLLAMA_MAX_LOADED_MODELS=4. Example fleet: Qwen3.5-0.8B (1GB) + Gemma 4 E2B (4GB) + SmolLM3-3B (3.5GB) + Qwen3.5-2B (2.7GB) = ~11GB total, plenty of headroom.

M2 Max 32GB — Dispatch workhorse

Model	Quant	Size	tok/s	Role
Qwen 2.5 Coder 14B	Q4_K_M	~9 GB	25-35	Primary coding workhorse (HumanEval ~89%)
Nemotron 3 Nano 30B-A3B	Q4_K_M	~18 GB	~77 tok/s (MLX)	AIME 89.1%, LCBv6 68.3% — top priority evaluation
DeepSeek-R1-Distill 14B	Q4_K_M	~9 GB	22-30	Chain-of-thought reasoning + code
Qwen3.5-9B	Q5_K_M	~6.5 GB	28-38	General + coding, 256K context
Phi-4 (14B)	Q4_K_M	~9 GB	30-38	STEM reasoning
Qwen3.5-27B	Q4_K_M	~17 GB	12-18	Peak quality (LiveCodeBench 80.7) — slow but usable for batch

Avoid: Gemma 4 26B MoE — community reports 11 tok/s vs 60+ for similarly-sized dense models. MoE has higher bandwidth demands per active param.

WSL + 3060 12GB — Heavy compute

Model	Quant	VRAM fit	tok/s	Notes
Nemotron 3 Nano 30B-A3B	Q4	~5 GB VRAM	40-60	Best MoE for this card — only 3.2B active
Qwen 2.5 Coder 14B	Q4_K_M	Full GPU (9GB)	12-18	Interactive workhorse
DeepSeek-R1-Distill 14B	Q4_K_M	Full GPU (9GB)	12-18	Reasoning + code
Qwen3.5-27B	Q4_K_M	Partial (16GB)	4-8	~75% GPU offload
Qwen 2.5 Coder 32B	Q4_K_M	Partial (20GB)	3-5	HumanEval 92.7% — overnight batch jobs
Qwen3.5-35B-A3B (MoE)	Q4_K_M	Partial (24GB)	5-10	Only 3B active, benefits from partial offload

Coding-specific models

Model	Params	HumanEval	SWE-bench	Best for
Qwen 2.5 Coder 7B	7B	88.4%	—	Autocomplete/FIM
Qwen 2.5 Coder 14B	14B	~89%	—	Best balance capability/speed
Qwen 2.5 Coder 32B	32B	92.7%	—	Highest code quality
Qwen3-Coder-Next (80B MoE)	80B/3B active	—	64.6%	Beats Claude Opus 4.6 on SWE-bench
Qwen3.5-9B	9B	—	65.6 LCBv6	Chat-based coding with vision
Qwen3.5-27B	27B	—	80.7 LCBv6	Multi-file reasoning

Abliterated variant sources

Producer	Method	Models	Where
huihui-ai	Abliteration	Qwen3.5 (all sizes), Qwen3, Gemma 3, gpt-oss-20b	Ollama + HuggingFace
mlabonne	Abliteration	Gemma 3 (1B-27B) + GGUF	HuggingFace
bartowski	GGUF quants	QwQ-32B, Llama 3.1 8B, many others	HuggingFace
DavidAU	HERETIC	Gemma 4 31B, gpt-oss-20b (multiple variants)	HuggingFace
HauhauCS	Abliteration	Gemma 4 E2B, E4B (“aggressive”)	HuggingFace
trohrbaugh	Heretic ARA	Gemma 4 31B (KL 0.012, refusals 98→5/100)	HuggingFace
p-e-w (Heretic tool)	Automated HERETIC	1000+ models including Gemma 4	GitHub + HuggingFace
TrevorJS	Biprojection + EGA	Gemma 4 (E2B, E4B, 26B MoE, 31B)	GitHub
amarck	Abliteration	Gemma 4 31B (GGUF quants, Q4_K_M ~19GB)	HuggingFace
pmarreck	HERETIC	Gemma 4 31B (one-command Ollama/MLX setup)	GitHub
aoxo	Fine-tune	gpt-oss-20b	HuggingFace

Quick Ollama access:

ollama pull huihui_ai/qwen3.5-abliterated       # Qwen 3.5 uncensored
ollama pull huihui_ai/gemma3-abliterated         # Gemma 3 uncensored

gpt-oss-20b abliterated landscape (complete)

Variant	Producer	Method	Format
Huihui-gpt-oss-20b-BF16-abliterated	huihui-ai	Abliteration	BF16/Ollama (v1+v2)
GPT-oss-20b-abliterated-uncensored-NEO	DavidAU	Abliteration+NEO	GGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-HERETIC-uncensored-NEO	DavidAU	HERETIC	GGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4	DavidAU	HERETIC	Native MXFP4
gpt-oss-20b-uncensored	aoxo	Fine-tune	BF16

All fit comfortably on all three machines. MXFP4 at ~14GB or IQ4_NL at ~11.5GB. HERETIC variant claims complete refusal removal.

Independent benchmarks (via BenchLM, DataRobot, Artificial Analysis):

Arena-Hard-v2: 48.5% (behind Nemotron 3 Nano at 67.7%)
LiveCodeBench v6: 61.0% (behind Nemotron 3 Nano at 68.3%)
Matches or exceeds o3-mini on most benchmarks
Outperforms gpt-oss-120B on HumanEval and MMLU despite being much smaller
“Low thinking effort” mode outperforms more expensive competitors
Fits 16GB devices — runs on RTX 3060 and M3 Max easily

Verdict: Solid general-purpose model but Nemotron 3 Nano beats it on coding benchmarks at similar active params. Best use: general reasoning/chat where abliterated variant is preferred.

Quantization reference

Quant	Bits	Quality	7B size	14B size	27B size
Q4_K_M	~4.5	Good	4.5 GB	9 GB	16 GB
Q5_K_M	~5.5	Better (<2% perplexity loss)	5.2 GB	10 GB	19 GB
Q6_K	~6.5	High	6.0 GB	12 GB	22 GB
Q8_0	~8.0	Near-lossless	7.5 GB	15 GB	27 GB

Rule of thumb for Apple Silicon: model should be <=60-70% of total unified memory.

Key insight: TurboQuant — 6x KV cache compression (NEW — April 12)

Google Research’s TurboQuant (March 25, ICLR 2026) compresses KV cache to 3 bits with zero accuracy loss. No retraining required. 6x reduction in KV memory.

Impact on RG’s hardware:

M3 Max 36GB: Gemma 4 31B at full 262K context becomes possible. KV cache drops from ~22GB to ~3.7GB. 31B Q4 (~20GB) + 3.7GB KV = 23.7GB total — fits.
M2 Max 32GB: Nemotron 30B-A3B and Qwen3.5-27B can serve dramatically longer contexts within existing memory.
RTX 3060 12GB: Context length multiplied within same VRAM budget. 14B models can run at very long context.

Implementation status:

Google official: Q2 2026
llama.cpp: turboquant_plus project, experimental, Metal support on Apple Silicon
Validated from 1.5B to 104B parameter models

The synthesis: TurboQuant + Ollama 0.19 MLX backend = two multiplicative improvements. MLX accelerates compute, TurboQuant expands context. Together they make Apple Silicon the most improved local inference platform.

Key insight: Ollama 0.19 MLX backend

Released March 2026. On Apple Silicon: 57% faster prefill, 93% faster decode vs v0.18 (llama.cpp). The M3 Max has higher memory bandwidth than M4 Pro, so it outperforms newer chips for memory-bound inference. Make sure Ollama is updated.

Models NOT practical for RG’s hardware

Model	Why
GLM-5.1 (744B MoE, 40B active)	MIT license, #1 SWE-Bench Pro (58.4). Smallest GGUF ~206GB. Cloud/API only. Watch for distills.
Kimi K2.5 (1T params)	Even smallest quant (1.8-bit) is ~240GB
Llama 4 Scout (109B)	Q4 is ~60GB+
Llama 4 Maverick (400B)	Data center only
gpt-oss-120b (117B MoE)	Needs 66GB+ unified for usable speed
Nemotron 3 Super 120B-A12B	Too large at full quality

Other models to assess

Nemotron 3 Nano 4B: Mamba-Transformer hybrid, claims 5x throughput. Tiny enough for fleet member on M3 Max. Priority evaluation.
Nemotron 3 Nano 30B-A3B: MoE with only 3B active, Mamba hybrid. Local benchmarks now available: ~77 tok/s M2 Max (MLX), 40-60 tok/s RTX 3060 (Q4, ~5GB VRAM). Fits 3060 comfortably — top priority.
MiniMax M2.7: “Self-evolving” training. 56.22% SWE-Pro — approaching Claude Opus 4.6. Too large for local but watch for quants/distills.
Cogito v1 (3B/8B/14B/32B/70B): Dense, hybrid reasoning toggle. Llama/Qwen-base variants. On Ollama.
Phi-4-mini-reasoning (3.8B): 128K context, reasoning-capable. Worth testing as alternative to SmolLM3.
Gemma 4 31B HERETIC+Thinking (DavidAU): Chain-of-thought reasoning + uncensored 31B.
Qwen3-Coder abliterated (huihui-ai): Abliterated variant for the coding model line.
gpt-oss-20b HERETIC (DavidAU): Claims complete refusal removal. Priority evaluation.
Gemma 4 31B abliterated GGUF (amarck): Q4_K_M ~19GB, fits M3 Max at short context.

Known issues

Qwen 3.5 GGUF + Ollama incompatibility: GGUF versions do not work in Ollama due to separate mmproj vision files. Use llama.cpp directly for now.
Gemma 4 GGUF chat template bug: Community GGUF uploads ship with incorrect chat templates (wrong delimiters), causing ”---” output loops. pmarreck/gemma4-heretical fixes this via Ollama RENDERER/PARSER support.
Gemma 4 31B flash-attention bug in Ollama: Hangs on prompts over ~500 tokens. Workaround: OLLAMA_FLASH_ATTENTION=0 but tanks speed to ~15 tok/s on Apple Silicon. The 26B MoE is the better pick at ~20-30 tok/s.
Gemma 4 31B context limits on 36GB Macs: 31B Q4 needs ~20GB weights + ~22GB KV at full 262K context. Only works at short context (<16K) on M3 Max 36GB.

Open threads

Meta Muse Spark — open-weight contraction: Meta went proprietary. Llama future unclear. The open-weight producers (Google Gemma, Alibaba Qwen, Zhipu GLM, community) become more important. Google’s Apache 2.0 shift for Gemma 4 looks prescient.
Heretic ARA quality: trohrbaugh’s gemma-4-31b-it-heretic-ara achieves KL divergence 0.012 (virtually no quality loss) while reducing refusals 98→5/100. Current best-quality abliteration for Gemma 4 31B. Needs evaluation.
Nemotron 3 Nano evaluation: Benchmarks available. AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%. At 3.2B active params, top priority for RG’s 3060. Beats Qwen3-30B-A3B and gpt-oss-20b on coding.
gpt-oss-20b evaluation: Benchmarks available. Arena-Hard 48.5%, LCBv6 61.0%. Solid but Nemotron 3 Nano beats it. Best for general reasoning with abliterated variant.
Ollama v0.20.5 (April 9): New release. Gemma 4 all sizes available. Check for stability/perf fixes.
TrevorJS abliteration technique: Biprojection + EGA, cross-validated against 686 prompts. New method worth tracking.
Qwen 3.6-Plus: API-only (Alibaba Bailian, OpenRouter). 1M context, agentic coding. Watch for local release.
DeepSeek V4: Imminent (~1T MoE, ~37B active, 1M context, multimodal, Apache 2.0). Too large for local, but distilled variants will follow.
vllm-mlx: Server framework claiming 400+ tok/s on tiny models, continuous batching, Claude Code compatible.
OpenClaw community adopting Kimi K2.5: Signal that model preference is shifting away from Claude for agentic work in open-source community.
GLM-5.1: 744B MoE, #1 SWE-Bench Pro (58.4), MIT license. Too large for local but distilled variants may follow. Watch Zhipu AI.
Copilot CLI BYOK: Now supports Ollama, vLLM, any OpenAI-compatible endpoint. Local models become usable inside a major agent’s workflow for the first time.