landscape · models

Local models

living document · last updated 2026-06-10 · 2d ago

2026-06-10

last updated

age

sections

tracked rows

Reference hardware

3 / 3

Profile	Tier	Key spec	Memory bandwidth	Budget for models
M3 Max 36GB	Tiny-fleet — tiny models, high tok/s, multiple in parallel	36GB unified	~400 GB/s	~21-24 GB
M2 Max 32GB	Dispatch — big jobs, 7B-14B	32GB unified	~400 GB/s	~19-22 GB
WSL + 3060 12GB	Heavy compute — biggest models, GPU offload	12GB VRAM + 64GB RAM	PCIe bottleneck on offload	12GB GPU / 64GB total

Cloud model context (benchmarks for reference)

4 / 4

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Gemini 3.1 Pro
SWE-Bench Pro	58.6%	—	64.3%	—
Terminal-Bench 2.0	82.7%	—	69.4%	68.5%
GPQA Diamond	93.6%	—	94.2%	94.3%
FrontierMath Tier 4	—	39.6%	22.9%	—

2 / 2

Model	Total params	Active params	Context	FLOPs vs V3.2	KV cache vs V3.2
V4-Pro	1.6T	49B	1M	27%	10%
V4-Flash	284B	13B	1M	—	—

What just shipped

Mellum 2 (June 1, 2026) — JetBrains' worker-tier coding MoE

1 / 1

Model	Total params	Active params	Architecture	License	Modality
Mellum2-12B-A2.5B-Thinking	12B	2.5B	MoE (sparse)	Apache 2.0	text + code

1 / 1

Model	Base	Total params	Active params	Architecture	Training data
Huihui4-8B-A4B-v2	Gemma 4 26B-A4B-it	9B	~4B	MoE (32 experts, 8 active)	GLM-5.1-Multilingual-STEM

1 / 1

Model	Total params	Active params	Architecture	Context	Modalities
Qwen3.6-27B	27B	27B (dense)	Hybrid Gated DeltaNet + self-attention	TBD	Text

1 / 1

Model	Total params	Active params	Architecture	Context	Key benchmark
Kimi K2.6	1T	32B (384 experts)	MoE, native multimodal	TBD	SWE-Bench Pro 58.6

1 / 1

Model	Total params	Active params	Size (Ollama)	Context	Modalities
Qwen3.6-35B-A3B	35B	3B	~20 GB	262K (1M+ YaRN)	Text, image, video

4 / 4

Model	Total params	Active params	Size (Ollama)	Context	Modalities
Gemma 4 E2B	5.1B	2.3B	7.2 GB	128K	Text, image, audio
Gemma 4 E4B	8B	4.5B	9.6 GB	128K	Text, image, audio
Gemma 4 26B (MoE)	25.2B	3.8B	18 GB	256K	Text, image
Gemma 4 31B (Dense)	30.7B	30.7B	20 GB	256K	Text, image

2 / 2

Model	Total params	Active params	Size (Q4)	Architecture	Key benchmarks
Nemotron 3 Nano 4B	3.6B	3.6B	~2.5 GB	Mamba-Transformer hybrid	TBD for this size
Nemotron 3 Nano 30B-A3B	31.6B	3.2B (MoE)	~18 GB	Mamba-Transformer hybrid	AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%

M3 Max 36GB — Tiny fleet for background tasks

6 / 6

Model	Quant	Size	tok/s	Role
Gemma 4 E2B	MLX 8-bit (Unsloth)	~4 GB	75-85+	Best tiny general-purpose; multimodal+audio. MLX-native = optimal on Apple Silicon
Nemotron 3 Nano 4B	Q8_0	~3.5 GB	TBD	Mamba hybrid for agentic tasks — evaluate
Qwen3.5-0.8B	Q8_0	~1 GB	120-150	Ultra-fast drafting/classification
Qwen3.5-2B	Q8_0	~2.7 GB	80-100	Fast chat/code assist
SmolLM3-3B	Q8_0	~3.5 GB	60-80	Best-in-class 3B; 128K context
Qwen3.5-4B	Q6_K	~3.4 GB	50-65	Strong coding at 4B

M2 Max 32GB — Dispatch workhorse

8 / 8

Model	Quant	Size	tok/s	Role
Qwen3.6-27B Dense	MLX 4-bit (Unsloth)	~15 GB	TBD	NEW — PRIORITY. Outperforms 397B MoE on agentic coding. Dense = predictable inference.
Qwen 2.5 Coder 14B	Q4_K_M	~9 GB	25-35	Primary coding workhorse (HumanEval ~89%)
Qwen3.6-35B-A3B	Q3_K_M	~15 GB	TBD	Best agentic coding MoE. Q3 safer than Q4 on 32GB.
Nemotron 3 Nano 30B-A3B	Q4_K_M	~18 GB	~77 tok/s (MLX)	AIME 89.1%, LCBv6 68.3% — priority evaluation
DeepSeek-R1-Distill 14B	Q4_K_M	~9 GB	22-30	Chain-of-thought reasoning + code
Qwen3.5-9B	Q5_K_M	~6.5 GB	28-38	General + coding, 256K context
Phi-4 (14B)	Q4_K_M	~9 GB	30-38	STEM reasoning
Qwen3.5-27B	Q4_K_M	~17 GB	12-18	Peak quality (LiveCodeBench 80.7) — slow but usable for batch

WSL + 3060 12GB — Heavy compute

6 / 6

Model	Quant	VRAM fit	tok/s	Notes
Nemotron 3 Nano 30B-A3B	Q4	~5 GB VRAM	40-60	Best MoE for this card — only 3.2B active
Qwen 2.5 Coder 14B	Q4_K_M	Full GPU (9GB)	12-18	Interactive workhorse
DeepSeek-R1-Distill 14B	Q4_K_M	Full GPU (9GB)	12-18	Reasoning + code
Qwen3.5-27B	Q4_K_M	Partial (16GB)	4-8	~75% GPU offload
Qwen 2.5 Coder 32B	Q4_K_M	Partial (20GB)	3-5	HumanEval 92.7% — overnight batch jobs
Qwen3.5-35B-A3B (MoE)	Q4_K_M	Partial (24GB)	5-10	Only 3B active, benefits from partial offload

Coding-specific models

6 / 6

Model	Params	HumanEval	SWE-bench	Best for
Qwen 2.5 Coder 7B	7B	88.4%	—	Autocomplete/FIM
Qwen 2.5 Coder 14B	14B	~89%	—	Best balance capability/speed
Qwen 2.5 Coder 32B	32B	92.7%	—	Highest code quality
Qwen3-Coder-Next (80B MoE)	80B/3B active	—	64.6%	Beats Claude Opus 4.6 on SWE-bench
Qwen3.5-9B	9B	—	65.6 LCBv6	Chat-based coding with vision
Qwen3.5-27B	27B	—	80.7 LCBv6	Multi-file reasoning

Abliterated variant sources

11 / 11

Producer	Method	Models	Where
huihui-ai	Abliteration + Expert pruning	Qwen3.6, Qwen3.5 (all sizes), Qwen3, Gemma 3, GLM-5.1, gpt-oss-20b + Huihui4-8B-A4B-v2 (pruned Gemma 4, 9B/4B active, INT4 6-9GB)	Ollama + HuggingFace
mlabonne	Abliteration	Gemma 3 (1B-27B) + GGUF	HuggingFace
bartowski	GGUF quants	QwQ-32B, Llama 3.1 8B, many others	HuggingFace
DavidAU	HERETIC	Gemma 4 31B, gpt-oss-20b (multiple variants)	HuggingFace
HauhauCS	Abliteration	Gemma 4 E2B, E4B, Qwen3.6-35B-A3B ("aggressive")	HuggingFace
trohrbaugh	Heretic ARA	Gemma 4 31B (KL 0.012, refusals 98→5/100)	HuggingFace
p-e-w (Heretic tool)	Automated HERETIC	1000+ models including Gemma 4	GitHub + HuggingFace
TrevorJS	Biprojection + EGA	Gemma 4 (E2B, E4B, 26B MoE, 31B)	GitHub
amarck	Abliteration	Gemma 4 31B (GGUF quants, Q4_K_M ~19GB)	HuggingFace
pmarreck	HERETIC	Gemma 4 31B (one-command Ollama/MLX setup)	GitHub
aoxo	Fine-tune	gpt-oss-20b	HuggingFace

gpt-oss-20b abliterated landscape (complete)

5 / 5

Variant	Producer	Method	Format
Huihui-gpt-oss-20b-BF16-abliterated	huihui-ai	Abliteration	BF16/Ollama (v1+v2)
GPT-oss-20b-abliterated-uncensored-NEO	DavidAU	Abliteration+NEO	GGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-HERETIC-uncensored-NEO	DavidAU	HERETIC	GGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4	DavidAU	HERETIC	Native MXFP4
gpt-oss-20b-uncensored	aoxo	Fine-tune	BF16

Quantization reference

4 / 4

Quant	Bits	Quality	7B size	14B size	27B size
Q4_K_M	~4.5	Good	4.5 GB	9 GB	16 GB
Q5_K_M	~5.5	Better (<2% perplexity loss)	5.2 GB	10 GB	19 GB
Q6_K	~6.5	High	6.0 GB	12 GB	22 GB
Q8_0	~8.0	Near-lossless	7.5 GB	15 GB	27 GB

Key insight: Unsloth MLX-native Gemma 4 lineup (NEW — April 14)

Unsloth uploaded MLX-native quantizations for the full Gemma 4 family:

4 / 4

Model	MLX 3-bit	MLX 4-bit	MLX 8-bit
Gemma 4 E2B	—	—	✓
Gemma 4 E4B	—	✓	✓
Gemma 4 26B MoE	✓	✓	✓
Gemma 4 31B Dense	✓	✓	✓

Models NOT practical on the reference hardware

6 / 6

Model	Why
GLM-5.1 (744B MoE, 40B active)	MIT license, #1 SWE-Bench Pro (58.4). Open-weight on HuggingFace (`zai-org/GLM-5.1`). Smallest GGUF ~206GB. huihui-ai abliterated GGUF exists. MLX community version exists. Watch for distills.
Kimi K2.5 (1T params)	Even smallest quant (1.8-bit) is ~240GB
Llama 4 Scout (109B)	Q4 is ~60GB+
Llama 4 Maverick (400B)	Data center only
gpt-oss-120b (117B MoE)	Needs 66GB+ unified for usable speed
Nemotron 3 Super 120B-A12B	Too large at full quality

Full document — prose, analysis, and everything not in the tables above

Local model landscape

Living document. Rewritten as new models ship. Last updated: 2026-06-10.

Reference hardware

Three representative consumer profiles, used as a baseline for the fit recommendations below. They span the common local-inference envelope: high-bandwidth Apple Silicon for fast small-model fleets, and a consumer NVIDIA card for GPU-offload of larger models.

Profile	Tier	Key spec	Memory bandwidth	Budget for models
M3 Max 36GB	Tiny-fleet — tiny models, high tok/s, multiple in parallel	36GB unified	~400 GB/s	~21-24 GB
M2 Max 32GB	Dispatch — big jobs, 7B-14B	32GB unified	~400 GB/s	~19-22 GB
WSL + 3060 12GB	Heavy compute — biggest models, GPU offload	12GB VRAM + 64GB RAM	PCIe bottleneck on offload	12GB GPU / 64GB total

Preferences

Abliterated/uncensored variants preferred — no alignment tax
Key producers: huihui-ai (Ollama + HF), mlabonne (HF), bartowski (GGUF quants), DavidAU (HERETIC method)
Inference: Ollama 0.19+ (MLX backend — 57% faster prefill, 93% faster decode vs 0.18)

Cloud model context (benchmarks for reference)

Claude Fable 5 / Mythos 5 (June 9) — new Anthropic frontier ceiling, replacing Opus 4.8 as the GA top. One set of weights, two names: Fable 5 (general, safeguarded) and Mythos 5 (ungated, Glasswing/bio-research partners only). SOTA “on nearly all tested benchmarks”; higher than Opus 4.8 on FrontierCode even at medium effort; new vision SOTA; “millions of tokens” long-context focus. Pricing $10/$50 (2× regular Opus 4.8). Safety is a routing layer, not a weights property: cyber/bio-chem/distillation queries classifier-fall-back to Opus 4.8 — the n-1 frontier is now the next one’s safety floor. Local-inference consequence: the bar TurboQuant-class compression has to clear on consumer hardware just moved up again, and the frontier↔local capability gap widened — but the governance gap (per-query capability gating) is something open weights structurally cannot copy, which keeps the open-weight value proposition on cost/control rather than frontier parity. Local relevance otherwise low (hosted-only). Full analysis: reports/2026-06-10-the-fable-and-the-fallback.md.

North Mini Code (Cohere, June 9) — Cohere’s first developer model and a new open-weight coding contributor: 30B MoE / 3B active, Apache 2.0 (bf16+fp8), 128K context, trained across multiple agent scaffolds for harness-robustness. 80.2% pass@10 SWE-Bench Verified, 55.1% pass@10 Terminal-Bench v2; positioned above similarly-sized Qwen3.5/Gemma 4/Devstral Small 2 and larger Nemotron 3 Super/Mistral Small 4/Devstral 2. 3B-active = sub-agent/worker economics (cf. Mellum 2); 30B at fp8 (~30GB) exceeds Mac budgets, Q4 (~15–16GB) fits. Candidate local coding worker. See radar/signals/2026-06-09-introducing-north-mini-code-cohere-s-first-model-for-develop.md.

Gemma 4 12B (June 3) — new encoder-free, native-audio mid-tier in a tracked family; full detail + hardware fit in models/families/gemma/README.md.

GPT-5.5 “Spud” (April 23) — first fully retrained base since GPT-4.5. 1M context (API). Natively omnimodal.

Benchmark	GPT-5.5	GPT-5.5 Pro	Claude Opus 4.7	Gemini 3.1 Pro
SWE-Bench Pro	58.6%	—	64.3%	—
Terminal-Bench 2.0	82.7%	—	69.4%	68.5%
GPQA Diamond	93.6%	—	94.2%	94.3%
FrontierMath Tier 4	—	39.6%	22.9%	—

Pricing: Standard $5/$30, Pro $30/$180 per 1M tokens. The frontier is now a surface, not a point — no single model wins every benchmark. Local models compete on cost at the expense of benchmark position.

DeepSeek V4 (April 23-24) — MIT license, largest open-weight model, radical efficiency gains:

Model	Total params	Active params	Context	FLOPs vs V3.2	KV cache vs V3.2
V4-Pro	1.6T	49B	1M	27%	10%
V4-Flash	284B	13B	1M	—	—

V4-Pro #1 open on Vibe Code Bench. Flash $0.14/$0.28, Pro $1.74/$3.48 per 1M tokens — 36-107x cheaper than GPT-5.5. Not viable for local inference (too large), but the CSA/HCA attention compression architecture will propagate to smaller models. When it reaches 7-27B scale, local agents get dramatically longer context without hardware upgrades.

What just shipped

Mellum 2 (June 1, 2026) — JetBrains’ worker-tier coding MoE

Model	Total params	Active params	Architecture	License	Modality
Mellum2-12B-A2.5B-Thinking	12B	2.5B	MoE (sparse)	Apache 2.0	text + code

JetBrains’ model card positions it not as a flagship coder but for routing/orchestration in multi-model systems and sub-agent tasks (planning, validation, transformation), RAG context compression, and private-code deployment — i.e. the worker slot in an agent fleet. ~2× faster inference than comparable dense models from the low active-param count. Context length not stated in the launch post (arXiv 2605.31268 + model card to confirm).

Hardware fit (the low active-param count is the story — fully GPU-resident on all three):

M3 Max 36GB: Q4 ~7GB — comfortable; can host several parallel instances, high tok/s
M2 Max 32GB: Q4 ~7GB — ideal dispatch worker for batch sub-agent jobs
RTX 3060 12GB: Q4/Q5 ~7–8GB — fully in VRAM, no CPU offload; Q8 (~13GB) spills, stay at Q4/Q5

Significance / recommendation change: for a local sub-agent / code-completion worker role, Mellum 2 (Q4/Q5) is now the model to reach for on the 3060 box — the rare 12B that stays GPU-resident on a 12GB card because only 2.5B params activate, fast enough to run fanned-out rather than as a single assistant. On the Macs it’s a strong dispatch-worker default. Lands in the slot the orchestration layer is creating: as fleet-of-cheap-workers-under-one-planner becomes the dominant pattern (Opus 4.8 Dynamic Workflows; 60%+ of Codex users run parallel tasks), an open, fast, code-specialized local worker is the piece an open-source agent stack was missing.

Huihui4-8B-A4B-v2 (April 27, 2026) — Expert-pruned Gemma 4 variant

Model	Base	Total params	Active params	Architecture	Training data
Huihui4-8B-A4B-v2	Gemma 4 26B-A4B-it	9B	~4B	MoE (32 experts, 8 active)	GLM-5.1-Multilingual-STEM

Expert pruning (128 → 32 experts) + SFT. Uses GLM-5.1 thinking mode format. INT4/INT8: 6-9GB VRAM. Cross-architecture lineage: Google model base, Chinese training data + reasoning format.

Hardware fit:

M3 Max 36GB: INT4 ~6GB — comfortable, room for multi-model fleet
M2 Max 32GB: INT4 ~6GB — comfortable
RTX 3060: INT4 ~6GB — fits entirely in VRAM

Significance: huihui-ai’s first technique beyond abliteration. Expert pruning restructures the model rather than removing safety guardrails. The v2 suffix indicates iterative refinement. At 6-9GB, this is the smallest Gemma 4-based coding-capable model — evaluate against Gemma 4 E4B (4.5B active, 9.6GB Ollama) to see if pruning preserves coding quality.

Qwen3.6-27B Dense (April 22, 2026) — Apache 2.0, most important local model release since Gemma 4

Model	Total params	Active params	Architecture	Context	Modalities
Qwen3.6-27B	27B	27B (dense)	Hybrid Gated DeltaNet + self-attention	TBD	Text

Dense model (not MoE). “Thinking Preservation” mechanism. Outperforms the 397B MoE Qwen3.6 on agentic coding benchmarks — 14x smaller, better at the specific task. Unsloth MLX quants (4/6/8-bit) available same day.

Hardware fit:

M3 Max 36GB: Q4_K_M (~15 GB) — fits with ~7GB headroom. Primary coding model candidate.
M2 Max 32GB: Q4_K_M (~15 GB) — fits with ~7GB headroom. Dispatch upgrade.
RTX 3060: Does not fit in 12GB VRAM. CPU offload with 64GB RAM possible but slow.

Priority evaluation. If the agentic coding benchmarks hold on practical tasks, this replaces the MoE models as the recommended local coding model for Apple Silicon. Dense architecture = more predictable inference, no routing overhead.

Kimi K2.6 (April 20, 2026) — Modified MIT, agent swarm architecture

Model	Total params	Active params	Architecture	Context	Key benchmark
Kimi K2.6	1T	32B (384 experts)	MoE, native multimodal	TBD	SWE-Bench Pro 58.6

First model designed for massive multi-agent orchestration: 300 sub-agents, 4,000 coordinated steps. SWE-Bench Pro 58.6 beats GPT-5.4 (57.7) and Opus 4.6 (57.3). Too large for local at full scale. Watch for distilled variants targeting the 32B active param slice.

Qwen3.6-35B-A3B (April 15-16, 2026) — Apache 2.0, MoE variant

Model	Total params	Active params	Size (Ollama)	Context	Modalities
Qwen3.6-35B-A3B	35B	3B	~20 GB	262K (1M+ YaRN)	Text, image, video

Terminal-Bench 2.0: 51.5. Best open-weight agentic coding MoE at this parameter count. huihui-ai shipped abliterated + Claude-named variants. Unsloth Dynamic 2.0 + bartowski imatrix available.

Hardware fit:

M3 Max 36GB: Q4_K_M (~18-19 GB) — fits but tight
M2 Max 32GB: Q3_K_M (~15 GB) safer
RTX 3060: does not fit. CPU+GPU split viable for batch.

Gemma 4 (April 2, 2026) — Apache 2.0 license (major change from Gemma 3’s custom license)

Model	Total params	Active params	Size (Ollama)	Context	Modalities
Gemma 4 E2B	5.1B	2.3B	7.2 GB	128K	Text, image, audio
Gemma 4 E4B	8B	4.5B	9.6 GB	128K	Text, image, audio
Gemma 4 26B (MoE)	25.2B	3.8B	18 GB	256K	Text, image
Gemma 4 31B (Dense)	30.7B	30.7B	20 GB	256K	Text, image

Gemma 4 E2B beats Gemma 3 27B on most benchmarks with only 2.3B active params. Most efficient model per byte I’ve tracked. ~75-85 tok/s on M3 Max.

Abliterated variants expanding — see abliteration section below.

Nemotron 3 Nano — Mamba-Transformer hybrid, benchmarks now available

Model	Total params	Active params	Size (Q4)	Architecture	Key benchmarks
Nemotron 3 Nano 4B	3.6B	3.6B	~2.5 GB	Mamba-Transformer hybrid	TBD for this size
Nemotron 3 Nano 30B-A3B	31.6B	3.2B (MoE)	~18 GB	Mamba-Transformer hybrid	AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%

Independent benchmarks (via NeMo Evaluator):

AIME 2025: 89.1% (beats Qwen3-30B-A3B at 85.0%; 99.2% with Python tools)
LiveCodeBench v6: 68.3% (beats Qwen3 66.0% and gpt-oss 61.0%)
Arena-Hard-v2: 67.7% (vs Qwen3-30B 57.8%, gpt-oss-20b 48.5%)
RULER: 87.5% at 64K, 82.9% at 128K, 70.6% at 512K (supports 1M context)
3.3x throughput vs Qwen3-30B-A3B on single H200

Verdict: At 3.2B active params, this runs on RTX 3060 12GB comfortably. Strong coding/reasoning at tiny active parameter count. Priority recommendation for the 3060 profile. GGUF quants available from Unsloth.

Hardware x Model fit matrix

M3 Max 36GB — Tiny fleet for background tasks

Model	Quant	Size	tok/s	Role
Gemma 4 E2B	MLX 8-bit (Unsloth)	~4 GB	75-85+	Best tiny general-purpose; multimodal+audio. MLX-native = optimal on Apple Silicon
Nemotron 3 Nano 4B	Q8_0	~3.5 GB	TBD	Mamba hybrid for agentic tasks — evaluate
Qwen3.5-0.8B	Q8_0	~1 GB	120-150	Ultra-fast drafting/classification
Qwen3.5-2B	Q8_0	~2.7 GB	80-100	Fast chat/code assist
SmolLM3-3B	Q8_0	~3.5 GB	60-80	Best-in-class 3B; 128K context
Qwen3.5-4B	Q6_K	~3.4 GB	50-65	Strong coding at 4B

Multi-model strategy: Set OLLAMA_MAX_LOADED_MODELS=4. Example fleet: Qwen3.5-0.8B (1GB) + Gemma 4 E2B (4GB) + SmolLM3-3B (3.5GB) + Qwen3.5-2B (2.7GB) = ~11GB total, plenty of headroom.

M2 Max 32GB — Dispatch workhorse

Model	Quant	Size	tok/s	Role
Qwen3.6-27B Dense	MLX 4-bit (Unsloth)	~15 GB	TBD	NEW — PRIORITY. Outperforms 397B MoE on agentic coding. Dense = predictable inference.
Qwen 2.5 Coder 14B	Q4_K_M	~9 GB	25-35	Primary coding workhorse (HumanEval ~89%)
Qwen3.6-35B-A3B	Q3_K_M	~15 GB	TBD	Best agentic coding MoE. Q3 safer than Q4 on 32GB.
Nemotron 3 Nano 30B-A3B	Q4_K_M	~18 GB	~77 tok/s (MLX)	AIME 89.1%, LCBv6 68.3% — priority evaluation
DeepSeek-R1-Distill 14B	Q4_K_M	~9 GB	22-30	Chain-of-thought reasoning + code
Qwen3.5-9B	Q5_K_M	~6.5 GB	28-38	General + coding, 256K context
Phi-4 (14B)	Q4_K_M	~9 GB	30-38	STEM reasoning
Qwen3.5-27B	Q4_K_M	~17 GB	12-18	Peak quality (LiveCodeBench 80.7) — slow but usable for batch

Avoid: Gemma 4 26B MoE — community reports 11 tok/s vs 60+ for similarly-sized dense models. MoE has higher bandwidth demands per active param.

WSL + 3060 12GB — Heavy compute

Model	Quant	VRAM fit	tok/s	Notes
Nemotron 3 Nano 30B-A3B	Q4	~5 GB VRAM	40-60	Best MoE for this card — only 3.2B active
Qwen 2.5 Coder 14B	Q4_K_M	Full GPU (9GB)	12-18	Interactive workhorse
DeepSeek-R1-Distill 14B	Q4_K_M	Full GPU (9GB)	12-18	Reasoning + code
Qwen3.5-27B	Q4_K_M	Partial (16GB)	4-8	~75% GPU offload
Qwen 2.5 Coder 32B	Q4_K_M	Partial (20GB)	3-5	HumanEval 92.7% — overnight batch jobs
Qwen3.5-35B-A3B (MoE)	Q4_K_M	Partial (24GB)	5-10	Only 3B active, benefits from partial offload

Coding-specific models

Model	Params	HumanEval	SWE-bench	Best for
Qwen 2.5 Coder 7B	7B	88.4%	—	Autocomplete/FIM
Qwen 2.5 Coder 14B	14B	~89%	—	Best balance capability/speed
Qwen 2.5 Coder 32B	32B	92.7%	—	Highest code quality
Qwen3-Coder-Next (80B MoE)	80B/3B active	—	64.6%	Beats Claude Opus 4.6 on SWE-bench
Qwen3.5-9B	9B	—	65.6 LCBv6	Chat-based coding with vision
Qwen3.5-27B	27B	—	80.7 LCBv6	Multi-file reasoning

Abliterated variant sources

Producer	Method	Models	Where
huihui-ai	Abliteration + Expert pruning	Qwen3.6, Qwen3.5 (all sizes), Qwen3, Gemma 3, GLM-5.1, gpt-oss-20b + Huihui4-8B-A4B-v2 (pruned Gemma 4, 9B/4B active, INT4 6-9GB)	Ollama + HuggingFace
mlabonne	Abliteration	Gemma 3 (1B-27B) + GGUF	HuggingFace
bartowski	GGUF quants	QwQ-32B, Llama 3.1 8B, many others	HuggingFace
DavidAU	HERETIC	Gemma 4 31B, gpt-oss-20b (multiple variants)	HuggingFace
HauhauCS	Abliteration	Gemma 4 E2B, E4B, Qwen3.6-35B-A3B (“aggressive”)	HuggingFace
trohrbaugh	Heretic ARA	Gemma 4 31B (KL 0.012, refusals 98→5/100)	HuggingFace
p-e-w (Heretic tool)	Automated HERETIC	1000+ models including Gemma 4	GitHub + HuggingFace
TrevorJS	Biprojection + EGA	Gemma 4 (E2B, E4B, 26B MoE, 31B)	GitHub
amarck	Abliteration	Gemma 4 31B (GGUF quants, Q4_K_M ~19GB)	HuggingFace
pmarreck	HERETIC	Gemma 4 31B (one-command Ollama/MLX setup)	GitHub
aoxo	Fine-tune	gpt-oss-20b	HuggingFace

Quick Ollama access:

ollama pull huihui_ai/qwen3.5-abliterated       # Qwen 3.5 uncensored
ollama pull huihui_ai/gemma3-abliterated         # Gemma 3 uncensored

gpt-oss-20b abliterated landscape (complete)

Variant	Producer	Method	Format
Huihui-gpt-oss-20b-BF16-abliterated	huihui-ai	Abliteration	BF16/Ollama (v1+v2)
GPT-oss-20b-abliterated-uncensored-NEO	DavidAU	Abliteration+NEO	GGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-HERETIC-uncensored-NEO	DavidAU	HERETIC	GGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4	DavidAU	HERETIC	Native MXFP4
gpt-oss-20b-uncensored	aoxo	Fine-tune	BF16

All fit comfortably on all three machines. MXFP4 at ~14GB or IQ4_NL at ~11.5GB. HERETIC variant claims complete refusal removal.

Independent benchmarks (via BenchLM, DataRobot, Artificial Analysis):

Arena-Hard-v2: 48.5% (behind Nemotron 3 Nano at 67.7%)
LiveCodeBench v6: 61.0% (behind Nemotron 3 Nano at 68.3%)
Matches or exceeds o3-mini on most benchmarks
Outperforms gpt-oss-120B on HumanEval and MMLU despite being much smaller
“Low thinking effort” mode outperforms more expensive competitors
Fits 16GB devices — runs on RTX 3060 and M3 Max easily

Verdict: Solid general-purpose model but Nemotron 3 Nano beats it on coding benchmarks at similar active params. Best use: general reasoning/chat where abliterated variant is preferred.

Quantization reference

Quant	Bits	Quality	7B size	14B size	27B size
Q4_K_M	~4.5	Good	4.5 GB	9 GB	16 GB
Q5_K_M	~5.5	Better (<2% perplexity loss)	5.2 GB	10 GB	19 GB
Q6_K	~6.5	High	6.0 GB	12 GB	22 GB
Q8_0	~8.0	Near-lossless	7.5 GB	15 GB	27 GB

Rule of thumb for Apple Silicon: model should be <=60-70% of total unified memory.

Key insight: TurboQuant — 6x KV cache compression (NEW — April 12)

Google Research’s TurboQuant (March 25, ICLR 2026) compresses KV cache to 3 bits with zero accuracy loss. No retraining required. 6x reduction in KV memory.

Impact on the reference hardware:

M3 Max 36GB: Gemma 4 31B at full 262K context becomes possible. KV cache drops from ~22GB to ~3.7GB. 31B Q4 (~20GB) + 3.7GB KV = 23.7GB total — fits.
M2 Max 32GB: Nemotron 30B-A3B and Qwen3.5-27B can serve dramatically longer contexts within existing memory.
RTX 3060 12GB: Context length multiplied within same VRAM budget. 14B models can run at very long context.

Implementation status:

Google official: Q2 2026
llama.cpp: turboquant_plus project, experimental, Metal support on Apple Silicon
Validated from 1.5B to 104B parameter models

The synthesis: TurboQuant + Ollama 0.19 MLX backend = two multiplicative improvements. MLX accelerates compute, TurboQuant expands context. Together they make Apple Silicon the most improved local inference platform.

Key insight: Unsloth MLX-native Gemma 4 lineup (NEW — April 14)

Unsloth uploaded MLX-native quantizations for the full Gemma 4 family:

Model	MLX 3-bit	MLX 4-bit	MLX 8-bit
Gemma 4 E2B	—	—	✓
Gemma 4 E4B	—	✓	✓
Gemma 4 26B MoE	✓	✓	✓
Gemma 4 31B Dense	✓	✓	✓

Why this matters: MLX-native quants skip GGUF→MLX conversion overhead. Combined with Ollama 0.19’s MLX backend, these are the optimal format for Apple Silicon. The Gemma 4 26B MoE at MLX 4-bit (~17GB) fits M3 Max and M2 Max comfortably. The 31B Dense at MLX 3-bit may also fit within budget.

Updated recommendation: For general-purpose inference on Apple Silicon, prefer Unsloth MLX quants over GGUF when available.

huihui-ai abliteration wave (April 14-16)

Huihui4-48B-A4B-abliterated (April 16) — experimental expanded architecture: takes Gemma 4 26B-A4B and replaces MLP layers with 256-expert MoE, expanding to 48B total. Not fine-tuned yet. Experimental.
Huihui3.5-67B-A3B (April 16) — Qwen3.5-35B-A3B base expanded to 512 experts, 67B/3B active MoE
Gemma 4 E2B, 31B, 26B MoE abliterated v2 — refreshed versions
Full Qwen3.5, GLM-4.7, Kimi, Mistral-Small-4 lineups now abliterated

DavidAU LFM2 HERETIC series (NEW — April 14-16)

Liquid AI’s LFM2 foundation models (SSM-Transformer hybrids) getting the HERETIC treatment:

LFM2-12B-A1B High-Intelligence Series-B (April 16)
LFM2-12B-A1B Deckard-II HERETIC Uncensored (April 16)
LFM2-8B-A1B Deckard-II HERETIC Uncensored (April 15)
LFM2-8B-A1B GLM-4.7-Flash Thinking (April 15)
gemma-4-19B-A4B-it INSTRUCT Heretic-Uncensored (April 15)
gemma-4-E4B-it Deckard-V2 Strong HERETIC (April 14)

At 1B active params, the LFM2-8B models are extremely efficient — run on all three machines. SSM-Transformer hybrid architecture is distinct from standard transformer; worth evaluating for latency characteristics.

Key insight: Ollama 0.19 MLX backend

Released March 2026. On Apple Silicon: 57% faster prefill, 93% faster decode vs v0.18 (llama.cpp). The M3 Max has higher memory bandwidth than M4 Pro, so it outperforms newer chips for memory-bound inference. Make sure Ollama is updated.

Models NOT practical on the reference hardware

Model	Why
GLM-5.1 (744B MoE, 40B active)	MIT license, #1 SWE-Bench Pro (58.4). Open-weight on HuggingFace (`zai-org/GLM-5.1`). Smallest GGUF ~206GB. huihui-ai abliterated GGUF exists. MLX community version exists. Watch for distills.
Kimi K2.5 (1T params)	Even smallest quant (1.8-bit) is ~240GB
Llama 4 Scout (109B)	Q4 is ~60GB+
Llama 4 Maverick (400B)	Data center only
gpt-oss-120b (117B MoE)	Needs 66GB+ unified for usable speed
Nemotron 3 Super 120B-A12B	Too large at full quality

Other models to assess

DavidAU LFM2-8B-A1B variants: SSM-Transformer hybrid from Liquid AI, HERETIC uncensored. 1B active params. Extremely efficient — fits all three machines. New architecture worth evaluating for latency and agentic task performance.
DavidAU gemma-4-19B-A4B HERETIC: Compact uncensored Gemma 4 variant. Fits M3 Max and M2 Max at Q4.
Nemotron 3 Nano 4B: Mamba-Transformer hybrid, claims 5x throughput. Tiny enough for fleet member on M3 Max. Priority evaluation.
Nemotron 3 Nano 30B-A3B: MoE with only 3B active, Mamba hybrid. Local benchmarks now available: ~77 tok/s M2 Max (MLX), 40-60 tok/s RTX 3060 (Q4, ~5GB VRAM). Fits 3060 comfortably — top priority.
MiniMax M2.7: “Self-evolving” training. 56.22% SWE-Pro — approaching Claude Opus 4.6. Too large for local but watch for quants/distills.
Cogito v1 (3B/8B/14B/32B/70B): Dense, hybrid reasoning toggle. Llama/Qwen-base variants. On Ollama.
Phi-4-mini-reasoning (3.8B): 128K context, reasoning-capable. Worth testing as alternative to SmolLM3.
Gemma 4 31B HERETIC+Thinking (DavidAU): Chain-of-thought reasoning + uncensored 31B.
Qwen3-Coder abliterated (huihui-ai): Abliterated variant for the coding model line.
gpt-oss-20b HERETIC (DavidAU): Claims complete refusal removal. Priority evaluation.
Gemma 4 31B abliterated GGUF (amarck): Q4_K_M ~19GB, fits M3 Max at short context.

Known issues

Qwen 3.5 GGUF + Ollama incompatibility: GGUF versions do not work in Ollama due to separate mmproj vision files. Use llama.cpp directly for now.
Gemma 4 GGUF chat template bug: Community GGUF uploads ship with incorrect chat templates (wrong delimiters), causing ”---” output loops. pmarreck/gemma4-heretical fixes this via Ollama RENDERER/PARSER support.
Gemma 4 31B flash-attention bug in Ollama: Hangs on prompts over ~500 tokens. Workaround: OLLAMA_FLASH_ATTENTION=0 but tanks speed to ~15 tok/s on Apple Silicon. The 26B MoE is the better pick at ~20-30 tok/s.
Gemma 4 31B context limits on 36GB Macs: 31B Q4 needs ~20GB weights + ~22GB KV at full 262K context. Only works at short context (<16K) on M3 Max 36GB.

Open threads

Meta Muse Spark — open-weight contraction: Meta went proprietary. Llama future unclear. The open-weight producers (Google Gemma, Alibaba Qwen, Zhipu GLM, community) become more important. Google’s Apache 2.0 shift for Gemma 4 looks prescient.
Heretic ARA quality: trohrbaugh’s gemma-4-31b-it-heretic-ara achieves KL divergence 0.012 (virtually no quality loss) while reducing refusals 98→5/100. Current best-quality abliteration for Gemma 4 31B. Needs evaluation.
Nemotron 3 Nano evaluation: Benchmarks available. AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%. At 3.2B active params, top priority for the 3060 profile. Beats Qwen3-30B-A3B and gpt-oss-20b on coding.
gpt-oss-20b evaluation: Benchmarks available. Arena-Hard 48.5%, LCBv6 61.0%. Solid but Nemotron 3 Nano beats it. Best for general reasoning with abliterated variant.
Ollama v0.20.5 (April 9): New release. Gemma 4 all sizes available. Check for stability/perf fixes.
TrevorJS abliteration technique: Biprojection + EGA, cross-validated against 686 prompts. New method worth tracking.
Qwen 3.6-Plus: API-only (Alibaba Bailian, OpenRouter). 1M context, agentic coding. Watch for local release.
huihui-ai Huihui4-8B-A4B: New model family uploaded April 25. 8B total/4B active (MoE), image-text-to-text. GGUF variant also available. If original work (not abliteration), marks huihui-ai’s transition to model producer. Fits all three machines easily. Evaluate.
huihui-ai Qwen3.6-27B abliterated: Uploaded April 23, 539 downloads. Abliterated dense Qwen3.6-27B — the model that outperforms 397B MoE. Combined with Unsloth MLX quants for Apple Silicon deployment.
DeepSeek V4: Shipped April 23-24. V4-Pro 1.6T/49B active, V4-Flash 284B/13B active. MIT license. CSA/HCA attention compression (27% FLOPs, 10% KV cache vs V3.2). Too large for local. The architecture matters more than the weights — watch for compression techniques propagating to smaller models.
vllm-mlx: Server framework claiming 400+ tok/s on tiny models, continuous batching, Claude Code compatible.
OpenClaw community adopting Kimi K2.5: Signal that model preference is shifting away from Claude for agentic work in open-source community.
GLM-5.1: 744B MoE (40B active), #1 SWE-Bench Pro (58.4), MIT license. Open-weight since April 7 (corrected from “cloud-only”). huihui-ai abliterated GGUF available. MLX community version exists. Too large for local at full scale (~206GB) but distills/aggressive quants may change this. Watch Z.ai for smaller variants.
Copilot CLI BYOK: Now supports Ollama, vLLM, any OpenAI-compatible endpoint. Local models become usable inside a major agent’s workflow for the first time.

← all landscape docs