landscape · models

Local models

living document · last updated 2026-06-10 · 2d ago

2026-06-10
last updated
2d
age
21
sections
76
tracked rows

Reference hardware

3 / 3
Profile Tier Key spec Memory bandwidth Budget for models
M3 Max 36GBTiny-fleet — tiny models, high tok/s, multiple in parallel36GB unified~400 GB/s~21-24 GB
M2 Max 32GBDispatch — big jobs, 7B-14B32GB unified~400 GB/s~19-22 GB
WSL + 3060 12GBHeavy compute — biggest models, GPU offload12GB VRAM + 64GB RAMPCIe bottleneck on offload12GB GPU / 64GB total

Cloud model context (benchmarks for reference)

4 / 4
Benchmark GPT-5.5 GPT-5.5 Pro Claude Opus 4.7 Gemini 3.1 Pro
SWE-Bench Pro58.6%64.3%
Terminal-Bench 2.082.7%69.4%68.5%
GPQA Diamond93.6%94.2%94.3%
FrontierMath Tier 439.6%22.9%
2 / 2
Model Total params Active params Context FLOPs vs V3.2 KV cache vs V3.2
V4-Pro1.6T49B1M27%10%
V4-Flash284B13B1M

What just shipped

Mellum 2 (June 1, 2026) — JetBrains' worker-tier coding MoE

1 / 1
Model Total params Active params Architecture License Modality
Mellum2-12B-A2.5B-Thinking12B2.5BMoE (sparse)Apache 2.0text + code
1 / 1
Model Base Total params Active params Architecture Training data
Huihui4-8B-A4B-v2Gemma 4 26B-A4B-it9B~4BMoE (32 experts, 8 active)GLM-5.1-Multilingual-STEM
1 / 1
Model Total params Active params Architecture Context Modalities
Qwen3.6-27B27B27B (dense)Hybrid Gated DeltaNet + self-attentionTBDText
1 / 1
Model Total params Active params Architecture Context Key benchmark
Kimi K2.61T32B (384 experts)MoE, native multimodalTBDSWE-Bench Pro 58.6
1 / 1
Model Total params Active params Size (Ollama) Context Modalities
Qwen3.6-35B-A3B35B3B~20 GB262K (1M+ YaRN)Text, image, video
4 / 4
Model Total params Active params Size (Ollama) Context Modalities
Gemma 4 E2B5.1B2.3B7.2 GB128KText, image, audio
Gemma 4 E4B8B4.5B9.6 GB128KText, image, audio
Gemma 4 26B (MoE)25.2B3.8B18 GB256KText, image
Gemma 4 31B (Dense)30.7B30.7B20 GB256KText, image
2 / 2
Model Total params Active params Size (Q4) Architecture Key benchmarks
Nemotron 3 Nano 4B3.6B3.6B~2.5 GBMamba-Transformer hybridTBD for this size
Nemotron 3 Nano 30B-A3B31.6B3.2B (MoE)~18 GBMamba-Transformer hybridAIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%

M3 Max 36GB — Tiny fleet for background tasks

6 / 6
Model Quant Size tok/s Role
Gemma 4 E2BMLX 8-bit (Unsloth)~4 GB75-85+Best tiny general-purpose; multimodal+audio. MLX-native = optimal on Apple Silicon
Nemotron 3 Nano 4BQ8_0~3.5 GBTBDMamba hybrid for agentic tasks — evaluate
Qwen3.5-0.8BQ8_0~1 GB120-150Ultra-fast drafting/classification
Qwen3.5-2BQ8_0~2.7 GB80-100Fast chat/code assist
SmolLM3-3BQ8_0~3.5 GB60-80Best-in-class 3B; 128K context
Qwen3.5-4BQ6_K~3.4 GB50-65Strong coding at 4B

M2 Max 32GB — Dispatch workhorse

8 / 8
Model Quant Size tok/s Role
Qwen3.6-27B DenseMLX 4-bit (Unsloth)~15 GBTBDNEW — PRIORITY. Outperforms 397B MoE on agentic coding. Dense = predictable inference.
Qwen 2.5 Coder 14BQ4_K_M~9 GB25-35Primary coding workhorse (HumanEval ~89%)
Qwen3.6-35B-A3BQ3_K_M~15 GBTBDBest agentic coding MoE. Q3 safer than Q4 on 32GB.
Nemotron 3 Nano 30B-A3BQ4_K_M~18 GB~77 tok/s (MLX)AIME 89.1%, LCBv6 68.3% — priority evaluation
DeepSeek-R1-Distill 14BQ4_K_M~9 GB22-30Chain-of-thought reasoning + code
Qwen3.5-9BQ5_K_M~6.5 GB28-38General + coding, 256K context
Phi-4 (14B)Q4_K_M~9 GB30-38STEM reasoning
Qwen3.5-27BQ4_K_M~17 GB12-18Peak quality (LiveCodeBench 80.7) — slow but usable for batch

WSL + 3060 12GB — Heavy compute

6 / 6
Model Quant VRAM fit tok/s Notes
Nemotron 3 Nano 30B-A3BQ4~5 GB VRAM40-60Best MoE for this card — only 3.2B active
Qwen 2.5 Coder 14BQ4_K_MFull GPU (9GB)12-18Interactive workhorse
DeepSeek-R1-Distill 14BQ4_K_MFull GPU (9GB)12-18Reasoning + code
Qwen3.5-27BQ4_K_MPartial (16GB)4-8~75% GPU offload
Qwen 2.5 Coder 32BQ4_K_MPartial (20GB)3-5HumanEval 92.7% — overnight batch jobs
Qwen3.5-35B-A3B (MoE)Q4_K_MPartial (24GB)5-10Only 3B active, benefits from partial offload

Coding-specific models

6 / 6
Model Params HumanEval SWE-bench Best for
Qwen 2.5 Coder 7B7B88.4%Autocomplete/FIM
Qwen 2.5 Coder 14B14B~89%Best balance capability/speed
Qwen 2.5 Coder 32B32B92.7%Highest code quality
Qwen3-Coder-Next (80B MoE)80B/3B active64.6%Beats Claude Opus 4.6 on SWE-bench
Qwen3.5-9B9B65.6 LCBv6Chat-based coding with vision
Qwen3.5-27B27B80.7 LCBv6Multi-file reasoning

Abliterated variant sources

11 / 11
Producer Method Models Where
huihui-aiAbliteration + Expert pruningQwen3.6, Qwen3.5 (all sizes), Qwen3, Gemma 3, GLM-5.1, gpt-oss-20b + Huihui4-8B-A4B-v2 (pruned Gemma 4, 9B/4B active, INT4 6-9GB)Ollama + HuggingFace
mlabonneAbliterationGemma 3 (1B-27B) + GGUFHuggingFace
bartowskiGGUF quantsQwQ-32B, Llama 3.1 8B, many othersHuggingFace
DavidAUHERETICGemma 4 31B, gpt-oss-20b (multiple variants)HuggingFace
HauhauCSAbliterationGemma 4 E2B, E4B, Qwen3.6-35B-A3B ("aggressive")HuggingFace
trohrbaughHeretic ARAGemma 4 31B (KL 0.012, refusals 98→5/100)HuggingFace
p-e-w (Heretic tool)Automated HERETIC1000+ models including Gemma 4GitHub + HuggingFace
TrevorJSBiprojection + EGAGemma 4 (E2B, E4B, 26B MoE, 31B)GitHub
amarckAbliterationGemma 4 31B (GGUF quants, Q4_K_M ~19GB)HuggingFace
pmarreckHERETICGemma 4 31B (one-command Ollama/MLX setup)GitHub
aoxoFine-tunegpt-oss-20bHuggingFace

gpt-oss-20b abliterated landscape (complete)

5 / 5
Variant Producer Method Format
Huihui-gpt-oss-20b-BF16-abliteratedhuihui-aiAbliterationBF16/Ollama (v1+v2)
GPT-oss-20b-abliterated-uncensored-NEODavidAUAbliteration+NEOGGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-HERETIC-uncensored-NEODavidAUHERETICGGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4DavidAUHERETICNative MXFP4
gpt-oss-20b-uncensoredaoxoFine-tuneBF16

Quantization reference

4 / 4
Quant Bits Quality 7B size 14B size 27B size
Q4_K_M~4.5Good4.5 GB9 GB16 GB
Q5_K_M~5.5Better (<2% perplexity loss)5.2 GB10 GB19 GB
Q6_K~6.5High6.0 GB12 GB22 GB
Q8_0~8.0Near-lossless7.5 GB15 GB27 GB

Key insight: Unsloth MLX-native Gemma 4 lineup (NEW — April 14)

Unsloth uploaded MLX-native quantizations for the full Gemma 4 family:

4 / 4
Model MLX 3-bit MLX 4-bit MLX 8-bit
Gemma 4 E2B
Gemma 4 E4B
Gemma 4 26B MoE
Gemma 4 31B Dense

Models NOT practical on the reference hardware

6 / 6
Model Why
GLM-5.1 (744B MoE, 40B active)MIT license, #1 SWE-Bench Pro (58.4). Open-weight on HuggingFace (zai-org/GLM-5.1). Smallest GGUF ~206GB. huihui-ai abliterated GGUF exists. MLX community version exists. Watch for distills.
Kimi K2.5 (1T params)Even smallest quant (1.8-bit) is ~240GB
Llama 4 Scout (109B)Q4 is ~60GB+
Llama 4 Maverick (400B)Data center only
gpt-oss-120b (117B MoE)Needs 66GB+ unified for usable speed
Nemotron 3 Super 120B-A12BToo large at full quality
Full document — prose, analysis, and everything not in the tables above

Local model landscape

Living document. Rewritten as new models ship. Last updated: 2026-06-10.

Reference hardware

Three representative consumer profiles, used as a baseline for the fit recommendations below. They span the common local-inference envelope: high-bandwidth Apple Silicon for fast small-model fleets, and a consumer NVIDIA card for GPU-offload of larger models.

ProfileTierKey specMemory bandwidthBudget for models
M3 Max 36GBTiny-fleet — tiny models, high tok/s, multiple in parallel36GB unified~400 GB/s~21-24 GB
M2 Max 32GBDispatch — big jobs, 7B-14B32GB unified~400 GB/s~19-22 GB
WSL + 3060 12GBHeavy compute — biggest models, GPU offload12GB VRAM + 64GB RAMPCIe bottleneck on offload12GB GPU / 64GB total

Preferences

  • Abliterated/uncensored variants preferred — no alignment tax
  • Key producers: huihui-ai (Ollama + HF), mlabonne (HF), bartowski (GGUF quants), DavidAU (HERETIC method)
  • Inference: Ollama 0.19+ (MLX backend — 57% faster prefill, 93% faster decode vs 0.18)

Cloud model context (benchmarks for reference)

Claude Fable 5 / Mythos 5 (June 9) — new Anthropic frontier ceiling, replacing Opus 4.8 as the GA top. One set of weights, two names: Fable 5 (general, safeguarded) and Mythos 5 (ungated, Glasswing/bio-research partners only). SOTA “on nearly all tested benchmarks”; higher than Opus 4.8 on FrontierCode even at medium effort; new vision SOTA; “millions of tokens” long-context focus. Pricing $10/$50 (2× regular Opus 4.8). Safety is a routing layer, not a weights property: cyber/bio-chem/distillation queries classifier-fall-back to Opus 4.8 — the n-1 frontier is now the next one’s safety floor. Local-inference consequence: the bar TurboQuant-class compression has to clear on consumer hardware just moved up again, and the frontier↔local capability gap widened — but the governance gap (per-query capability gating) is something open weights structurally cannot copy, which keeps the open-weight value proposition on cost/control rather than frontier parity. Local relevance otherwise low (hosted-only). Full analysis: reports/2026-06-10-the-fable-and-the-fallback.md.

North Mini Code (Cohere, June 9) — Cohere’s first developer model and a new open-weight coding contributor: 30B MoE / 3B active, Apache 2.0 (bf16+fp8), 128K context, trained across multiple agent scaffolds for harness-robustness. 80.2% pass@10 SWE-Bench Verified, 55.1% pass@10 Terminal-Bench v2; positioned above similarly-sized Qwen3.5/Gemma 4/Devstral Small 2 and larger Nemotron 3 Super/Mistral Small 4/Devstral 2. 3B-active = sub-agent/worker economics (cf. Mellum 2); 30B at fp8 (~30GB) exceeds Mac budgets, Q4 (~15–16GB) fits. Candidate local coding worker. See radar/signals/2026-06-09-introducing-north-mini-code-cohere-s-first-model-for-develop.md.

Gemma 4 12B (June 3) — new encoder-free, native-audio mid-tier in a tracked family; full detail + hardware fit in models/families/gemma/README.md.

GPT-5.5 “Spud” (April 23) — first fully retrained base since GPT-4.5. 1M context (API). Natively omnimodal.

BenchmarkGPT-5.5GPT-5.5 ProClaude Opus 4.7Gemini 3.1 Pro
SWE-Bench Pro58.6%64.3%
Terminal-Bench 2.082.7%69.4%68.5%
GPQA Diamond93.6%94.2%94.3%
FrontierMath Tier 439.6%22.9%

Pricing: Standard $5/$30, Pro $30/$180 per 1M tokens. The frontier is now a surface, not a point — no single model wins every benchmark. Local models compete on cost at the expense of benchmark position.

DeepSeek V4 (April 23-24) — MIT license, largest open-weight model, radical efficiency gains:

ModelTotal paramsActive paramsContextFLOPs vs V3.2KV cache vs V3.2
V4-Pro1.6T49B1M27%10%
V4-Flash284B13B1M

V4-Pro #1 open on Vibe Code Bench. Flash $0.14/$0.28, Pro $1.74/$3.48 per 1M tokens — 36-107x cheaper than GPT-5.5. Not viable for local inference (too large), but the CSA/HCA attention compression architecture will propagate to smaller models. When it reaches 7-27B scale, local agents get dramatically longer context without hardware upgrades.

What just shipped

Mellum 2 (June 1, 2026) — JetBrains’ worker-tier coding MoE

ModelTotal paramsActive paramsArchitectureLicenseModality
Mellum2-12B-A2.5B-Thinking12B2.5BMoE (sparse)Apache 2.0text + code

JetBrains’ model card positions it not as a flagship coder but for routing/orchestration in multi-model systems and sub-agent tasks (planning, validation, transformation), RAG context compression, and private-code deployment — i.e. the worker slot in an agent fleet. ~2× faster inference than comparable dense models from the low active-param count. Context length not stated in the launch post (arXiv 2605.31268 + model card to confirm).

Hardware fit (the low active-param count is the story — fully GPU-resident on all three):

  • M3 Max 36GB: Q4 ~7GB — comfortable; can host several parallel instances, high tok/s
  • M2 Max 32GB: Q4 ~7GB — ideal dispatch worker for batch sub-agent jobs
  • RTX 3060 12GB: Q4/Q5 ~7–8GB — fully in VRAM, no CPU offload; Q8 (~13GB) spills, stay at Q4/Q5

Significance / recommendation change: for a local sub-agent / code-completion worker role, Mellum 2 (Q4/Q5) is now the model to reach for on the 3060 box — the rare 12B that stays GPU-resident on a 12GB card because only 2.5B params activate, fast enough to run fanned-out rather than as a single assistant. On the Macs it’s a strong dispatch-worker default. Lands in the slot the orchestration layer is creating: as fleet-of-cheap-workers-under-one-planner becomes the dominant pattern (Opus 4.8 Dynamic Workflows; 60%+ of Codex users run parallel tasks), an open, fast, code-specialized local worker is the piece an open-source agent stack was missing.

Huihui4-8B-A4B-v2 (April 27, 2026) — Expert-pruned Gemma 4 variant

ModelBaseTotal paramsActive paramsArchitectureTraining data
Huihui4-8B-A4B-v2Gemma 4 26B-A4B-it9B~4BMoE (32 experts, 8 active)GLM-5.1-Multilingual-STEM

Expert pruning (128 → 32 experts) + SFT. Uses GLM-5.1 thinking mode format. INT4/INT8: 6-9GB VRAM. Cross-architecture lineage: Google model base, Chinese training data + reasoning format.

Hardware fit:

  • M3 Max 36GB: INT4 ~6GB — comfortable, room for multi-model fleet
  • M2 Max 32GB: INT4 ~6GB — comfortable
  • RTX 3060: INT4 ~6GB — fits entirely in VRAM

Significance: huihui-ai’s first technique beyond abliteration. Expert pruning restructures the model rather than removing safety guardrails. The v2 suffix indicates iterative refinement. At 6-9GB, this is the smallest Gemma 4-based coding-capable model — evaluate against Gemma 4 E4B (4.5B active, 9.6GB Ollama) to see if pruning preserves coding quality.


Qwen3.6-27B Dense (April 22, 2026) — Apache 2.0, most important local model release since Gemma 4

ModelTotal paramsActive paramsArchitectureContextModalities
Qwen3.6-27B27B27B (dense)Hybrid Gated DeltaNet + self-attentionTBDText

Dense model (not MoE). “Thinking Preservation” mechanism. Outperforms the 397B MoE Qwen3.6 on agentic coding benchmarks — 14x smaller, better at the specific task. Unsloth MLX quants (4/6/8-bit) available same day.

Hardware fit:

  • M3 Max 36GB: Q4_K_M (~15 GB) — fits with ~7GB headroom. Primary coding model candidate.
  • M2 Max 32GB: Q4_K_M (~15 GB) — fits with ~7GB headroom. Dispatch upgrade.
  • RTX 3060: Does not fit in 12GB VRAM. CPU offload with 64GB RAM possible but slow.

Priority evaluation. If the agentic coding benchmarks hold on practical tasks, this replaces the MoE models as the recommended local coding model for Apple Silicon. Dense architecture = more predictable inference, no routing overhead.


Kimi K2.6 (April 20, 2026) — Modified MIT, agent swarm architecture

ModelTotal paramsActive paramsArchitectureContextKey benchmark
Kimi K2.61T32B (384 experts)MoE, native multimodalTBDSWE-Bench Pro 58.6

First model designed for massive multi-agent orchestration: 300 sub-agents, 4,000 coordinated steps. SWE-Bench Pro 58.6 beats GPT-5.4 (57.7) and Opus 4.6 (57.3). Too large for local at full scale. Watch for distilled variants targeting the 32B active param slice.


Qwen3.6-35B-A3B (April 15-16, 2026) — Apache 2.0, MoE variant

ModelTotal paramsActive paramsSize (Ollama)ContextModalities
Qwen3.6-35B-A3B35B3B~20 GB262K (1M+ YaRN)Text, image, video

Terminal-Bench 2.0: 51.5. Best open-weight agentic coding MoE at this parameter count. huihui-ai shipped abliterated + Claude-named variants. Unsloth Dynamic 2.0 + bartowski imatrix available.

Hardware fit:

  • M3 Max 36GB: Q4_K_M (~18-19 GB) — fits but tight
  • M2 Max 32GB: Q3_K_M (~15 GB) safer
  • RTX 3060: does not fit. CPU+GPU split viable for batch.

Gemma 4 (April 2, 2026) — Apache 2.0 license (major change from Gemma 3’s custom license)

ModelTotal paramsActive paramsSize (Ollama)ContextModalities
Gemma 4 E2B5.1B2.3B7.2 GB128KText, image, audio
Gemma 4 E4B8B4.5B9.6 GB128KText, image, audio
Gemma 4 26B (MoE)25.2B3.8B18 GB256KText, image
Gemma 4 31B (Dense)30.7B30.7B20 GB256KText, image

Gemma 4 E2B beats Gemma 3 27B on most benchmarks with only 2.3B active params. Most efficient model per byte I’ve tracked. ~75-85 tok/s on M3 Max.

Abliterated variants expanding — see abliteration section below.

Nemotron 3 Nano — Mamba-Transformer hybrid, benchmarks now available

ModelTotal paramsActive paramsSize (Q4)ArchitectureKey benchmarks
Nemotron 3 Nano 4B3.6B3.6B~2.5 GBMamba-Transformer hybridTBD for this size
Nemotron 3 Nano 30B-A3B31.6B3.2B (MoE)~18 GBMamba-Transformer hybridAIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%

Independent benchmarks (via NeMo Evaluator):

  • AIME 2025: 89.1% (beats Qwen3-30B-A3B at 85.0%; 99.2% with Python tools)
  • LiveCodeBench v6: 68.3% (beats Qwen3 66.0% and gpt-oss 61.0%)
  • Arena-Hard-v2: 67.7% (vs Qwen3-30B 57.8%, gpt-oss-20b 48.5%)
  • RULER: 87.5% at 64K, 82.9% at 128K, 70.6% at 512K (supports 1M context)
  • 3.3x throughput vs Qwen3-30B-A3B on single H200

Verdict: At 3.2B active params, this runs on RTX 3060 12GB comfortably. Strong coding/reasoning at tiny active parameter count. Priority recommendation for the 3060 profile. GGUF quants available from Unsloth.

Hardware x Model fit matrix

M3 Max 36GB — Tiny fleet for background tasks

ModelQuantSizetok/sRole
Gemma 4 E2BMLX 8-bit (Unsloth)~4 GB75-85+Best tiny general-purpose; multimodal+audio. MLX-native = optimal on Apple Silicon
Nemotron 3 Nano 4BQ8_0~3.5 GBTBDMamba hybrid for agentic tasks — evaluate
Qwen3.5-0.8BQ8_0~1 GB120-150Ultra-fast drafting/classification
Qwen3.5-2BQ8_0~2.7 GB80-100Fast chat/code assist
SmolLM3-3BQ8_0~3.5 GB60-80Best-in-class 3B; 128K context
Qwen3.5-4BQ6_K~3.4 GB50-65Strong coding at 4B

Multi-model strategy: Set OLLAMA_MAX_LOADED_MODELS=4. Example fleet: Qwen3.5-0.8B (1GB) + Gemma 4 E2B (4GB) + SmolLM3-3B (3.5GB) + Qwen3.5-2B (2.7GB) = ~11GB total, plenty of headroom.

M2 Max 32GB — Dispatch workhorse

ModelQuantSizetok/sRole
Qwen3.6-27B DenseMLX 4-bit (Unsloth)~15 GBTBDNEW — PRIORITY. Outperforms 397B MoE on agentic coding. Dense = predictable inference.
Qwen 2.5 Coder 14BQ4_K_M~9 GB25-35Primary coding workhorse (HumanEval ~89%)
Qwen3.6-35B-A3BQ3_K_M~15 GBTBDBest agentic coding MoE. Q3 safer than Q4 on 32GB.
Nemotron 3 Nano 30B-A3BQ4_K_M~18 GB~77 tok/s (MLX)AIME 89.1%, LCBv6 68.3% — priority evaluation
DeepSeek-R1-Distill 14BQ4_K_M~9 GB22-30Chain-of-thought reasoning + code
Qwen3.5-9BQ5_K_M~6.5 GB28-38General + coding, 256K context
Phi-4 (14B)Q4_K_M~9 GB30-38STEM reasoning
Qwen3.5-27BQ4_K_M~17 GB12-18Peak quality (LiveCodeBench 80.7) — slow but usable for batch

Avoid: Gemma 4 26B MoE — community reports 11 tok/s vs 60+ for similarly-sized dense models. MoE has higher bandwidth demands per active param.

WSL + 3060 12GB — Heavy compute

ModelQuantVRAM fittok/sNotes
Nemotron 3 Nano 30B-A3BQ4~5 GB VRAM40-60Best MoE for this card — only 3.2B active
Qwen 2.5 Coder 14BQ4_K_MFull GPU (9GB)12-18Interactive workhorse
DeepSeek-R1-Distill 14BQ4_K_MFull GPU (9GB)12-18Reasoning + code
Qwen3.5-27BQ4_K_MPartial (16GB)4-8~75% GPU offload
Qwen 2.5 Coder 32BQ4_K_MPartial (20GB)3-5HumanEval 92.7% — overnight batch jobs
Qwen3.5-35B-A3B (MoE)Q4_K_MPartial (24GB)5-10Only 3B active, benefits from partial offload

Coding-specific models

ModelParamsHumanEvalSWE-benchBest for
Qwen 2.5 Coder 7B7B88.4%Autocomplete/FIM
Qwen 2.5 Coder 14B14B~89%Best balance capability/speed
Qwen 2.5 Coder 32B32B92.7%Highest code quality
Qwen3-Coder-Next (80B MoE)80B/3B active64.6%Beats Claude Opus 4.6 on SWE-bench
Qwen3.5-9B9B65.6 LCBv6Chat-based coding with vision
Qwen3.5-27B27B80.7 LCBv6Multi-file reasoning

Abliterated variant sources

ProducerMethodModelsWhere
huihui-aiAbliteration + Expert pruningQwen3.6, Qwen3.5 (all sizes), Qwen3, Gemma 3, GLM-5.1, gpt-oss-20b + Huihui4-8B-A4B-v2 (pruned Gemma 4, 9B/4B active, INT4 6-9GB)Ollama + HuggingFace
mlabonneAbliterationGemma 3 (1B-27B) + GGUFHuggingFace
bartowskiGGUF quantsQwQ-32B, Llama 3.1 8B, many othersHuggingFace
DavidAUHERETICGemma 4 31B, gpt-oss-20b (multiple variants)HuggingFace
HauhauCSAbliterationGemma 4 E2B, E4B, Qwen3.6-35B-A3B (“aggressive”)HuggingFace
trohrbaughHeretic ARAGemma 4 31B (KL 0.012, refusals 98→5/100)HuggingFace
p-e-w (Heretic tool)Automated HERETIC1000+ models including Gemma 4GitHub + HuggingFace
TrevorJSBiprojection + EGAGemma 4 (E2B, E4B, 26B MoE, 31B)GitHub
amarckAbliterationGemma 4 31B (GGUF quants, Q4_K_M ~19GB)HuggingFace
pmarreckHERETICGemma 4 31B (one-command Ollama/MLX setup)GitHub
aoxoFine-tunegpt-oss-20bHuggingFace

Quick Ollama access:

ollama pull huihui_ai/qwen3.5-abliterated       # Qwen 3.5 uncensored
ollama pull huihui_ai/gemma3-abliterated         # Gemma 3 uncensored

gpt-oss-20b abliterated landscape (complete)

VariantProducerMethodFormat
Huihui-gpt-oss-20b-BF16-abliteratedhuihui-aiAbliterationBF16/Ollama (v1+v2)
GPT-oss-20b-abliterated-uncensored-NEODavidAUAbliteration+NEOGGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-HERETIC-uncensored-NEODavidAUHERETICGGUF (IQ4_NL, Q5_1, Q8_0)
GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4DavidAUHERETICNative MXFP4
gpt-oss-20b-uncensoredaoxoFine-tuneBF16

All fit comfortably on all three machines. MXFP4 at ~14GB or IQ4_NL at ~11.5GB. HERETIC variant claims complete refusal removal.

Independent benchmarks (via BenchLM, DataRobot, Artificial Analysis):

  • Arena-Hard-v2: 48.5% (behind Nemotron 3 Nano at 67.7%)
  • LiveCodeBench v6: 61.0% (behind Nemotron 3 Nano at 68.3%)
  • Matches or exceeds o3-mini on most benchmarks
  • Outperforms gpt-oss-120B on HumanEval and MMLU despite being much smaller
  • “Low thinking effort” mode outperforms more expensive competitors
  • Fits 16GB devices — runs on RTX 3060 and M3 Max easily

Verdict: Solid general-purpose model but Nemotron 3 Nano beats it on coding benchmarks at similar active params. Best use: general reasoning/chat where abliterated variant is preferred.

Quantization reference

QuantBitsQuality7B size14B size27B size
Q4_K_M~4.5Good4.5 GB9 GB16 GB
Q5_K_M~5.5Better (<2% perplexity loss)5.2 GB10 GB19 GB
Q6_K~6.5High6.0 GB12 GB22 GB
Q8_0~8.0Near-lossless7.5 GB15 GB27 GB

Rule of thumb for Apple Silicon: model should be <=60-70% of total unified memory.

Key insight: TurboQuant — 6x KV cache compression (NEW — April 12)

Google Research’s TurboQuant (March 25, ICLR 2026) compresses KV cache to 3 bits with zero accuracy loss. No retraining required. 6x reduction in KV memory.

Impact on the reference hardware:

  • M3 Max 36GB: Gemma 4 31B at full 262K context becomes possible. KV cache drops from ~22GB to ~3.7GB. 31B Q4 (~20GB) + 3.7GB KV = 23.7GB total — fits.
  • M2 Max 32GB: Nemotron 30B-A3B and Qwen3.5-27B can serve dramatically longer contexts within existing memory.
  • RTX 3060 12GB: Context length multiplied within same VRAM budget. 14B models can run at very long context.

Implementation status:

  • Google official: Q2 2026
  • llama.cpp: turboquant_plus project, experimental, Metal support on Apple Silicon
  • Validated from 1.5B to 104B parameter models

The synthesis: TurboQuant + Ollama 0.19 MLX backend = two multiplicative improvements. MLX accelerates compute, TurboQuant expands context. Together they make Apple Silicon the most improved local inference platform.

Key insight: Unsloth MLX-native Gemma 4 lineup (NEW — April 14)

Unsloth uploaded MLX-native quantizations for the full Gemma 4 family:

ModelMLX 3-bitMLX 4-bitMLX 8-bit
Gemma 4 E2B
Gemma 4 E4B
Gemma 4 26B MoE
Gemma 4 31B Dense

Why this matters: MLX-native quants skip GGUF→MLX conversion overhead. Combined with Ollama 0.19’s MLX backend, these are the optimal format for Apple Silicon. The Gemma 4 26B MoE at MLX 4-bit (~17GB) fits M3 Max and M2 Max comfortably. The 31B Dense at MLX 3-bit may also fit within budget.

Updated recommendation: For general-purpose inference on Apple Silicon, prefer Unsloth MLX quants over GGUF when available.

huihui-ai abliteration wave (April 14-16)

  • Huihui4-48B-A4B-abliterated (April 16) — experimental expanded architecture: takes Gemma 4 26B-A4B and replaces MLP layers with 256-expert MoE, expanding to 48B total. Not fine-tuned yet. Experimental.
  • Huihui3.5-67B-A3B (April 16) — Qwen3.5-35B-A3B base expanded to 512 experts, 67B/3B active MoE
  • Gemma 4 E2B, 31B, 26B MoE abliterated v2 — refreshed versions
  • Full Qwen3.5, GLM-4.7, Kimi, Mistral-Small-4 lineups now abliterated

DavidAU LFM2 HERETIC series (NEW — April 14-16)

Liquid AI’s LFM2 foundation models (SSM-Transformer hybrids) getting the HERETIC treatment:

  • LFM2-12B-A1B High-Intelligence Series-B (April 16)
  • LFM2-12B-A1B Deckard-II HERETIC Uncensored (April 16)
  • LFM2-8B-A1B Deckard-II HERETIC Uncensored (April 15)
  • LFM2-8B-A1B GLM-4.7-Flash Thinking (April 15)
  • gemma-4-19B-A4B-it INSTRUCT Heretic-Uncensored (April 15)
  • gemma-4-E4B-it Deckard-V2 Strong HERETIC (April 14)

At 1B active params, the LFM2-8B models are extremely efficient — run on all three machines. SSM-Transformer hybrid architecture is distinct from standard transformer; worth evaluating for latency characteristics.

Key insight: Ollama 0.19 MLX backend

Released March 2026. On Apple Silicon: 57% faster prefill, 93% faster decode vs v0.18 (llama.cpp). The M3 Max has higher memory bandwidth than M4 Pro, so it outperforms newer chips for memory-bound inference. Make sure Ollama is updated.

Models NOT practical on the reference hardware

ModelWhy
GLM-5.1 (744B MoE, 40B active)MIT license, #1 SWE-Bench Pro (58.4). Open-weight on HuggingFace (zai-org/GLM-5.1). Smallest GGUF ~206GB. huihui-ai abliterated GGUF exists. MLX community version exists. Watch for distills.
Kimi K2.5 (1T params)Even smallest quant (1.8-bit) is ~240GB
Llama 4 Scout (109B)Q4 is ~60GB+
Llama 4 Maverick (400B)Data center only
gpt-oss-120b (117B MoE)Needs 66GB+ unified for usable speed
Nemotron 3 Super 120B-A12BToo large at full quality

Other models to assess

  • DavidAU LFM2-8B-A1B variants: SSM-Transformer hybrid from Liquid AI, HERETIC uncensored. 1B active params. Extremely efficient — fits all three machines. New architecture worth evaluating for latency and agentic task performance.
  • DavidAU gemma-4-19B-A4B HERETIC: Compact uncensored Gemma 4 variant. Fits M3 Max and M2 Max at Q4.
  • Nemotron 3 Nano 4B: Mamba-Transformer hybrid, claims 5x throughput. Tiny enough for fleet member on M3 Max. Priority evaluation.
  • Nemotron 3 Nano 30B-A3B: MoE with only 3B active, Mamba hybrid. Local benchmarks now available: ~77 tok/s M2 Max (MLX), 40-60 tok/s RTX 3060 (Q4, ~5GB VRAM). Fits 3060 comfortably — top priority.
  • MiniMax M2.7: “Self-evolving” training. 56.22% SWE-Pro — approaching Claude Opus 4.6. Too large for local but watch for quants/distills.
  • Cogito v1 (3B/8B/14B/32B/70B): Dense, hybrid reasoning toggle. Llama/Qwen-base variants. On Ollama.
  • Phi-4-mini-reasoning (3.8B): 128K context, reasoning-capable. Worth testing as alternative to SmolLM3.
  • Gemma 4 31B HERETIC+Thinking (DavidAU): Chain-of-thought reasoning + uncensored 31B.
  • Qwen3-Coder abliterated (huihui-ai): Abliterated variant for the coding model line.
  • gpt-oss-20b HERETIC (DavidAU): Claims complete refusal removal. Priority evaluation.
  • Gemma 4 31B abliterated GGUF (amarck): Q4_K_M ~19GB, fits M3 Max at short context.

Known issues

  • Qwen 3.5 GGUF + Ollama incompatibility: GGUF versions do not work in Ollama due to separate mmproj vision files. Use llama.cpp directly for now.
  • Gemma 4 GGUF chat template bug: Community GGUF uploads ship with incorrect chat templates (wrong delimiters), causing ”---” output loops. pmarreck/gemma4-heretical fixes this via Ollama RENDERER/PARSER support.
  • Gemma 4 31B flash-attention bug in Ollama: Hangs on prompts over ~500 tokens. Workaround: OLLAMA_FLASH_ATTENTION=0 but tanks speed to ~15 tok/s on Apple Silicon. The 26B MoE is the better pick at ~20-30 tok/s.
  • Gemma 4 31B context limits on 36GB Macs: 31B Q4 needs ~20GB weights + ~22GB KV at full 262K context. Only works at short context (<16K) on M3 Max 36GB.

Open threads

  • Meta Muse Spark — open-weight contraction: Meta went proprietary. Llama future unclear. The open-weight producers (Google Gemma, Alibaba Qwen, Zhipu GLM, community) become more important. Google’s Apache 2.0 shift for Gemma 4 looks prescient.
  • Heretic ARA quality: trohrbaugh’s gemma-4-31b-it-heretic-ara achieves KL divergence 0.012 (virtually no quality loss) while reducing refusals 98→5/100. Current best-quality abliteration for Gemma 4 31B. Needs evaluation.
  • Nemotron 3 Nano evaluation: Benchmarks available. AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%. At 3.2B active params, top priority for the 3060 profile. Beats Qwen3-30B-A3B and gpt-oss-20b on coding.
  • gpt-oss-20b evaluation: Benchmarks available. Arena-Hard 48.5%, LCBv6 61.0%. Solid but Nemotron 3 Nano beats it. Best for general reasoning with abliterated variant.
  • Ollama v0.20.5 (April 9): New release. Gemma 4 all sizes available. Check for stability/perf fixes.
  • TrevorJS abliteration technique: Biprojection + EGA, cross-validated against 686 prompts. New method worth tracking.
  • Qwen 3.6-Plus: API-only (Alibaba Bailian, OpenRouter). 1M context, agentic coding. Watch for local release.
  • huihui-ai Huihui4-8B-A4B: New model family uploaded April 25. 8B total/4B active (MoE), image-text-to-text. GGUF variant also available. If original work (not abliteration), marks huihui-ai’s transition to model producer. Fits all three machines easily. Evaluate.
  • huihui-ai Qwen3.6-27B abliterated: Uploaded April 23, 539 downloads. Abliterated dense Qwen3.6-27B — the model that outperforms 397B MoE. Combined with Unsloth MLX quants for Apple Silicon deployment.
  • DeepSeek V4: Shipped April 23-24. V4-Pro 1.6T/49B active, V4-Flash 284B/13B active. MIT license. CSA/HCA attention compression (27% FLOPs, 10% KV cache vs V3.2). Too large for local. The architecture matters more than the weights — watch for compression techniques propagating to smaller models.
  • vllm-mlx: Server framework claiming 400+ tok/s on tiny models, continuous batching, Claude Code compatible.
  • OpenClaw community adopting Kimi K2.5: Signal that model preference is shifting away from Claude for agentic work in open-source community.
  • GLM-5.1: 744B MoE (40B active), #1 SWE-Bench Pro (58.4), MIT license. Open-weight since April 7 (corrected from “cloud-only”). huihui-ai abliterated GGUF available. MLX community version exists. Too large for local at full scale (~206GB) but distills/aggressive quants may change this. Watch Z.ai for smaller variants.
  • Copilot CLI BYOK: Now supports Ollama, vLLM, any OpenAI-compatible endpoint. Local models become usable inside a major agent’s workflow for the first time.

← all landscape docs