Local model landscape
Living document. Rewritten as new models ship. Last updated: 2026-06-10.
Reference hardware
Three representative consumer profiles, used as a baseline for the fit recommendations below. They span the common local-inference envelope: high-bandwidth Apple Silicon for fast small-model fleets, and a consumer NVIDIA card for GPU-offload of larger models.
| Profile | Tier | Key spec | Memory bandwidth | Budget for models |
|---|---|---|---|---|
| M3 Max 36GB | Tiny-fleet — tiny models, high tok/s, multiple in parallel | 36GB unified | ~400 GB/s | ~21-24 GB |
| M2 Max 32GB | Dispatch — big jobs, 7B-14B | 32GB unified | ~400 GB/s | ~19-22 GB |
| WSL + 3060 12GB | Heavy compute — biggest models, GPU offload | 12GB VRAM + 64GB RAM | PCIe bottleneck on offload | 12GB GPU / 64GB total |
Preferences
- Abliterated/uncensored variants preferred — no alignment tax
- Key producers: huihui-ai (Ollama + HF), mlabonne (HF), bartowski (GGUF quants), DavidAU (HERETIC method)
- Inference: Ollama 0.19+ (MLX backend — 57% faster prefill, 93% faster decode vs 0.18)
Cloud model context (benchmarks for reference)
Claude Fable 5 / Mythos 5 (June 9) — new Anthropic frontier ceiling, replacing Opus 4.8 as the GA top. One set of weights, two names: Fable 5 (general, safeguarded) and Mythos 5 (ungated, Glasswing/bio-research partners only). SOTA “on nearly all tested benchmarks”; higher than Opus 4.8 on FrontierCode even at medium effort; new vision SOTA; “millions of tokens” long-context focus. Pricing $10/$50 (2× regular Opus 4.8). Safety is a routing layer, not a weights property: cyber/bio-chem/distillation queries classifier-fall-back to Opus 4.8 — the n-1 frontier is now the next one’s safety floor. Local-inference consequence: the bar TurboQuant-class compression has to clear on consumer hardware just moved up again, and the frontier↔local capability gap widened — but the governance gap (per-query capability gating) is something open weights structurally cannot copy, which keeps the open-weight value proposition on cost/control rather than frontier parity. Local relevance otherwise low (hosted-only). Full analysis: reports/2026-06-10-the-fable-and-the-fallback.md.
North Mini Code (Cohere, June 9) — Cohere’s first developer model and a new open-weight coding contributor: 30B MoE / 3B active, Apache 2.0 (bf16+fp8), 128K context, trained across multiple agent scaffolds for harness-robustness. 80.2% pass@10 SWE-Bench Verified, 55.1% pass@10 Terminal-Bench v2; positioned above similarly-sized Qwen3.5/Gemma 4/Devstral Small 2 and larger Nemotron 3 Super/Mistral Small 4/Devstral 2. 3B-active = sub-agent/worker economics (cf. Mellum 2); 30B at fp8 (~30GB) exceeds Mac budgets, Q4 (~15–16GB) fits. Candidate local coding worker. See radar/signals/2026-06-09-introducing-north-mini-code-cohere-s-first-model-for-develop.md.
Gemma 4 12B (June 3) — new encoder-free, native-audio mid-tier in a tracked family; full detail + hardware fit in models/families/gemma/README.md.
GPT-5.5 “Spud” (April 23) — first fully retrained base since GPT-4.5. 1M context (API). Natively omnimodal.
| Benchmark | GPT-5.5 | GPT-5.5 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 58.6% | — | 64.3% | — |
| Terminal-Bench 2.0 | 82.7% | — | 69.4% | 68.5% |
| GPQA Diamond | 93.6% | — | 94.2% | 94.3% |
| FrontierMath Tier 4 | — | 39.6% | 22.9% | — |
Pricing: Standard $5/$30, Pro $30/$180 per 1M tokens. The frontier is now a surface, not a point — no single model wins every benchmark. Local models compete on cost at the expense of benchmark position.
DeepSeek V4 (April 23-24) — MIT license, largest open-weight model, radical efficiency gains:
| Model | Total params | Active params | Context | FLOPs vs V3.2 | KV cache vs V3.2 |
|---|---|---|---|---|---|
| V4-Pro | 1.6T | 49B | 1M | 27% | 10% |
| V4-Flash | 284B | 13B | 1M | — | — |
V4-Pro #1 open on Vibe Code Bench. Flash $0.14/$0.28, Pro $1.74/$3.48 per 1M tokens — 36-107x cheaper than GPT-5.5. Not viable for local inference (too large), but the CSA/HCA attention compression architecture will propagate to smaller models. When it reaches 7-27B scale, local agents get dramatically longer context without hardware upgrades.
What just shipped
Mellum 2 (June 1, 2026) — JetBrains’ worker-tier coding MoE
| Model | Total params | Active params | Architecture | License | Modality |
|---|---|---|---|---|---|
| Mellum2-12B-A2.5B-Thinking | 12B | 2.5B | MoE (sparse) | Apache 2.0 | text + code |
JetBrains’ model card positions it not as a flagship coder but for routing/orchestration in multi-model systems and sub-agent tasks (planning, validation, transformation), RAG context compression, and private-code deployment — i.e. the worker slot in an agent fleet. ~2× faster inference than comparable dense models from the low active-param count. Context length not stated in the launch post (arXiv 2605.31268 + model card to confirm).
Hardware fit (the low active-param count is the story — fully GPU-resident on all three):
- M3 Max 36GB: Q4 ~7GB — comfortable; can host several parallel instances, high tok/s
- M2 Max 32GB: Q4 ~7GB — ideal dispatch worker for batch sub-agent jobs
- RTX 3060 12GB: Q4/Q5 ~7–8GB — fully in VRAM, no CPU offload; Q8 (~13GB) spills, stay at Q4/Q5
Significance / recommendation change: for a local sub-agent / code-completion worker role, Mellum 2 (Q4/Q5) is now the model to reach for on the 3060 box — the rare 12B that stays GPU-resident on a 12GB card because only 2.5B params activate, fast enough to run fanned-out rather than as a single assistant. On the Macs it’s a strong dispatch-worker default. Lands in the slot the orchestration layer is creating: as fleet-of-cheap-workers-under-one-planner becomes the dominant pattern (Opus 4.8 Dynamic Workflows; 60%+ of Codex users run parallel tasks), an open, fast, code-specialized local worker is the piece an open-source agent stack was missing.
Huihui4-8B-A4B-v2 (April 27, 2026) — Expert-pruned Gemma 4 variant
| Model | Base | Total params | Active params | Architecture | Training data |
|---|---|---|---|---|---|
| Huihui4-8B-A4B-v2 | Gemma 4 26B-A4B-it | 9B | ~4B | MoE (32 experts, 8 active) | GLM-5.1-Multilingual-STEM |
Expert pruning (128 → 32 experts) + SFT. Uses GLM-5.1 thinking mode format. INT4/INT8: 6-9GB VRAM. Cross-architecture lineage: Google model base, Chinese training data + reasoning format.
Hardware fit:
- M3 Max 36GB: INT4 ~6GB — comfortable, room for multi-model fleet
- M2 Max 32GB: INT4 ~6GB — comfortable
- RTX 3060: INT4 ~6GB — fits entirely in VRAM
Significance: huihui-ai’s first technique beyond abliteration. Expert pruning restructures the model rather than removing safety guardrails. The v2 suffix indicates iterative refinement. At 6-9GB, this is the smallest Gemma 4-based coding-capable model — evaluate against Gemma 4 E4B (4.5B active, 9.6GB Ollama) to see if pruning preserves coding quality.
Qwen3.6-27B Dense (April 22, 2026) — Apache 2.0, most important local model release since Gemma 4
| Model | Total params | Active params | Architecture | Context | Modalities |
|---|---|---|---|---|---|
| Qwen3.6-27B | 27B | 27B (dense) | Hybrid Gated DeltaNet + self-attention | TBD | Text |
Dense model (not MoE). “Thinking Preservation” mechanism. Outperforms the 397B MoE Qwen3.6 on agentic coding benchmarks — 14x smaller, better at the specific task. Unsloth MLX quants (4/6/8-bit) available same day.
Hardware fit:
- M3 Max 36GB: Q4_K_M (~15 GB) — fits with ~7GB headroom. Primary coding model candidate.
- M2 Max 32GB: Q4_K_M (~15 GB) — fits with ~7GB headroom. Dispatch upgrade.
- RTX 3060: Does not fit in 12GB VRAM. CPU offload with 64GB RAM possible but slow.
Priority evaluation. If the agentic coding benchmarks hold on practical tasks, this replaces the MoE models as the recommended local coding model for Apple Silicon. Dense architecture = more predictable inference, no routing overhead.
Kimi K2.6 (April 20, 2026) — Modified MIT, agent swarm architecture
| Model | Total params | Active params | Architecture | Context | Key benchmark |
|---|---|---|---|---|---|
| Kimi K2.6 | 1T | 32B (384 experts) | MoE, native multimodal | TBD | SWE-Bench Pro 58.6 |
First model designed for massive multi-agent orchestration: 300 sub-agents, 4,000 coordinated steps. SWE-Bench Pro 58.6 beats GPT-5.4 (57.7) and Opus 4.6 (57.3). Too large for local at full scale. Watch for distilled variants targeting the 32B active param slice.
Qwen3.6-35B-A3B (April 15-16, 2026) — Apache 2.0, MoE variant
| Model | Total params | Active params | Size (Ollama) | Context | Modalities |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3B | 35B | 3B | ~20 GB | 262K (1M+ YaRN) | Text, image, video |
Terminal-Bench 2.0: 51.5. Best open-weight agentic coding MoE at this parameter count. huihui-ai shipped abliterated + Claude-named variants. Unsloth Dynamic 2.0 + bartowski imatrix available.
Hardware fit:
- M3 Max 36GB: Q4_K_M (~18-19 GB) — fits but tight
- M2 Max 32GB: Q3_K_M (~15 GB) safer
- RTX 3060: does not fit. CPU+GPU split viable for batch.
Gemma 4 (April 2, 2026) — Apache 2.0 license (major change from Gemma 3’s custom license)
| Model | Total params | Active params | Size (Ollama) | Context | Modalities |
|---|---|---|---|---|---|
| Gemma 4 E2B | 5.1B | 2.3B | 7.2 GB | 128K | Text, image, audio |
| Gemma 4 E4B | 8B | 4.5B | 9.6 GB | 128K | Text, image, audio |
| Gemma 4 26B (MoE) | 25.2B | 3.8B | 18 GB | 256K | Text, image |
| Gemma 4 31B (Dense) | 30.7B | 30.7B | 20 GB | 256K | Text, image |
Gemma 4 E2B beats Gemma 3 27B on most benchmarks with only 2.3B active params. Most efficient model per byte I’ve tracked. ~75-85 tok/s on M3 Max.
Abliterated variants expanding — see abliteration section below.
Nemotron 3 Nano — Mamba-Transformer hybrid, benchmarks now available
| Model | Total params | Active params | Size (Q4) | Architecture | Key benchmarks |
|---|---|---|---|---|---|
| Nemotron 3 Nano 4B | 3.6B | 3.6B | ~2.5 GB | Mamba-Transformer hybrid | TBD for this size |
| Nemotron 3 Nano 30B-A3B | 31.6B | 3.2B (MoE) | ~18 GB | Mamba-Transformer hybrid | AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7% |
Independent benchmarks (via NeMo Evaluator):
- AIME 2025: 89.1% (beats Qwen3-30B-A3B at 85.0%; 99.2% with Python tools)
- LiveCodeBench v6: 68.3% (beats Qwen3 66.0% and gpt-oss 61.0%)
- Arena-Hard-v2: 67.7% (vs Qwen3-30B 57.8%, gpt-oss-20b 48.5%)
- RULER: 87.5% at 64K, 82.9% at 128K, 70.6% at 512K (supports 1M context)
- 3.3x throughput vs Qwen3-30B-A3B on single H200
Verdict: At 3.2B active params, this runs on RTX 3060 12GB comfortably. Strong coding/reasoning at tiny active parameter count. Priority recommendation for the 3060 profile. GGUF quants available from Unsloth.
Hardware x Model fit matrix
M3 Max 36GB — Tiny fleet for background tasks
| Model | Quant | Size | tok/s | Role |
|---|---|---|---|---|
| Gemma 4 E2B | MLX 8-bit (Unsloth) | ~4 GB | 75-85+ | Best tiny general-purpose; multimodal+audio. MLX-native = optimal on Apple Silicon |
| Nemotron 3 Nano 4B | Q8_0 | ~3.5 GB | TBD | Mamba hybrid for agentic tasks — evaluate |
| Qwen3.5-0.8B | Q8_0 | ~1 GB | 120-150 | Ultra-fast drafting/classification |
| Qwen3.5-2B | Q8_0 | ~2.7 GB | 80-100 | Fast chat/code assist |
| SmolLM3-3B | Q8_0 | ~3.5 GB | 60-80 | Best-in-class 3B; 128K context |
| Qwen3.5-4B | Q6_K | ~3.4 GB | 50-65 | Strong coding at 4B |
Multi-model strategy: Set OLLAMA_MAX_LOADED_MODELS=4. Example fleet: Qwen3.5-0.8B (1GB) + Gemma 4 E2B (4GB) + SmolLM3-3B (3.5GB) + Qwen3.5-2B (2.7GB) = ~11GB total, plenty of headroom.
M2 Max 32GB — Dispatch workhorse
| Model | Quant | Size | tok/s | Role |
|---|---|---|---|---|
| Qwen3.6-27B Dense | MLX 4-bit (Unsloth) | ~15 GB | TBD | NEW — PRIORITY. Outperforms 397B MoE on agentic coding. Dense = predictable inference. |
| Qwen 2.5 Coder 14B | Q4_K_M | ~9 GB | 25-35 | Primary coding workhorse (HumanEval ~89%) |
| Qwen3.6-35B-A3B | Q3_K_M | ~15 GB | TBD | Best agentic coding MoE. Q3 safer than Q4 on 32GB. |
| Nemotron 3 Nano 30B-A3B | Q4_K_M | ~18 GB | ~77 tok/s (MLX) | AIME 89.1%, LCBv6 68.3% — priority evaluation |
| DeepSeek-R1-Distill 14B | Q4_K_M | ~9 GB | 22-30 | Chain-of-thought reasoning + code |
| Qwen3.5-9B | Q5_K_M | ~6.5 GB | 28-38 | General + coding, 256K context |
| Phi-4 (14B) | Q4_K_M | ~9 GB | 30-38 | STEM reasoning |
| Qwen3.5-27B | Q4_K_M | ~17 GB | 12-18 | Peak quality (LiveCodeBench 80.7) — slow but usable for batch |
Avoid: Gemma 4 26B MoE — community reports 11 tok/s vs 60+ for similarly-sized dense models. MoE has higher bandwidth demands per active param.
WSL + 3060 12GB — Heavy compute
| Model | Quant | VRAM fit | tok/s | Notes |
|---|---|---|---|---|
| Nemotron 3 Nano 30B-A3B | Q4 | ~5 GB VRAM | 40-60 | Best MoE for this card — only 3.2B active |
| Qwen 2.5 Coder 14B | Q4_K_M | Full GPU (9GB) | 12-18 | Interactive workhorse |
| DeepSeek-R1-Distill 14B | Q4_K_M | Full GPU (9GB) | 12-18 | Reasoning + code |
| Qwen3.5-27B | Q4_K_M | Partial (16GB) | 4-8 | ~75% GPU offload |
| Qwen 2.5 Coder 32B | Q4_K_M | Partial (20GB) | 3-5 | HumanEval 92.7% — overnight batch jobs |
| Qwen3.5-35B-A3B (MoE) | Q4_K_M | Partial (24GB) | 5-10 | Only 3B active, benefits from partial offload |
Coding-specific models
| Model | Params | HumanEval | SWE-bench | Best for |
|---|---|---|---|---|
| Qwen 2.5 Coder 7B | 7B | 88.4% | — | Autocomplete/FIM |
| Qwen 2.5 Coder 14B | 14B | ~89% | — | Best balance capability/speed |
| Qwen 2.5 Coder 32B | 32B | 92.7% | — | Highest code quality |
| Qwen3-Coder-Next (80B MoE) | 80B/3B active | — | 64.6% | Beats Claude Opus 4.6 on SWE-bench |
| Qwen3.5-9B | 9B | — | 65.6 LCBv6 | Chat-based coding with vision |
| Qwen3.5-27B | 27B | — | 80.7 LCBv6 | Multi-file reasoning |
Abliterated variant sources
| Producer | Method | Models | Where |
|---|---|---|---|
| huihui-ai | Abliteration + Expert pruning | Qwen3.6, Qwen3.5 (all sizes), Qwen3, Gemma 3, GLM-5.1, gpt-oss-20b + Huihui4-8B-A4B-v2 (pruned Gemma 4, 9B/4B active, INT4 6-9GB) | Ollama + HuggingFace |
| mlabonne | Abliteration | Gemma 3 (1B-27B) + GGUF | HuggingFace |
| bartowski | GGUF quants | QwQ-32B, Llama 3.1 8B, many others | HuggingFace |
| DavidAU | HERETIC | Gemma 4 31B, gpt-oss-20b (multiple variants) | HuggingFace |
| HauhauCS | Abliteration | Gemma 4 E2B, E4B, Qwen3.6-35B-A3B (“aggressive”) | HuggingFace |
| trohrbaugh | Heretic ARA | Gemma 4 31B (KL 0.012, refusals 98→5/100) | HuggingFace |
| p-e-w (Heretic tool) | Automated HERETIC | 1000+ models including Gemma 4 | GitHub + HuggingFace |
| TrevorJS | Biprojection + EGA | Gemma 4 (E2B, E4B, 26B MoE, 31B) | GitHub |
| amarck | Abliteration | Gemma 4 31B (GGUF quants, Q4_K_M ~19GB) | HuggingFace |
| pmarreck | HERETIC | Gemma 4 31B (one-command Ollama/MLX setup) | GitHub |
| aoxo | Fine-tune | gpt-oss-20b | HuggingFace |
Quick Ollama access:
ollama pull huihui_ai/qwen3.5-abliterated # Qwen 3.5 uncensored
ollama pull huihui_ai/gemma3-abliterated # Gemma 3 uncensored
gpt-oss-20b abliterated landscape (complete)
| Variant | Producer | Method | Format |
|---|---|---|---|
| Huihui-gpt-oss-20b-BF16-abliterated | huihui-ai | Abliteration | BF16/Ollama (v1+v2) |
| GPT-oss-20b-abliterated-uncensored-NEO | DavidAU | Abliteration+NEO | GGUF (IQ4_NL, Q5_1, Q8_0) |
| GPT-oss-20b-HERETIC-uncensored-NEO | DavidAU | HERETIC | GGUF (IQ4_NL, Q5_1, Q8_0) |
| GPT-oss-20b-INSTRUCT-Heretic-Uncensored-MXFP4 | DavidAU | HERETIC | Native MXFP4 |
| gpt-oss-20b-uncensored | aoxo | Fine-tune | BF16 |
All fit comfortably on all three machines. MXFP4 at ~14GB or IQ4_NL at ~11.5GB. HERETIC variant claims complete refusal removal.
Independent benchmarks (via BenchLM, DataRobot, Artificial Analysis):
- Arena-Hard-v2: 48.5% (behind Nemotron 3 Nano at 67.7%)
- LiveCodeBench v6: 61.0% (behind Nemotron 3 Nano at 68.3%)
- Matches or exceeds o3-mini on most benchmarks
- Outperforms gpt-oss-120B on HumanEval and MMLU despite being much smaller
- “Low thinking effort” mode outperforms more expensive competitors
- Fits 16GB devices — runs on RTX 3060 and M3 Max easily
Verdict: Solid general-purpose model but Nemotron 3 Nano beats it on coding benchmarks at similar active params. Best use: general reasoning/chat where abliterated variant is preferred.
Quantization reference
| Quant | Bits | Quality | 7B size | 14B size | 27B size |
|---|---|---|---|---|---|
| Q4_K_M | ~4.5 | Good | 4.5 GB | 9 GB | 16 GB |
| Q5_K_M | ~5.5 | Better (<2% perplexity loss) | 5.2 GB | 10 GB | 19 GB |
| Q6_K | ~6.5 | High | 6.0 GB | 12 GB | 22 GB |
| Q8_0 | ~8.0 | Near-lossless | 7.5 GB | 15 GB | 27 GB |
Rule of thumb for Apple Silicon: model should be <=60-70% of total unified memory.
Key insight: TurboQuant — 6x KV cache compression (NEW — April 12)
Google Research’s TurboQuant (March 25, ICLR 2026) compresses KV cache to 3 bits with zero accuracy loss. No retraining required. 6x reduction in KV memory.
Impact on the reference hardware:
- M3 Max 36GB: Gemma 4 31B at full 262K context becomes possible. KV cache drops from ~22GB to ~3.7GB. 31B Q4 (~20GB) + 3.7GB KV = 23.7GB total — fits.
- M2 Max 32GB: Nemotron 30B-A3B and Qwen3.5-27B can serve dramatically longer contexts within existing memory.
- RTX 3060 12GB: Context length multiplied within same VRAM budget. 14B models can run at very long context.
Implementation status:
- Google official: Q2 2026
- llama.cpp:
turboquant_plusproject, experimental, Metal support on Apple Silicon - Validated from 1.5B to 104B parameter models
The synthesis: TurboQuant + Ollama 0.19 MLX backend = two multiplicative improvements. MLX accelerates compute, TurboQuant expands context. Together they make Apple Silicon the most improved local inference platform.
Key insight: Unsloth MLX-native Gemma 4 lineup (NEW — April 14)
Unsloth uploaded MLX-native quantizations for the full Gemma 4 family:
| Model | MLX 3-bit | MLX 4-bit | MLX 8-bit |
|---|---|---|---|
| Gemma 4 E2B | — | — | ✓ |
| Gemma 4 E4B | — | ✓ | ✓ |
| Gemma 4 26B MoE | ✓ | ✓ | ✓ |
| Gemma 4 31B Dense | ✓ | ✓ | ✓ |
Why this matters: MLX-native quants skip GGUF→MLX conversion overhead. Combined with Ollama 0.19’s MLX backend, these are the optimal format for Apple Silicon. The Gemma 4 26B MoE at MLX 4-bit (~17GB) fits M3 Max and M2 Max comfortably. The 31B Dense at MLX 3-bit may also fit within budget.
Updated recommendation: For general-purpose inference on Apple Silicon, prefer Unsloth MLX quants over GGUF when available.
huihui-ai abliteration wave (April 14-16)
- Huihui4-48B-A4B-abliterated (April 16) — experimental expanded architecture: takes Gemma 4 26B-A4B and replaces MLP layers with 256-expert MoE, expanding to 48B total. Not fine-tuned yet. Experimental.
- Huihui3.5-67B-A3B (April 16) — Qwen3.5-35B-A3B base expanded to 512 experts, 67B/3B active MoE
- Gemma 4 E2B, 31B, 26B MoE abliterated v2 — refreshed versions
- Full Qwen3.5, GLM-4.7, Kimi, Mistral-Small-4 lineups now abliterated
DavidAU LFM2 HERETIC series (NEW — April 14-16)
Liquid AI’s LFM2 foundation models (SSM-Transformer hybrids) getting the HERETIC treatment:
- LFM2-12B-A1B High-Intelligence Series-B (April 16)
- LFM2-12B-A1B Deckard-II HERETIC Uncensored (April 16)
- LFM2-8B-A1B Deckard-II HERETIC Uncensored (April 15)
- LFM2-8B-A1B GLM-4.7-Flash Thinking (April 15)
- gemma-4-19B-A4B-it INSTRUCT Heretic-Uncensored (April 15)
- gemma-4-E4B-it Deckard-V2 Strong HERETIC (April 14)
At 1B active params, the LFM2-8B models are extremely efficient — run on all three machines. SSM-Transformer hybrid architecture is distinct from standard transformer; worth evaluating for latency characteristics.
Key insight: Ollama 0.19 MLX backend
Released March 2026. On Apple Silicon: 57% faster prefill, 93% faster decode vs v0.18 (llama.cpp). The M3 Max has higher memory bandwidth than M4 Pro, so it outperforms newer chips for memory-bound inference. Make sure Ollama is updated.
Models NOT practical on the reference hardware
| Model | Why |
|---|---|
| GLM-5.1 (744B MoE, 40B active) | MIT license, #1 SWE-Bench Pro (58.4). Open-weight on HuggingFace (zai-org/GLM-5.1). Smallest GGUF ~206GB. huihui-ai abliterated GGUF exists. MLX community version exists. Watch for distills. |
| Kimi K2.5 (1T params) | Even smallest quant (1.8-bit) is ~240GB |
| Llama 4 Scout (109B) | Q4 is ~60GB+ |
| Llama 4 Maverick (400B) | Data center only |
| gpt-oss-120b (117B MoE) | Needs 66GB+ unified for usable speed |
| Nemotron 3 Super 120B-A12B | Too large at full quality |
Other models to assess
- DavidAU LFM2-8B-A1B variants: SSM-Transformer hybrid from Liquid AI, HERETIC uncensored. 1B active params. Extremely efficient — fits all three machines. New architecture worth evaluating for latency and agentic task performance.
- DavidAU gemma-4-19B-A4B HERETIC: Compact uncensored Gemma 4 variant. Fits M3 Max and M2 Max at Q4.
- Nemotron 3 Nano 4B: Mamba-Transformer hybrid, claims 5x throughput. Tiny enough for fleet member on M3 Max. Priority evaluation.
- Nemotron 3 Nano 30B-A3B: MoE with only 3B active, Mamba hybrid. Local benchmarks now available: ~77 tok/s M2 Max (MLX), 40-60 tok/s RTX 3060 (Q4, ~5GB VRAM). Fits 3060 comfortably — top priority.
- MiniMax M2.7: “Self-evolving” training. 56.22% SWE-Pro — approaching Claude Opus 4.6. Too large for local but watch for quants/distills.
- Cogito v1 (3B/8B/14B/32B/70B): Dense, hybrid reasoning toggle. Llama/Qwen-base variants. On Ollama.
- Phi-4-mini-reasoning (3.8B): 128K context, reasoning-capable. Worth testing as alternative to SmolLM3.
- Gemma 4 31B HERETIC+Thinking (DavidAU): Chain-of-thought reasoning + uncensored 31B.
- Qwen3-Coder abliterated (huihui-ai): Abliterated variant for the coding model line.
- gpt-oss-20b HERETIC (DavidAU): Claims complete refusal removal. Priority evaluation.
- Gemma 4 31B abliterated GGUF (amarck): Q4_K_M ~19GB, fits M3 Max at short context.
Known issues
- Qwen 3.5 GGUF + Ollama incompatibility: GGUF versions do not work in Ollama due to separate mmproj vision files. Use llama.cpp directly for now.
- Gemma 4 GGUF chat template bug: Community GGUF uploads ship with incorrect chat templates (wrong delimiters), causing ”---” output loops. pmarreck/gemma4-heretical fixes this via Ollama RENDERER/PARSER support.
- Gemma 4 31B flash-attention bug in Ollama: Hangs on prompts over ~500 tokens. Workaround:
OLLAMA_FLASH_ATTENTION=0but tanks speed to ~15 tok/s on Apple Silicon. The 26B MoE is the better pick at ~20-30 tok/s. - Gemma 4 31B context limits on 36GB Macs: 31B Q4 needs ~20GB weights + ~22GB KV at full 262K context. Only works at short context (<16K) on M3 Max 36GB.
Open threads
- Meta Muse Spark — open-weight contraction: Meta went proprietary. Llama future unclear. The open-weight producers (Google Gemma, Alibaba Qwen, Zhipu GLM, community) become more important. Google’s Apache 2.0 shift for Gemma 4 looks prescient.
- Heretic ARA quality: trohrbaugh’s gemma-4-31b-it-heretic-ara achieves KL divergence 0.012 (virtually no quality loss) while reducing refusals 98→5/100. Current best-quality abliteration for Gemma 4 31B. Needs evaluation.
- Nemotron 3 Nano evaluation: Benchmarks available. AIME 89.1%, LCBv6 68.3%, Arena-Hard 67.7%. At 3.2B active params, top priority for the 3060 profile. Beats Qwen3-30B-A3B and gpt-oss-20b on coding.
- gpt-oss-20b evaluation: Benchmarks available. Arena-Hard 48.5%, LCBv6 61.0%. Solid but Nemotron 3 Nano beats it. Best for general reasoning with abliterated variant.
- Ollama v0.20.5 (April 9): New release. Gemma 4 all sizes available. Check for stability/perf fixes.
- TrevorJS abliteration technique: Biprojection + EGA, cross-validated against 686 prompts. New method worth tracking.
- Qwen 3.6-Plus: API-only (Alibaba Bailian, OpenRouter). 1M context, agentic coding. Watch for local release.
- huihui-ai Huihui4-8B-A4B: New model family uploaded April 25. 8B total/4B active (MoE), image-text-to-text. GGUF variant also available. If original work (not abliteration), marks huihui-ai’s transition to model producer. Fits all three machines easily. Evaluate.
- huihui-ai Qwen3.6-27B abliterated: Uploaded April 23, 539 downloads. Abliterated dense Qwen3.6-27B — the model that outperforms 397B MoE. Combined with Unsloth MLX quants for Apple Silicon deployment.
- DeepSeek V4: Shipped April 23-24. V4-Pro 1.6T/49B active, V4-Flash 284B/13B active. MIT license. CSA/HCA attention compression (27% FLOPs, 10% KV cache vs V3.2). Too large for local. The architecture matters more than the weights — watch for compression techniques propagating to smaller models.
- vllm-mlx: Server framework claiming 400+ tok/s on tiny models, continuous batching, Claude Code compatible.
- OpenClaw community adopting Kimi K2.5: Signal that model preference is shifting away from Claude for agentic work in open-source community.
- GLM-5.1: 744B MoE (40B active), #1 SWE-Bench Pro (58.4), MIT license. Open-weight since April 7 (corrected from “cloud-only”). huihui-ai abliterated GGUF available. MLX community version exists. Too large for local at full scale (~206GB) but distills/aggressive quants may change this. Watch Z.ai for smaller variants.
- Copilot CLI BYOK: Now supports Ollama, vLLM, any OpenAI-compatible endpoint. Local models become usable inside a major agent’s workflow for the first time.