A Skeptic’s Guide to Running Local LLMs on macOS

A skeptical but practical guide to running LLMs locally on Apple Silicon Macs using llama.cpp or LM Studio. It explains why local use matters (experimentation, privacy, ethics), how to pick models (size, runtime, quantization, vision/reasoning), and how to safely use tools via MCPs. The author stresses fact-checking, avoiding anthropomorphism, and using compaction to manage context.

Key Points

Run LLMs locally to experiment freely, protect sensitive data, and avoid funding companies with questionable practices.
Two main options on macOS: llama.cpp (open-source, flexible) and LM Studio (closed-source, easier UI with guardrails and MCP/tooling).
Choose models based on RAM-constrained size, correct runtime/format (GGUF vs. MLX), 4-bit quantization, and whether you need vision or reasoning.
Use tools/MCPs cautiously (confirm tool calls, beware data exfiltration); they’re powerful but quickly pollute context.
LLMs are helpful for summarization and brain-dumps but hallucinate; always fact-check and avoid anthropomorphizing.

Sentiment

Cautiously positive: most participants agree local LLMs are valuable and increasingly usable on macOS for specific, privacy-focused automation tasks, while acknowledging significant limits in quality, power use, and hardware/ANE support compared to cloud SOTA.

In Agreement

Local LLMs are surprisingly capable for summarization, classification, embeddings, search, grammar checking, and coding assistance—use them for automation over factual recall.
Model size must match RAM; on 16 GB Macs, 12B–20B is the practical upper limit, with larger models benefiting from 48–512 GB RAM.
On Apple Silicon, inference generally runs best on the GPU via Metal; ANE support for transformers is limited and opaque.
LM Studio, MLX, llama.cpp/ollama are effective, user-friendly ways to get started; OpenAI-compatible local servers make integration easy.
Privacy, offline access, and cost control during development are compelling reasons to go local.
Quantization (e.g., 4-bit) offers a solid performance/quality balance for local use.

Opposed

Local LLMs still hallucinate too often for certain workflows, making manual verification more costly than doing tasks by hand.
State-of-the-art reasoning and coding quality remain better in cloud models; local models may be impractical for general use for years.
Battery drain on laptops is severe during continuous inference, reducing viability for mobile scenarios.
ANE/NPU support is lacking or ineffective; without standardized interfaces and adequate performance, local hardware accelerators are underused.
High-end local boxes ($5k–$12k) are expensive and may be inferior to paying for access to frontier cloud models.
Installation/maintenance complexity (vs a simple in-browser, no-install solution) is still a barrier for some users.