Skip to main content

Running AI Models Locally: A Beginner's Guide to Local LLMs

Tutorials

Running AI Models Locally: A Beginner's Guide to Local LLMs

Cloud-based AI services like ChatGPT and Claude are convenient, but they come with trade-offs: subscription costs, data privacy concerns, internet dependency, and limited customization. Running large language models (LLMs) on your own hardware eliminates every one of those problems. In this guide, we walk through exactly how to get started — from understanding hardware requirements to running your first local model in under five minutes.

Why Run LLMs Locally?

Before diving into setup, it helps to understand what you gain by going local.

Privacy and Data Control

Every prompt you send to a cloud API travels across the internet and lands on someone else's server. For personal projects that might be fine, but for businesses handling customer data, medical records, legal documents, or proprietary code, this is a serious liability. Local models process everything on your machine. Nothing leaves your network.

Cost Elimination

GPT-4o API calls cost roughly $2.50 per million input tokens and $10 per million output tokens as of early 2026. If you run thousands of queries daily — for summarization, code review, or document processing — costs add up fast. A local model runs on hardware you already own, with zero per-query fees. The ROI becomes obvious within weeks for heavy users.

Offline Access

Cloud APIs require internet. Local models work on airplanes, in remote locations, or during outages. If you build applications that depend on AI inference, removing the network dependency makes your system fundamentally more reliable.

Customization and Fine-Tuning

With local models, you can fine-tune on your own datasets, adjust inference parameters freely, create custom model merges, and run specialized quantizations optimized for your hardware. Cloud providers give you a fixed menu; local deployment gives you the kitchen.

Hardware Requirements: What You Actually Need

The single biggest factor determining which models you can run is RAM — specifically, the amount of memory available to load the model weights. Here is a practical breakdown by hardware tier.

Tier 1: 8 GB RAM (Entry Level)

With 8 GB of system RAM and no dedicated GPU, you can run smaller models using CPU-only inference. Expect slower generation speeds (around 5–15 tokens per second), but the quality of compact models has improved dramatically.

Models that work well:
  • Phi-3 Mini (3.8B) — Microsoft's compact model, surprisingly capable for its size
  • Gemma 2 2B — Google's efficient small model, strong at instruction following
  • TinyLlama (1.1B) — Fast and lightweight, good for simple tasks
  • Qwen2.5 3B — Alibaba's model, solid multilingual support

At this tier, stick to Q4_K_M or Q5_K_M quantizations to balance quality with memory usage. You will be limited to shorter context windows (2K–4K tokens).

Tier 2: 16 GB RAM (Sweet Spot)

This is where local LLMs become genuinely useful. With 16 GB, you can load 7B–8B parameter models comfortably with room for context.

Models that work well:
  • Llama 3.1 8B — Meta's flagship small model, excellent general performance
  • Mistral 7B v0.3 — Strong reasoning and instruction following
  • Gemma 2 9B — Google's mid-range model, impressive benchmark results
  • Qwen2.5 7B — Excellent coding and math capabilities
  • DeepSeek-R1 Distill 8B — Reasoning-focused with chain-of-thought

At Q4_K_M quantization, a 7B model uses roughly 4–5 GB of RAM, leaving space for the operating system and applications. Generation speeds on a modern CPU hit 10–25 tokens per second. Add a GPU with 8+ GB VRAM and you jump to 40–80 tokens per second.

Tier 3: 32 GB+ RAM (Power User)

With 32 GB or more, you unlock larger models that rival cloud API quality for many tasks.

Models that work well:
  • Llama 3.1 70B (Q4) — Requires ~40 GB, so 48–64 GB RAM is ideal; near-GPT-4 quality
  • Mixtral 8x7B — Mixture-of-experts architecture, fast and capable
  • Qwen2.5 32B — Strong across coding, reasoning, and creative writing
  • Command R+ 35B — Cohere's model, excellent for RAG and tool use
  • DeepSeek-R1 Distill 32B — Best reasoning in its class

If you have a GPU with 24 GB VRAM (like an RTX 4090 or RTX 3090), you can run 13B–34B models entirely in VRAM for blazing fast inference at 60–100+ tokens per second.

GPU vs CPU: What Matters

GPU (CUDA/ROCm): Dramatically faster inference. An RTX 3060 12 GB can run a 7B model at 50+ tokens per second. An RTX 4090 24 GB handles 34B models smoothly. AMD GPUs work via ROCm but driver support can be finicky. CPU-only: Perfectly viable for models up to 13B with enough RAM. Modern CPUs with AVX2/AVX-512 support (most processors from 2016 onward) handle inference well. Apple Silicon Macs are exceptional here — the M1 Pro/Max/Ultra and M2/M3/M4 series use unified memory, meaning the GPU and CPU share the same RAM pool. An M2 Max with 32 GB can run 34B models at impressive speeds. Apple Silicon note: If you own an M-series Mac, you are in a uniquely good position for local LLMs. The Metal framework provides GPU acceleration, and unified memory means your full RAM is available for model loading.

Tool Comparison: Picking Your Runtime

Four tools dominate the local LLM space. Each has distinct strengths.

Ollama

Best for: Getting started quickly, server-style deployment, API integration

Ollama wraps llama.cpp in a clean CLI with a model library. You pull models by name (ollama pull llama3.1) and run them instantly. It exposes an OpenAI-compatible API on localhost:11434, making it trivial to integrate with existing applications.

  • Supports macOS, Linux, and Windows
  • Built-in model management (pull, list, delete)
  • Modelfile system for custom configurations
  • GPU acceleration detected automatically
  • Active development with frequent updates

LM Studio

Best for: GUI users, model exploration, beginners who prefer visual interfaces

LM Studio provides a desktop application with a chat interface, model search, and download management. You can browse Hugging Face models directly, adjust parameters with sliders, and compare outputs side by side.

  • Visual model browser and download manager
  • Built-in chat interface with conversation history
  • Local server mode with OpenAI-compatible API
  • Quantization format support (GGUF)
  • Available on macOS, Windows, and Linux

llama.cpp

Best for: Maximum performance, advanced users, custom builds

llama.cpp is the underlying C/C++ inference engine that powers Ollama and many other tools. Running it directly gives you the most control: custom compilation flags, experimental features, and bleeding-edge optimizations.

  • Highest raw performance
  • Supports every quantization format
  • Compiles for specific hardware targets
  • Server mode available (llama-server)
  • Requires command-line comfort

GPT4All

Best for: Privacy-focused users, enterprise deployment, offline-first use cases

GPT4All by Nomic emphasizes privacy and ease of use. It includes a desktop app, local document chat (primitive RAG), and a curated model selection. The focus is on models that run well on consumer hardware.

  • Curated model library optimized for consumer hardware
  • Built-in local document chat
  • Plugin ecosystem
  • Enterprise deployment options
  • Strong privacy focus

Step-by-Step: Your First Local Model with Ollama

Let us get a model running. Ollama is the fastest path from zero to working local LLM.

Step 1: Install Ollama

macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com and run it. Ollama runs as a background service. Verify installation:
ollama --version

Step 2: Pull a Model

For your first model, start with Llama 3.1 8B — it strikes the best balance of quality and resource usage:
ollama pull llama3.1
This downloads the Q4_K_M quantized version (~4.7 GB). The download happens once; subsequent runs load from disk. For systems with limited RAM, try the smaller Phi-3 Mini:
ollama pull phi3:mini

Step 3: Run and Chat

Start an interactive chat session:
ollama run llama3.1
You are now chatting with a local LLM. Type your prompt and press Enter. Type /bye to exit.

Step 4: Use the API

Ollama automatically serves an OpenAI-compatible API. With the service running, send requests from any HTTP client:
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Explain quicksort in 3 sentences."}]
  }'
This means any application that supports the OpenAI API format can use your local model by simply changing the base URL to http://localhost:11434/v1.

Step 5: Customize with a Modelfile

Create a file called Modelfile to customize behavior:
FROM llama3.1

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """You are a senior software engineer. You write clean, well-documented code and explain your reasoning step by step."""
Build and run your custom model:
ollama create code-assistant -f Modelfile
ollama run code-assistant

Local vs Cloud: Honest Performance Comparison

Local models are not a universal replacement for cloud APIs. Here is where each excels.

Where Local Models Win

  • Batch processing: Running thousands of documents through summarization or classification is dramatically cheaper locally
  • Code completion: Low-latency, privacy-preserving autocomplete for IDEs (tools like Continue and Tabby use local models)
  • Sensitive data: Legal, medical, financial, or proprietary content that should never touch external servers
  • Prototyping: Experimenting with prompts and workflows without worrying about API costs
  • Embedded systems: Edge deployment where internet connectivity is unreliable

Where Cloud APIs Still Win

  • Raw capability ceiling: GPT-4o and Claude Opus still outperform the best locally-runnable models on complex reasoning, nuanced writing, and multi-step tasks
  • Long context: Cloud models handle 100K–200K token contexts natively; local models typically max out at 8K–32K due to memory constraints
  • Multimodal: Vision and audio capabilities are more mature in cloud offerings
  • Zero setup: Cloud APIs work immediately with no hardware investment

The Hybrid Approach

Many teams use both. Route simple, high-volume tasks (classification, extraction, summarization) to local models and reserve cloud APIs for complex tasks requiring maximum capability. This hybrid strategy cuts costs by 70–90% while maintaining quality where it matters.

Use Cases Where Local LLMs Shine

Development and Coding

Use local models as coding assistants in your IDE. Tools like Continue (VS Code extension) and Tabby connect to Ollama and provide autocomplete, code explanation, and refactoring suggestions — all without sending your codebase to external servers.

Document Processing

Build pipelines that summarize, classify, or extract information from documents. A local 8B model handles invoice parsing, contract summarization, and email categorization with excellent accuracy for structured tasks.

Privacy-First Business Applications

Healthcare organizations can use local models for clinical note summarization. Law firms can analyze contracts. Financial institutions can process sensitive reports. The data never leaves the premises.

Personal Knowledge Bases

Combine a local model with a vector database (ChromaDB, Qdrant) to build a personal RAG system. Index your notes, documents, and bookmarks, then query them in natural language — all running on your laptop.

Education and Experimentation

Local models are perfect for learning about LLM behavior. Adjust parameters, test different quantizations, compare model architectures, and build intuition without spending money on API calls.

Tips for Getting the Best Results

Start small, then scale up. Begin with a 7B–8B model. Only move to larger models if you hit quality limitations for your specific use case. Many tasks do not require 70B parameters. Use the right quantization. Q4_K_M is the default sweet spot. Q5_K_M offers slightly better quality at roughly 15% more memory usage. Q3_K_M saves memory but noticeably degrades output quality. Avoid Q2 quantizations for anything beyond simple classification. Increase context gradually. Larger context windows consume more RAM. Start with 2048 or 4096 tokens and increase only if your task demands it. Each doubling of context roughly doubles the memory overhead during inference. Match the model to the task. Use coding-specialized models (like DeepSeek Coder or CodeGemma) for code tasks. Use reasoning models (like DeepSeek-R1 distills) for math and logic. General-purpose models are jacks of all trades but masters of none. Keep models updated. The local LLM space moves fast. New model releases and quantization improvements arrive monthly. Check Ollama's library and Hugging Face regularly for upgrades.

What Comes Next

Once you are comfortable running models locally, the natural next steps are:

  • Build a local RAG system — combine your model with a vector database for document Q&A
  • Set up a coding assistant — integrate with your IDE for privacy-preserving autocomplete
  • Explore fine-tuning — customize a model on your own data using tools like Unsloth or Axolotl
  • Deploy as an API — serve your model to other applications on your network using Ollama's built-in server
  • Local LLMs have crossed the threshold from hobbyist curiosity to practical daily tool. The hardware you already own is likely sufficient to get started. The setup takes minutes, the cost is zero, and your data stays yours. That is a hard combination to beat.

    Tags:TutorialsRAGVector Databases