Ollama
Run open-source LLMs locally with a single command. OpenAI-compatible API, GPU acceleration on Apple Silicon, NVIDIA, and AMD, and a growing library of models including Llama, Mistral, Gemma, and DeepSeek
Quick Start
curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.2 Overview
Ollama is the standard tool for running open-source language models on your own hardware. One command installs it on macOS, Windows, or Linux. A second command pulls a model and starts a chat session. After that, it runs a local API server at localhost:11434 that behaves exactly like the OpenAI API, so any application already built against OpenAI’s SDK can point to a local model by changing one environment variable.
The model library covers most of what you would reach for: Llama 3, Mistral, Gemma, Phi, Qwen, DeepSeek, and dozens of specialist models for code, vision, and embeddings. Quantized variants let you run larger models on modest hardware by trading a small amount of accuracy for dramatically lower memory use. A Modelfile format lets you bake in system prompts, adjust parameters, and create named model variants you can share or version-control.
GPU acceleration works across Apple Silicon (Metal), NVIDIA (CUDA), and AMD (ROCm). On a MacBook with an M-series chip, inference on a 7B model is fast enough to feel conversational. Without a GPU, CPU inference works but slows down considerably on anything larger than a 3B model.
Ollama does not ship a chat UI. For a browser interface, pair it with Open WebUI, which is built specifically for Ollama and discovers your installed models automatically. For document Q&A, AnythingLLM connects to Ollama directly.
A cloud option exists for when local hardware is not enough, but the local self-hosted path is what 172k GitHub stars are voting for.
Ollama: Pros & Cons
| Pros (The Wins) | Cons (The Friction) |
|---|---|
| One-command install: macOS, Windows, Linux; models pull and run immediately. | No chat UI: API server only; pair with Open WebUI for a browser interface. |
| OpenAI-compatible API: Swap one env var; any OpenAI SDK app just works. | RAM requirements: 7B needs ~8GB, 13B needs 16GB+; large models punish weak hardware. |
| Multi-platform GPU: Apple Silicon, NVIDIA, AMD acceleration all supported. | Quantization trade-offs: Smaller models run faster but lose some response accuracy. |
| 172k stars, MIT: De facto standard for local model inference. | Windows rough edges: Newer platform support; occasional issues vs macOS and Linux. |
Use Cases
Specific ways to use Ollama for your workflow.
Deployment Strategy
Recommended ways to host Ollama in your own environment.