Ollama

aideveloper tools

Run open-source LLMs locally with a single command. OpenAI-compatible API, GPU acceleration on Apple Silicon, NVIDIA, and AMD, and a growing library of models including Llama, Mistral, Gemma, and DeepSeek

#llm#ai#local-ai#openai-compatible#self-hosted#inference#gpu#ollama
Alternative to OpenAI APIAnthropic API

Quick Start

curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.2

Overview

Ollama is the standard tool for running open-source language models on your own hardware. One command installs it on macOS, Windows, or Linux. A second command pulls a model and starts a chat session. After that, it runs a local API server at localhost:11434 that behaves exactly like the OpenAI API, so any application already built against OpenAI’s SDK can point to a local model by changing one environment variable.

The model library covers most of what you would reach for: Llama 3, Mistral, Gemma, Phi, Qwen, DeepSeek, and dozens of specialist models for code, vision, and embeddings. Quantized variants let you run larger models on modest hardware by trading a small amount of accuracy for dramatically lower memory use. A Modelfile format lets you bake in system prompts, adjust parameters, and create named model variants you can share or version-control.

GPU acceleration works across Apple Silicon (Metal), NVIDIA (CUDA), and AMD (ROCm). On a MacBook with an M-series chip, inference on a 7B model is fast enough to feel conversational. Without a GPU, CPU inference works but slows down considerably on anything larger than a 3B model.

Ollama does not ship a chat UI. For a browser interface, pair it with Open WebUI, which is built specifically for Ollama and discovers your installed models automatically. For document Q&A, AnythingLLM connects to Ollama directly.

A cloud option exists for when local hardware is not enough, but the local self-hosted path is what 172k GitHub stars are voting for.

Ollama: Pros & Cons

Pros (The Wins)Cons (The Friction)
One-command install:
macOS, Windows, Linux;
models pull and run immediately.
No chat UI:
API server only; pair with
Open WebUI for a browser interface.
OpenAI-compatible API:
Swap one env var; any
OpenAI SDK app just works.
RAM requirements:
7B needs ~8GB, 13B needs 16GB+;
large models punish weak hardware.
Multi-platform GPU:
Apple Silicon, NVIDIA, AMD
acceleration all supported.
Quantization trade-offs:
Smaller models run faster
but lose some response accuracy.
172k stars, MIT:
De facto standard for
local model inference.
Windows rough edges:
Newer platform support; occasional
issues vs macOS and Linux.

Use Cases

Specific ways to use Ollama for your workflow.

01
Run a local code assistant that reads your codebase without sending anything to a third-party API
02
Build and test LLM-powered features locally with no API key, no usage bill, and no rate limits
03
Point any OpenAI SDK app at a local model by swapping one environment variable
04
Deploy a local model on hardware with no internet connection for offline or air-gapped environments

Deployment Strategy

Recommended ways to host Ollama in your own environment.

self-hosted
docker
binary