Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Running Models and Understanding How They Work inside Ollama
31%

Running Models with the Ollama API

The Ollama CLI is convenient for poking around, but every real integration goes through the HTTP API on http://localhost:11434. Three endpoints cover 90% of what you'll do:

  • /api/pull to get a model
  • /api/generate for one-shot completions
  • /api/chat for multi-turn conversations with roles and tool calls

Prerequisites

Before using the API, make sure the server is running by testing the /api/version endpoint:

curl http://localhost:11434/api/version

Expected response:

{"version":"0.30.0"}

If you get Connection refused, start the server with systemctl start ollama. All examples below assume the default host (localhost) and port (11434). Override with OLLAMA_HOST and OLLAMA_PORT if you have a different setup.

Set a model variable so the rest of the lesson is simple to copy/paste:

export MODEL="llama3.2:3b"

/api/pull: Download a Model

/api/pull is the HTTP equivalent of ollama pull. It streams NDJSON (Newline Delimited JSON) progress events while the layers download. Use this when your app needs to provision a model on first run instead of requiring the user to pre-pull it.

curl http://localhost:11434/api/pull \
  -d "{\"model\": \"$MODEL\"}"

You'll see a series of JSON lines like:

{"status":"pulling manifest"}
{"status":"pulling dde5aa3fc5ff","digest":"sha256:...","total":2019377376}

// [... progress ...]

{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}

If you don't care about progress and just want to block until done, set stream: false:

curl http://localhost:11434/api/pull \
  -d "{\"model\": \"$MODEL\", \"stream\": false}"

You get a single response when the pull finishes. The tradeoff is no progress feedback, so use streaming for anything user-facing where a multi-GB download would otherwise look hung.

/api/generate: Single-Shot Completion

/api/generate takes a prompt and returns a completion without taking into consideration the context of previous interactions. Reach for it when you want stateless generation like a one-off summary, a classification, or a code rewrite.

Non-streaming for a clean response object:

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"Explain what a transformer architecture is in two sentences.\",
  \"stream\": false
}" | jq

Response (truncated):

{
  "model": "llama3.2:3b",
  "created_at": "2026-05-12T08:10:31.659858474Z",
  "response": "A transformer architecture is a type of neural network design that uses self-attention mechanisms to process sequential data, such as text or images...",
  "done": true,
  "done_reason": "stop",
  "context": [
    128006,
    // [... other tokens ...]
  ],
  "total_duration": 5946213748,
  "load_duration": 334729261,
  "prompt_eval_count": 37,
  "prompt_eval_duration": 91284365,
  "eval_count": 63,
  "eval_duration": 5389747128
}

The response object has these fields:

FieldWhat it tells you
modelWhich model produced the response. Useful when your app routes between models.
created_atTimestamp when the response completed, in RFC 3339 format.
responseThe generated text. Empty string when stream: true since each token arrived in a prior event.
donetrue on the final event, false on intermediate streaming events.
done_reasonWhy generation stopped. Common values: stop (hit a stop token or natural end), length (hit num_predict limit), load (model was just loaded, no generation happened).
contextToken IDs representing the full conversation state. Pass this back in the next /api/generate call as context to continue without resending the prompt. Deprecated in favor of /api/chat with a messages array, but still works.
total_durationWhole-request time in nanoseconds, including model load, prompt eval, and generation.
load_durationTime spent loading the model into memory in nanoseconds. Zero or tiny when the model was already resident.
prompt_eval_countNumber of tokens in your prompt. When the prompt was served from cache, this will not be exact, so don't rely on it as an exact input-token count across repeated calls.
prompt_eval_durationTime spent processing the prompt (the prefill phase) in nanoseconds.
eval_countNumber of tokens generated.
eval_durationTime spent generating tokens (the decode phase) in nanoseconds.

Tokens per second is eval_count / (eval_duration / 1e9). This is the number you actually care about when comparing models or hardware.

TPS is, indeed, the headline number for "is this model usable on this hardware". People use it in concrete ways to compare models and quantization (on the same hardware). Here's a table of TPS for some hardware and the use cases that make sense on it:

Use caseScenarioTPSHardware example
Background summarizationRAG, digest jobs, async tasks5 to 10CPU only, low-end GPU
Interactive chatStreaming responses to a user15 to 20Apple Silicon M2, RTX 3060 12GB
Code completionInline suggestions in an editor50+RTX 4090 24GB, M3 Max

Streaming (default) returns one JSON object per token chunk. Each chunk has done: false until the last one, which carries the stats:

curl http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"List three reasons to run models locally.\"
}"

You'll see something like:

{"model":"llama3.2:3b","created_at":"...","response":"Here","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" are","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" three","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" reasons","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" to","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" run","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" models","done":false}
{"model":"llama3.2:3b","created_at":"...","response":" locally","done":false}

// ...

{"model":"llama3.2:3b","created_at":"...","response":"","done":true,"total_duration":"..."}

To watch it accumulate in a terminal:

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"List 3 reasons to run models locally.\"
}" | jq -j '.response'

The -j flag tells jq to skip newlines so the tokens flow as a single stream.


/api/generate accepts a handful of extra fields that matter in practice:

  • system: a system prompt prepended to your input
  • options: model parameters such as temperature, top_p, num_ctx, num_predict, seed
  • format: "json" or a JSON schema object for structured output
  • keep_alive: how long to keep the model in memory after the call (reminder: default 5m, set "0" to unload immediately, "-1" to keep loaded indefinitely)

Example with options and a system prompt:

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"$MODEL\",
  \"system\": \"You answer in exactly one sentence.\",
  \"prompt\": \"What is quantization?\",
  \"options\": {\"temperature\": 0.2, \"num_predict\": 80},
  \"stream\": false
}" | jq -j '.response'

/api/chat: Multi-Turn Conversations

/api/chat accepts a messages array where each message has a role (system, user, assistant, or tool) and content. The server applies the model's chat template, so you don't hand-build the prompt. Use this when you want conversation history, system prompts handled cleanly, multimodal input, or tool calling.

To get a non-streaming chat with a system prompt, you can use an array of 2 messages:

curl -s http://localhost:11434/api/chat -d "{
  \"model\": \"$MODEL\",
  \"messages\": [
    {\"role\": \"system\", \"content\": \"You are a terse Linux assistant. You answer in one sentence.\"},
    {\"role\": \"user\", \"content\": \"How do I find files larger than 100MB?\"}
  ],
  \"stream\": false
}" | jq

The response will look something like:

{
  "model": "llama3.2:3b",
  "created_at": "2026-05-12T08:50:25.218995078Z"

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.