Concurrency: Parallel Requests and the Queue
54%
Choosing the Right Number of Parallel Slots
The right value depends on memory headroom and how much per-request latency you can tolerate. The math is the same in both cases: weights stay constant, and KV cache scales linearly with OLLAMA_NUM_PARALLEL × num_ctx.
On GPU, VRAM is the hard limit, and compute scales reasonably well up to a point. Rough starting points:
| VRAM | Typical hardware | Starting slots |
|---|---|---|
| 8 GB | Consumer cards, small models only | 1 to 2 |
| 16 GB | Mid-range cards, 7B-8B at Q4 | 2 to 4 |
| 24 GB | RTX 4090, RTX 3090, 7B-8B with room | 4 to 8 |
| 48 GB+ | RTX A6000, dual GPUs, datacenter cards | 8 to 16+ |
While the numbers above are just rough estimations, you can base your decisions on testing: raise the value, restart the server, watch ollama ps:
Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.
