How Many Requests One Model Handles at Once

By default, a loaded model processes one request at a time. The second request waits for the first to finish before it starts. This is fine for interactive use, terrible for anything that fans out work (multiple users, parallel tool calls, batch evaluation).

OLLAMA_NUM_PARALLEL raises that ceiling. Let's update the model to handle 4 requests at once:

# Update the unit file
cat > /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"
EOF

# Reload and restart
systemctl daemon-reload
systemctl restart ollama

Now the model handles up to 4 requests in parallel. Beyond that, requests queue. Check the output of ollama ps and you'll see an increase in the SIZE column:

MODEL=granite3.3:2b

# Run the model with an empty prompt to load it
ollama run $MODEL ""

# Check the size and note the increase
ollama ps

To understand how the user experience changes, run 4 parallel requests using this script:

cat > $HOME/test.sh << 'EOF'
#!/bin/bash
MODEL=${MODEL}
N=${N:-4}

for i in $(seq 1 $N); do
  (

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next