Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
18%

What Are Inference Parameters?

Inference parameters are settings that control how the model picks its next token from the scores it produces. They don't change the model's weights or what it knows. They only change how the output is sampled from what's already there.

The most common ones are:

  • Temperature: randomness dial
  • Top k: candidate cutoff
  • Top p: probability cutoff
  • Frequency and presence penalties: reduce repetition
  • Seed: reproducible output

The neural network does its full work, layer by layer, regardless of what your inference parameters are set to. Every layer multiplies and every weight gets used. The output of the last layer is a list of scores, one for every token in the vocabulary. That part is fixed, meaning same prompt and same weights always produce the same scores. Inference parameters kick in just after that. They're the rules the runtime follows when picking one token from the score list.

Layers think, inference parameters choose

Layers think, inference parameters choose

(i) Layers do the thinking while the inference parameters do the choosing.

Depending on the model provider, you can set some or all of these parameters in different ways:

  • For Ollama, you set them per-model in a Modelfile, per-request through the API, or live in ollama run.

  • For LM Studio, you set them in the right-hand sidebar of the chat UI or in the Server tab when running as an OpenAI-compatible API. Settings persist per-model.

  • For llama.cpp, you pass them as command-line flags to the llama-cli or llama-server binary. When using the server, you can also override per-request through the OpenAI-compatible endpoint.

  • For OpenAI and similar providers (Anthropic, Mistral La Plateforme, Together, Groq, etc.), you set them per-request in the JSON body.

Let's quickly understand the most important parameters.

Temperature

Temperature controls how randomly the model picks its next token.

  • At low temperature (0.1-0.3), the model almost always picks the highest-scoring token, so the output is focused and predictable.
  • At high temperature (0.8-1.5), the score gaps get flattened, so lower-ranked tokens get a real chance, and the output becomes more varied and "creative".

Mathematically, temperature reshapes how the model picks its next token. Here's what happens under the hood:

  • For each possible next token, the model produces a score indicating how likely that token is to come next. At this stage, the scores are just raw numbers, not yet probabilities.

  • These scores are divided by the temperature value. A low temperature (like 0.3) exaggerates the differences between scores and makes the best option stand out dramatically. A high temperature (like 1.8) shrinks the differences and makes all options look more similar.

Say you prompt the model with The cat sat on the and the scores look like this:

mat:    6.0
floor:  5.0
couch:  4.0
roof:   2.0

At temperature = 0.3 (low), divide each score by 0.3:

mat:    20.0
floor:  16.7
couch:  13.3
roof:   6.7

The gaps widen. mat wins almost every time.

At temperature = 1.8 (high), divide each score by 1.8:

mat:    3.3
floor:  2.8
couch:  2.2
roof:   1.1

mat is still the favorite, but only wins about a third of the time. floor, couch, and even roof get picked occasionally.

  • The adjusted scores are then converted into probabilities that add up to 100%. This is where the "sharpening" or "flattening" effect becomes visible: with low temperature, one token ends up with nearly all the probability. With high temperature, even unlikely tokens get a meaningful share.

Temperature visualization

Temperature visualization

  • Finally, the model picks a token by sampling from this distribution.

Without diving too deep into the math, here's the intuition:

  • Low temperature (close to 0): The model becomes more conservative. It strongly prefers the most likely token at each step, making responses focused, deterministic, and predictable. The same prompt with the same low temperature will produce nearly identical outputs.

  • Medium temperature (around 1.0): The model samples naturally from its learned distribution. Responses feel balanced and coherent, with some variation in phrasing and word choice. This is the default setting for a reason. It works well for most domains.

  • High temperature (above 1.5): The model flattens the distribution and considers even unlikely tokens. Responses become creative, surprising, and sometimes chaotic. The same prompt can produce wildly different outputs.

  • Very high temperature (above 1.8): Depending on the model, the output may become nonsensical or grammatically broken. High temperature can amplify hallucinations and off-topic responses, so it's rarely useful to go this high.

Top P

temperature is not the only way to control randomness. The API also exposes a parameter called top_p, also called nucleus sampling. It controls randomness too, but in a different way.

Where temperature reshapes the entire probability distribution, top_p simply trims it. It works like this:

  1. The model sorts all candidate tokens from most likely to least likely.

  2. It adds them to a "pool" one by one, starting with the most likely, until their combined probability reaches the top_p threshold.

  3. Everything outside that pool is thrown away.

  4. The model samples from the remaining tokens.

So top_p is less of a dial and more of a cutoff. It says: "only consider the most likely tokens that together account for X% of the probability, ignore the rest".

Let's go back to the "cat" example. The model is picking the next word after The cat sat on the ___. The candidates and their probabilities look like this:

  • mat (45%)
  • couch (20%)
  • floor (15%)
  • roof (8%)
  • fence (6%)
  • moon (4%)
  • Tuesday (2%)

With top_p = 0.9, the model keeps adding candidates until their combined probability reaches 90%:

  • mat = 45%
  • + couch = 65%
  • + floor = 80%
  • + roof = 88%
  • + fence = 94% (just crossed 90% threshold, stop)

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.