Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
What Are Inference Parameters?
Inference parameters are settings that control how the model picks its next token from the scores it produces. They don't change the model's weights or what it knows. They only change how the output is sampled from what's already there.
The most common ones are:
- Temperature: randomness dial
- Top k: candidate cutoff
- Top p: probability cutoff
- Frequency and presence penalties: reduce repetition
- Seed: reproducible output
The neural network does its full work, layer by layer, regardless of what your inference parameters are set to. Every layer multiplies and every weight gets used. The output of the last layer is a list of scores, one for every token in the vocabulary. That part is fixed, meaning same prompt and same weights always produce the same scores. Inference parameters kick in just after that. They're the rules the runtime follows when picking one token from the score list.
Layers think, inference parameters choose
(i) Layers do the thinking while the inference parameters do the choosing.
Depending on the model provider, you can set some or all of these parameters in different ways:
For Ollama, you set them per-model in a Modelfile, per-request through the API, or live in
ollama run.For LM Studio, you set them in the right-hand sidebar of the chat UI or in the
Servertab when running as an OpenAI-compatible API. Settings persist per-model.For llama.cpp, you pass them as command-line flags to the
llama-cliorllama-serverbinary. When using the server, you can also override per-request through the OpenAI-compatible endpoint.For OpenAI and similar providers (Anthropic, Mistral La Plateforme, Together, Groq, etc.), you set them per-request in the JSON body.
Let's quickly understand the most important parameters.
Temperature
Temperature controls how randomly the model picks its next token.
- At low temperature (0.1-0.3), the model almost always picks the highest-scoring token, so the output is focused and predictable.
- At high temperature (0.8-1.5), the score gaps get flattened, so lower-ranked tokens get a real chance, and the output becomes more varied and "creative".
Mathematically, temperature reshapes how the model picks its next token. Here's what happens under the hood:
For each possible next token, the model produces a score indicating how likely that token is to come next. At this stage, the scores are just raw numbers, not yet probabilities.
These scores are divided by the
temperaturevalue. A lowtemperature(like 0.3) exaggerates the differences between scores and makes the best option stand out dramatically. A hightemperature(like 1.8) shrinks the differences and makes all options look more similar.
Say you prompt the model with The cat sat on the and the scores look like this:
mat: 6.0
floor: 5.0
couch: 4.0
roof: 2.0
At temperature = 0.3 (low), divide each score by 0.3:
mat: 20.0
floor: 16.7
couch: 13.3
roof: 6.7
The gaps widen. mat wins almost every time.
At temperature = 1.8 (high), divide each score by 1.8:
mat: 3.3
floor: 2.8
couch: 2.2
roof: 1.1
mat is still the favorite, but only wins about a third of the time. floor, couch, and even roof get picked occasionally.
- The adjusted scores are then converted into probabilities that add up to 100%. This is where the "sharpening" or "flattening" effect becomes visible: with low
temperature, one token ends up with nearly all the probability. With hightemperature, even unlikely tokens get a meaningful share.
Temperature visualization
- Finally, the model picks a token by sampling from this distribution.
Without diving too deep into the math, here's the intuition:
Low
temperature(close to 0): The model becomes more conservative. It strongly prefers the most likely token at each step, making responses focused, deterministic, and predictable. The same prompt with the same lowtemperaturewill produce nearly identical outputs.Medium
temperature(around 1.0): The model samples naturally from its learned distribution. Responses feel balanced and coherent, with some variation in phrasing and word choice. This is the default setting for a reason. It works well for most domains.High
temperature(above 1.5): The model flattens the distribution and considers even unlikely tokens. Responses become creative, surprising, and sometimes chaotic. The same prompt can produce wildly different outputs.Very high
temperature(above 1.8): Depending on the model, the output may become nonsensical or grammatically broken. Hightemperaturecan amplify hallucinations and off-topic responses, so it's rarely useful to go this high.
Top P
temperature is not the only way to control randomness. The API also exposes a parameter called top_p, also called nucleus sampling. It controls randomness too, but in a different way.
Where temperature reshapes the entire probability distribution, top_p simply trims it. It works like this:
The model sorts all candidate tokens from most likely to least likely.
It adds them to a "pool" one by one, starting with the most likely, until their combined probability reaches the
top_pthreshold.Everything outside that pool is thrown away.
The model samples from the remaining tokens.
So top_p is less of a dial and more of a cutoff. It says: "only consider the most likely tokens that together account for X% of the probability, ignore the rest".
Let's go back to the "cat" example. The model is picking the next word after The cat sat on the ___. The candidates and their probabilities look like this:
mat(45%)couch(20%)floor(15%)roof(8%)fence(6%)moon(4%)Tuesday(2%)
With top_p = 0.9, the model keeps adding candidates until their combined probability reaches 90%:
mat= 45%+ couch= 65%+ floor= 80%+ roof= 88%+ fence= 94% (just crossed 90% threshold, stop)
Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.


