Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
What Is a Neural Network?
A neural network is a system that turns input numbers into output numbers by passing them through a stack of layers. Each layer does a simple math step, and the output of one layer becomes the input to the next.
Neural network
Everything a neural network sees has to be numbers first:
- Text gets split into tokens and each token is mapped to its token ID.
- Images become numbers for each pixel.
- Audio becomes numbers for each slice of sound.
- And so on.
The network takes those numbers in at one end, transforms them stage by stage, and produces a prediction at the other end. For a language model, the input is your prompt (the text you typed), and the output is a score for every possible next token (how likely each one is to come next).
For example, if you type The capital of France is, the model converts that into tokens, runs them through every layer, and produces scores for the entire vocabulary:
- The token
Parismight get a score of 0.92, themight get 0.01,bananamight get 0.0001,- and so on for every token the model knows.
The runtime then picks one based on those scores (usually the highest, unless you changed some inference/sampling parameters), appends it to your text, and feeds the whole thing back in to predict the next token after that.
(i) Common runtimes for local models include llama.cpp (the most widely used engine) and tools built on top of it like Ollama and LM Studio.
That loop is how a model writes a full sentence, one token at a time:
You ask it:
Who is the creator of the Linux kernel?
Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.

