Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
Quantization: Trade Precision You Don't Need for Memory You Do
Let's start with some basic facts about floating point numbers.
CPUs and high-level languages like Python default to 64-bit floats because general computing prioritizes accuracy. This means that when you write a decimal number like 0.123456789, the CPU stores it as a 64-bit float with about 15 digits of precision.
GPUs and ML frameworks, on the other hand, default to 32-bit floats because graphics and neural networks don't need that much precision, and 32-bit is faster and uses less memory. AI runs on GPUs, so it inherited the 32-bit default.
CPU vs GPU
With that in mind, let's talk about quantization.
Quantization is the practice of cutting below that 32-bit default to save memory and speed up inference. Instead of storing each weight in 32 bits, you store it in 16, 8, or even 4. Fewer bits mean fewer possible values per weight, so the number gets rounded to the nearest one the format can represent. You lose precision, but you gain the ability to fit a model that used to need a server into a laptop or a phone.
(i) In reality, people didn't conclude that 32-bit precision was useless. They discovered that trained neural networks tolerate rounding errors well during inference. Cutting precision saves memory roughly in proportion to the bit reduction, with surprisingly small quality loss down to about 4 bits per weight. Below that, quality degrades fast.
Today, most modern open-weight language models are released at 16-bit floating point precision, almost always BF16. Each weight is stored with only about 2 to 3 digits of precision (think 0.78 or 0.783).
Bits vs precision
The model has already been heavily rounded compared to its full-precision form; quantization just continues that compression further.
At 16-bit precision, an 8B parameter model takes about 16 GB on disk and roughly the same amount in RAM or VRAM to run (8 billion weights times 2 bytes per weight). Most people running models locally don't have 16 GB of spare VRAM, so the weights are compressed further.
Quantization rounds each weight to a smaller number of bits. 4-bit quantization, for example, cuts the file size to roughly a quarter of BF16: an 8B model goes from 16 GB to about 4 to 5 GB. The model gets dumber as you compress harder, but the curve is not linear: there's a flat region near the top where quality barely changes, then a cliff.
(i) The more precisely each weight is stored, the more accurately the model can reproduce the exact relationships it learned during training, but the cost is more memory and slower inference; quantization trades a small amount of that precision for a large reduction in size and speed.
This table represents some GGUF quantization options you can choose from when downloading or running a local model in tools like Ollama, llama.cpp, or LM Studio. The numbers are approximate and are relative to an 8B model (8 billion parameters).
| Quant | File size (8B model) | Quality vs original | When to use |
|---|---|---|---|
| Q8_0 | ~8.5 GB | Near-lossless | You have the memory and want max quality |
| Q6_K | ~6.6 GB | Very close to Q8 | Sweet spot for quality-conscious users |
| Q5_K_M | ~5.7 GB | Slight degradation | Good balance |
| Q4_K_M | ~4.9 GB | Noticeable but usable | Default recommendation in most tools |
| Q3_K_M | ~4.0 GB | Visible quality drop | Tight memory budgets |
| Q2_K | ~3.2 GB | Significant quality loss, can break on smaller models | Last resort, test heavily |
Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.


