Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
20%

Quantization: Trade Precision You Don't Need for Memory You Do

Let's start with some basic facts about floating point numbers.

CPUs and high-level languages like Python default to 64-bit floats because general computing prioritizes accuracy. This means that when you write a decimal number like 0.123456789, the CPU stores it as a 64-bit float with about 15 digits of precision.

GPUs and ML frameworks, on the other hand, default to 32-bit floats because graphics and neural networks don't need that much precision, and 32-bit is faster and uses less memory. AI runs on GPUs, so it inherited the 32-bit default.

CPU vs GPU

CPU vs GPU

With that in mind, let's talk about quantization.

Quantization is the practice of cutting below that 32-bit default to save memory and speed up inference. Instead of storing each weight in 32 bits, you store it in 16, 8, or even 4. Fewer bits mean fewer possible values per weight, so the number gets rounded to the nearest one the format can represent. You lose precision, but you gain the ability to fit a model that used to need a server into a laptop or a phone.

(i) In reality, people didn't conclude that 32-bit precision was useless. They discovered that trained neural networks tolerate rounding errors well during inference. Cutting precision saves memory roughly in proportion to the bit reduction, with surprisingly small quality loss down to about 4 bits per weight. Below that, quality degrades fast.

Today, most modern open-weight language models are released at 16-bit floating point precision, almost always BF16. Each weight is stored with only about 2 to 3 digits of precision (think 0.78 or 0.783).

Bits vs precision

Bits vs precision

The model has already been heavily rounded compared to its full-precision form; quantization just continues that compression further.

At 16-bit precision, an 8B parameter model takes about 16 GB on disk and roughly the same amount in RAM or VRAM to run (8 billion weights times 2 bytes per weight). Most people running models locally don't have 16 GB of spare VRAM, so the weights are compressed further.

Quantization rounds each weight to a smaller number of bits. 4-bit quantization, for example, cuts the file size to roughly a quarter of BF16: an 8B model goes from 16 GB to about 4 to 5 GB. The model gets dumber as you compress harder, but the curve is not linear: there's a flat region near the top where quality barely changes, then a cliff.

(i) The more precisely each weight is stored, the more accurately the model can reproduce the exact relationships it learned during training, but the cost is more memory and slower inference; quantization trades a small amount of that precision for a large reduction in size and speed.

This table represents some GGUF quantization options you can choose from when downloading or running a local model in tools like Ollama, llama.cpp, or LM Studio. The numbers are approximate and are relative to an 8B model (8 billion parameters).

QuantFile size (8B model)Quality vs originalWhen to use
Q8_0~8.5 GBNear-losslessYou have the memory and want max quality
Q6_K~6.6 GBVery close to Q8Sweet spot for quality-conscious users
Q5_K_M~5.7 GBSlight degradationGood balance
Q4_K_M~4.9 GBNoticeable but usableDefault recommendation in most tools
Q3_K_M~4.0 GBVisible quality dropTight memory budgets
Q2_K~3.2 GBSignificant quality loss, can break on smaller modelsLast resort, test heavily

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.