Core Concepts: From Tokens and Embeddings to Quantization and KV Cache
What Is GGUF and Why Does It Exist?
Before GGUF, running a model meant juggling several files, including but not limited to:
Tokenizer
The rules for splitting text into tokens and mapping them to IDs.
Model's Settings
The architecture details (layer count, hidden size (aka dimension), and other parameters).
Generation Defaults
The default sampling parameters (temperature, top_p, etc.) the model ships with.
Chat Template
A chat model needs to know who said what. The chat template is the formatting recipe that labels each message as coming from the user, the assistant, or the system (the hidden instructions that set the model's behavior). Different model families use different recipes, kind of like how letters, emails, and text messages all communicate but follow different conventions.
Example of a template that has 3 parts:
<system>
{{ system_message }}
</system>
{% for message in messages %}
<{{ message.role }}>
{{ message.content }}
</{{ message.Local AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.
