Feedback

Chat Icon

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Creating a Fine-Tuned Model (English to SQL)
66%

Fine-Tuning the Model

By the end of this section you will have a Granite model trained to turn an English question into a SQL query, saved to disk as a LoRA adapter. This is the full training run, from install to saved result. Exporting it to Ollama comes later.

The code targets a single GPU. A free Colab T4 or any 8 GB-plus card is enough, because QLoRA loads the base model in 4-bit and trains only a small adapter on top.

We are going to use a tool called Unsloth, which makes fine-tuning easy. "Unsloth lets you run and train AI models on your own local hardware via an open-source UI", as the docs say.

Step 1: Clone the Repo and Install Unsloth

Unsloth pulls in the right versions of PyTorch, transformers, PEFT, and TRL.

# Install git
apt update && apt install git -y

# Install uv if needed
curl -LsSf https://astral.sh/uv/0.11.13/install.sh | sh \
    && source $HOME/.local/bin/env 

# Clone the repo
git clone \
    https://github.com/eon01/LocalAIEngineeringWithOllamaCompanionToolkit.git \
    $HOME/companion/

# Change directory
cd $HOME/companion/code/finetuning

# Install packages from the lockfile
uv sync

You can find the whole code in $HOME/companion/code/finetuning. We're going to understand it as you go through the chapter.

Step 2: Load the Base Model in 4-Bit

After installing the packages using uv sync, the first thing we do is load the base model.

The base model is the starting point: a general model that already understands language and SQL, but has not yet been trained on our specific task. Fine-tuning means taking this model and nudging it toward our goal. So before anything else, we need to load it.

Where the Model Comes From

We download the base model from Hugging Face, a site that hosts models in a form you can train.

You might expect to reuse the Granite you already have in Ollama, but that copy will not work here. The Ollama version is packaged for running: its weights are compressed and frozen so the model loads fast and serves answers, not so it can learn. Training needs the source version, with the weights at full precision and laid out so each one can still be adjusted. That is what lives on Hugging Face. It is the same Granite, just the build made for training instead of serving.

Which Model We Use

We use Granite for two simple reasons: it is small enough to train on a free GPU, and its license is open, so there are no restrictions on using the result.

Granite 4 comes in two versions of this size, with nearly identical names: granite-4.0-micro and granite-4.0-h-micro. We use the first one. They are built differently inside.

  • The plain granite-4.0-micro is built the standard way, like most models you can run locally.
  • The h version uses Mamba layers, a model design that llama.cpp and Ollama support less reliably than the standard transformer design.

The standard one is the well-tested path, so every step works smoothly.

One more detail you will see in the code: we load Unsloth's copy of the model, not IBM's original. It is the same model, just repackaged by Unsloth to download faster and work cleanly with their training tool.


We load the model in 4-bit. A model is a huge pile of numbers, and normally each number is stored in a large, precise form that uses a lot of memory. Loading in 4-bit stores each number in a smaller, rougher form that takes about a quarter of the memory. That is what lets the model fit on a small GPU like the free one on Colab.

The rougher form makes the model slightly less precise, but not in a way that hurts results for our task. This memory-saving trick is the QLoRA part of the setup.

Let's start with understanding the code:

from unsloth import FastLanguageModel

# How long each training example can be, in tokens.
# A SQL schema plus a question fits comfortably in 2048.
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/granite-4.0-micro",
    max_seq_length = max_seq_length,
    load_in_4bit = True,   # load the compressed 4-bit form (the QLoRA part)
)

The first time you run this, it downloads the model from Hugging Face, a few gigabytes, and saves it on disk. Every run after that loads from that saved copy, so the download happens only once.

Step 3: Attach the LoRA Adapter

The base model is loaded and frozen. Now we add the small trainable part on top of it: the adapter. This is the only piece that learns during training.

The function below adds the adapter to the model. Its settings do two things:

  • They set the size of the adapter
  • They choose where inside the model it attaches
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                # adapter size
    lora_alpha = 16,       # adapter strength
    lora_dropout = 0,
    bias = "none",
    target_modules = [     # where the adapter attaches (leave as-is)
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    use_gradient_checkpointing = "unsloth",  # saves memory during training
    random_state = 3407,                     # makes the run repeatable
)

Here is what each setting means.

Size of the adapter: r is the size of the adapter. It must be a positive whole number, with no fixed maximum, but in practice people pick from a standard ladder: 8, 16, 32, 64, 128. A bigger number gives the adapter more room to learn, but uses more memory and trains slower. A smaller number is lighter but learns less. Going very low (1 or 2) rarely learns enough to be useful, and going very high uses so much memory that you lose the savings that made LoRA worth choosing. 16 sits in the middle and works well for a task like ours. If you later find the model is not learning enough, stepping up to 32 or 64 is the first thing to try.

Strength of the adapter: lora_alpha sets how much the adapter's learning is allowed to influence the model. Internally, the adapter's effect is scaled by lora_alpha divided by r. So if lora_alpha equals r, the scale is 1, and the adapter applies at full strength as-is. If lora_alpha is larger than r, the adapter's effect is amplified; if smaller, it is dampened. This is why the two values are always considered together: r sets how much the adapter can learn, and lora_alpha sets how loudly that learning speaks to the model. Setting lora_alpha equal to r, as we do here, keeps the scale at 1, which is the simplest and most predictable choice. A common alternative is to set lora_alpha to twice r, which makes the adapter adapt faster; pushing it much higher can make training erratic, so most people stay at one or two times r.

Overfitting guard: lora_dropout is a fraction between 0 and 1 that pushes the model toward learning the general task instead of memorizing the exact examples. At 0 the guard is off, so nothing pushes the model away from memorizing, though on a large, varied dataset it usually generalizes fine anyway. Higher values apply a stronger push toward generalizing, at the cost of slower and slightly harder learning. In practice people stay low, around 0.05 to 0.1, and only raise it when the dataset is small or repetitive and the model starts memorizing instead of handling new inputs. Our dataset does not need it, so we leave it at 0.

Bias terms: inside a model, the main numbers that do the heavy lifting are called weights. In simple terms, the model takes an input, multiplies it by the weights, and adds a second, much smaller number called the bias to get the output: input x weight + bias = output. If the weights decide the answer, the bias is a tiny nudge that shifts it a little in one direction. The bias setting decides whether training is allowed to adjust those nudges too. Turning them on adds a little training cost and, in LoRA, barely changes the result, so there's rarely a reason to. The standard choice is "none": train only the adapter, leave the nudges alone. We use "none".

Where the adapter attaches: target_modules is the list of spots inside the model where we plug in the adapter. A model is made of many internal parts, and each name in this list points to one kind of part worth training. The two groups you see are the model's attention parts (q_proj, k_proj, v_proj, o_proj), which control what the model focuses on, and its processing parts (gate_proj, up_proj, down_proj), which do the work of transforming that information into an output. You do not need to understand each name. This is the standard, recommended set for this kind of model, and attaching to all of them gives the adapter the best chance to learn. Use it as written. Leaving names out would train fewer parts and save a little memory, but the model would learn less, so there is no reason to change it here.

Memory saver: use_gradient_checkpointing lets training run in less memory, at the cost of a little speed. During training, the model normally keeps a lot of temporary information in memory to learn from each example. That information is what fills up a small GPU and causes out-of-memory crashes. This setting tells the model to keep less of it and recreate what it needs later. The result is lower memory use, so the run fits on a small GPU, in exchange for slightly slower training (Unsloth measures the slowdown at around 2 percent). It takes 3 values: "unsloth" is Unsloth's own version, tuned to save the most memory, and is what we use; True is the standard version that saves less; False turns it off (fastest, but uses the most memory and will likely crash on a small GPU). Keep it at "unsloth". If you ever hit an out-of-memory error with "unsloth", try True instead, which occasionally fits when "unsloth" does not.

Repeatability: random_state makes a training run repeatable, so that fine-tuning the model twice with the same script and data produces the same result. Training relies on randomness in several places, so two runs of the same script can come out slightly different. Setting this to a fixed number locks that randomness in place. The exact value does not matter (3407 is just what we picked); it only needs to stay the same across the runs you want to match. Change it and the next run differs slightly; keep it and you get the same fine-tuned model every time.

This table summarizes the settings we use:

ConfigDefinition
rHow big the add-on is. Bigger learns more but uses more memory and is slower. 16 is a safe start.
lora_alphaHow loud the add-on speaks to the model. Setting it equal to r is the simplest, most predictable choice.
lora_dropoutStops the model from just memorizing examples. Leave at 0 unless your data is small or repetitive.
biasWhether to also tweak the model's tiny "nudge" numbers. "none" means don't bother; it barely helps.
target_modulesThe spots inside the model where the add-on plugs in. Use the list as written.
use_gradient_checkpointingTrades a little speed for much less memory, so training fits on a small GPU. Keep "unsloth".
random_stateA fixed number that makes the run repeatable. Same number = same result every time.

(i) The two settings you might ever touch are r and lora_alpha, and even those are fine at 16. Everything else you can leave exactly as shown.

Step 4: Load the Dataset

Pull the text-to-SQL dataset from Hugging Face. The slice takes the first 5,000 rows, enough to teach the task without a long training time.

from datasets import load_dataset

dataset = load_dataset(
    "gretelai/synthetic_text_to_sql",
    split = "train[:5000]",
)

# Confirm the column names before mapping them. A renamed column
# is a silent break, so check the real data once.
print(dataset.column_names)
print(dataset[0])

This dataset has 3 columns we use:

  • sql_prompt is the question in plain English, like How many users signed up in May?.
  • sql_context is the database description: the CREATE TABLE statements that say which tables and columns exist. A query only makes sense against a specific database, so the model needs this to know what it is querying.
  • sql is the answer: the query the model should produce.

Earlier we said a training example has 2 parts, an input and an output. That still holds. Here the input is built from 2 columns, not one: the schema (sql_context) and the question (sql_prompt) together form what the model reads, and sql is what it should write back. So 3 columns in the dataset, but still just 2 roles in training: question in, query out. The next step combines the schema and the question into a single input.

Step 5: Format Each Row into Granite's Chat Format

The model learns the format along with the content, so each row must look the way you will prompt the model later. You build 3 turns per example and let the tokenizer apply Granite's own chat template. Do not write the template by hand.

# A fixed instruction. The schema and question change per row,
# so they go in the user turn, not here.
SYSTEM = ("You translate questions into SQL for the given schema. "
          "Reply with one SQL query and nothing else.")

def format_example(row):
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content":
            f"Schema:\n{row['sql_context']}\n\nQuestion: {

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.