Pass 5: Swap Hard Trimming for an Automatic Summary of Older Messages

The trim function we wrote works, but it has an honest weakness: it throws information away. The moment a turn falls outside the character budget, it's gone, and the model has no way to recover what it knew. If you spent the first 5 turns telling the assistant about a project you're working on (the language, the constraints, the deadlines), and then you fill the next 30 turns with related but separate questions, pass 4 will eventually drop those early turns. The model will keep answering your questions, but it'll be answering them without any memory of the context you gave it at the start. From your side this looks like the assistant getting dumber over time, which is exactly the experience we want to avoid.

A better approach is to compress old turns instead of dropping them. Rather than deleting the first ten messages when they fall out of budget, we ask the model itself to read them and produce a short summary. That summary takes the place of the original messages and becomes the assistant's "long-term memory" of the early conversation. The recent turns stay intact, so the model still has the immediate context it needs to reply well, and the summary preserves the gist of everything that came before. This is the same technique production assistants and coding agents use to handle long sessions; it works because language models are remarkably good at distilling conversation history into a paragraph or two.

We could write this ourselves. It would be about eighty lines of code: a token counter, a trigger condition, a summarization prompt, careful handling of system messages and assistant/tool message pairs, and a state machine that knows when summarization has already happened so it doesn't summarize the summary. None of that is conceptually hard, but all of it is detail work that's been done dozens of times by other people. For this section we'll use the implementation that ships with LangChain, called SummarizationMiddleware, which gives us the same behavior in about ten lines of configuration.

Why LangChain

It's worth being upfront about what changes when we bring LangChain into the picture, because this is the first part where we step outside the Ollama SDK and into a framework. The Ollama SDK is a thin client over the HTTP API: you give it a model name and a list of messages, and it gives you back a response. LangChain is something different, a framework that wraps any number of model providers behind a unified interface and adds layers on top of them (agents, middleware, tool calling, retrievers). For most of what we did, the SDK was the right tool (less code, no abstraction tax, easy to debug). For summarization specifically, the calculus flips. LangChain has already solved the conversation-management problem in a way that integrates cleanly with token counting, message-pair preservation, and per-model defaults. Reinventing it would be a waste of time.

The trade-off going forward is that the REPL no longer talks directly to ollama.Client. It talks to ChatOllama, which is LangChain's adapter for Ollama. Under the hood, ChatOllama still uses the same ollama package we've been using, so we're not bypassing what we learned; we're wrapping it.

A standalone version of pass 3 written against ChatOllama looks like this:

from langchain_core.messages import HumanMessage, AIMessage
from langchain_ollama import ChatOllama

# ChatOllama accepts the same base_url and model that we already use.
# It also takes a num_predict cap (max tokens per reply) and a temperature.
llm = ChatOllama(
    model=OLLAMA_MODEL,
    base_url=OLLAMA_HOST,
    num_predict=512,
)

messages = [HumanMessage(content="What's 17 times 19?")]

# .stream() returns an iterator of AIMessageChunk objects. Each chunk
# has a small piece of text in chunk.content, same idea as the SDK's
# chunk.message.content but renamed in the LangChain abstraction.
full_reply = ""
for chunk in llm.stream(messages):
    piece = chunk.content
    print(piece, end="", flush=True)
    full_reply += piece

# To remember this turn, append an AIMessage (not a dict) to the list.
messages.append(AIMessage(content=full_reply))

You could stop here, plug this into pass 4's loop in place of the SDK call, and have a working LangChain version of the REPL. But the reason we did this is so we can stack SummarizationMiddleware on top, and for that we need to upgrade from a raw model call to an agent.

Step 1: Build the Chat Model with `ChatOllama`

We don't talk to Ollama via the raw Client anymore. We wrap it in LangChain's ChatOllama, which gives us the same backend behind a LangChain-friendly interface:

llm = ChatOllama(
    model=OLLAMA_MODEL,
    base_url=OLLAMA_HOST,
    num_predict=2048,
)

num_predict=2048 is the maximum number of tokens the model will generate in one reply. You can give it more space if needed.

Step 2: Build a Second Model for Summarizing

The middleware needs a model to write the summaries. We create another ChatOllama for that role:

summarizer = ChatOllama(
    model=OLLAMA_MODEL,
    base_url=OLLAMA_HOST,
    num_predict=2048,
)

Reusing the same model is fine for learning. In real apps you'd often pick a smaller, cheaper model here, because summarizing is an easier task than answering.

Step 3: Wrap the Model in an Agent with Summarization Middleware

create_agent(...) is a LangChain function that launches a new agent. It takes a model and a list of middleware and returns an "agent": the model plus the extra behavior wired around it. We attach our summarization middleware:

agent = create_agent(
    model=llm,    
    middleware=[
        SummarizationMiddleware(
            model=summarizer,
            trigger=("tokens", 2000