Pass 7: Give the Chat a Long-Term Memory That Survives Restarts

Our previous example solved one half of the memory problem. Inside a single conversation, summarization keeps old turns alive in compressed form so the model stays coherent even after 20 or 30 exchanges. What the previous code cannot do is help the model recognize you the next time you start the REPL. The summary lives in RAM for the duration of one run, and when you hit /bye it goes away. The next session starts blank, and you re-introduce yourself, your projects, and so on.

For most chat applications, that's the bigger problem. A research assistant that forgets your project every morning is annoying. A coding helper that doesn't remember which framework you use is useless. The fix is a separate layer that writes durable facts to disk after each conversation and reads them back at the start of every new turn. This layer is usually called long-term memory, and it works very differently from summarization: instead of compressing recent context, it extracts stable facts about the user (preferences, identity, recurring projects, etc.) and stores them in a database that survives restarts.

We'll use a library called mem0, which packages this idea into a small Python API. Under the hood, it runs an LLM over each conversation turn to extract facts, embeds those facts into vectors, and stores them in a local vector database (Chroma in our case). On retrieval, it embeds your latest question and returns the most semantically similar facts, which we then inject into the conversation as a system message. The model sees those facts as context before reading your question, so it can answer correctly without ever seeing the original conversation those facts came from.

The following diagram describes what we're going to implement:

How long-term memory works

Step 1: Add the Memory Settings

We will add the following constants to config.py:

# repl/config.py

# Model used by mem0 to extract durable facts from conversations.
# Needs to be smart enough to distinguish "the user said X" from
# generic assistant text. A 2B model is too small for reliable
# extraction; 3B is the practical minimum.
EXTRACTION_MODEL: str = os.environ.get(
    "EXTRACTION_MODEL", "granite3.3:8b"
)

# Model used to turn text into vectors for similarity search.
EMBED_MODEL: str = os.environ.get(
    "EMBED_MODEL", "nomic-embed-text:v1.5"
)

# Collection name inside Chroma. One collection per app is fine.
COLLECTION_NAME: str = os.environ.get(
    "COLLECTION_NAME", "my_memories"
)

# Where mem0 keeps its vector store on disk. Survives restarts.
# Delete this directory to wipe all memories and start fresh.
MEMORY_DB_PATH: str = os.path.expanduser(
    os.environ.get("MEMORY_DB_PATH", "/var/data/ollama")
)

# Maximum cosine distance (Chroma raw, lower = more similar) for a memory
# to be considered relevant to the current query. mem0's own `threshold`
# is unusable - its normalized score saturates at 1.0 for irrelevant
# content. With nomic-embed-text:v1.5, relevant matches sit below ~1.0
# and noise sits above ~1.2. Lower this to be stricter.
MEMORY_RELEVANCE_THRESHOLD: float = float(
    os.environ.get("MEMORY_RELEVANCE_THRESHOLD", "1.0")
)

# When true, print extra diagnostic lines (e.g. memory write lifecycle).
# Errors are always shown regardless of this flag.
DEBUG: bool = os.environ.get("DEBUG", "false").lower() in {
    "1",
    "true",
    "yes",
    "on",
}

mem0 needs two extra models running on Ollama:

One to extract facts from conversation text so they can be stored: EXTRACTION_MODEL
One to turn text into vectors so we can find related facts later: EMBED_MODEL

DEBUG lights up extra diagnostic prints inside the memory code. We'll add some conditional print() statements to see what's going on.

COLLECTION_NAME is just a name for the storage bucket inside Chroma.

MEMORY_DB_PATH is where the database files live on disk (delete this folder to wipe all memories).

MEMORY_RELEVANCE_THRESHOLD is a maximum distance: smaller means "more strict about what counts as relevant".

All of these settings can be overridden by environment variables (in the .env file).

Before using any model (summarization, extraction, embedding), make sure you download it with ollama pull.

Step 2: Build the Memory Object

build_memory() wires mem0 together. mem0 needs three pieces of configuration:

def build_memory() -> Memory:
    """Set up mem0. Everything runs locally, no cloud services.

    mem0 needs three pieces:
      - llm: extracts the durable facts from a conversation.
      - embedder: turns text into vectors (lists of numbers) so we
        can find facts that look similar to the user's question.
      - vector_store: where the facts and their vectors live on disk.
    """
    config = {
        "llm": {
            "provider": "ollama",
            "config": {
                "model": EXTRACTION_MODEL,
                "ollama_base_url": OLLAMA_HOST,
            },
        },
        "embedder": {
            "provider": "ollama",
            "config": {
                "model": EMBED_MODEL,
                "ollama_base_url": OLLAMA_HOST,
            },
        },
        "vector_store": {
            "provider": "chroma",
            "config": {
                "collection_name": COLLECTION_NAME,
                "path": MEMORY_DB_PATH,
            },
        },
    }
    return Memory.from_config(config)

llm: extracts the durable facts from a conversation turn.
embedder: converts text into a vector (a list of numbers) so we can compare meanings.
vector_store: where the facts and their vectors live on disk.

Step 3: Identify the User at Startup

Memories are scoped per user, so we ask for an id once when the program starts:

USER_ID = (
    input("Enter your user id: ").strip() or "default"
)

We store it in a module-level variable because the background memory-writer (later) needs to read it without us passing it around.

Step 4: Look Up Relevant Facts before Answering

relevant_memories(memory, query, user_id, k=5) is the read path. It returns a formatted block of facts that look related to the user's current question. Three sub-steps inside it.

Sub-step A: embed the query.

We turn the question into the same kind of vector that the stored facts were embedded with, so they're comparable:

# Step 1: turn the user's question into a vector (a list of
# numbers) using the same embedding model mem0 used when storing
# facts. Same model = comparable vectors.
embedding = memory.embedding_model.embed(
    query, memory_action="search"
)

Sub-step B: ask Chroma for the closest candidates.

# Step 2: ask Chroma for the closest stored facts. We ask for
# k*4 candidates so we have spares after filtering by distance.
# `where` restricts the search to this user only.
res = memory.vector_store.collection.query(
    query_embeddings=[embedding],
    n_results=k * 4,
    where={"user_id": user_id},
)

We ask for k*4 candidates so we have spares to throw away once we apply the relevance filter. The where clause restricts the search to this user's facts only.

We're talking directly to Chroma here rather than using mem0's own search(), because mem0's relevance score saturates at 1.0 even for unrelated content, so it can't be used for filtering.

Sub-step C: filter by distance, format as text.

# Chroma returns lists of lists (one inner list per query). We
# only ran one query, so we grab the [0] inner list.
distances = res.get("distances", [[]])[0]
metadatas = res.get("metadatas", [[]])[0]

# Step 3: walk through the candidates from closest to farthest,

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next