Pass 3: Stream the Reply Token-by-Token and Accept Multi-Line Input

The second version works, but it has two problems that become obvious as soon as you ask it anything substantial. The first is that long replies arrive as a single block of text after a noticeable pause. The second is that you can only send one line at a time, which means pasting a code snippet or a multi-paragraph question is impossible: the first newline ends your turn.

We're going to fix both, and the fix is small.

For streaming, we switch to stream=True and iterate the chunks. Each chunk's message.content is a small piece of the full reply; we print it as it arrives, and at the same time we accumulate it into a string so we have the complete reply to append to the history when streaming ends. This dual responsibility (display now, store for later) is the part that trips people up, so it's worth doing slowly.
For multi-line input, we read lines until the user signals they're done. The convention we'll use is a blank line: hit Enter once to add a newline to your message, hit Enter on an empty line to send.

Step 1: Read Multi-Line Input

We pull input reading out of the main loop into its own helper, read_input(). The trick is simple: keep calling input() and collect lines until the user submits an empty line.

while True:
    line = input(prompt)
    if line == "":
        break
    lines.append(line)
    prompt = "      "
return "\n".join(lines)

The first line uses the You > prompt. After that, we switch prompt to spaces so continuation lines visually align under the user's text.
An empty line ends the message. If the user hits Enter right away on the first line, we return "" and let the main loop ignore it.

Step 2: Handle the Empty Submission in the Main Loop

Since read_input() can return an empty string, the main loop has to skip those instead of sending nothing to the model:

# If the user pressed Enter on an empty prompt, just loop back.
if user == "":
    continue
if user.strip() == "/bye":
    break

Step 3: Raise the Client Timeout

Streamed answers can be long, and we want the connection to stay open while the model is producing tokens. We bump the timeout when creating the client:

client = Client(host=OLLAMA_HOST, timeout=300)

300 seconds (5 minutes) is comfortably above what a normal reply will take.

Step 4: Ask the API to Stream

The change in the API call itself is one flag: stream=True. Instead of returning a single response, client.chat(...) now returns an iterator that yields small chunks as the model generates them:

# stream=True makes the model send back its reply one tiny
# piece at a time instead of all at once at the end. We print
# each piece as it arrives (the "typing" effect) and also
# build up the full reply so we can save it to history.
print

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

Enroll now to unlock all content and receive all future updates for free.

Unlock now $26.99 Learn More

Previous Next