Building Advanced Agents: Streaming and Multi-Line Input
Pass 3: Stream the Reply Token-by-Token and Accept Multi-Line Input
The second version works, but it has two problems that become obvious as soon as you ask it anything substantial. The first is that long replies arrive as a single block of text after a noticeable pause. The second is that you can only send one line at a time, which means pasting a code snippet or a multi-paragraph question is impossible: the first newline ends your turn.
We're going to fix both, and the fix is small.
For streaming, we switch to
stream=Trueand iterate the chunks. Each chunk'smessage.contentis a small piece of the full reply; we print it as it arrives, and at the same time we accumulate it into a string so we have the complete reply to append to the history when streaming ends. This dual responsibility (display now, store for later) is the part that trips people up, so it's worth doing slowly.For multi-line input, we read lines until the user signals they're done. The convention we'll use is a blank line: hit Enter once to add a newline to your message, hit Enter on an empty line to send.
Step 1: Read Multi-Line Input
We pull input reading out of the main loop into its own helper, read_input(). The trick is simple: keep calling input() and collect lines until the user submits an empty line.
while True:
line = input(prompt)
if line == "":
break
lines.append(line)
prompt = " "
return "\n".join(lines)
- The first line uses the
You >prompt. After that, we switchpromptto spaces so continuation lines visually align under the user's text. - An empty line ends the message. If the user hits Enter right away on the first line, we return
""and let the main loop ignore it.
Step 2: Handle the Empty Submission in the Main Loop
Since read_input() can return an empty string, the main loop has to skip those instead of sending nothing to the model:
# If the user pressed Enter on an empty prompt, just loop back.
if user == "":
continue
if user.strip() == "/bye":
break
Step 3: Raise the Client Timeout
Streamed answers can be long, and we want the connection to stay open while the model is producing tokens. We bump the timeout when creating the client:
client = Client(host=OLLAMA_HOST, timeout=300)
300 seconds (5 minutes) is comfortably above what a normal reply will take.
Step 4: Ask the API to Stream
The change in the API call itself is one flag: stream=True. Instead of returning a single response, client.chat(...) now returns an iterator that yields small chunks as the model generates them:
# stream=True makes the model send back its reply one tiny
# piece at a time instead of all at once at the end. We print
# each piece as it arrives (the "typing" effect) and also
# build up the full reply so we can save it to history.
printLocal AI Engineering with Ollama
Run, understand, customize, fine-tune, and build agentic apps on your own hardwareEnroll now to unlock all content and receive all future updates for free.
