Background
Agents are an exciting space in the development of LLM applications - especially with the release of OpenAI function calling.
This post is about chat assistant agents (though the ontology of “agents” is still a bit fuzzy). Basically, these are chatbots that approximate a domain-specific human assistant. They must be able to use tools and perform basic planning. ChatGTD is an example.
My working hypothesis is that there is a baseline of interaction types that a chat assistant agent must support - or it’ll be super frustrating to use. I haven’t seen an agent that actually achieves this baseline. (I’ve tried to build one, and can’t quite get there!)
This resource
I’ve been keeping a list of those interaction types, for use while designing agents. That list is below. Each item has a simple example.
Caveats:
-
Items are categories, not test scenarios. Within each category, an agent could pass the simple example here but fail many others.
-
These are atomic. An agent could pass all of these in isolation, yet still fail when they are combined in a single chat.
Nevertheless, I found this helpful for designing agents. In one project, I turned this checklist into a suite of regression tests.
Checklist
Single tool
This is the Hello World example: ask an AI to use a tool, and the AI does it.
human: Close the task
ai: <closes task>
Command translation
This is low hanging fruit for an LLM, yet it can still fail at extremes, e.g. slang.
human: Complete this task
ai: <closes task>
Tool inference
“This is done” isn’t an instruction, yet the AI recognizes that a tool is appropriate.
human: This is done
ai: <closes task>
Confirmation (human-in-the-loop)
AI must confirm tool use with the human inline.
human: Close the task
ai: Okay, I’ll close the task. Confirm? [Y/n]
human: Y
ai: <closes task>
Modify a tool execution
This is different than rejecting a confirmation, because the second human message assumes context from the first. The AI must consider the full chat history in its response.
human: Set the due date to July 1st
ai: Okay, I’ll set the due date to July 1. Confirm? [Y/n]
human: Actually, that’s over a holiday weekend, let’s do July 15th.
ai: Okay, I’ll set the due date to July 15. Confirm? [Y/n]
...
Multiple tools
A single human message specificies multiple tools.
human: Assign to Bob and set the due date to July 1st.
ai: <assigns to bob>
ai: <sets due date>
Sequential tools
Note the “and then” here - the human is prescribing an order.
human: Set the due date to July 1st and then assign to Bob.
ai: <sets due date>
ai: <assigns to bob>
Multi-step task
Human gives a direct instruction that has two steps, without listing those steps as they did above. However, the steps are obvious and can be directly inferred from the set of tools.
human: Can you block 30mins tomorrow morning?
ai: <looks up availability in calendar>
ai: <creates event in calendar>
Multi-step planning
The agent is given a goal instead of an explicit instruction, and has to reason out steps. This is the type of planning that ReAct and AutoGPT do, but inline in the chat.
Note the Thought:
clause. This implies that the AI can’t fully predict the set of steps at the start.
human: Can you find some restaurants for dinner with Jeff on Wednesday?
ai: <looks up Jeff from contacts>
ai: Thought: Jeff will like expensive restaurants.
ai: <looks up $$$$ restaurants with availability on Wednesday>
Multi-step reasoning with change course
Same as above, but the user can change course mid-plan. Note that we have to bring back the Confirm:
here.
human: Can you find some restaurants for dinner with Jeff on Wednesday?
ai: Okay, I’ll look up Jeff from contacts. Confirm? [Y/n]
ai: <looks up Jeff from contacts>
ai: Thought: Jeff will like expensive restaurants.
ai: I’ll look for expensive restaurants with availability on Wednesday. Confirm? [Y/n]
human: No Jeff is chill, just find some good Asian food.
ai: Thought: I should find restaurants with Japanese, Thai, Korean, Chinese cuisine and $$ or $$$ pricing and order by highest rated.
...
Explain its reasoning
This is the first example where the agent responds with free text - no tools are run.
(Here, the reasoning isn’t actually connected to the original commands - but IMO it’s still important to support this.)
human: Can you find some restaurants for dinner with Jeff on Wednesday?
ai: <looks up Jeff from contacts>
ai: Thought: Jeff will like expensive restaurants.
ai: <looks up $$$$ restaurants with availability on Wednesday>
human: Why did you pick these restaurants?
ai: I selected restaurants marked $$$$ because Jeff’s title is Managing Director, so he probably likes expensive restaurants.
Summarization
Here, the agent needs the full context of the current task - it can’t solely rely on the chat history.
human: What’s the status of that dinner reservation?
ai: On Monday, we talked about restaurants that you and Jeff would like, and we chose Zen sushi.
Suggest next steps
Here, the agent needs to infer tools from its earlier free text messages earlier in the chat.
human: What's next for scheduling dinner?
ai: You may want to confirm
human: Right, can you email him to confirm?
ai: Okay, I'll send this message. Confirm? [Y/n]
...