A Checklist For Chat Assistant Agents

Status: I find this helpful, but it's still a first draft that needs technical diligence. If I'm still using it in a month, I'll probably add some items and revise some examples.

Background

Agents are an exciting space in the development of LLM applications - especially with the release of OpenAI function calling.

This post is about chat assistant agents (though the ontology of “agents” is still a bit fuzzy). Basically, these are chatbots that approximate a domain-specific human assistant. They must be able to use tools and perform basic planning. ChatGTD is an example.

My working hypothesis is that there is a baseline of interaction types that a chat assistant agent must support - or it’ll be super frustrating to use. I haven’t seen an agent that actually achieves this baseline. (I’ve tried to build one, and can’t quite get there!)

This resource

I’ve been keeping a list of those interaction types, for use while designing agents. That list is below. Each item has a simple example.

Caveats:

  • Items are categories, not test scenarios. Within each category, an agent could pass the simple example here but fail many others.

  • These are atomic. An agent could pass all of these in isolation, yet still fail when they are combined in a single chat.

Nevertheless, I found this helpful for designing agents. In one project, I turned this checklist into a suite of regression tests.

Checklist

Single tool

This is the Hello World example: ask an AI to use a tool, and the AI does it.

human: Close the task

ai: <closes task>

Command translation

This is low hanging fruit for an LLM, yet it can still fail at extremes, e.g. slang.

human: Complete this task

ai: <closes task>

Tool inference

“This is done” isn’t an instruction, yet the AI recognizes that a tool is appropriate.

human: This is done

ai: <closes task>

Confirmation (human-in-the-loop)

AI must confirm tool use with the human inline.

human: Close the task

ai: Okay, I’ll close the task. Confirm? [Y/n]

human: Y

ai: <closes task>

Modify a tool execution

This is different than rejecting a confirmation, because the second human message assumes context from the first. The AI must consider the full chat history in its response.

human: Set the due date to July 1st

ai: Okay, I’ll set the due date to July 1. Confirm? [Y/n]

human: Actually, that’s over a holiday weekend, let’s do July 15th. 

ai: Okay, I’ll set the due date to July 15. Confirm? [Y/n]
...

Multiple tools

A single human message specificies multiple tools.

human: Assign to Bob and set the due date to July 1st. 

ai: <assigns to bob>

ai: <sets due date>

Sequential tools

Note the “and then” here - the human is prescribing an order.

human: Set the due date to July 1st and then assign to Bob. 

ai: <sets due date>

ai: <assigns to bob>

Multi-step task

Human gives a direct instruction that has two steps, without listing those steps as they did above. However, the steps are obvious and can be directly inferred from the set of tools.

human: Can you block 30mins tomorrow morning?

ai: <looks up availability in calendar>

ai: <creates event in calendar>

Multi-step planning

The agent is given a goal instead of an explicit instruction, and has to reason out steps. This is the type of planning that ReAct and AutoGPT do, but inline in the chat.

Note the Thought: clause. This implies that the AI can’t fully predict the set of steps at the start.

human: Can you find some restaurants for dinner with Jeff on Wednesday?

ai: <looks up Jeff from contacts> 

ai: Thought: Jeff will like expensive restaurants.

ai: <looks up $$$$ restaurants with availability on Wednesday>

Multi-step reasoning with change course

Same as above, but the user can change course mid-plan. Note that we have to bring back the Confirm: here.

human: Can you find some restaurants for dinner with Jeff on Wednesday?

ai: Okay, I’ll look up Jeff from contacts. Confirm? [Y/n]

ai: <looks up Jeff from contacts> 

ai: Thought: Jeff will like expensive restaurants.

ai: I’ll look for expensive restaurants with availability on Wednesday. Confirm? [Y/n]

human: No Jeff is chill, just find some good Asian food.

ai: Thought: I should find restaurants with Japanese, Thai, Korean, Chinese cuisine and $$ or $$$ pricing and order by highest rated.
...

Explain its reasoning

This is the first example where the agent responds with free text - no tools are run.

(Here, the reasoning isn’t actually connected to the original commands - but IMO it’s still important to support this.)

human: Can you find some restaurants for dinner with Jeff on Wednesday?

ai: <looks up Jeff from contacts> 

ai: Thought: Jeff will like expensive restaurants.

ai: <looks up $$$$ restaurants with availability on Wednesday>

human: Why did you pick these restaurants?

ai: I selected restaurants marked $$$$ because Jeff’s title is Managing Director, so he probably likes expensive restaurants. 

Summarization

Here, the agent needs the full context of the current task - it can’t solely rely on the chat history.

human: What’s the status of that dinner reservation?

ai: On Monday, we talked about restaurants that you and Jeff would like, and we chose Zen sushi. 

Suggest next steps

Here, the agent needs to infer tools from its earlier free text messages earlier in the chat.

human: What's next for scheduling dinner?

ai: You may want to confirm 

human: Right, can you email him to confirm?

ai: Okay, I'll send this message. Confirm? [Y/n]
...