Write and run evals

Evals are automated tests that simulate conversations with your agent and assert on the results. You write a conversation script with expected behaviors, and the eval runner plays it against your running agent and reports pass/fail.

Create an eval

Create .eval.ts files in the evals/ directory at your agent root:

import { Eval } from "@botpress/runtime"

export default new Eval({
  name: "greeting",
  description: "Agent should greet the user",
  conversation: [
    {
      user: "Hello",
      assert: {
        response: [
          { llm_judge: "The response is a friendly greeting" },
        ],
      },
    },
  ],
})

Each file can export one or more Eval instances as the default export or named exports.

Conversation turns

An eval’s conversation is a sequence of turns. Each turn can send a user message, fire an event, or assert the agent stays silent, plus optionally assert on what happens. Send user messages:

conversation: [
  {
    user: "What's the weather in Paris?",
    assert: {
      response: [{ contains: "Paris" }],
      tools: [{ called: "getWeather", params: { city: { equals: "Paris" } } }],
    },
  },
  {
    user: "And in London?",
    assert: {
      response: [{ contains: "London" }],
    },
  },
]

Fire an event instead of a message:

{
  event: {
    type: "order.placed",
    payload: { orderId: "ord-123", total: 49.99 },
  },
  assert: {
    response: [{ contains: "order" }],
  },
}

Assert the agent doesn’t respond:

{
  user: "ok thanks",
  expectSilence: true,
}

Assertions

Each turn’s assert block can hold any combination of the assertion types below.

Response

Assert on the text of the agent’s reply:

Assertion	Description
`{ contains: "text" }`	Response includes this string
`{ not_contains: "text" }`	Response does not include this string
`{ matches: "regex" }`	Response matches this regex pattern
`{ similar_to: "text" }`	Response is semantically similar to this text
`{ llm_judge: "criteria" }`	An LLM evaluates whether the response meets the criteria (see LLM judge)

Tools

Assert on which tools the agent called:

Assertion	Description
`{ called: "toolName" }`	Tool was called
`{ called: "toolName", params: { key: { equals: "value" } } }`	Tool was called with matching parameters
`{ not_called: "toolName" }`	Tool was not called
`{ call_order: ["tool1", "tool2"] }`	Tools were called in this order

Use these operators inside params to match tool arguments:

Operator	Example
`{ equals: value }`	Exact match
`{ contains: "text" }`	String contains
`{ not_contains: "text" }`	String does not contain
`{ matches: "regex" }`	Regex match
`{ in: [1, 2, 3] }`	Value is in array
`{ exists: true }`	Field exists
`{ gte: 10 }`	Greater than or equal
`{ lte: 100 }`	Less than or equal

State

Assert on state changes after a turn:

assert: {
  state: [
    { path: "messageCount", changed: true },
    { path: "topic", equals: "weather" },
  ],
}

Tables

Assert on table data:

assert: {
  tables: [
    { table: "OrderTable", row_exists: { userId: { equals: "user123" } } },
    { table: "OrderTable", row_count: { gte: 1 }, where: { status: { equals: "pending" } } },
  ],
}

Workflows

Assert on workflow state:

assert: {
  workflow: [
    { name: "processOrder", entered: true },
    { name: "processOrder", completed: true },
  ],
}

Timing

Assert on response time:

assert: {
  timing: [
    { response_time: { lte: 5000 } },
  ],
}

LLM judge

Use llm_judge when the right answer isn’t a fixed string. You describe what a good response looks like in plain English, and an LLM scores the agent’s reply against it. It’s the right choice for subjective checks like tone, intent, or whether the agent understood the user:

assert: {
  response: [
    { llm_judge: "The response is polite and professional" },
    { llm_judge: "The response correctly identifies the user's issue" },
  ],
}

Configure the judge model and pass threshold under Configure eval behavior.

Setup and outcome

setup runs before the first turn. outcome runs after the last.

Setup

Seed state or trigger a workflow before the conversation starts:

new Eval({
  name: "returning-user",
  setup: {
    state: {
      user: { name: "Alice", visitCount: 5 },
      conversation: { topic: "billing" },
    },
  },
  conversation: [
    {
      user: "Hi",
      assert: {
        response: [{ contains: "Alice" }],
      },
    },
  ],
})

Trigger a workflow instead of (or in addition to) seeding state:

setup: {
  workflow: {
    trigger: "onboarding",
    input: { userId: "user123" },
  },
},

Outcome

Assert on state, tables, or workflows after the entire conversation completes. Outcome assertions use the same shapes as turn assertions:

new Eval({
  name: "full-flow",
  conversation: [
    { user: "Create a ticket for VPN issues" },
    { user: "Set priority to high" },
  ],
  outcome: {
    tables: [
      { table: "TicketTable", row_exists: { priority: { equals: "high" } } },
    ],
    workflow: [
      { name: "createTicket", completed: true },
    ],
  },
})

Configure eval behavior

Set defaults for all evals in agent.config.ts:

evals: {
  judgeModel: "openai:gpt-4o",
  judgePassThreshold: 3,
  idleTimeout: 15000,
},

Option	Type	Description
`judgeModel`	`string`	Model for `llm_judge` assertions. Defaults to `"fast"`
`judgePassThreshold`	`number` (1–5)	Minimum score for `llm_judge` to pass. Defaults to `3`
`idleTimeout`	`number` (ms)	How long to wait for the agent to respond. Defaults to `15000`

Override judgePassThreshold or idleTimeout on a specific eval:

new Eval({
  name: "quality-check",
  options: {
    judgePassThreshold: 4,
    idleTimeout: 30000,
  },
  conversation: [/* ... */],
})

Organize evals

Evals have two organizing fields that pair with CLI filters: tags and type. Tags are free-form labels. Use them to group evals you want to run together (smoke tests, critical paths, slow suites):

new Eval({
  name: "greeting",
  tags: ["smoke", "core"],
  conversation: [/* ... */],
})

adk evals --tag smoke    # Run only evals tagged "smoke"

Type marks what the eval is for:

capability: tests that a feature works as designed
regression: reproduces a past bug so it doesn’t come back

new Eval({
  name: "fix-for-ticket-priority-bug",
  type: "regression",
  conversation: [/* ... */],
})

adk evals --type regression    # Run only regression evals

Run evals

From the CLI:

adk evals                        # Run all evals
adk evals greeting               # Run a specific eval by name
adk evals --tag smoke            # Run evals with a specific tag
adk evals --type regression      # Run only regression evals
adk evals -v                     # Show full details, not just failures

View past runs:

adk evals runs                   # List recent runs
adk evals runs --latest          # Show the latest run
adk evals runs <run-id>          # Show a specific run

You can also run and view evals from the dev console under Evals.

Get started

ADK

Studio

Integrations

Webchat

Desk

Write and run evals

Create an eval

Conversation turns

Assertions

Response

Tools

State

Tables

Workflows

Timing

LLM judge

Setup and outcome

Setup

Outcome

Configure eval behavior

Organize evals

Run evals

Get started

ADK

Studio

Integrations

Webchat

Desk

​Create an eval

​Conversation turns

​Assertions

​Response

​Tools

​State

​Tables

​Workflows

​Timing

​LLM judge

​Setup and outcome

​Setup

​Outcome

​Configure eval behavior

​Organize evals

​Run evals

Create an eval

Conversation turns

Assertions

Response

Tools

State

Tables

Workflows

Timing

LLM judge

Setup and outcome

Setup

Outcome

Configure eval behavior

Organize evals

Run evals