Docs idea: evaluating agent tool-routing changes beyond final-answer quality

## Context

Agent evaluation often focuses on final-answer quality, but production agent changes can regress in other ways:

- wrong tool selected
- too many tools exposed
- unsafe/destructive tool attempted
- unresolved request rate increases
- cost/latency increases
- context grows with raw tool outputs
- policy/approval decisions change unexpectedly

## Proposal

Would it make sense to document a recipe for evaluating agent tool-routing changes with multiple dimensions, not only final-answer assertions?

A possible report could compare current vs candidate agent/router/tool policy on:

```text
success
correct_tool
unsafe_action_rate
unresolved_rate
avg_cost
avg_latency
context_tokens
policy_decision_delta
```

## Why this might help

For agent workflows, prompt quality and routing quality are coupled. A candidate change can improve the final answer on a small set while worsening safety, cost, or tool correctness elsewhere.

## Related exploration

I have been exploring this in a small open-source lab for offline evaluation of agent routing/tool-selection policies:

https://github.com/dgenio/agent-routing-eval-lab

The broader pattern could be useful as a promptfoo recipe: evaluate agent/tool changes as deployment decisions, not just answer-level tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs idea: evaluating agent tool-routing changes beyond final-answer quality #9673

Context

Proposal

Why this might help

Related exploration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Docs idea: evaluating agent tool-routing changes beyond final-answer quality #9673

Description

Context

Proposal

Why this might help

Related exploration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions