Context
Agent evaluation often focuses on final-answer quality, but production agent changes can regress in other ways:
- wrong tool selected
- too many tools exposed
- unsafe/destructive tool attempted
- unresolved request rate increases
- cost/latency increases
- context grows with raw tool outputs
- policy/approval decisions change unexpectedly
Proposal
Would it make sense to document a recipe for evaluating agent tool-routing changes with multiple dimensions, not only final-answer assertions?
A possible report could compare current vs candidate agent/router/tool policy on:
success
correct_tool
unsafe_action_rate
unresolved_rate
avg_cost
avg_latency
context_tokens
policy_decision_delta
Why this might help
For agent workflows, prompt quality and routing quality are coupled. A candidate change can improve the final answer on a small set while worsening safety, cost, or tool correctness elsewhere.
Related exploration
I have been exploring this in a small open-source lab for offline evaluation of agent routing/tool-selection policies:
https://github.com/dgenio/agent-routing-eval-lab
The broader pattern could be useful as a promptfoo recipe: evaluate agent/tool changes as deployment decisions, not just answer-level tests.
Context
Agent evaluation often focuses on final-answer quality, but production agent changes can regress in other ways:
Proposal
Would it make sense to document a recipe for evaluating agent tool-routing changes with multiple dimensions, not only final-answer assertions?
A possible report could compare current vs candidate agent/router/tool policy on:
Why this might help
For agent workflows, prompt quality and routing quality are coupled. A candidate change can improve the final answer on a small set while worsening safety, cost, or tool correctness elsewhere.
Related exploration
I have been exploring this in a small open-source lab for offline evaluation of agent routing/tool-selection policies:
https://github.com/dgenio/agent-routing-eval-lab
The broader pattern could be useful as a promptfoo recipe: evaluate agent/tool changes as deployment decisions, not just answer-level tests.