Skip to content

Docs idea: evaluating agent tool-routing changes beyond final-answer quality #9673

@dgenio

Description

@dgenio

Context

Agent evaluation often focuses on final-answer quality, but production agent changes can regress in other ways:

  • wrong tool selected
  • too many tools exposed
  • unsafe/destructive tool attempted
  • unresolved request rate increases
  • cost/latency increases
  • context grows with raw tool outputs
  • policy/approval decisions change unexpectedly

Proposal

Would it make sense to document a recipe for evaluating agent tool-routing changes with multiple dimensions, not only final-answer assertions?

A possible report could compare current vs candidate agent/router/tool policy on:

success
correct_tool
unsafe_action_rate
unresolved_rate
avg_cost
avg_latency
context_tokens
policy_decision_delta

Why this might help

For agent workflows, prompt quality and routing quality are coupled. A candidate change can improve the final answer on a small set while worsening safety, cost, or tool correctness elsewhere.

Related exploration

I have been exploring this in a small open-source lab for offline evaluation of agent routing/tool-selection policies:

https://github.com/dgenio/agent-routing-eval-lab

The broader pattern could be useful as a promptfoo recipe: evaluate agent/tool changes as deployment decisions, not just answer-level tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions