Overview

PandaProbe evaluation helps you measure how well your agents perform after traces have been captured. You can score a single trace to understand one request, or score an entire session to understand the lifecycle of an agent across multiple steps. Evaluations produce structured scores with reasons and metadata, so you can debug failures, compare changes, track trends, and monitor production quality over time.

Two levels of evaluation

PandaProbe evaluates at two levels. Each answers a different question:

Trace Evaluation

“How well did the agent handle this single request?”Scores individual traces for task completion, tool use, arguments, planning, coherence, loops, and related quality signals.

Agent Evaluation

“How reliable is this agent across an entire session?”Scores sessions by aggregating trace-level signals across the full agent lifecycle, capturing reliability, consistency, and worst-case failures.

Start with trace evaluation when you need to debug specific failures. Use session evaluation when you need to understand how an agent behaves across a complete conversation, workflow, or task.

How it works

Evaluations run asynchronously in the background. You create an eval run from the dashboard or API, PandaProbe resolves the matching traces or sessions, runs the selected metrics, and stores the results as scores.

Create an eval run

Select the target type, metrics, and filters. You can evaluate all matching data, filter by fields such as date range, status, session, user, or tags, and sample a fraction of results to control cost.

Background processing

A worker executes each metric against the selected traces or sessions. Some trace metrics use an LLM judge, while others use embedding or similarity analysis. Session metrics aggregate trace-level signals.

Scores are persisted

Each metric produces a score, a reason, and rich metadata. Scores are stored and linked to the originating eval run, trace, or session.

Review and iterate

View scores in the dashboard, query them via the API, track trends over time, and set up recurring monitors to evaluate new data automatically.

Built-in metrics at a glance

PandaProbe includes metrics for both single-trace quality and session-level agent behavior.

Trace-level metrics

Metric	Method	What it measures
`task_completion`	LLM judge (2-stage)	Did the agent accomplish the user’s objective?
`tool_correctness`	LLM judge (2-stage)	Did the agent select the right tools?
`argument_correctness`	LLM judge (3-stage)	Were tool call arguments correct?
`step_efficiency`	LLM judge (2-stage)	Did the agent execute with minimal unnecessary steps?
`confidence`	LLM judge (1-stage)	Were the agent’s actions decisive and well-founded?
`plan_adherence`	LLM judge (3-stage)	Did the agent follow its declared plan?
`plan_quality`	LLM judge (3-stage)	Is the agent’s plan complete and well-structured?
`coherence`	Embedding distance	Does the output logically follow from the input?
`loop_detection`	Hybrid similarity	Is the agent stuck repeating itself across traces?

Session-level metrics

Metric	Method	What it measures
`agent_reliability`	Max-compose + top-k tail risk	Worst-case failure risk across the session
`agent_consistency`	Weighted RMS aggregation	Overall stability and smooth operation

How to choose

Use trace evaluation when you want to inspect or regress-test individual interactions:

Did this request succeed?
Did the agent call the right tools?
Were the tool arguments correct?
Did the output follow from the input?

Use session evaluation when the behavior only makes sense across multiple traces:

Did the agent stay reliable across the whole task?
Did quality degrade over a conversation?
Did retries, loops, or inefficient steps affect the final outcome?
Which sessions need review first?

Scheduling evaluations

Beyond one-off eval runs, PandaProbe supports monitors: recurring evaluation schedules that automatically create eval runs on a cadence (every_6h, daily, weekly, or custom cron). Monitors can skip runs when no new data has arrived, which helps control evaluation cost.

Scheduling Evaluations

Set up automated evaluation monitors with custom cadences and filters.

Next steps

Core Concepts

Understand the evaluation model: runs, scores, signals, and aggregation.

Set Up Evaluation

Choose dashboard, API, or scheduled monitors for running evaluations.

Run via API

Create eval runs programmatically using the evaluation API.

Get Started

Tracing

Evaluation

Two levels of evaluation

Trace Evaluation

Agent Evaluation

How it works

Built-in metrics at a glance

Trace-level metrics

Session-level metrics

How to choose

Scheduling evaluations

Scheduling Evaluations

Next steps

Core Concepts

Set Up Evaluation

Run via API

Get Started

Tracing

Evaluation

Documentation Index

​Two levels of evaluation

Trace Evaluation

Agent Evaluation

​How it works

​Built-in metrics at a glance

​Trace-level metrics

​Session-level metrics

​How to choose

​Scheduling evaluations

Scheduling Evaluations

​Next steps

Core Concepts

Set Up Evaluation

Run via API

Two levels of evaluation

How it works

Built-in metrics at a glance

Trace-level metrics

Session-level metrics

How to choose

Scheduling evaluations

Next steps