PandaProbe evaluation helps you measure how well your agents perform after traces have been captured. You can score a single trace to understand one request, or score an entire session to understand the lifecycle of an agent across multiple steps. Evaluations produce structured scores with reasons and metadata, so you can debug failures, compare changes, track trends, and monitor production quality over time.Documentation Index
Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
Use this file to discover all available pages before exploring further.
Two levels of evaluation
PandaProbe evaluates at two levels. Each answers a different question:Trace Evaluation
“How well did the agent handle this single request?”Scores individual traces for task completion, tool use, arguments, planning, coherence, loops, and related quality signals.
Agent Evaluation
“How reliable is this agent across an entire session?”Scores sessions by aggregating trace-level signals across the full agent lifecycle, capturing reliability, consistency, and worst-case failures.
How it works
Evaluations run asynchronously in the background. You create an eval run from the dashboard or API, PandaProbe resolves the matching traces or sessions, runs the selected metrics, and stores the results as scores.Create an eval run
Select the target type, metrics, and filters. You can evaluate all matching data, filter by fields such as date range, status, session, user, or tags, and sample a fraction of results to control cost.
Background processing
A worker executes each metric against the selected traces or sessions. Some trace metrics use an LLM judge, while others use embedding or similarity analysis. Session metrics aggregate trace-level signals.
Scores are persisted
Each metric produces a score, a reason, and rich metadata. Scores are stored and linked to the originating eval run, trace, or session.
Built-in metrics at a glance
PandaProbe includes metrics for both single-trace quality and session-level agent behavior.Trace-level metrics
| Metric | Method | What it measures |
|---|---|---|
task_completion | LLM judge (2-stage) | Did the agent accomplish the user’s objective? |
tool_correctness | LLM judge (2-stage) | Did the agent select the right tools? |
argument_correctness | LLM judge (3-stage) | Were tool call arguments correct? |
step_efficiency | LLM judge (2-stage) | Did the agent execute with minimal unnecessary steps? |
confidence | LLM judge (1-stage) | Were the agent’s actions decisive and well-founded? |
plan_adherence | LLM judge (3-stage) | Did the agent follow its declared plan? |
plan_quality | LLM judge (3-stage) | Is the agent’s plan complete and well-structured? |
coherence | Embedding distance | Does the output logically follow from the input? |
loop_detection | Hybrid similarity | Is the agent stuck repeating itself across traces? |
Session-level metrics
| Metric | Method | What it measures |
|---|---|---|
agent_reliability | Max-compose + top-k tail risk | Worst-case failure risk across the session |
agent_consistency | Weighted RMS aggregation | Overall stability and smooth operation |
How to choose
Use trace evaluation when you want to inspect or regress-test individual interactions:- Did this request succeed?
- Did the agent call the right tools?
- Were the tool arguments correct?
- Did the output follow from the input?
- Did the agent stay reliable across the whole task?
- Did quality degrade over a conversation?
- Did retries, loops, or inefficient steps affect the final outcome?
- Which sessions need review first?
Scheduling evaluations
Beyond one-off eval runs, PandaProbe supports monitors: recurring evaluation schedules that automatically create eval runs on a cadence (every_6h, daily, weekly, or custom cron). Monitors can skip runs when no new data has arrived, which helps control evaluation cost.
Scheduling Evaluations
Set up automated evaluation monitors with custom cadences and filters.
Next steps
Core Concepts
Understand the evaluation model: runs, scores, signals, and aggregation.
Set Up Evaluation
Choose dashboard, API, or scheduled monitors for running evaluations.
Run via API
Create eval runs programmatically using the evaluation API.

