Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

PandaProbe evaluation helps you measure how well your agents perform after traces have been captured. You can score a single trace to understand one request, or score an entire session to understand the lifecycle of an agent across multiple steps. Evaluations produce structured scores with reasons and metadata, so you can debug failures, compare changes, track trends, and monitor production quality over time.

Two levels of evaluation

PandaProbe evaluates at two levels. Each answers a different question:

Trace Evaluation

“How well did the agent handle this single request?”Scores individual traces for task completion, tool use, arguments, planning, coherence, loops, and related quality signals.

Agent Evaluation

“How reliable is this agent across an entire session?”Scores sessions by aggregating trace-level signals across the full agent lifecycle, capturing reliability, consistency, and worst-case failures.
Start with trace evaluation when you need to debug specific failures. Use session evaluation when you need to understand how an agent behaves across a complete conversation, workflow, or task.

How it works

Evaluations run asynchronously in the background. You create an eval run from the dashboard or API, PandaProbe resolves the matching traces or sessions, runs the selected metrics, and stores the results as scores.
1

Create an eval run

Select the target type, metrics, and filters. You can evaluate all matching data, filter by fields such as date range, status, session, user, or tags, and sample a fraction of results to control cost.
2

Background processing

A worker executes each metric against the selected traces or sessions. Some trace metrics use an LLM judge, while others use embedding or similarity analysis. Session metrics aggregate trace-level signals.
3

Scores are persisted

Each metric produces a score, a reason, and rich metadata. Scores are stored and linked to the originating eval run, trace, or session.
4

Review and iterate

View scores in the dashboard, query them via the API, track trends over time, and set up recurring monitors to evaluate new data automatically.

Built-in metrics at a glance

PandaProbe includes metrics for both single-trace quality and session-level agent behavior.

Trace-level metrics

MetricMethodWhat it measures
task_completionLLM judge (2-stage)Did the agent accomplish the user’s objective?
tool_correctnessLLM judge (2-stage)Did the agent select the right tools?
argument_correctnessLLM judge (3-stage)Were tool call arguments correct?
step_efficiencyLLM judge (2-stage)Did the agent execute with minimal unnecessary steps?
confidenceLLM judge (1-stage)Were the agent’s actions decisive and well-founded?
plan_adherenceLLM judge (3-stage)Did the agent follow its declared plan?
plan_qualityLLM judge (3-stage)Is the agent’s plan complete and well-structured?
coherenceEmbedding distanceDoes the output logically follow from the input?
loop_detectionHybrid similarityIs the agent stuck repeating itself across traces?

Session-level metrics

MetricMethodWhat it measures
agent_reliabilityMax-compose + top-k tail riskWorst-case failure risk across the session
agent_consistencyWeighted RMS aggregationOverall stability and smooth operation

How to choose

Use trace evaluation when you want to inspect or regress-test individual interactions:
  • Did this request succeed?
  • Did the agent call the right tools?
  • Were the tool arguments correct?
  • Did the output follow from the input?
Use session evaluation when the behavior only makes sense across multiple traces:
  • Did the agent stay reliable across the whole task?
  • Did quality degrade over a conversation?
  • Did retries, loops, or inefficient steps affect the final outcome?
  • Which sessions need review first?

Scheduling evaluations

Beyond one-off eval runs, PandaProbe supports monitors: recurring evaluation schedules that automatically create eval runs on a cadence (every_6h, daily, weekly, or custom cron). Monitors can skip runs when no new data has arrived, which helps control evaluation cost.

Scheduling Evaluations

Set up automated evaluation monitors with custom cadences and filters.

Next steps

Core Concepts

Understand the evaluation model: runs, scores, signals, and aggregation.

Set Up Evaluation

Choose dashboard, API, or scheduled monitors for running evaluations.

Run via API

Create eval runs programmatically using the evaluation API.