Agent evaluation operates at the session level. While trace evaluation scores one request at a time, agent evaluation answers a broader question: How reliable and consistent is this agent across a full conversation or workflow? In PandaProbe, a session is the unit for an agent lifecycle. It can represent an entire conversation, a multi-step workflow, a support ticket, or an autonomous job. Agent evaluation aggregates trace-level signals into session-level scores that capture behavior only visible across multiple steps.Documentation Index
Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
Use this file to discover all available pages before exploring further.
Why session-level evaluation?
A trace-level score tells you about one moment. But agents fail in patterns:- An agent might handle 9 out of 10 requests well but catastrophically fail on the 10th
- An agent might show gradually declining confidence across a long conversation
- An agent might get stuck in a loop, repeating the same response over and over
How it works
Session evaluation is a two-phase process:Collect traces in the session
PandaProbe starts with all traces that share the same
session_id, preserving the sequence of interactions in the agent lifecycle.Compute trace-level signals
For each trace, PandaProbe computes signals such as
confidence, coherence, tool_correctness, and loop_detection.Phase 1: Trace-level signals
For each trace in the session, PandaProbe computes four signals using the trace-level metrics:| Signal | Source metric | What it captures |
|---|---|---|
confidence | Confidence metric (LLM judge) | Decisiveness and appropriateness of actions |
loop_detection | Loop Detection metric (hybrid similarity) | Repetition across traces |
tool_correctness | Tool Correctness metric (LLM judge) | Quality of tool selection |
coherence | Coherence metric (embedding distance) | Input-output alignment |
Phase 2: Session-level aggregation
The two session metrics receive the precomputed signals and aggregate them using pure mathematical functions: no additional LLM or embedding calls. This makes the aggregation fast and deterministic. Session evaluation scores are research driven and algorithmic, designed specifically for agents with long trajectory:- Agent Reliability focuses on the worst moments — a single catastrophic trace drags the score down
- Agent Consistency focuses on overall stability — many moderate issues compound even if no single trace is terrible
Signal weights
Both session metrics apply configurable weights to each signal:| Signal | Default weight | Rationale |
|---|---|---|
confidence | 1.0 | Core indicator of agent behavior quality |
loop_detection | 1.0 | Critical for detecting stuck agents |
tool_correctness | 0.8 | Slightly lower — not all traces involve tools |
coherence | 1.0 | Fundamental quality signal |
signal_weights parameter. This lets you emphasize the signals most important for your use case.
Available session metrics
Agent Reliability
Worst-case failure risk across the session. A single catastrophic trace scores poorly even if all others are fine.
Agent Consistency
Overall stability via weighted RMS. Many moderate issues compound even without a single catastrophic failure.
Next steps
Session Metrics Reference
Detailed documentation for agent_reliability and agent_consistency.
Run via API
Create session eval runs programmatically.

