Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

Agent evaluation operates at the session level. While trace evaluation scores one request at a time, agent evaluation answers a broader question: How reliable and consistent is this agent across a full conversation or workflow? In PandaProbe, a session is the unit for an agent lifecycle. It can represent an entire conversation, a multi-step workflow, a support ticket, or an autonomous job. Agent evaluation aggregates trace-level signals into session-level scores that capture behavior only visible across multiple steps.

Why session-level evaluation?

A trace-level score tells you about one moment. But agents fail in patterns:
  • An agent might handle 9 out of 10 requests well but catastrophically fail on the 10th
  • An agent might show gradually declining confidence across a long conversation
  • An agent might get stuck in a loop, repeating the same response over and over
These patterns are invisible at the trace level. Session-level metrics surface them by looking at the distribution of trace-level signals across the entire session.

How it works

Session evaluation is a two-phase process:
1

Collect traces in the session

PandaProbe starts with all traces that share the same session_id, preserving the sequence of interactions in the agent lifecycle.
2

Compute trace-level signals

For each trace, PandaProbe computes signals such as confidence, coherence, tool_correctness, and loop_detection.
3

Aggregate session-level scores

Session metrics combine those signals into agent_reliability and agent_consistency, using deterministic math instead of additional LLM calls.

Phase 1: Trace-level signals

For each trace in the session, PandaProbe computes four signals using the trace-level metrics:
SignalSource metricWhat it captures
confidenceConfidence metric (LLM judge)Decisiveness and appropriateness of actions
loop_detectionLoop Detection metric (hybrid similarity)Repetition across traces
tool_correctnessTool Correctness metric (LLM judge)Quality of tool selection
coherenceCoherence metric (embedding distance)Input-output alignment

Phase 2: Session-level aggregation

The two session metrics receive the precomputed signals and aggregate them using pure mathematical functions: no additional LLM or embedding calls. This makes the aggregation fast and deterministic. Session evaluation scores are research driven and algorithmic, designed specifically for agents with long trajectory:
  • Agent Reliability focuses on the worst moments — a single catastrophic trace drags the score down
  • Agent Consistency focuses on overall stability — many moderate issues compound even if no single trace is terrible

Signal weights

Both session metrics apply configurable weights to each signal:
SignalDefault weightRationale
confidence1.0Core indicator of agent behavior quality
loop_detection1.0Critical for detecting stuck agents
tool_correctness0.8Slightly lower — not all traces involve tools
coherence1.0Fundamental quality signal
Weights can be overridden per eval run via the API’s signal_weights parameter. This lets you emphasize the signals most important for your use case.

Available session metrics

Agent Reliability

Worst-case failure risk across the session. A single catastrophic trace scores poorly even if all others are fine.

Agent Consistency

Overall stability via weighted RMS. Many moderate issues compound even without a single catastrophic failure.

Next steps

Session Metrics Reference

Detailed documentation for agent_reliability and agent_consistency.

Run via API

Create session eval runs programmatically.