Introduction

Agent evaluation operates at the session level. While trace evaluation scores one request at a time, agent evaluation answers a broader question: How reliable and consistent is this agent across a full conversation or workflow? In PandaProbe, a session is the unit for an agent lifecycle. It can represent an entire conversation, a multi-step workflow, a support ticket, or an autonomous job. Agent evaluation aggregates trace-level signals into session-level scores that capture behavior only visible across multiple steps.

Why session-level evaluation?

A trace-level score tells you about one moment. But agents fail in patterns:

An agent might handle 9 out of 10 requests well but catastrophically fail on the 10th
An agent might show gradually declining confidence across a long conversation
An agent might get stuck in a loop, repeating the same response over and over

These patterns are invisible at the trace level. Session-level metrics surface them by looking at the distribution of trace-level signals across the entire session.

How it works

Session evaluation is a two-phase process:

Collect traces in the session

PandaProbe starts with all traces that share the same session_id, preserving the sequence of interactions in the agent lifecycle.

Compute trace-level signals

For each trace, PandaProbe computes signals such as confidence, coherence, tool_correctness, and loop_detection.

Aggregate session-level scores

Session metrics combine those signals into agent_reliability and agent_consistency, using deterministic math instead of additional LLM calls.

Phase 1: Trace-level signals

For each trace in the session, PandaProbe computes four signals using the trace-level metrics:

Signal	Source metric	What it captures
`confidence`	Confidence metric (LLM judge)	Decisiveness and appropriateness of actions
`loop_detection`	Loop Detection metric (hybrid similarity)	Repetition across traces
`tool_correctness`	Tool Correctness metric (LLM judge)	Quality of tool selection
`coherence`	Coherence metric (embedding distance)	Input-output alignment

Phase 2: Session-level aggregation

The two session metrics receive the precomputed signals and aggregate them using pure mathematical functions: no additional LLM or embedding calls. This makes the aggregation fast and deterministic. Session evaluation scores are research driven and algorithmic, designed specifically for agents with long trajectory:

Agent Reliability focuses on the worst moments — a single catastrophic trace drags the score down
Agent Consistency focuses on overall stability — many moderate issues compound even if no single trace is terrible

Signal weights

Both session metrics apply configurable weights to each signal:

Signal	Default weight	Rationale
`confidence`	1.0	Core indicator of agent behavior quality
`loop_detection`	1.0	Critical for detecting stuck agents
`tool_correctness`	0.8	Slightly lower — not all traces involve tools
`coherence`	1.0	Fundamental quality signal

Weights can be overridden per eval run via the API’s signal_weights parameter. This lets you emphasize the signals most important for your use case.

Available session metrics

Agent Reliability

Worst-case failure risk across the session. A single catastrophic trace scores poorly even if all others are fine.

Agent Consistency

Overall stability via weighted RMS. Many moderate issues compound even without a single catastrophic failure.

Get Started

Tracing

Evaluation

Why session-level evaluation?

How it works

Phase 1: Trace-level signals

Phase 2: Session-level aggregation

Signal weights

Available session metrics

Agent Reliability

Agent Consistency

Next steps

Session Metrics Reference

Run via API

Get Started

Tracing

Evaluation

Documentation Index

​Why session-level evaluation?

​How it works

​Phase 1: Trace-level signals

​Phase 2: Session-level aggregation

​Signal weights

​Available session metrics

Agent Reliability

Agent Consistency

​Next steps

Session Metrics Reference

Run via API

Why session-level evaluation?

How it works

Phase 1: Trace-level signals

Phase 2: Session-level aggregation

Signal weights

Available session metrics

Next steps