> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Introduction

> Evaluate agent reliability and consistency across entire sessions.

Agent evaluation operates at the **session level**. While trace evaluation scores one request at a time, agent evaluation answers a broader question: *How reliable and consistent is this agent across a full conversation or workflow?*

In PandaProbe, a **session** is the unit for an agent lifecycle. It can represent an entire conversation, a multi-step workflow, a support ticket, or an autonomous job. Agent evaluation aggregates trace-level signals into session-level scores that capture behavior only visible across multiple steps.

## Why session-level evaluation?

A trace-level score tells you about one moment. But agents fail in patterns:

* An agent might handle 9 out of 10 requests well but catastrophically fail on the 10th
* An agent might show gradually declining confidence across a long conversation
* An agent might get stuck in a loop, repeating the same response over and over

These patterns are invisible at the trace level. Session-level metrics surface them by looking at the **distribution** of trace-level signals across the entire session.

## How it works

Session evaluation is a two-phase process:

<Steps>
  <Step title="Collect traces in the session">
    PandaProbe starts with all traces that share the same `session_id`, preserving the sequence of interactions in the agent lifecycle.
  </Step>

  <Step title="Compute trace-level signals">
    For each trace, PandaProbe computes signals such as `confidence`, `coherence`, `tool_correctness`, and `loop_detection`.
  </Step>

  <Step title="Aggregate session-level scores">
    Session metrics combine those signals into `agent_reliability` and `agent_consistency`, using deterministic math instead of additional LLM calls.
  </Step>
</Steps>

### Phase 1: Trace-level signals

For each trace in the session, PandaProbe computes four signals using the trace-level metrics:

| Signal             | Source metric                             | What it captures                            |
| ------------------ | ----------------------------------------- | ------------------------------------------- |
| `confidence`       | Confidence metric (LLM judge)             | Decisiveness and appropriateness of actions |
| `loop_detection`   | Loop Detection metric (hybrid similarity) | Repetition across traces                    |
| `tool_correctness` | Tool Correctness metric (LLM judge)       | Quality of tool selection                   |
| `coherence`        | Coherence metric (embedding distance)     | Input-output alignment                      |

### Phase 2: Session-level aggregation

The two session metrics receive the precomputed signals and aggregate them using pure mathematical functions: **no additional LLM or embedding calls**. This makes the aggregation fast and deterministic.

Session evaluation scores are research driven and algorithmic, designed specifically for agents with long trajectory:

* **Agent Reliability** focuses on the **worst moments** — a single catastrophic trace drags the score down
* **Agent Consistency** focuses on **overall stability** — many moderate issues compound even if no single trace is terrible

## Signal weights

Both session metrics apply configurable weights to each signal:

| Signal             | Default weight | Rationale                                     |
| ------------------ | -------------- | --------------------------------------------- |
| `confidence`       | 1.0            | Core indicator of agent behavior quality      |
| `loop_detection`   | 1.0            | Critical for detecting stuck agents           |
| `tool_correctness` | 0.8            | Slightly lower — not all traces involve tools |
| `coherence`        | 1.0            | Fundamental quality signal                    |

Weights can be overridden per eval run via the API's `signal_weights` parameter. This lets you emphasize the signals most important for your use case.

## Available session metrics

<CardGroup cols={2}>
  <Card title="Agent Reliability" icon="shield">
    Worst-case failure risk across the session. A single catastrophic trace scores poorly even if all others are fine.
  </Card>

  <Card title="Agent Consistency" icon="activity">
    Overall stability via weighted RMS. Many moderate issues compound even without a single catastrophic failure.
  </Card>
</CardGroup>

## Next steps

<CardGroup cols={2}>
  <Card title="Session Metrics Reference" icon="book-open" href="/evaluation/agent-evaluation/metrics">
    Detailed documentation for agent\_reliability and agent\_consistency.
  </Card>

  <Card title="Run via API" icon="terminal" href="/evaluation/setup/run-eval-api">
    Create session eval runs programmatically.
  </Card>
</CardGroup>
