> ## Documentation Index > Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt > Use this file to discover all available pages before exploring further. # Introduction > Evaluate agent reliability and consistency across entire sessions. Agent evaluation operates at the **session level**. While trace evaluation scores one request at a time, agent evaluation answers a broader question: *How reliable and consistent is this agent across a full conversation or workflow?* In PandaProbe, a **session** is the unit for an agent lifecycle. It can represent an entire conversation, a multi-step workflow, a support ticket, or an autonomous job. Agent evaluation aggregates trace-level signals into session-level scores that capture behavior only visible across multiple steps. ## Why session-level evaluation? A trace-level score tells you about one moment. But agents fail in patterns: * An agent might handle 9 out of 10 requests well but catastrophically fail on the 10th * An agent might show gradually declining confidence across a long conversation * An agent might get stuck in a loop, repeating the same response over and over These patterns are invisible at the trace level. Session-level metrics surface them by looking at the **distribution** of trace-level signals across the entire session. ## How it works Session evaluation is a two-phase process: PandaProbe starts with all traces that share the same `session_id`, preserving the sequence of interactions in the agent lifecycle. For each trace, PandaProbe computes signals such as `confidence`, `coherence`, `tool_correctness`, and `loop_detection`. Session metrics combine those signals into `agent_reliability` and `agent_consistency`, using deterministic math instead of additional LLM calls. ### Phase 1: Trace-level signals For each trace in the session, PandaProbe computes four signals using the trace-level metrics: | Signal | Source metric | What it captures | | ------------------ | ----------------------------------------- | ------------------------------------------- | | `confidence` | Confidence metric (LLM judge) | Decisiveness and appropriateness of actions | | `loop_detection` | Loop Detection metric (hybrid similarity) | Repetition across traces | | `tool_correctness` | Tool Correctness metric (LLM judge) | Quality of tool selection | | `coherence` | Coherence metric (embedding distance) | Input-output alignment | ### Phase 2: Session-level aggregation The two session metrics receive the precomputed signals and aggregate them using pure mathematical functions: **no additional LLM or embedding calls**. This makes the aggregation fast and deterministic. Session evaluation scores are research driven and algorithmic, designed specifically for agents with long trajectory: * **Agent Reliability** focuses on the **worst moments** — a single catastrophic trace drags the score down * **Agent Consistency** focuses on **overall stability** — many moderate issues compound even if no single trace is terrible ## Signal weights Both session metrics apply configurable weights to each signal: | Signal | Default weight | Rationale | | ------------------ | -------------- | --------------------------------------------- | | `confidence` | 1.0 | Core indicator of agent behavior quality | | `loop_detection` | 1.0 | Critical for detecting stuck agents | | `tool_correctness` | 0.8 | Slightly lower — not all traces involve tools | | `coherence` | 1.0 | Fundamental quality signal | Weights can be overridden per eval run via the API's `signal_weights` parameter. This lets you emphasize the signals most important for your use case. ## Available session metrics Worst-case failure risk across the session. A single catastrophic trace scores poorly even if all others are fine. Overall stability via weighted RMS. Many moderate issues compound even without a single catastrophic failure. ## Next steps Detailed documentation for agent\_reliability and agent\_consistency. Create session eval runs programmatically.