Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

Trace evaluation scores a single agent execution: one trace with all of its spans, model calls, tool calls, inputs, outputs, and metadata. Use it to answer questions like “Did the agent complete the task?”, “Were the right tools used?”, and “Was the output coherent?” PandaProbe ships with 9 built-in trace-level metrics. Most use LLM-as-judge analysis, while coherence and loop detection use embedding or similarity-based methods.

Available metrics

Task Completion

Did the agent accomplish the user’s stated objective?

Tool Correctness

Did the agent select appropriate tools for the task?

Argument Correctness

Were tool call arguments correctly specified?

Step Efficiency

Did the agent execute with minimal unnecessary steps?

Confidence

Were the agent’s actions decisive and well-founded?

Plan Adherence

Did the agent follow its declared plan?

Plan Quality

Was the agent’s plan complete and well-structured?

Coherence

Does the output logically follow from the input?

Loop Detection

Is the agent stuck repeating itself across traces?

How trace metrics work

Each metric receives the full trace and produces a score between 0 and 1, a human-readable reason, and structured metadata. Higher scores generally mean better behavior.

LLM-as-judge metrics

Most LLM-as-judge metrics separate context extraction from scoring:
1

Extract trace context

PandaProbe turns raw trace data into focused inputs for the metric, such as the user task, final outcome, tool calls, tool arguments, or declared plan.
2

Judge one quality dimension

An LLM evaluates the extracted context for a specific metric, such as task completion, tool correctness, or plan quality.
3

Store a structured score

The metric returns a numeric score, a reason, and metadata that explains the intermediate extraction or verdict.
This multi-stage design improves reliability because the judge LLM receives clean, focused inputs rather than raw trace data. Example: Task Completion (2-stage)
  1. Extract — LLM identifies the user’s task and the agent’s factual outcome from the trace
  2. Score — LLM compares task vs. outcome and returns a 0–1 verdict with explanation
Example: Argument Correctness (3-stage)
  1. Extract — LLM identifies user input and all tool calls (name, parameters, reasoning)
  2. Verdict — LLM evaluates each tool call’s arguments individually (yes/no per call)
  3. Reason — LLM produces an overall explanation from the per-call verdicts

Embedding-based metrics

Two metrics skip LLM calls entirely:
  • Coherence computes the cosine distance between input and output embeddings. A small distance means high coherence (score close to 1.0).
  • Loop Detection uses a hybrid approach: cosine similarity (semantic overlap) multiplied by Jaccard similarity (lexical overlap) across recent traces. High scores on both indicate the agent is stuck repeating itself.

Metric interface

Every trace metric implements the same evaluate() method:
async def evaluate(
    self,
    trace: Trace,
    llm: LLMEngine,
    *,
    threshold: float | None = None,
    model: str | None = None,
    session_traces: list[Trace] | None = None,
) -> MetricResult
ParameterDescription
traceThe full trace entity with all spans
llmLLM engine for judge calls and embeddings
thresholdOverride the metric’s default pass/fail threshold
modelOverride the default LLM model (e.g., openai/gpt-5.4)
session_tracesPrevious traces in the same session (only used by loop_detection)

Standalone vs. session-context metrics

Most trace metrics evaluate a trace in isolation — they don’t need other traces. These are available for standalone eval runs. One metric, loop_detection, requires session context (the session_traces parameter) to compare the current trace against previous outputs. It is excluded from standalone trace eval runs but is automatically computed as a signal during session-level evaluation.
When you run a session eval, PandaProbe first computes trace-level signals (confidence, loop_detection, tool_correctness, coherence) for each trace in the session, then feeds those signals into the session-level aggregation metrics.

Next steps

Trace Metrics Reference

Detailed documentation for each trace-level metric.

Run via API

Create trace eval runs programmatically.