Introduction

Trace evaluation scores a single agent execution: one trace with all of its spans, model calls, tool calls, inputs, outputs, and metadata. Use it to answer questions like “Did the agent complete the task?”, “Were the right tools used?”, and “Was the output coherent?” PandaProbe ships with 9 built-in trace-level metrics. Most use LLM-as-judge analysis, while coherence and loop detection use embedding or similarity-based methods.

Available metrics

Task Completion

Did the agent accomplish the user’s stated objective?

Tool Correctness

Did the agent select appropriate tools for the task?

Argument Correctness

Were tool call arguments correctly specified?

Step Efficiency

Did the agent execute with minimal unnecessary steps?

Confidence

Were the agent’s actions decisive and well-founded?

Plan Adherence

Did the agent follow its declared plan?

Plan Quality

Was the agent’s plan complete and well-structured?

Coherence

Does the output logically follow from the input?

Loop Detection

Is the agent stuck repeating itself across traces?

How trace metrics work

Each metric receives the full trace and produces a score between 0 and 1, a human-readable reason, and structured metadata. Higher scores generally mean better behavior.

LLM-as-judge metrics

Most LLM-as-judge metrics separate context extraction from scoring:

Extract trace context

PandaProbe turns raw trace data into focused inputs for the metric, such as the user task, final outcome, tool calls, tool arguments, or declared plan.

Judge one quality dimension

An LLM evaluates the extracted context for a specific metric, such as task completion, tool correctness, or plan quality.

Store a structured score

The metric returns a numeric score, a reason, and metadata that explains the intermediate extraction or verdict.

This multi-stage design improves reliability because the judge LLM receives clean, focused inputs rather than raw trace data. Example: Task Completion (2-stage)

Extract — LLM identifies the user’s task and the agent’s factual outcome from the trace
Score — LLM compares task vs. outcome and returns a 0–1 verdict with explanation

Example: Argument Correctness (3-stage)

Extract — LLM identifies user input and all tool calls (name, parameters, reasoning)
Verdict — LLM evaluates each tool call’s arguments individually (yes/no per call)
Reason — LLM produces an overall explanation from the per-call verdicts

Embedding-based metrics

Two metrics skip LLM calls entirely:

Coherence computes the cosine distance between input and output embeddings. A small distance means high coherence (score close to 1.0).
Loop Detection uses a hybrid approach: cosine similarity (semantic overlap) multiplied by Jaccard similarity (lexical overlap) across recent traces. High scores on both indicate the agent is stuck repeating itself.

Metric interface

Every trace metric implements the same evaluate() method:

async def evaluate(
    self,
    trace: Trace,
    llm: LLMEngine,
    *,
    threshold: float | None = None,
    model: str | None = None,
    session_traces: list[Trace] | None = None,
) -> MetricResult

Parameter	Description
`trace`	The full trace entity with all spans
`llm`	LLM engine for judge calls and embeddings
`threshold`	Override the metric’s default pass/fail threshold
`model`	Override the default LLM model (e.g., `openai/gpt-5.4`)
`session_traces`	Previous traces in the same session (only used by `loop_detection`)

Standalone vs. session-context metrics

Most trace metrics evaluate a trace in isolation — they don’t need other traces. These are available for standalone eval runs. One metric, loop_detection, requires session context (the session_traces parameter) to compare the current trace against previous outputs. It is excluded from standalone trace eval runs but is automatically computed as a signal during session-level evaluation.

When you run a session eval, PandaProbe first computes trace-level signals (confidence, loop_detection, tool_correctness, coherence) for each trace in the session, then feeds those signals into the session-level aggregation metrics.

Get Started

Tracing

Evaluation

Available metrics

Task Completion

Tool Correctness

Argument Correctness

Step Efficiency

Confidence

Plan Adherence

Plan Quality

Coherence

Loop Detection

How trace metrics work

LLM-as-judge metrics

Embedding-based metrics

Metric interface

Standalone vs. session-context metrics

Next steps

Trace Metrics Reference

Run via API

Get Started

Tracing

Evaluation

Documentation Index

​Available metrics

Task Completion

Tool Correctness

Argument Correctness

Step Efficiency

Confidence

Plan Adherence

Plan Quality

Coherence

Loop Detection

​How trace metrics work

​LLM-as-judge metrics

​Embedding-based metrics

​Metric interface

​Standalone vs. session-context metrics

​Next steps

Trace Metrics Reference

Run via API

Available metrics

How trace metrics work

LLM-as-judge metrics

Embedding-based metrics

Metric interface

Standalone vs. session-context metrics

Next steps