Trace evaluation scores a single agent execution: one trace with all of its spans, model calls, tool calls, inputs, outputs, and metadata. Use it to answer questions like “Did the agent complete the task?”, “Were the right tools used?”, and “Was the output coherent?” PandaProbe ships with 9 built-in trace-level metrics. Most use LLM-as-judge analysis, while coherence and loop detection use embedding or similarity-based methods.Documentation Index
Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
Use this file to discover all available pages before exploring further.
Available metrics
Task Completion
Did the agent accomplish the user’s stated objective?
Tool Correctness
Did the agent select appropriate tools for the task?
Argument Correctness
Were tool call arguments correctly specified?
Step Efficiency
Did the agent execute with minimal unnecessary steps?
Confidence
Were the agent’s actions decisive and well-founded?
Plan Adherence
Did the agent follow its declared plan?
Plan Quality
Was the agent’s plan complete and well-structured?
Coherence
Does the output logically follow from the input?
Loop Detection
Is the agent stuck repeating itself across traces?
How trace metrics work
Each metric receives the full trace and produces a score between 0 and 1, a human-readable reason, and structured metadata. Higher scores generally mean better behavior.LLM-as-judge metrics
Most LLM-as-judge metrics separate context extraction from scoring:Extract trace context
PandaProbe turns raw trace data into focused inputs for the metric, such as the user task, final outcome, tool calls, tool arguments, or declared plan.
Judge one quality dimension
An LLM evaluates the extracted context for a specific metric, such as task completion, tool correctness, or plan quality.
- Extract — LLM identifies the user’s task and the agent’s factual outcome from the trace
- Score — LLM compares task vs. outcome and returns a 0–1 verdict with explanation
- Extract — LLM identifies user input and all tool calls (name, parameters, reasoning)
- Verdict — LLM evaluates each tool call’s arguments individually (yes/no per call)
- Reason — LLM produces an overall explanation from the per-call verdicts
Embedding-based metrics
Two metrics skip LLM calls entirely:- Coherence computes the cosine distance between input and output embeddings. A small distance means high coherence (score close to 1.0).
- Loop Detection uses a hybrid approach: cosine similarity (semantic overlap) multiplied by Jaccard similarity (lexical overlap) across recent traces. High scores on both indicate the agent is stuck repeating itself.
Metric interface
Every trace metric implements the sameevaluate() method:
| Parameter | Description |
|---|---|
trace | The full trace entity with all spans |
llm | LLM engine for judge calls and embeddings |
threshold | Override the metric’s default pass/fail threshold |
model | Override the default LLM model (e.g., openai/gpt-5.4) |
session_traces | Previous traces in the same session (only used by loop_detection) |
Standalone vs. session-context metrics
Most trace metrics evaluate a trace in isolation — they don’t need other traces. These are available for standalone eval runs. One metric,loop_detection, requires session context (the session_traces parameter) to compare the current trace against previous outputs. It is excluded from standalone trace eval runs but is automatically computed as a signal during session-level evaluation.
When you run a session eval, PandaProbe first computes trace-level signals (confidence, loop_detection, tool_correctness, coherence) for each trace in the session, then feeds those signals into the session-level aggregation metrics.
Next steps
Trace Metrics Reference
Detailed documentation for each trace-level metric.
Run via API
Create trace eval runs programmatically.

