This page provides a comprehensive reference for every trace-level metric in PandaProbe. Each metric produces a score between 0.0 and 1.0, where higher is better (except where noted).Documentation Index
Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
Use this file to discover all available pages before exploring further.
Task Completion
Registry name:
task_completion · Default threshold: 0.5 · Method: LLM judge (2-stage)How it works
| Stage | What happens | Output schema |
|---|---|---|
| 1. Extract | LLM identifies the user’s task and a strictly factual description of what the agent did from the trace. Subjective language (e.g., “successfully”) is explicitly excluded. | task (string), outcome (string) |
| 2. Score | LLM compares the extracted task against the actual outcome and scores fulfillment on a 0–1 scale. | verdict (float 0–1), reason (string) |
Scoring guide
| Score | Meaning |
|---|---|
| 1.0 | Task perfectly accomplished |
| 0.75–0.99 | Mostly complete, minor aspects missing |
| 0.5–0.74 | Partially complete |
| 0.25–0.49 | Significant gaps in completion |
| 0.0–0.24 | Task not meaningfully addressed |
Metadata returned
Tool Correctness
Registry name:
tool_correctness · Default threshold: 0.5 · Method: LLM judge (2-stage)How it works
| Stage | What happens | Output schema |
|---|---|---|
| 1. Extract | LLM extracts the user’s goal, all tool calls made (name + parameters), and all available tools (name + description) from trace spans. | user_input, tools_called (list), available_tools (list) |
| 2. Score | LLM judges tool selection quality considering correct selection, over-selection, under-selection, and mis-selection. | score (float 0–1), reason (string) |
What the judge evaluates
- Correct selection — Were the tools used appropriate and sufficient?
- Over-selection — Were unnecessary or redundant tools called?
- Under-selection — Were useful available tools ignored?
- Mis-selection — Were wrong or irrelevant tools chosen?
Metadata returned
Argument Correctness
Registry name:
argument_correctness · Default threshold: 0.5 · Method: LLM judge (3-stage)How it works
| Stage | What happens | Output schema |
|---|---|---|
| 1. Extract | LLM identifies the user input and all tool calls with their parameters and reasoning from the trace. | user_input (string), tool_calls (list of name/parameters/reasoning) |
| 2. Verdicts | LLM evaluates each tool call individually, returning a yes/no verdict on whether its arguments correctly address the task. | verdicts (list of verdict + reason) |
| 3. Reason | LLM produces a concise overall explanation from the score and list of incorrect-call reasons. | reason (string) |
Score calculation
The score is computed deterministically from the verdicts:Metadata returned
Step Efficiency
Registry name:
step_efficiency · Default threshold: 0.5 · Method: LLM judge (2-stage)How it works
| Stage | What happens | Output schema |
|---|---|---|
| 1. Extract | LLM extracts the user’s original goal from the trace. | task (string) |
| 2. Score | LLM scores execution efficiency based on minimality of actions, penalizing redundant steps, unnecessary tool calls, and speculative work. | score (float 0–1), reason (string) |
What lowers the score
- Redundant or duplicate tool calls
- Unnecessary intermediate steps
- Speculative work that wasn’t needed for the task
- Overly verbose reasoning chains that don’t add value
Metadata returned
Confidence
Registry name:
confidence · Default threshold: 0.5 · Method: LLM judge (1-stage)agent_reliability and agent_consistency.
How it works
A single LLM call evaluates the entire trace against four criteria:- Decisiveness — Did the agent act without unnecessary hesitation or contradictory steps?
- Appropriateness — Were the actions relevant to the user’s goal?
- Consistency — Did the agent maintain a coherent strategy throughout?
- Indicators of low confidence — hedging language, contradictions, unnecessary retries, vague outputs, repeated tool calls with identical parameters, or abandoned strategies
Scoring guide
| Score | Meaning |
|---|---|
| 1.0 | Fully confident, decisive execution |
| 0.75 | Minor hesitation or suboptimal strategy |
| 0.5 | Noticeable indecision or inconsistency |
| 0.25 | Significant uncertainty, multiple abandoned approaches |
| 0.0 | Completely uncertain, contradictory actions |
Metadata returned
Plan Adherence
Registry name:
plan_adherence · Default threshold: 0.5 · Method: LLM judge (3-stage)How it works
| Stage | What happens | Output schema |
|---|---|---|
| 1. Extract task | LLM extracts the user’s task from the trace (reuses the Step Efficiency extraction prompt). | task (string) |
| 2. Extract plan | LLM extracts the agent’s explicit or implied plan from reasoning/thought fields. Every step must be supported by trace evidence — no hallucination. | plan (list of strings) |
| 3. Score | LLM scores how strictly execution followed the plan, checking step order and completeness. | score (float 0–1), reason (string) |
Scoring guide
| Score | Meaning |
|---|---|
| 1.0 | Perfect adherence — execution matches plan exactly |
| 0.75 | Nearly all steps in order, minor deviations |
| 0.5 | Partial adherence, some steps skipped or reordered |
| 0.25 | Weak adherence, significant deviation from plan |
| 0.0 | No adherence, execution bears little relation to plan |
What the judge evaluates
- Were all planned steps executed?
- Were steps followed in the intended order?
- Were there extraneous actions not in the plan?
- Were any planned steps skipped?
Plan Quality
Registry name:
plan_quality · Default threshold: 0.5 · Method: LLM judge (3-stage)How it works
| Stage | What happens | Output schema |
|---|---|---|
| 1. Extract task | LLM extracts the user’s task from the trace. | task (string) |
| 2. Extract plan | LLM extracts the agent’s plan from reasoning fields. | plan (list of strings) |
| 3. Score quality | LLM scores plan quality based on completeness, logical coherence, optimality, detail level, and alignment with the task. | score (float 0–1), reason (string) |
What the judge evaluates
- Completeness — Does the plan address all aspects of the task?
- Logical coherence — Are steps ordered and structured sensibly?
- Optimality/efficiency — Could the plan be streamlined?
- Level of detail — Sufficiently detailed without being overly verbose?
- Alignment with task — Does the plan match the user’s intent?
Scoring guide
| Score | Meaning |
|---|---|
| 1.0 | Excellent plan — complete, coherent, optimal |
| 0.75 | Good plan — minor flaws or suboptimal choices |
| 0.5 | Adequate but flawed — works but has notable gaps |
| 0.25 | Weak plan — significant issues |
| 0.0 | Inadequate — does not address the task |
Coherence
Registry name:
coherence · Default threshold: 0.5 · Method: Embedding distance (no LLM call)How it works
- The trace’s
inputandoutputare serialized to text - Both texts are embedded using the configured embedding model
- The cosine distance between the two embeddings is computed
- Score =
1.0 - cosine_distance(clamped to [0, 1])
Edge cases
- If either input or output is empty, the metric returns 1.0 with a note explaining coherence was assumed
- The metric also serves as a signal for session-level aggregation
Metadata returned
Loop Detection
Registry name:
loop_detection · Default threshold: 0.5 · Method: Hybrid semantic + Jaccard similarityHow it works
- The current trace’s output and the previous traces’ outputs (up to a window of 3) are collected
- All outputs are embedded using the configured embedding model
- For each previous trace, two similarity scores are computed:
- Cosine similarity (semantic overlap) between embeddings
- Jaccard similarity (lexical overlap) between tokenized word sets (with stop-word removal)
- A hybrid score = cosine × Jaccard is computed for each pair
- Final score =
1.0 - max(hybrid_scores)(clamped to [0, 1])
Why the hybrid approach
| Scenario | Cosine | Jaccard | Hybrid | Interpretation |
|---|---|---|---|---|
| Agent repeating exact same response | High | High | High → Low score | Stuck in a loop |
| Agent enumerating related items | High | Low | Low → High score | Valid exploration |
| Completely unrelated outputs | Low | Low | Low → High score | No repetition |
Metadata returned
Model override
All LLM-based metrics support a model override parameter. When creating an eval run, you can specifymodel (e.g., "openai/gpt-5.4") to change which LLM serves as the judge. If omitted, the system default is used.
This lets you balance cost and accuracy — use a faster model for quick checks and a more capable model for production evaluations.
Next steps
Agent Evaluation Metrics
Learn about session-level aggregation metrics.
Run via API
Create trace eval runs programmatically.

