Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

This page provides a comprehensive reference for every trace-level metric in PandaProbe. Each metric produces a score between 0.0 and 1.0, where higher is better (except where noted).

Task Completion

Registry name: task_completion · Default threshold: 0.5 · Method: LLM judge (2-stage)
Evaluates whether the agent accomplished the user’s stated objective. This is typically the most important metric — it answers the fundamental question: Did the agent do what was asked?

How it works

StageWhat happensOutput schema
1. ExtractLLM identifies the user’s task and a strictly factual description of what the agent did from the trace. Subjective language (e.g., “successfully”) is explicitly excluded.task (string), outcome (string)
2. ScoreLLM compares the extracted task against the actual outcome and scores fulfillment on a 0–1 scale.verdict (float 0–1), reason (string)

Scoring guide

ScoreMeaning
1.0Task perfectly accomplished
0.75–0.99Mostly complete, minor aspects missing
0.5–0.74Partially complete
0.25–0.49Significant gaps in completion
0.0–0.24Task not meaningfully addressed

Metadata returned

{
  "task": "Book a round-trip flight from SFO to JFK for next Friday",
  "outcome": "Found 3 available flights, selected the cheapest option, and completed booking confirmation",
  "threshold": 0.5,
  "success": true
}

Tool Correctness

Registry name: tool_correctness · Default threshold: 0.5 · Method: LLM judge (2-stage)
Evaluates whether the agent selected appropriate tools for its task. Catches over-selection (unnecessary tools), under-selection (missing tools), and mis-selection (wrong tools).

How it works

StageWhat happensOutput schema
1. ExtractLLM extracts the user’s goal, all tool calls made (name + parameters), and all available tools (name + description) from trace spans.user_input, tools_called (list), available_tools (list)
2. ScoreLLM judges tool selection quality considering correct selection, over-selection, under-selection, and mis-selection.score (float 0–1), reason (string)

What the judge evaluates

  • Correct selection — Were the tools used appropriate and sufficient?
  • Over-selection — Were unnecessary or redundant tools called?
  • Under-selection — Were useful available tools ignored?
  • Mis-selection — Were wrong or irrelevant tools chosen?

Metadata returned

{
  "user_input": "Look up the weather in Tokyo and book a restaurant",
  "tools_called": [
    {"name": "weather_lookup", "parameters": {"city": "Tokyo"}},
    {"name": "restaurant_booking", "parameters": {"city": "Tokyo", "cuisine": "sushi"}}
  ],
  "available_tools": [
    {"name": "weather_lookup", "description": "Get weather forecast"},
    {"name": "restaurant_booking", "description": "Book a restaurant"},
    {"name": "flight_search", "description": "Search flights"}
  ],
  "threshold": 0.5,
  "success": true
}

Argument Correctness

Registry name: argument_correctness · Default threshold: 0.5 · Method: LLM judge (3-stage)
Evaluates whether the arguments passed to each tool call were correct for the user’s task. While Tool Correctness checks which tools were called, Argument Correctness checks how they were called.

How it works

StageWhat happensOutput schema
1. ExtractLLM identifies the user input and all tool calls with their parameters and reasoning from the trace.user_input (string), tool_calls (list of name/parameters/reasoning)
2. VerdictsLLM evaluates each tool call individually, returning a yes/no verdict on whether its arguments correctly address the task.verdicts (list of verdict + reason)
3. ReasonLLM produces a concise overall explanation from the score and list of incorrect-call reasons.reason (string)

Score calculation

The score is computed deterministically from the verdicts:
score = correct_count / total_verdicts
If a trace has no tool calls, the metric returns a perfect 1.0 score (nothing to evaluate).

Metadata returned

{
  "user_input": "Find flights from SFO to JFK under $500",
  "verdicts": [
    {"verdict": "yes", "reason": null},
    {"verdict": "no", "reason": "Price filter was set to $1000 instead of $500"}
  ],
  "threshold": 0.5,
  "success": true
}

Step Efficiency

Registry name: step_efficiency · Default threshold: 0.5 · Method: LLM judge (2-stage)
Evaluates how efficiently the agent executed its task, penalizing redundant steps, unnecessary tool calls, and speculative work.

How it works

StageWhat happensOutput schema
1. ExtractLLM extracts the user’s original goal from the trace.task (string)
2. ScoreLLM scores execution efficiency based on minimality of actions, penalizing redundant steps, unnecessary tool calls, and speculative work.score (float 0–1), reason (string)

What lowers the score

  • Redundant or duplicate tool calls
  • Unnecessary intermediate steps
  • Speculative work that wasn’t needed for the task
  • Overly verbose reasoning chains that don’t add value

Metadata returned

{
  "task": "Get the current Bitcoin price",
  "threshold": 0.5,
  "success": true
}

Confidence

Registry name: confidence · Default threshold: 0.5 · Method: LLM judge (1-stage)
Evaluates whether the agent’s actions were decisive, appropriate, and well-founded. This metric is also used as a signal for session-level aggregation in agent_reliability and agent_consistency.

How it works

A single LLM call evaluates the entire trace against four criteria:
  • Decisiveness — Did the agent act without unnecessary hesitation or contradictory steps?
  • Appropriateness — Were the actions relevant to the user’s goal?
  • Consistency — Did the agent maintain a coherent strategy throughout?
  • Indicators of low confidence — hedging language, contradictions, unnecessary retries, vague outputs, repeated tool calls with identical parameters, or abandoned strategies

Scoring guide

ScoreMeaning
1.0Fully confident, decisive execution
0.75Minor hesitation or suboptimal strategy
0.5Noticeable indecision or inconsistency
0.25Significant uncertainty, multiple abandoned approaches
0.0Completely uncertain, contradictory actions

Metadata returned

{
  "threshold": 0.5,
  "success": true
}

Plan Adherence

Registry name: plan_adherence · Default threshold: 0.5 · Method: LLM judge (3-stage)
Evaluates how closely the agent followed its declared or implied plan during execution. Useful for agents that produce a plan before acting.

How it works

StageWhat happensOutput schema
1. Extract taskLLM extracts the user’s task from the trace (reuses the Step Efficiency extraction prompt).task (string)
2. Extract planLLM extracts the agent’s explicit or implied plan from reasoning/thought fields. Every step must be supported by trace evidence — no hallucination.plan (list of strings)
3. ScoreLLM scores how strictly execution followed the plan, checking step order and completeness.score (float 0–1), reason (string)
If no plan is found in the trace, the metric returns 1.0 (no plan to deviate from).

Scoring guide

ScoreMeaning
1.0Perfect adherence — execution matches plan exactly
0.75Nearly all steps in order, minor deviations
0.5Partial adherence, some steps skipped or reordered
0.25Weak adherence, significant deviation from plan
0.0No adherence, execution bears little relation to plan

What the judge evaluates

  • Were all planned steps executed?
  • Were steps followed in the intended order?
  • Were there extraneous actions not in the plan?
  • Were any planned steps skipped?

Plan Quality

Registry name: plan_quality · Default threshold: 0.5 · Method: LLM judge (3-stage)
Evaluates the intrinsic quality of the agent’s plan, independent of whether the plan was followed. While Plan Adherence checks execution vs. plan, Plan Quality checks whether the plan itself was good.

How it works

StageWhat happensOutput schema
1. Extract taskLLM extracts the user’s task from the trace.task (string)
2. Extract planLLM extracts the agent’s plan from reasoning fields.plan (list of strings)
3. Score qualityLLM scores plan quality based on completeness, logical coherence, optimality, detail level, and alignment with the task.score (float 0–1), reason (string)
If no plan is found, the metric returns 1.0.

What the judge evaluates

  • Completeness — Does the plan address all aspects of the task?
  • Logical coherence — Are steps ordered and structured sensibly?
  • Optimality/efficiency — Could the plan be streamlined?
  • Level of detail — Sufficiently detailed without being overly verbose?
  • Alignment with task — Does the plan match the user’s intent?

Scoring guide

ScoreMeaning
1.0Excellent plan — complete, coherent, optimal
0.75Good plan — minor flaws or suboptimal choices
0.5Adequate but flawed — works but has notable gaps
0.25Weak plan — significant issues
0.0Inadequate — does not address the task

Coherence

Registry name: coherence · Default threshold: 0.5 · Method: Embedding distance (no LLM call)
Measures whether the agent’s output logically follows from its input using embedding-based cosine distance. This metric is fast and deterministic — it only requires embedding API calls, no LLM generation.

How it works

  1. The trace’s input and output are serialized to text
  2. Both texts are embedded using the configured embedding model
  3. The cosine distance between the two embeddings is computed
  4. Score = 1.0 - cosine_distance (clamped to [0, 1])
A small distance (high score) means the output is semantically aligned with the input. A large distance (low score) suggests the output is unrelated or off-topic.

Edge cases

  • If either input or output is empty, the metric returns 1.0 with a note explaining coherence was assumed
  • The metric also serves as a signal for session-level aggregation

Metadata returned

{
  "coherence_gap": 0.1234,
  "threshold": 0.5,
  "success": true
}

Loop Detection

Registry name: loop_detection · Default threshold: 0.5 · Method: Hybrid semantic + Jaccard similarity
Detects whether the agent is stuck repeating itself across traces in the same session. This metric requires session context — it compares the current trace’s output against previous traces.
Loop Detection is excluded from standalone trace eval runs. It runs automatically as a signal during session-level evaluations.

How it works

  1. The current trace’s output and the previous traces’ outputs (up to a window of 3) are collected
  2. All outputs are embedded using the configured embedding model
  3. For each previous trace, two similarity scores are computed:
    • Cosine similarity (semantic overlap) between embeddings
    • Jaccard similarity (lexical overlap) between tokenized word sets (with stop-word removal)
  4. A hybrid score = cosine × Jaccard is computed for each pair
  5. Final score = 1.0 - max(hybrid_scores) (clamped to [0, 1])

Why the hybrid approach

ScenarioCosineJaccardHybridInterpretation
Agent repeating exact same responseHighHighHigh → Low scoreStuck in a loop
Agent enumerating related itemsHighLowLow → High scoreValid exploration
Completely unrelated outputsLowLowLow → High scoreNo repetition
The multiplication of cosine × Jaccard ensures that only outputs that are both semantically and lexically similar are flagged as loops.

Metadata returned

{
  "window_size": 3,
  "max_hybrid": 0.7234,
  "comparisons": [
    {
      "trace_index": 2,
      "cosine_similarity": 0.9512,
      "jaccard_similarity": 0.7608,
      "hybrid_score": 0.7234
    }
  ],
  "threshold": 0.5,
  "success": false
}

Model override

All LLM-based metrics support a model override parameter. When creating an eval run, you can specify model (e.g., "openai/gpt-5.4") to change which LLM serves as the judge. If omitted, the system default is used. This lets you balance cost and accuracy — use a faster model for quick checks and a more capable model for production evaluations.

Next steps

Agent Evaluation Metrics

Learn about session-level aggregation metrics.

Run via API

Create trace eval runs programmatically.