> ## Documentation Index > Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt > Use this file to discover all available pages before exploring further. # Introduction > Evaluate individual traces with LLM-as-judge metrics and embedding analysis. Trace evaluation scores a single agent execution: one trace with all of its spans, model calls, tool calls, inputs, outputs, and metadata. Use it to answer questions like "Did the agent complete the task?", "Were the right tools used?", and "Was the output coherent?" PandaProbe ships with **9 built-in trace-level metrics**. Most use LLM-as-judge analysis, while coherence and loop detection use embedding or similarity-based methods. ## Available metrics Did the agent accomplish the user's stated objective? Did the agent select appropriate tools for the task? Were tool call arguments correctly specified? Did the agent execute with minimal unnecessary steps? Were the agent's actions decisive and well-founded? Did the agent follow its declared plan? Was the agent's plan complete and well-structured? Does the output logically follow from the input? Is the agent stuck repeating itself across traces? ## How trace metrics work Each metric receives the full trace and produces a score between 0 and 1, a human-readable reason, and structured metadata. Higher scores generally mean better behavior. ### LLM-as-judge metrics Most LLM-as-judge metrics separate context extraction from scoring: PandaProbe turns raw trace data into focused inputs for the metric, such as the user task, final outcome, tool calls, tool arguments, or declared plan. An LLM evaluates the extracted context for a specific metric, such as task completion, tool correctness, or plan quality. The metric returns a numeric score, a reason, and metadata that explains the intermediate extraction or verdict. This multi-stage design improves reliability because the judge LLM receives clean, focused inputs rather than raw trace data. **Example: Task Completion (2-stage)** 1. **Extract** — LLM identifies the user's task and the agent's factual outcome from the trace 2. **Score** — LLM compares task vs. outcome and returns a 0–1 verdict with explanation **Example: Argument Correctness (3-stage)** 1. **Extract** — LLM identifies user input and all tool calls (name, parameters, reasoning) 2. **Verdict** — LLM evaluates each tool call's arguments individually (yes/no per call) 3. **Reason** — LLM produces an overall explanation from the per-call verdicts ### Embedding-based metrics Two metrics skip LLM calls entirely: * **Coherence** computes the cosine distance between input and output embeddings. A small distance means high coherence (score close to 1.0). * **Loop Detection** uses a hybrid approach: cosine similarity (semantic overlap) multiplied by Jaccard similarity (lexical overlap) across recent traces. High scores on *both* indicate the agent is stuck repeating itself. ## Metric interface Every trace metric implements the same `evaluate()` method: ```python theme={null} async def evaluate( self, trace: Trace, llm: LLMEngine, *, threshold: float | None = None, model: str | None = None, session_traces: list[Trace] | None = None, ) -> MetricResult ``` | Parameter | Description | | ---------------- | ------------------------------------------------------------------- | | `trace` | The full trace entity with all spans | | `llm` | LLM engine for judge calls and embeddings | | `threshold` | Override the metric's default pass/fail threshold | | `model` | Override the default LLM model (e.g., `openai/gpt-5.4`) | | `session_traces` | Previous traces in the same session (only used by `loop_detection`) | ## Standalone vs. session-context metrics Most trace metrics evaluate a trace **in isolation** — they don't need other traces. These are available for standalone eval runs. One metric, `loop_detection`, requires **session context** (the `session_traces` parameter) to compare the current trace against previous outputs. It is excluded from standalone trace eval runs but is automatically computed as a signal during session-level evaluation. When you run a session eval, PandaProbe first computes trace-level signals (confidence, loop\_detection, tool\_correctness, coherence) for each trace in the session, then feeds those signals into the session-level aggregation metrics. ## Next steps Detailed documentation for each trace-level metric. Create trace eval runs programmatically.