Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

PandaProbe includes two session-level metrics that aggregate trace-level signals into scores capturing agent behavior across an entire session. Both are pure mathematical functions — they receive precomputed per-trace signals and perform zero LLM or embedding calls.

Agent Reliability

Registry name: agent_reliability · Default threshold: 0.5 · Method: Max-compose + top-k tail risk
Measures worst-case failure risk across a session. A session with one catastrophic trace scores poorly even if all other traces are fine. Use this metric to catch agents that are generally good but occasionally fail badly.

Algorithm

For each trace in the session, the metric computes a per-trace risk from the precomputed signals:
1

Convert signals to risks

Each signal score is inverted to become a risk value:
confidence_risk = 1.0 - confidence_score
loop_risk       = 1.0 - loop_detection_score
tool_risk       = 1.0 - tool_correctness_score
coherence_risk  = 1.0 - coherence_score
2

Weight and max-compose per trace

Each risk is multiplied by its signal weight, and the maximum weighted risk becomes the trace’s risk:
per_trace_risk = max(w_conf × confidence_risk,
                     w_loop × loop_risk,
                     w_tool × tool_risk,
                     w_coh  × coherence_risk)
Only signals that are present for a trace are included — missing signals are skipped, not treated as zero.
3

Top-k tail risk aggregation

Per-trace risks are sorted in descending order. The top 15% (at least 1) are selected:
k = max(1, ceil(num_traces × 0.15))
mean_top_k = mean(sorted_risks[:k])
max_risk   = sorted_risks[0]
4

Ensemble and final score

The raw session risk blends the top-k mean with the single worst trace:
raw_risk = 0.9 × mean_top_k + 0.1 × max_risk
score    = clamp(1.0 - raw_risk, 0, 1)

Why max-compose + top-k

  • Max-compose ensures each trace’s risk is driven by its worst signal. An agent that has great tool selection but terrible coherence on one trace still gets flagged.
  • Top-k focuses on the tail of the distribution. A session with 100 traces where 3 have high risk will be scored based on those 3, not diluted by the 97 good ones.
  • The 10% max-risk blend gives extra weight to the single worst trace, preventing a handful of high-risk traces from being averaged away.

Flagged traces

Traces with per_trace_risk > 0.5 are flagged in the metadata. This lets you quickly identify which specific traces are dragging down the session’s reliability score.

Metadata returned

{
  "total_traces_in_session": 12,
  "traces_evaluated": 10,
  "raw_risk": 0.3200,
  "signal_weights": {
    "confidence": 1.0,
    "loop_detection": 1.0,
    "tool_correctness": 0.8,
    "coherence": 1.0
  },
  "per_trace_signals": {
    "trace-id-1": {
      "confidence_risk": 0.12,
      "loop_risk": 0.05,
      "tool_risk": 0.08,
      "coherence_risk": 0.03,
      "step_risk": 0.12
    },
    "trace-id-7": {
      "confidence_risk": 0.65,
      "loop_risk": 0.72,
      "tool_risk": 0.10,
      "coherence_risk": 0.15,
      "step_risk": 0.72
    }
  },
  "flagged_traces": ["trace-id-7"],
  "aggregation": {
    "method": "max_compose_top_k",
    "top_k_percentile": 0.15,
    "ensemble_weight": 0.1,
    "mean_top_k_risk": 0.72,
    "max_risk": 0.72
  }
}

Interpreting the score

Score rangeMeaning
0.9–1.0Highly reliable — no elevated risk in any trace
0.7–0.89Generally reliable — minor risk in a few traces
0.5–0.69Moderate risk — some traces show concerning behavior
0.3–0.49Elevated risk — multiple traces with significant issues
0.0–0.29High risk — session contains catastrophic failures

Agent Consistency

Registry name: agent_consistency · Default threshold: 0.5 · Method: Weighted RMS aggregation
Measures overall stability across a session. Unlike reliability (which focuses on worst moments), consistency penalizes any trace that deviates from smooth operation. Many moderate issues score poorly even if no single trace is catastrophic.

Algorithm

1

Filter traces

Only traces with a confidence signal are included. Traces missing confidence are skipped entirely (unlike reliability, which includes traces with any signal subset).
2

Compute weighted uncertainty per trace

For each trace:
confidence_risk = 1.0 - confidence_score

penalty = w_loop × (1 - loop_detection)      # if present
        + w_tool × (1 - tool_correctness)     # if present
        + w_coh  × (1 - coherence)            # if present

amplification = 1.0 + penalty
weighted_uncertainty = amplification × (w_conf × confidence_risk)
The penalty terms amplify the confidence risk. A trace with low confidence and poor tool correctness gets a higher uncertainty than one with low confidence alone.
3

RMS aggregation

The root mean square of all weighted uncertainties becomes the raw instability:
rms = sqrt(sum(wu² for wu in weighted_uncertainties) / n)
score = clamp(1.0 - rms, 0, 1)

Why RMS

RMS (root mean square) is sensitive to variation. Unlike a simple average:
  • A session where all traces have moderate uncertainty (e.g., all at 0.3) gets the same RMS as that average
  • A session where most traces are fine but a few have high uncertainty gets a higher RMS due to the squaring
This means consistency captures the spread of issues, not just their average severity.

Why amplification

The confidence signal is the foundation — it’s the only required signal. But confidence alone doesn’t tell the whole story. The penalty terms from other signals amplify the base confidence risk:
  • If an agent is uncertain (low confidence) and using wrong tools, the combined uncertainty is worse than either alone
  • If an agent is uncertain but coherent with correct tools, the uncertainty is less concerning
This multiplicative interaction captures real-world failure modes where problems compound.

Metadata returned

{
  "total_traces_in_session": 12,
  "traces_evaluated": 10,
  "raw_instability": 0.2800,
  "signal_weights": {
    "confidence": 1.0,
    "loop_detection": 1.0,
    "tool_correctness": 0.8,
    "coherence": 1.0
  },
  "per_trace_signals": {
    "trace-id-1": {
      "confidence_risk": 0.12,
      "loop_risk": 0.05,
      "tool_risk": 0.08,
      "coherence_risk": 0.03,
      "situational_penalty": 0.134,
      "weighted_uncertainty": 0.136
    }
  },
  "aggregation": {
    "method": "weighted_rms",
    "rms_value": 0.2800
  }
}

Interpreting the score

Score rangeMeaning
0.9–1.0Highly consistent — smooth operation across all traces
0.7–0.89Generally consistent — minor fluctuations
0.5–0.69Moderate instability — some traces deviate noticeably
0.3–0.49High instability — frequent or severe deviations
0.0–0.29Unstable — multiple signals compounding across traces

Reliability vs. Consistency

These two metrics complement each other:
Agent ReliabilityAgent Consistency
FocusWorst-case failuresOverall stability
Question”Did the agent ever fail badly?""Did the agent perform smoothly?”
Sensitive toTail risk — a single bad traceVariance — spread of uncertainty
Signal handlingUses any available subsetRequires confidence signal
AggregationMax-compose + top-k tail riskWeighted RMS
Use caseSafety-critical applicationsQuality-sensitive applications
A session can have:
  • High reliability + High consistency — agent is both safe and smooth
  • High reliability + Low consistency — no catastrophic failures, but uneven performance
  • Low reliability + High consistency — consistently mediocre (not great, but predictable)
  • Low reliability + Low consistency — unreliable and unstable

Edge cases

Both metrics handle edge cases gracefully:
ScenarioScoreReason
No traces or no precomputed signals1.0”No traces or signals to evaluate.”
No evaluable traces after filtering1.0”No evaluable traces.”
Single trace in sessionEvaluated normallyTop-k degenerates to k=1

Next steps

Run via API

Create session eval runs programmatically.

Scheduling

Set up automated recurring evaluations.