Metrics

PandaProbe includes two session-level metrics that aggregate trace-level signals into scores capturing agent behavior across an entire session. Both are pure mathematical functions — they receive precomputed per-trace signals and perform zero LLM or embedding calls.

Agent Reliability

Registry name: agent_reliability · Default threshold: 0.5 · Method: Max-compose + top-k tail risk

Measures worst-case failure risk across a session. A session with one catastrophic trace scores poorly even if all other traces are fine. Use this metric to catch agents that are generally good but occasionally fail badly.

Algorithm

For each trace in the session, the metric computes a per-trace risk from the precomputed signals:

Convert signals to risks

Each signal score is inverted to become a risk value:

confidence_risk = 1.0 - confidence_score
loop_risk       = 1.0 - loop_detection_score
tool_risk       = 1.0 - tool_correctness_score
coherence_risk  = 1.0 - coherence_score

Weight and max-compose per trace

Each risk is multiplied by its signal weight, and the maximum weighted risk becomes the trace’s risk:

per_trace_risk = max(w_conf × confidence_risk,
                     w_loop × loop_risk,
                     w_tool × tool_risk,
                     w_coh  × coherence_risk)

Only signals that are present for a trace are included — missing signals are skipped, not treated as zero.

Top-k tail risk aggregation

Per-trace risks are sorted in descending order. The top 15% (at least 1) are selected:

k = max(1, ceil(num_traces × 0.15))
mean_top_k = mean(sorted_risks[:k])
max_risk   = sorted_risks[0]

Ensemble and final score

The raw session risk blends the top-k mean with the single worst trace:

raw_risk = 0.9 × mean_top_k + 0.1 × max_risk
score    = clamp(1.0 - raw_risk, 0, 1)

Why max-compose + top-k

Max-compose ensures each trace’s risk is driven by its worst signal. An agent that has great tool selection but terrible coherence on one trace still gets flagged.
Top-k focuses on the tail of the distribution. A session with 100 traces where 3 have high risk will be scored based on those 3, not diluted by the 97 good ones.
The 10% max-risk blend gives extra weight to the single worst trace, preventing a handful of high-risk traces from being averaged away.

Flagged traces

Traces with per_trace_risk > 0.5 are flagged in the metadata. This lets you quickly identify which specific traces are dragging down the session’s reliability score.

Metadata returned

{
  "total_traces_in_session": 12,
  "traces_evaluated": 10,
  "raw_risk": 0.3200,
  "signal_weights": {
    "confidence": 1.0,
    "loop_detection": 1.0,
    "tool_correctness": 0.8,
    "coherence": 1.0
  },
  "per_trace_signals": {
    "trace-id-1": {
      "confidence_risk": 0.12,
      "loop_risk": 0.05,
      "tool_risk": 0.08,
      "coherence_risk": 0.03,
      "step_risk": 0.12
    },
    "trace-id-7": {
      "confidence_risk": 0.65,
      "loop_risk": 0.72,
      "tool_risk": 0.10,
      "coherence_risk": 0.15,
      "step_risk": 0.72
    }
  },
  "flagged_traces": ["trace-id-7"],
  "aggregation": {
    "method": "max_compose_top_k",
    "top_k_percentile": 0.15,
    "ensemble_weight": 0.1,
    "mean_top_k_risk": 0.72,
    "max_risk": 0.72
  }
}

Interpreting the score

Score range	Meaning
0.9–1.0	Highly reliable — no elevated risk in any trace
0.7–0.89	Generally reliable — minor risk in a few traces
0.5–0.69	Moderate risk — some traces show concerning behavior
0.3–0.49	Elevated risk — multiple traces with significant issues
0.0–0.29	High risk — session contains catastrophic failures

Agent Consistency

Registry name: agent_consistency · Default threshold: 0.5 · Method: Weighted RMS aggregation

Measures overall stability across a session. Unlike reliability (which focuses on worst moments), consistency penalizes any trace that deviates from smooth operation. Many moderate issues score poorly even if no single trace is catastrophic.

Algorithm

Filter traces

Only traces with a confidence signal are included. Traces missing confidence are skipped entirely (unlike reliability, which includes traces with any signal subset).

Compute weighted uncertainty per trace

For each trace:

confidence_risk = 1.0 - confidence_score

penalty = w_loop × (1 - loop_detection)      # if present
        + w_tool × (1 - tool_correctness)     # if present
        + w_coh  × (1 - coherence)            # if present

amplification = 1.0 + penalty
weighted_uncertainty = amplification × (w_conf × confidence_risk)

The penalty terms amplify the confidence risk. A trace with low confidence and poor tool correctness gets a higher uncertainty than one with low confidence alone.

RMS aggregation

The root mean square of all weighted uncertainties becomes the raw instability:

rms = sqrt(sum(wu² for wu in weighted_uncertainties) / n)
score = clamp(1.0 - rms, 0, 1)

Why RMS

RMS (root mean square) is sensitive to variation. Unlike a simple average:

A session where all traces have moderate uncertainty (e.g., all at 0.3) gets the same RMS as that average
A session where most traces are fine but a few have high uncertainty gets a higher RMS due to the squaring

This means consistency captures the spread of issues, not just their average severity.

Why amplification

The confidence signal is the foundation — it’s the only required signal. But confidence alone doesn’t tell the whole story. The penalty terms from other signals amplify the base confidence risk:

If an agent is uncertain (low confidence) and using wrong tools, the combined uncertainty is worse than either alone
If an agent is uncertain but coherent with correct tools, the uncertainty is less concerning

This multiplicative interaction captures real-world failure modes where problems compound.

Metadata returned

{
  "total_traces_in_session": 12,
  "traces_evaluated": 10,
  "raw_instability": 0.2800,
  "signal_weights": {
    "confidence": 1.0,
    "loop_detection": 1.0,
    "tool_correctness": 0.8,
    "coherence": 1.0
  },
  "per_trace_signals": {
    "trace-id-1": {
      "confidence_risk": 0.12,
      "loop_risk": 0.05,
      "tool_risk": 0.08,
      "coherence_risk": 0.03,
      "situational_penalty": 0.134,
      "weighted_uncertainty": 0.136
    }
  },
  "aggregation": {
    "method": "weighted_rms",
    "rms_value": 0.2800
  }
}

Interpreting the score

Score range	Meaning
0.9–1.0	Highly consistent — smooth operation across all traces
0.7–0.89	Generally consistent — minor fluctuations
0.5–0.69	Moderate instability — some traces deviate noticeably
0.3–0.49	High instability — frequent or severe deviations
0.0–0.29	Unstable — multiple signals compounding across traces

Reliability vs. Consistency

These two metrics complement each other:

	Agent Reliability	Agent Consistency
Focus	Worst-case failures	Overall stability
Question	”Did the agent ever fail badly?"	"Did the agent perform smoothly?”
Sensitive to	Tail risk — a single bad trace	Variance — spread of uncertainty
Signal handling	Uses any available subset	Requires confidence signal
Aggregation	Max-compose + top-k tail risk	Weighted RMS
Use case	Safety-critical applications	Quality-sensitive applications

A session can have:

High reliability + High consistency — agent is both safe and smooth
High reliability + Low consistency — no catastrophic failures, but uneven performance
Low reliability + High consistency — consistently mediocre (not great, but predictable)
Low reliability + Low consistency — unreliable and unstable

Edge cases

Both metrics handle edge cases gracefully:

Scenario	Score	Reason
No traces or no precomputed signals	1.0	”No traces or signals to evaluate.”
No evaluable traces after filtering	1.0	”No evaluable traces.”
Single trace in session	Evaluated normally	Top-k degenerates to k=1

Get Started

Tracing

Evaluation

Agent Reliability

Algorithm

Why max-compose + top-k

Flagged traces

Metadata returned

Interpreting the score

Agent Consistency

Algorithm

Why RMS

Why amplification

Metadata returned

Interpreting the score

Reliability vs. Consistency

Edge cases

Next steps

Run via API

Scheduling

Get Started

Tracing

Evaluation

Documentation Index

​Agent Reliability

​Algorithm

​Why max-compose + top-k

​Flagged traces

​Metadata returned

​Interpreting the score

​Agent Consistency

​Algorithm

​Why RMS

​Why amplification

​Metadata returned

​Interpreting the score

​Reliability vs. Consistency

​Edge cases

​Next steps

Run via API

Scheduling

Agent Reliability

Algorithm

Why max-compose + top-k

Flagged traces

Metadata returned

Interpreting the score

Agent Consistency

Algorithm

Why RMS

Why amplification

Metadata returned

Interpreting the score

Reliability vs. Consistency

Edge cases

Next steps