PandaProbe includes two session-level metrics that aggregate trace-level signals into scores capturing agent behavior across an entire session. Both are pure mathematical functions — they receive precomputed per-trace signals and perform zero LLM or embedding calls.Documentation Index
Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
Use this file to discover all available pages before exploring further.
Agent Reliability
Registry name:
agent_reliability · Default threshold: 0.5 · Method: Max-compose + top-k tail riskAlgorithm
For each trace in the session, the metric computes a per-trace risk from the precomputed signals:Weight and max-compose per trace
Each risk is multiplied by its signal weight, and the maximum weighted risk becomes the trace’s risk:Only signals that are present for a trace are included — missing signals are skipped, not treated as zero.
Top-k tail risk aggregation
Per-trace risks are sorted in descending order. The top 15% (at least 1) are selected:
Why max-compose + top-k
- Max-compose ensures each trace’s risk is driven by its worst signal. An agent that has great tool selection but terrible coherence on one trace still gets flagged.
- Top-k focuses on the tail of the distribution. A session with 100 traces where 3 have high risk will be scored based on those 3, not diluted by the 97 good ones.
- The 10% max-risk blend gives extra weight to the single worst trace, preventing a handful of high-risk traces from being averaged away.
Flagged traces
Traces withper_trace_risk > 0.5 are flagged in the metadata. This lets you quickly identify which specific traces are dragging down the session’s reliability score.
Metadata returned
Interpreting the score
| Score range | Meaning |
|---|---|
| 0.9–1.0 | Highly reliable — no elevated risk in any trace |
| 0.7–0.89 | Generally reliable — minor risk in a few traces |
| 0.5–0.69 | Moderate risk — some traces show concerning behavior |
| 0.3–0.49 | Elevated risk — multiple traces with significant issues |
| 0.0–0.29 | High risk — session contains catastrophic failures |
Agent Consistency
Registry name:
agent_consistency · Default threshold: 0.5 · Method: Weighted RMS aggregationAlgorithm
Filter traces
Only traces with a
confidence signal are included. Traces missing confidence are skipped entirely (unlike reliability, which includes traces with any signal subset).Compute weighted uncertainty per trace
For each trace:The penalty terms amplify the confidence risk. A trace with low confidence and poor tool correctness gets a higher uncertainty than one with low confidence alone.
Why RMS
RMS (root mean square) is sensitive to variation. Unlike a simple average:- A session where all traces have moderate uncertainty (e.g., all at 0.3) gets the same RMS as that average
- A session where most traces are fine but a few have high uncertainty gets a higher RMS due to the squaring
Why amplification
The confidence signal is the foundation — it’s the only required signal. But confidence alone doesn’t tell the whole story. The penalty terms from other signals amplify the base confidence risk:- If an agent is uncertain (low confidence) and using wrong tools, the combined uncertainty is worse than either alone
- If an agent is uncertain but coherent with correct tools, the uncertainty is less concerning
Metadata returned
Interpreting the score
| Score range | Meaning |
|---|---|
| 0.9–1.0 | Highly consistent — smooth operation across all traces |
| 0.7–0.89 | Generally consistent — minor fluctuations |
| 0.5–0.69 | Moderate instability — some traces deviate noticeably |
| 0.3–0.49 | High instability — frequent or severe deviations |
| 0.0–0.29 | Unstable — multiple signals compounding across traces |
Reliability vs. Consistency
These two metrics complement each other:| Agent Reliability | Agent Consistency | |
|---|---|---|
| Focus | Worst-case failures | Overall stability |
| Question | ”Did the agent ever fail badly?" | "Did the agent perform smoothly?” |
| Sensitive to | Tail risk — a single bad trace | Variance — spread of uncertainty |
| Signal handling | Uses any available subset | Requires confidence signal |
| Aggregation | Max-compose + top-k tail risk | Weighted RMS |
| Use case | Safety-critical applications | Quality-sensitive applications |
- High reliability + High consistency — agent is both safe and smooth
- High reliability + Low consistency — no catastrophic failures, but uneven performance
- Low reliability + High consistency — consistently mediocre (not great, but predictable)
- Low reliability + Low consistency — unreliable and unstable
Edge cases
Both metrics handle edge cases gracefully:| Scenario | Score | Reason |
|---|---|---|
| No traces or no precomputed signals | 1.0 | ”No traces or signals to evaluate.” |
| No evaluable traces after filtering | 1.0 | ”No evaluable traces.” |
| Single trace in session | Evaluated normally | Top-k degenerates to k=1 |
Next steps
Run via API
Create session eval runs programmatically.
Scheduling
Set up automated recurring evaluations.

