Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

Before you begin, make sure you have:
  • A PandaProbe account. Sign up at app.pandaprobe.com.
  • At least one trace captured in your project. If you haven’t set up tracing yet, follow the Observability Quickstart first.
  • For agent (session) evaluation: traces grouped under the same session_id.
PandaProbe Cloud manages the evaluation LLM infrastructure for you. You do not need to bring your own LLM API key to run evaluations in PandaProbe Cloud.

Run your first evaluation

The fastest way to evaluate is directly from the dashboard. You pick a trace (or session), choose a metric, and PandaProbe runs the evaluation in the background.
1

Open the Traces tab

In the PandaProbe dashboard, open the Traces tab. You should see the traces that were captured by the SDK.
2

Select traces to evaluate

Pick one or more traces, then click Evaluate. You can also open a single trace and click Evaluate from the detail view.
3

Choose a metric

Start with task_completion — a 2-stage LLM-as-judge metric that scores whether the agent accomplished the user’s objective.
You can run multiple metrics in the same eval run. Each one produces an independent score attached to the trace.
4

Submit the run

Click Submit. PandaProbe creates an eval run with status PENDING and dispatches the work to a background worker. The API responds with 202 Accepted.
5

Review the score

Open the trace once the run completes. You should see a score with a numeric value, a pass/fail status, a human-readable reason, and structured metadata explaining how the score was produced.

Try session evaluation

If you have traces grouped under a session_id, you can evaluate the entire agent lifecycle:
1

Open the Sessions tab

Open Sessions to view grouped agent lifecycles.
2

Select a session and click Evaluate

Choose one or more sessions and click Evaluate.
3

Pick a session metric

Start with agent_reliability — it surfaces worst-case failure risk across the session by aggregating trace-level signals (confidence, coherence, tool_correctness, loop_detection).
4

Submit and review

Submit the run. When it completes, the session detail page shows the aggregated score along with the trace-level signals that produced it.

What’s next?

Core Concepts

Learn how eval runs, metrics, scores, signals, and monitors fit together.

Evaluation Approaches

Understand when to use trace vs. agent (session) evaluation.

Run via API

Create eval runs programmatically from CI, notebooks, or internal tools.