Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt

Use this file to discover all available pages before exploring further.

All evaluation features are accessible through the REST API at /evaluations. This page covers every endpoint for running trace-level and session-level evaluations.
All endpoints require authentication via API key with X-API-Key and X-Project-Name headers.

Discover available metrics

Before creating eval runs, check which metrics are available.

List trace metrics

GET /evaluations/trace-metrics
Returns all registered trace-level metrics:
[
  {
    "name": "task_completion",
    "description": "Evaluates whether the agent accomplished the user's stated objective.",
    "category": "trace"
  },
  {
    "name": "tool_correctness",
    "description": "Evaluates whether the agent selected appropriate tools for the task.",
    "category": "trace"
  }
]

List session metrics

GET /evaluations/session-metrics
Returns all registered session-level metrics:
[
  {
    "name": "agent_reliability",
    "description": "Evaluates worst-case failure risk across a session.",
    "category": "session"
  },
  {
    "name": "agent_consistency",
    "description": "Evaluates overall stability across a session.",
    "category": "session"
  }
]

Check LLM provider availability

GET /evaluations/providers
Returns which LLM providers are configured and available for judge calls.

Trace evaluation runs

Create a filtered trace eval run

POST /evaluations/trace-runs
Resolves traces matching your filters, samples them, and dispatches background evaluation. Returns 202 Accepted immediately. Request body:
{
  "name": "Weekly production eval",
  "metrics": ["task_completion", "tool_correctness", "step_efficiency"],
  "filters": {
    "date_from": "2026-03-01T00:00:00Z",
    "date_to": "2026-03-29T00:00:00Z",
    "status": "COMPLETED",
    "tags": ["production"]
  },
  "sampling_rate": 0.5,
  "model": "openai/gpt-5.4"
}
FieldTypeRequiredDescription
namestringNoHuman-readable label for the run
metricsstring[]YesMetric names to run (at least one)
filters.date_fromISO 8601NoInclude traces started on or after this time
filters.date_toISO 8601NoInclude traces started before this time (exclusive)
filters.statusstringNoPENDING, RUNNING, COMPLETED, or ERROR
filters.session_idstringNoExact session ID
filters.user_idstringNoExact user ID
filters.tagsstring[]NoMatch traces with ANY of these tags
filters.namestringNoSubstring match on trace name (case-insensitive)
sampling_ratefloatNoFraction of matches to evaluate (default: 1.0)
modelstringNoLLM model override (default: system default)
Response (202):
{
  "id": "a1b2c3d4-...",
  "name": "Weekly production eval",
  "status": "PENDING",
  "metric_names": ["task_completion", "tool_correctness", "step_efficiency"],
  "total_traces": 150,
  "evaluated_count": 0,
  "failed_count": 0,
  "created_at": "2026-03-29T10:00:00Z",
  "completed_at": null,
  "project_id": "...",
  "target_type": "TRACE",
  "filters": {"status": "COMPLETED", "tags": ["production"]},
  "sampling_rate": 0.5,
  "model": "openai/gpt-5.4",
  "monitor_id": null,
  "error_message": null
}
Rate limit: 50/min

Create a batch trace eval run

POST /evaluations/trace-runs/batch
Evaluate specific traces by ID instead of using filters.
{
  "trace_ids": [
    "11111111-1111-1111-1111-111111111111",
    "22222222-2222-2222-2222-222222222222"
  ],
  "metrics": ["task_completion", "argument_correctness"],
  "name": "Manual review batch",
  "model": null
}
FieldTypeRequiredDescription
trace_idsUUID[]YesSpecific traces to evaluate (at least one)
metricsstring[]YesMetric names to run
namestringNoHuman-readable label
modelstringNoLLM model override
Rate limit: 50/min

Poll eval run status

GET /evaluations/trace-runs/{run_id}
Check the progress of an eval run. Poll this endpoint until status is COMPLETED or FAILED.

List eval runs

GET /evaluations/trace-runs?status=COMPLETED&limit=20&offset=0
ParameterTypeDescription
statusstringFilter by status: PENDING, RUNNING, COMPLETED, FAILED
limitintPage size (1–200, default 50)
offsetintItems to skip (default 0)

Get scores for a run

GET /evaluations/trace-runs/{run_id}/scores
Returns all trace scores produced by a specific eval run.

Retry failed metrics

POST /evaluations/trace-runs/{run_id}/retry
Creates a new eval run targeting only the trace+metric pairs that failed in the original run. Returns 422 if the original run has no failures. Rate limit: 50/min

Delete an eval run

DELETE /evaluations/trace-runs/{run_id}?delete_scores=false
By default, only the run record is deleted — scores are preserved with eval_run_id set to null. Pass ?delete_scores=true to also delete all scores from this run.

Trace scores

Create a manual score

POST /evaluations/trace-scores
Manually attach a score to a trace (human annotation or programmatic submission).
{
  "trace_id": "11111111-1111-1111-1111-111111111111",
  "name": "quality",
  "value": "0.9",
  "data_type": "NUMERIC",
  "source": "ANNOTATION",
  "reason": "High quality response with accurate information"
}
FieldTypeRequiredDescription
trace_idUUIDYesTrace to score
namestringYesScore name (e.g., metric name or custom label)
valuestringYesScore value: "0.85" (NUMERIC), "true" (BOOLEAN), "PASS" (CATEGORICAL)
data_typestringNoNUMERIC (default), BOOLEAN, or CATEGORICAL
sourcestringNoANNOTATION (default) or PROGRAMMATIC
reasonstringNoExplanation or annotation note
metadataobjectNoCustom metadata
NUMERIC scores must be in the range [0.0, 1.0]. BOOLEAN scores must be "true" or "false".

List trace scores

GET /evaluations/trace-scores
Comprehensive filtering:
ParameterDescription
trace_idFilter by trace UUID
nameFilter by metric name (exact match)
sourceAUTOMATED, ANNOTATION, or PROGRAMMATIC
statusSUCCESS, FAILED, or PENDING
data_typeNUMERIC, BOOLEAN, or CATEGORICAL
eval_run_idFilter by eval run UUID
environmentFilter by trace environment
date_from / date_toISO 8601 datetime range
limit / offsetPagination (default 50, max 200)

Get latest scores for a trace

GET /evaluations/trace-scores/{trace_id}
Returns one score per metric name, deduplicated by most recent created_at. Use this to display a score overview panel for a specific trace.

Update a score

PATCH /evaluations/trace-scores/{score_id}
{
  "value": "0.95",
  "reason": "Revised after manual review"
}
Only value, reason, and metadata can be changed. status is automatically set to SUCCESS and source to ANNOTATION.

Delete a score

DELETE /evaluations/trace-scores/{score_id}

Session evaluation runs

Session eval runs follow the same pattern as trace eval runs but target sessions instead of traces.

Create a filtered session eval run

POST /evaluations/session-runs
{
  "name": "Agent reliability check",
  "metrics": ["agent_reliability", "agent_consistency"],
  "filters": {
    "date_from": "2026-03-01T00:00:00Z",
    "date_to": "2026-03-29T00:00:00Z",
    "min_trace_count": 3
  },
  "sampling_rate": 1.0,
  "model": "openai/gpt-5.4",
  "signal_weights": {
    "confidence": 1.0,
    "loop_detection": 1.5,
    "tool_correctness": 0.8,
    "coherence": 1.0
  }
}
FieldTypeRequiredDescription
namestringNoHuman-readable label
metricsstring[]YesSession metric names (at least one)
filters.date_fromISO 8601NoInclude sessions from this time
filters.date_toISO 8601NoInclude sessions before this time
filters.user_idstringNoExact user ID
filters.has_errorbooleanNoOnly sessions with/without errors
filters.tagsstring[]NoMatch traces with ANY of these tags
filters.min_trace_countintNoMinimum traces in session (≥1)
sampling_ratefloatNoFraction of sessions to evaluate (default: 1.0)
modelstringNoLLM model override for trace-level signal computation
signal_weightsobjectNoOverride signal weights for aggregation
Rate limit: 50/min

Create a batch session eval run

POST /evaluations/session-runs/batch
{
  "session_ids": ["session-abc-123", "session-def-456"],
  "metrics": ["agent_reliability"],
  "signal_weights": {"confidence": 1.0, "loop_detection": 2.0}
}
Rate limit: 50/min

Other session run endpoints

EndpointMethodDescription
/session-runsGETList session eval runs (supports status, limit, offset)
/session-runs/{run_id}GETGet session eval run detail
/session-runs/{run_id}DELETEDelete a session eval run (?delete_scores=true optional)
/session-runs/{run_id}/retryPOSTRetry failed session+metric pairs
/session-runs/{run_id}/scoresGETList session scores from a run

Session scores

EndpointMethodDescription
/session-scoresGETList session scores (supports filtering by session_id, name, source, status, eval_run_id, date range)
/session-scores/{session_id}GETGet all scores for a specific session
/session-scores/{score_id}DELETEDelete a single session score

Analytics

PandaProbe provides analytics endpoints for both trace and session scores.

Trace score analytics

Summary — aggregated stats per metric:
GET /evaluations/analytics/trace-scores/summary?date_from=2026-03-01T00:00:00Z
[
  {
    "metric_name": "task_completion",
    "avg_score": 0.82,
    "min_score": 0.15,
    "max_score": 1.0,
    "median_score": 0.87,
    "success_count": 145,
    "failed_count": 5,
    "latest_score_at": "2026-03-29T09:30:00Z"
  }
]
Trend — time series of average scores:
GET /evaluations/analytics/trace-scores/trend?metric_name=task_completion&granularity=day
ParameterOptions
metric_nameRequired — metric to track
granularityhour, day, week (default: day)
date_from / date_toOptional date range
Distribution — histogram of score values:
GET /evaluations/analytics/trace-scores/distribution?metric_name=task_completion&buckets=10
ParameterDescription
metric_nameRequired — metric to analyze
bucketsNumber of histogram buckets (1–100, default 10)

Session score analytics

Session score analytics mirror the trace analytics:
EndpointDescription
/analytics/session-scores/summaryAggregated stats per session metric
/analytics/session-scores/trendTime series of session scores
/analytics/session-scores/distributionHistogram of session score values
/analytics/session-scores/history/{session_id}Score evolution for a session across re-evaluations
/analytics/session-scores/comparisonLeaderboard — sessions ranked by a metric
Session score history — track how a session’s score evolves over re-evaluations:
GET /evaluations/analytics/session-scores/history/{session_id}?metric_name=agent_reliability&limit=50
Session comparison — rank sessions by a metric (useful for finding worst-performing sessions):
GET /evaluations/analytics/session-scores/comparison?metric_name=agent_reliability&sort_order=asc&limit=10
Pass sort_order=asc to surface the worst sessions first.

Get an eval run template

To help build eval run requests, PandaProbe can generate a pre-filled template for a metric:
GET /evaluations/trace-runs/template?metric=task_completion
Returns the metric’s full info (including prompt previews), default filters, sampling rate, and the default model. Use this to populate a form in your own tooling.

Error handling

HTTP CodeMeaning
202Eval run created and dispatched (async)
201Score created
204Resource deleted
400Bad request (invalid filters, unknown metric, etc.)
404Resource not found
422Validation error (e.g., NUMERIC score out of range, retry on run with no failures)
429Rate limit exceeded

Next steps

Scheduling Evaluations

Set up automated recurring evaluations with monitors.

Trace Metrics Reference

Detailed documentation for each trace metric.