Documentation Index Fetch the complete documentation index at: https://docs.pandaprobe.com/llms.txt
Use this file to discover all available pages before exploring further.
All evaluation features are accessible through the REST API at /evaluations. This page covers every endpoint for running trace-level and session-level evaluations.
All endpoints require authentication via API key with X-API-Key and X-Project-Name headers.
Discover available metrics
Before creating eval runs, check which metrics are available.
List trace metrics
GET /evaluations/trace-metrics
Returns all registered trace-level metrics:
[
{
"name" : "task_completion" ,
"description" : "Evaluates whether the agent accomplished the user's stated objective." ,
"category" : "trace"
},
{
"name" : "tool_correctness" ,
"description" : "Evaluates whether the agent selected appropriate tools for the task." ,
"category" : "trace"
}
]
List session metrics
GET /evaluations/session-metrics
Returns all registered session-level metrics:
[
{
"name" : "agent_reliability" ,
"description" : "Evaluates worst-case failure risk across a session." ,
"category" : "session"
},
{
"name" : "agent_consistency" ,
"description" : "Evaluates overall stability across a session." ,
"category" : "session"
}
]
Check LLM provider availability
GET /evaluations/providers
Returns which LLM providers are configured and available for judge calls.
Trace evaluation runs
Create a filtered trace eval run
POST /evaluations/trace-runs
Resolves traces matching your filters, samples them, and dispatches background evaluation. Returns 202 Accepted immediately.
Request body:
{
"name" : "Weekly production eval" ,
"metrics" : [ "task_completion" , "tool_correctness" , "step_efficiency" ],
"filters" : {
"date_from" : "2026-03-01T00:00:00Z" ,
"date_to" : "2026-03-29T00:00:00Z" ,
"status" : "COMPLETED" ,
"tags" : [ "production" ]
},
"sampling_rate" : 0.5 ,
"model" : "openai/gpt-5.4"
}
Field Type Required Description namestring No Human-readable label for the run metricsstring[] Yes Metric names to run (at least one) filters.date_fromISO 8601 No Include traces started on or after this time filters.date_toISO 8601 No Include traces started before this time (exclusive) filters.statusstring No PENDING, RUNNING, COMPLETED, or ERRORfilters.session_idstring No Exact session ID filters.user_idstring No Exact user ID filters.tagsstring[] No Match traces with ANY of these tags filters.namestring No Substring match on trace name (case-insensitive) sampling_ratefloat No Fraction of matches to evaluate (default: 1.0) modelstring No LLM model override (default: system default)
Response (202):
{
"id" : "a1b2c3d4-..." ,
"name" : "Weekly production eval" ,
"status" : "PENDING" ,
"metric_names" : [ "task_completion" , "tool_correctness" , "step_efficiency" ],
"total_traces" : 150 ,
"evaluated_count" : 0 ,
"failed_count" : 0 ,
"created_at" : "2026-03-29T10:00:00Z" ,
"completed_at" : null ,
"project_id" : "..." ,
"target_type" : "TRACE" ,
"filters" : { "status" : "COMPLETED" , "tags" : [ "production" ]},
"sampling_rate" : 0.5 ,
"model" : "openai/gpt-5.4" ,
"monitor_id" : null ,
"error_message" : null
}
Rate limit: 50/min
Create a batch trace eval run
POST /evaluations/trace-runs/batch
Evaluate specific traces by ID instead of using filters.
{
"trace_ids" : [
"11111111-1111-1111-1111-111111111111" ,
"22222222-2222-2222-2222-222222222222"
],
"metrics" : [ "task_completion" , "argument_correctness" ],
"name" : "Manual review batch" ,
"model" : null
}
Field Type Required Description trace_idsUUID[] Yes Specific traces to evaluate (at least one) metricsstring[] Yes Metric names to run namestring No Human-readable label modelstring No LLM model override
Rate limit: 50/min
Poll eval run status
GET /evaluations/trace-runs/{run_id}
Check the progress of an eval run. Poll this endpoint until status is COMPLETED or FAILED.
List eval runs
GET /evaluations/trace-runs?status=COMPLETED & limit = 20 & offset = 0
Parameter Type Description statusstring Filter by status: PENDING, RUNNING, COMPLETED, FAILED limitint Page size (1–200, default 50) offsetint Items to skip (default 0)
Get scores for a run
GET /evaluations/trace-runs/{run_id}/scores
Returns all trace scores produced by a specific eval run.
Retry failed metrics
POST /evaluations/trace-runs/{run_id}/retry
Creates a new eval run targeting only the trace+metric pairs that failed in the original run. Returns 422 if the original run has no failures.
Rate limit: 50/min
Delete an eval run
DELETE /evaluations/trace-runs/{run_id}?delete_scores= false
By default, only the run record is deleted — scores are preserved with eval_run_id set to null. Pass ?delete_scores=true to also delete all scores from this run.
Trace scores
Create a manual score
POST /evaluations/trace-scores
Manually attach a score to a trace (human annotation or programmatic submission).
{
"trace_id" : "11111111-1111-1111-1111-111111111111" ,
"name" : "quality" ,
"value" : "0.9" ,
"data_type" : "NUMERIC" ,
"source" : "ANNOTATION" ,
"reason" : "High quality response with accurate information"
}
Field Type Required Description trace_idUUID Yes Trace to score namestring Yes Score name (e.g., metric name or custom label) valuestring Yes Score value: "0.85" (NUMERIC), "true" (BOOLEAN), "PASS" (CATEGORICAL) data_typestring No NUMERIC (default), BOOLEAN, or CATEGORICALsourcestring No ANNOTATION (default) or PROGRAMMATICreasonstring No Explanation or annotation note metadataobject No Custom metadata
NUMERIC scores must be in the range [0.0, 1.0]. BOOLEAN scores must be "true" or "false".
List trace scores
GET /evaluations/trace-scores
Comprehensive filtering:
Parameter Description trace_idFilter by trace UUID nameFilter by metric name (exact match) sourceAUTOMATED, ANNOTATION, or PROGRAMMATICstatusSUCCESS, FAILED, or PENDINGdata_typeNUMERIC, BOOLEAN, or CATEGORICALeval_run_idFilter by eval run UUID environmentFilter by trace environment date_from / date_toISO 8601 datetime range limit / offsetPagination (default 50, max 200)
Get latest scores for a trace
GET /evaluations/trace-scores/{trace_id}
Returns one score per metric name, deduplicated by most recent created_at. Use this to display a score overview panel for a specific trace.
Update a score
PATCH /evaluations/trace-scores/{score_id}
{
"value" : "0.95" ,
"reason" : "Revised after manual review"
}
Only value, reason, and metadata can be changed. status is automatically set to SUCCESS and source to ANNOTATION.
Delete a score
DELETE /evaluations/trace-scores/{score_id}
Session evaluation runs
Session eval runs follow the same pattern as trace eval runs but target sessions instead of traces.
Create a filtered session eval run
POST /evaluations/session-runs
{
"name" : "Agent reliability check" ,
"metrics" : [ "agent_reliability" , "agent_consistency" ],
"filters" : {
"date_from" : "2026-03-01T00:00:00Z" ,
"date_to" : "2026-03-29T00:00:00Z" ,
"min_trace_count" : 3
},
"sampling_rate" : 1.0 ,
"model" : "openai/gpt-5.4" ,
"signal_weights" : {
"confidence" : 1.0 ,
"loop_detection" : 1.5 ,
"tool_correctness" : 0.8 ,
"coherence" : 1.0
}
}
Field Type Required Description namestring No Human-readable label metricsstring[] Yes Session metric names (at least one) filters.date_fromISO 8601 No Include sessions from this time filters.date_toISO 8601 No Include sessions before this time filters.user_idstring No Exact user ID filters.has_errorboolean No Only sessions with/without errors filters.tagsstring[] No Match traces with ANY of these tags filters.min_trace_countint No Minimum traces in session (≥1) sampling_ratefloat No Fraction of sessions to evaluate (default: 1.0) modelstring No LLM model override for trace-level signal computation signal_weightsobject No Override signal weights for aggregation
Rate limit: 50/min
Create a batch session eval run
POST /evaluations/session-runs/batch
{
"session_ids" : [ "session-abc-123" , "session-def-456" ],
"metrics" : [ "agent_reliability" ],
"signal_weights" : { "confidence" : 1.0 , "loop_detection" : 2.0 }
}
Rate limit: 50/min
Other session run endpoints
Endpoint Method Description /session-runsGET List session eval runs (supports status, limit, offset) /session-runs/{run_id}GET Get session eval run detail /session-runs/{run_id}DELETE Delete a session eval run (?delete_scores=true optional) /session-runs/{run_id}/retryPOST Retry failed session+metric pairs /session-runs/{run_id}/scoresGET List session scores from a run
Session scores
Endpoint Method Description /session-scoresGET List session scores (supports filtering by session_id, name, source, status, eval_run_id, date range) /session-scores/{session_id}GET Get all scores for a specific session /session-scores/{score_id}DELETE Delete a single session score
Analytics
PandaProbe provides analytics endpoints for both trace and session scores.
Trace score analytics
Summary — aggregated stats per metric:
GET /evaluations/analytics/trace-scores/summary?date_from=2026-03-01T00:00:00Z
[
{
"metric_name" : "task_completion" ,
"avg_score" : 0.82 ,
"min_score" : 0.15 ,
"max_score" : 1.0 ,
"median_score" : 0.87 ,
"success_count" : 145 ,
"failed_count" : 5 ,
"latest_score_at" : "2026-03-29T09:30:00Z"
}
]
Trend — time series of average scores:
GET /evaluations/analytics/trace-scores/trend?metric_name=task_completion & granularity = day
Parameter Options metric_nameRequired — metric to track granularityhour, day, week (default: day)date_from / date_toOptional date range
Distribution — histogram of score values:
GET /evaluations/analytics/trace-scores/distribution?metric_name=task_completion & buckets = 10
Parameter Description metric_nameRequired — metric to analyze bucketsNumber of histogram buckets (1–100, default 10)
Session score analytics
Session score analytics mirror the trace analytics:
Endpoint Description /analytics/session-scores/summaryAggregated stats per session metric /analytics/session-scores/trendTime series of session scores /analytics/session-scores/distributionHistogram of session score values /analytics/session-scores/history/{session_id}Score evolution for a session across re-evaluations /analytics/session-scores/comparisonLeaderboard — sessions ranked by a metric
Session score history — track how a session’s score evolves over re-evaluations:
GET /evaluations/analytics/session-scores/history/{session_id}?metric_name=agent_reliability & limit = 50
Session comparison — rank sessions by a metric (useful for finding worst-performing sessions):
GET /evaluations/analytics/session-scores/comparison?metric_name=agent_reliability & sort_order = asc & limit = 10
Pass sort_order=asc to surface the worst sessions first.
Get an eval run template
To help build eval run requests, PandaProbe can generate a pre-filled template for a metric:
GET /evaluations/trace-runs/template?metric=task_completion
Returns the metric’s full info (including prompt previews), default filters, sampling rate, and the default model. Use this to populate a form in your own tooling.
Error handling
HTTP Code Meaning 202 Eval run created and dispatched (async) 201 Score created 204 Resource deleted 400 Bad request (invalid filters, unknown metric, etc.) 404 Resource not found 422 Validation error (e.g., NUMERIC score out of range, retry on run with no failures) 429 Rate limit exceeded
Next steps
Scheduling Evaluations Set up automated recurring evaluations with monitors.
Trace Metrics Reference Detailed documentation for each trace metric.