Skip to main content
POST
/
evaluations
/
trace-runs
Create Eval Run
curl --request POST \
  --url https://api.pandaprobe.com/evaluations/trace-runs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "metrics": [
    "task_completion",
    "tool_correctness",
    "step_efficiency"
  ]
}
'
{
  "id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "name": "<string>",
  "status": "PENDING",
  "metric_names": [
    "<string>"
  ],
  "total_traces": 123,
  "evaluated_count": 123,
  "failed_count": 123,
  "created_at": "<string>",
  "completed_at": "<string>",
  "project_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "target_type": "<string>",
  "filters": {},
  "sampling_rate": 123,
  "model": "<string>",
  "monitor_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "error_message": "<string>"
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Create a filtered eval run.

The system resolves matching traces from the filters, optionally samples a fraction of them, then dispatches a background task to run the requested metrics asynchronously via an LLM judge.

Typical dashboard flow:

  1. User selects metrics -> call GET /trace-runs/template?metric=task_completion
  2. Dashboard renders the template as a form with editable filters
  3. User customizes filters/sampling -> frontend builds this request body
  4. Frontend calls POST /trace-runs with the final body
metrics
string[]
required

List of metric names to run. Use GET /evaluations/trace-metrics to see available names. Example: ['task_completion', 'tool_correctness'].

Minimum array length: 1
name
string | null

Optional human-readable label for this run (e.g. 'Weekly prod eval').

filters
EvalRunFilters · object

Trace filters to select which traces to evaluate. Omit or leave empty to evaluate all traces in the project.

sampling_rate
number
default:1

Fraction of matching traces to evaluate (0.0 to 1.0). 1.0 = all matching traces, 0.1 = random 10%.

Required range: 0 <= x <= 1
model
string | null

LLM model string override for the judge (e.g. 'openai/gpt-4o'). Null uses the system default.

Response

Successful Response

Full eval run representation used by both list and detail endpoints.

id
string<uuid>
required
name
string | null
required
status
enum<string>
required

Lifecycle status of an evaluation job.

Available options:
PENDING,
RUNNING,
COMPLETED,
FAILED
metric_names
string[]
required
total_traces
integer
required
evaluated_count
integer
required
failed_count
integer
required
created_at
string
required
completed_at
string | null
required
project_id
string<uuid>
required
target_type
string
required
filters
Filters · object
required
sampling_rate
number
required
model
string | null
required
monitor_id
string<uuid> | null
required
error_message
string | null
required