Create Eval Run

curl --request POST \
  --url https://api.pandaprobe.com/evaluations/trace-runs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "metrics": [
    "task_completion",
    "tool_correctness",
    "step_efficiency"
  ]
}
'

{
  "id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "name": "<string>",
  "metric_names": [
    "<string>"
  ],
  "total_traces": 123,
  "evaluated_count": 123,
  "failed_count": 123,
  "created_at": "<string>",
  "completed_at": "<string>",
  "project_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "target_type": "<string>",
  "filters": {},
  "sampling_rate": 123,
  "model": "<string>",
  "monitor_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "error_message": "<string>"
}

Trace Eval Runs

Create Eval Run

Create a filtered eval run.

Resolves traces matching the provided filters, optionally samples a fraction of them, then dispatches a background Celery task to run the requested metrics asynchronously via an LLM judge.

Request body fields:

name (string, optional): Human-readable label, e.g. "Weekly prod eval".
metrics (string[], required): Metric names to run. Get available names from GET /evaluations/trace-metrics. Example: ["task_completion", "tool_correctness"].
filters (object, optional): Trace selection filters. All fields optional:
- date_from: ISO 8601 datetime, e.g. "2025-01-15T00:00:00Z". Includes traces started on or after this time.
- date_to: ISO 8601 datetime. Includes traces started before this time (exclusive).
- status: One of PENDING, RUNNING, COMPLETED, ERROR.
- session_id: Exact session ID string.
- user_id: Exact user ID string.
- tags: Array of strings, e.g. ["production", "v2"]. Matches traces with ANY tag.
- name: Substring match on trace name (case-insensitive).
sampling_rate (float, optional, default 1.0): Fraction of matching traces to evaluate. 1.0 = all, 0.1 = random 10%.
model (string, optional): LLM model override, e.g. "openai/gpt-4o". Uses system default if null.

Auth: Bearer + X-Project-ID | X-API-Key + X-Project-Name

Rate limit: 50/min

POST

evaluations

trace-runs

Create Eval Run

curl --request POST \
  --url https://api.pandaprobe.com/evaluations/trace-runs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "metrics": [
    "task_completion",
    "tool_correctness",
    "step_efficiency"
  ]
}
'

{
  "id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "name": "<string>",
  "metric_names": [
    "<string>"
  ],
  "total_traces": 123,
  "evaluated_count": 123,
  "failed_count": 123,
  "created_at": "<string>",
  "completed_at": "<string>",
  "project_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "target_type": "<string>",
  "filters": {},
  "sampling_rate": 123,
  "model": "<string>",
  "monitor_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "error_message": "<string>"
}

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Create a filtered eval run.

The system resolves matching traces from the filters, optionally samples a fraction of them, then dispatches a background task to run the requested metrics asynchronously via an LLM judge.

Typical dashboard flow:

User selects metrics -> call GET /trace-runs/template?metric=task_completion
Dashboard renders the template as a form with editable filters
User customizes filters/sampling -> frontend builds this request body
Frontend calls POST /trace-runs with the final body

metrics

string[]

required

List of metric names to run. Use GET /evaluations/trace-metrics to see available names. Example: ['task_completion', 'tool_correctness'].

Minimum array length: 1

name

string | null

Optional human-readable label for this run (e.g. 'Weekly prod eval').

filters

EvalRunFilters · object

Trace filters to select which traces to evaluate. Omit or leave empty to evaluate all traces in the project.

Show child attributes

sampling_rate

number

default:1

Fraction of matching traces to evaluate (0.0 to 1.0). 1.0 = all matching traces, 0.1 = random 10%.

Required range: 0 <= x <= 1

model

string | null

LLM model string override for the judge (e.g. 'openai/gpt-4o'). Null uses the system default.

Response

Successful Response

Full eval run representation used by both list and detail endpoints.

string<uuid>

required

name

string | null

required

status

enum<string>

required

Lifecycle status of an evaluation job.

Available options:

PENDING,

RUNNING,

COMPLETED,

FAILED

metric_names

string[]

required

total_traces

integer

required

evaluated_count

integer

required

failed_count

integer

required

created_at

string

required

completed_at

string | null

required

project_id

string<uuid>

required

target_type

string

required

filters

Filters · object

required

sampling_rate

number

required

model

string | null

required

monitor_id

string<uuid> | null

required

error_message

string | null

required

Get Eval Run Template List Eval Runs

Documentation Index

Authorizations

Body

Response