Terminal-Bench Evaluator
Select Task Zip File
No file selected
Backend:
Codex
DSPy (gpt-5.2-codex)
Use Harbor (v2) Prompt:
Iterations (k):
Aggregation:
All@K (pass only if ALL k runs pass)
Majority Vote (pass if >50% pass)
Pass@K (pass if ANY run passes)
Verify Issues:
Evaluate Task
Evaluating task... This may take a moment.
Download Evaluations CSV