Checks pack format
YAML structure, dimensions, optional live execution, and how to share your benchmark.
What is a pack?
A pack is a YAML file that defines cases: each case has a name, a dimension (quality, safety, latency, cost), a weight, and either pre-filled baseline/candidate outputs (for heuristic scoring) or an input plus an optional execution block so A2ZAI can call an LLM and score the response.
Required fields
name— Pack name (used in the benchmark card and PR comment).description— Short summary of what the pack evaluates.cases— Array of case objects. Each case must have:name,dimension,weight, and either (a)baseline/candidatescores plusbaselineOutput/candidateOutput, or (b)inputwhen usingexecution.
Dimensions
Every case is tagged with one of four dimensions so the scorecard can show deltas per dimension:
quality— Correctness, relevance, and completeness of the response.safety— Policy adherence, no overpromising, safe handling of edge cases.latency— Speed or turnaround (e.g. fewer cycles, concise replies).cost— Token efficiency, concision, or cost-related behavior.
Scoring rules (per case)
For heuristic scoring you provide baselineOutput and candidateOutput. Checks compares the candidate against:
expectedContains— Array of strings; the candidate output should contain these.forbiddenContains— Array of strings; the candidate must not contain these.maxOutputChars/minOutputChars— Length guardrails.threshold— Minimum score (0–100) for the case to pass.
When you add an execution block with provider: openai,baselineModel, and candidateModel, Checks runs each case’s input through the models and then applies the same rules to the live outputs.
Execution block (optional)
To run live model comparisons instead of pre-filled outputs, add an execution object:
execution: provider: openai baselineModel: gpt-4o-mini candidateModel: gpt-4.1-mini system: Optional system prompt for the assistant. temperature: 0 maxTokens: 140
Each case in the pack must then have an input string (the user prompt). Checks will call the baseline and candidate models with that input and score the responses using expectedContains, forbiddenContains, and length rules.
Sharing your benchmark
Every run produces a permanent, public URL and an Open Graph image:
- Benchmark URL —
https://a2zai.ai/checks/benchmarks/<slug>. Use it in READMEs, launch posts, and X. - Social card image —
https://a2zai.ai/checks/benchmarks/<slug>/opengraph-image. On the benchmark page you’ll find copy-paste markdown for a badge and a link.
You can also compare with a previous run on the same repo and pack: the benchmark page shows “Run history” and a “Compare with” link for each past run.
View benchmark showcaseNext
Use a starter pack from the workbench or paste your own YAML. Connect a repo and run Checks to generate your first benchmark card.
Open workbench →