Back to Checks

Checks pack format

YAML structure, dimensions, optional live execution, and how to share your benchmark.

What is a pack?

A pack is a YAML file that defines cases: each case has a name, a dimension (quality, safety, latency, cost), a weight, and either pre-filled baseline/candidate outputs (for heuristic scoring) or an input plus an optional execution block so A2ZAI can call an LLM and score the response.

Required fields

  • name — Pack name (used in the benchmark card and PR comment).
  • description — Short summary of what the pack evaluates.
  • cases — Array of case objects. Each case must have: name, dimension, weight, and either (a) baseline / candidate scores plus baselineOutput / candidateOutput, or (b) input when using execution.

Dimensions

Every case is tagged with one of four dimensions so the scorecard can show deltas per dimension:

  • quality — Correctness, relevance, and completeness of the response.
  • safety — Policy adherence, no overpromising, safe handling of edge cases.
  • latency — Speed or turnaround (e.g. fewer cycles, concise replies).
  • cost — Token efficiency, concision, or cost-related behavior.

Scoring rules (per case)

For heuristic scoring you provide baselineOutput and candidateOutput. Checks compares the candidate against:

  • expectedContains — Array of strings; the candidate output should contain these.
  • forbiddenContains — Array of strings; the candidate must not contain these.
  • maxOutputChars / minOutputChars — Length guardrails.
  • threshold — Minimum score (0–100) for the case to pass.

When you add an execution block with provider: openai,baselineModel, and candidateModel, Checks runs each case’s input through the models and then applies the same rules to the live outputs.

Execution block (optional)

To run live model comparisons instead of pre-filled outputs, add an execution object:

execution:
  provider: openai
  baselineModel: gpt-4o-mini
  candidateModel: gpt-4.1-mini
  system: Optional system prompt for the assistant.
  temperature: 0
  maxTokens: 140

Each case in the pack must then have an input string (the user prompt). Checks will call the baseline and candidate models with that input and score the responses using expectedContains, forbiddenContains, and length rules.

Sharing your benchmark

Every run produces a permanent, public URL and an Open Graph image:

  • Benchmark URLhttps://a2zai.ai/checks/benchmarks/<slug>. Use it in READMEs, launch posts, and X.
  • Social card imagehttps://a2zai.ai/checks/benchmarks/<slug>/opengraph-image. On the benchmark page you’ll find copy-paste markdown for a badge and a link.

You can also compare with a previous run on the same repo and pack: the benchmark page shows “Run history” and a “Compare with” link for each past run.

View benchmark showcase

Next

Use a starter pack from the workbench or paste your own YAML. Connect a repo and run Checks to generate your first benchmark card.

Open workbench →