Back to Checks

DriftCheck quickstart and pack format

Install the local-first runner, use the three V1 packs, and publish proof cards only when you choose.

Install and run

DriftCheck starts on your machine or in your CI. The CLI creates three starter packs for tool calling, RAG faithfulness, and model migration, then writes a local JSON report plus markdown summary.

npx @a2zai-ai/driftcheck init
npx @a2zai-ai/driftcheck check
npx @a2zai-ai/driftcheck check --pack tool-calling

Local runs write .driftcheck/runs/latest.json and driftcheck-report.md. Nothing is uploaded unless you explicitly run publish.

V1 starter packs

  • Tool-Calling Reliability — schema-valid tool arguments, fallback behavior, and hallucinated tools.
  • RAG Faithfulness — grounded answers, citations, missing-context refusal, and source scope.
  • Model Migration — quality, cost, latency, and safety drift when moving between models.

What is a pack?

A pack is a YAML file that defines cases: each case has a name, a dimension (quality, safety, latency, cost), a weight, and either pre-filled baseline/candidate outputs (for heuristic scoring) or an input plus an optional execution block so A2ZAI can call an LLM and score the response.

Required fields

  • id — Stable pack id, for example tool-calling.
  • name — Pack name (used in the proof card and PR comment).
  • category — One of tool-calling, rag-faithfulness, or model-migration.
  • description — Short summary of what the pack evaluates.
  • cases — Array of case objects. Each case must have: name, dimension, weight, and either (a) baseline / candidate scores plus baselineOutput / candidateOutput, or (b) input when using execution.

Dimensions

Every case is tagged with one of four dimensions so the scorecard can show deltas per dimension:

  • quality — Correctness, relevance, and completeness of the response.
  • safety — Policy adherence, no overpromising, safe handling of edge cases.
  • latency — Speed or turnaround (e.g. fewer cycles, concise replies).
  • cost — Token efficiency, concision, or cost-related behavior.

Scoring rules (per case)

For heuristic scoring you provide baselineOutput and candidateOutput. Checks compares the candidate against:

  • expectedContains — Array of strings; the candidate output should contain these.
  • forbiddenContains — Array of strings; the candidate must not contain these.
  • maxOutputChars / minOutputChars — Length guardrails.
  • threshold — Minimum score (0–100) for the case to pass.

When you add an execution block with provider: openai,baselineModel, and candidateModel, Checks runs each case’s input through the models and then applies the same rules to the live outputs.

Execution block (optional)

To run live model comparisons instead of pre-filled outputs, add an execution object:

execution:
  provider: openai
  baselineModel: gpt-4o-mini
  candidateModel: gpt-4.1-mini
  system: Optional system prompt for the assistant.
  temperature: 0
  maxTokens: 140

Each case in the pack must then have an input string (the user prompt). Checks will call the baseline and candidate models with that input and score the responses using expectedContains, forbiddenContains, and length rules.

Sharing your benchmark

Local runs stay private. When you explicitly publish a report, A2ZAI creates a proof URL:

DRIFTCHECK_TOKEN="paste-token-here" npx @a2zai-ai/driftcheck publish --run .driftcheck/runs/latest.json --public
  • Proof URLhttps://a2zai.ai/checks/proof/<slug>. Use it in READMEs, launch posts, and X only after you choose to publish.
  • Local reporta2zai-report.md remains in your repo or CI artifact.

Hosted history, richer comparison, and team dashboards are later phases. V1 proves the local-first loop first.

View proof gallery

Next

Use a starter pack from the local runner or paste your own YAML in the workbench. Publish only when you want your first proof card.

Open workbench →