DriftCheck quickstart and pack format
Install the local-first runner, use the three V1 packs, and publish proof cards only when you choose.
Install and run
DriftCheck starts on your machine or in your CI. The CLI creates three starter packs for tool calling, RAG faithfulness, and model migration, then writes a local JSON report plus markdown summary.
npx @a2zai-ai/driftcheck init npx @a2zai-ai/driftcheck check npx @a2zai-ai/driftcheck check --pack tool-calling
Local runs write .driftcheck/runs/latest.json and driftcheck-report.md. Nothing is uploaded unless you explicitly run publish.
V1 starter packs
- Tool-Calling Reliability — schema-valid tool arguments, fallback behavior, and hallucinated tools.
- RAG Faithfulness — grounded answers, citations, missing-context refusal, and source scope.
- Model Migration — quality, cost, latency, and safety drift when moving between models.
What is a pack?
A pack is a YAML file that defines cases: each case has a name, a dimension (quality, safety, latency, cost), a weight, and either pre-filled baseline/candidate outputs (for heuristic scoring) or an input plus an optional execution block so A2ZAI can call an LLM and score the response.
Required fields
id— Stable pack id, for exampletool-calling.name— Pack name (used in the proof card and PR comment).category— One oftool-calling,rag-faithfulness, ormodel-migration.description— Short summary of what the pack evaluates.cases— Array of case objects. Each case must have:name,dimension,weight, and either (a)baseline/candidatescores plusbaselineOutput/candidateOutput, or (b)inputwhen usingexecution.
Dimensions
Every case is tagged with one of four dimensions so the scorecard can show deltas per dimension:
quality— Correctness, relevance, and completeness of the response.safety— Policy adherence, no overpromising, safe handling of edge cases.latency— Speed or turnaround (e.g. fewer cycles, concise replies).cost— Token efficiency, concision, or cost-related behavior.
Scoring rules (per case)
For heuristic scoring you provide baselineOutput and candidateOutput. Checks compares the candidate against:
expectedContains— Array of strings; the candidate output should contain these.forbiddenContains— Array of strings; the candidate must not contain these.maxOutputChars/minOutputChars— Length guardrails.threshold— Minimum score (0–100) for the case to pass.
When you add an execution block with provider: openai,baselineModel, and candidateModel, Checks runs each case’s input through the models and then applies the same rules to the live outputs.
Execution block (optional)
To run live model comparisons instead of pre-filled outputs, add an execution object:
execution: provider: openai baselineModel: gpt-4o-mini candidateModel: gpt-4.1-mini system: Optional system prompt for the assistant. temperature: 0 maxTokens: 140
Each case in the pack must then have an input string (the user prompt). Checks will call the baseline and candidate models with that input and score the responses using expectedContains, forbiddenContains, and length rules.
Sharing your benchmark
Local runs stay private. When you explicitly publish a report, A2ZAI creates a proof URL:
DRIFTCHECK_TOKEN="paste-token-here" npx @a2zai-ai/driftcheck publish --run .driftcheck/runs/latest.json --public
- Proof URL —
https://a2zai.ai/checks/proof/<slug>. Use it in READMEs, launch posts, and X only after you choose to publish. - Local report —
a2zai-report.mdremains in your repo or CI artifact.
Hosted history, richer comparison, and team dashboards are later phases. V1 proves the local-first loop first.
View proof galleryNext
Use a starter pack from the local runner or paste your own YAML in the workbench. Publish only when you want your first proof card.
Open workbench →