ML evaluation reproducibility audit: a method that survives contact with the auditor
Reproducibility audits on ML evaluation claims fail at a remarkably consistent rate. The reason is structural, not technical. The model card says 94% accuracy. The audit shows 71%. Both numbers are arithmetically correct on the data each party ran. Nobody wrote down which "94%" was the real one, and there is no way to recover the answer after the fact.
Why most ML reproducibility audits fail
An ML evaluation reproducibility audit is supposed to answer a small set of questions. Did the claim hold? Was the dataset the same? Was the metric the same? Was the threshold the same? Was the seed the same? In practice the audit instead spends three weeks producing a forensic reconstruction of what the team probably meant, because none of those values were committed before the experiment ran.
The common failure modes:
- Silent threshold drift. The model card says "above 0.90 accuracy." The notebook that produced the number compares against 0.875. Both were true at different moments. Neither is provable.
- Dataset substitution. The held-out set was regenerated when someone noticed a class imbalance. Nobody flagged the change. The original bytes are gone.
- Metric ambiguity. "Accuracy" turns out to mean micro-averaged, then macro-averaged, then top-5 — whichever was kindest in a given quarter.
- Seed laundering. The seed in the published notebook is the seed that produced the best of seven runs. The other six are deleted.
- Post-hoc rationalisation. The claim and the experiment are in the wrong order. The number arrives, the prose follows.
None of these are caught by a CI pipeline that runs the eval. The CI pipeline runs whatever the file says today. The audit needs to know what the file said the day the claim was made.
Define "reproducible" before you audit anything
"Reproducible" gets used to mean three different things, and conflating them is what makes audits drag. We separate them explicitly:
| Level | What it means | What it costs |
|---|---|---|
| Re-runnable | Someone else can run your eval harness and get a number | Docker image, deterministic seed |
| Reproducible | Someone else gets the same number on the same bytes | Above + dataset content hash + canonical metric implementation |
| Auditable | A third party can verify the claim was committed before the experiment ran | Above + pre-experimental SHA-256 commitment + tamper-evident chain |
Re-runnable is the floor. Reproducible is what most ML governance documents ask for. Auditable is what a reproducibility audit actually needs, and it is the level the field rarely hits, because it requires a discipline imported from clinical trials and high-energy physics: commit the claim before you see the result.
The PRML method: hash first, run second
PRML (Pre-Registered ML Manifest, v0.1 spec, Zenodo DOI 10.5281/zenodo.20177839) makes auditability mechanical. A manifest binds metric, comparator, threshold, dataset hash, seed, and producer identity into a single canonical YAML, then commits the canonical bytes to a SHA-256.
For a classifier evaluation audit, the manifest looks like this:
spec: prml/v0.1
claim_id: spam-classifier-v4.2-holdout
created_at: 2026-05-15T08:00:00Z
producer:
id: research.example
metric: f1
comparator: ">="
threshold: 0.90
dataset:
id: enron-spam-holdout-v3
hash: sha256:7b21...e904
rows: 5800
seed: 1729
prior_hash: null
Run falsify lock. The sidecar is the commitment. From that moment, the team can run the eval, run it again, run it on different hardware. The number either clears the threshold or it doesn't. If anyone edits threshold, dataset hash, or seed afterwards, falsify verdict returns exit code 3 (TAMPERED). The audit doesn't ask the team whether the spec was edited — the hash answers.
Auditing a calibration claim
For probabilistic classifiers and forecasters, the more honest claim is calibration rather than headline accuracy. A Brier score manifest:
spec: prml/v0.1
claim_id: medical-triage-calibration-2026Q2
created_at: 2026-04-30T14:11:00Z
producer:
id: clinic.example
metric: brier
comparator: "<="
threshold: 0.20
dataset:
id: triage-prospective-2026Q1
hash: sha256:4ef8...ab32
rows: 12440
seed: 42
prior_hash: sha256:1aa3...90d7 # Q1 calibration claim
notes: |
Calibration is the primary; AUC is reported but not gated.
Prospective dataset frozen 2026-03-31 23:59 UTC.
The auditor in 2027 doesn't need access to your evaluation harness. They need the manifest, the sidecar, and any of the four reference implementations (Python, JavaScript, Go, Rust — all byte-equivalent on the 12 v0.1 conformance vectors). They recompute the canonical bytes, recompute the hash, compare. Integrity verification requires no trust in the producer's tooling.
The amendment chain replaces the "we re-ran it" email
Real evaluations get re-run. New holdout sets ship. Distribution shifts force recalibration. The audit-killing question is which version of the claim was authoritative on a given date.
PRML answers it with a forward-only chain. Each new manifest points at the canonical hash of the previous one via prior_hash. The chain's terminal hash is a single 32-byte value that compresses the entire evaluation history. An auditor can confirm with a hash walk: every manifest in order, no skipped links, no retroactive edits.
spec: prml/v0.1
claim_id: spam-classifier-v4.2-holdout-amend-1
created_at: 2026-06-20T10:42:11Z
producer: { id: research.example }
metric: f1
comparator: ">="
threshold: 0.90
dataset:
id: enron-spam-holdout-v3.1 # adversarial examples added
hash: sha256:c0a1...7f55
rows: 6240
seed: 1729
prior_hash: sha256:9d2f...8c01 # the v3 manifest above
reason: |
Re-evaluation after adversarial holdout expansion.
F1 dropped from 0.93 to 0.87 on the harder set.
Threshold not relaxed; this manifest verdicts FAIL (exit 10).
An auditor reading that chain sees: claim committed, claim verified, claim re-evaluated under harder conditions, claim failed, team did not silently relax the threshold. That is what passing a reproducibility audit looks like.
What this gives ML governance teams
The practical wins, in order of how much auditor time they save:
- The threshold-drift question becomes arithmetic. No interviews, no Slack archeology.
- The dataset-substitution question becomes arithmetic. The hash matches or it doesn't.
- The "which run is real" question goes away. The seed is in the manifest. The eval is deterministic against it.
- The retention burden drops. A 32-byte chain hash and a directory of YAML files outlive every MLOps vendor you currently pay.
- The audit conversation shrinks. "Send the chain hash" replaces a week of meetings.
What it doesn't give you: a substantive opinion on whether the claim is the right claim. PRML is a primitive for record integrity, not for evaluation design. Pair it with HELM-style external benchmarking, fairness auditing, and whatever sectoral process your domain demands. The AI Act crosswalk spells out the boundary explicitly.
How this fits next to existing standards
PRML is downstream of pre-registration norms in clinical trials (ICMJE, ClinicalTrials.gov), upstream of model cards (which it does not replace), and orthogonal to provenance frameworks like in-toto and SLSA, which prove where an artifact came from but not what claim was committed about it. The positioning analysis works through the comparison field by field.
For regulated contexts, PRML mechanically supports EU AI Act Article 12 logging, Article 18 retention, and Article 72 post-market monitoring — covered in our Article 12 compliance page. The reproducibility audit and the AI Act audit converge on the same artifact.
Read the spec. PRML v0.1 is 14 pages and CC BY 4.0. The v0.2 RFC is in public review, frozen 2026-05-22. Reference implementations and the 20-vector conformance suite live on GitHub; registry.falsify.dev lets you commit your first manifest in a browser.