falsify · spec Home · Spec v0.1 · RFC v0.2 (open) · Registry · GitHub

Pre-registration for machine learning: how the clinical-trials discipline ports to ML

Pre-registration emerged in clinical research because too many positive results disappeared on replication. The same pattern is now routine in ML. Headline accuracy numbers retract. Benchmark leaderboards regress under scrutiny. Nature retracts an AI-for-materials paper; an arXiv preprint claiming a record on MMLU evaporates when an external team re-runs it. The mechanism is the same one Daryl Bem critics named in 2011: claims and analyses arrive in the wrong order.

What pre-registration originally solved

The clinical-trials and psychology literatures converged on pre-registration in the 2010s as the practical answer to two specific failure modes:

Both look like science from the outside. Both produce literatures that fail to replicate. ClinicalTrials.gov, the AsPredicted registry, the Open Science Framework's pre-registration template, and Registered Reports (peer review of the protocol before data collection) were the institutional responses. They worked. The replication rate in pre-registered psychology studies is materially higher than in unregistered ones; the effect is even stronger in clinical trials.

The format is boring on purpose. You write down, before the data arrives: the hypothesis, the dependent and independent variables, the sample size, the exclusion criteria, the analysis plan, and the rule that distinguishes a positive from a negative result. You commit it. Then you run the study. If you deviate, you say so explicitly.

ML is the field that most needs pre-registration

Machine learning makes empirical claims at industrial scale. Every paper, every model card, every leaderboard entry, every internal release note asserts that some metric clears some threshold on some dataset. The structural pressure to silently relax the threshold, swap the dataset, or pick the seed that worked is at least as strong as in psychology — and the gatekeeping is weaker.

Concrete failure modes in ML that map cleanly onto pre-registration's original targets:

None of these require malice. They are emergent under publication pressure and the absence of any binding pre-experimental commitment.

Why a Notion page or a README isn't enough

A pre-registration is only useful if it is verifiable — if a third party can confirm that the document you point at is the document you committed to, and that it hasn't been edited since. ClinicalTrials.gov solves this institutionally: the registry timestamps and freezes the protocol. AsPredicted does the same with a fixed PDF.

ML has nothing analogous. A Notion page is editable. A README in main is editable. A pre-print on arXiv is timestamped but the supplementary code is not. The audit-grade question — "what claim did you commit to on the day you ran the experiment" — has no infrastructural answer.

The answer that does work, and that the field has been edging toward, is cryptographic: hash the protocol, publish the hash, run the experiment. Tampering becomes arithmetically detectable.

PRML: the format

PRML (Pre-Registered ML Manifest, v0.1 spec, Zenodo DOI 10.5281/zenodo.20177839) is what the AsPredicted form would look like if it were machine-readable and content-addressed. A manifest binds: metric, comparator, threshold, dataset.hash, seed, producer.id, and an optional prior_hash chain pointer. The manifest is canonicalised — a small deterministic subset of YAML — and SHA-256 hashed. The hash is the registration.

A registered claim for a new retrieval method might read:

spec: prml/v0.1
claim_id: dense-retriever-nq-2026-05
created_at: 2026-05-15T07:30:00Z
producer:
  id: lab.example
  role: research
metric: ndcg_at_10
comparator: ">="
threshold: 0.42
dataset:
  id: nq-open-dev-v1
  hash: sha256:b113...8027
  rows: 8757
seed: 13
prior_hash: null
notes: |
  Primary endpoint: nDCG@10 on NQ-open dev set.
  Secondary endpoints (Recall@20, MRR) reported but not gated.
  No model selection on the dev set; tuning frozen pre-lock.

Lock it before the experiment runs. Publish the hash anywhere — a tweet, an arXiv comment, the project's registry.falsify.dev page. The hash is short, durable, and verifiable by anyone with the four-language reference implementation or any SHA-256 library.

Negative results are first-class

The standard pre-registration outcome that ML has trouble reporting is "we tried this and it didn't work." PRML makes that outcome a deterministic exit code — 10 (FAIL) — not a narrative judgement. The chain shows you committed, ran, and got FAIL; the published artifact is the FAIL itself.

spec: prml/v0.1
claim_id: dense-retriever-nq-2026-05-result
created_at: 2026-05-15T11:48:02Z
producer: { id: lab.example }
metric: ndcg_at_10
comparator: ">="
threshold: 0.42
dataset:
  id: nq-open-dev-v1
  hash: sha256:b113...8027
  rows: 8757
seed: 13
observed_value: 0.387
verdict: FAIL          # exit code 10
prior_hash: sha256:fe22...09a4    # the pre-registered manifest above

This is what an honest negative result looks like in a content-addressed world: the pre-registration and the outcome are linked by hash; the threshold was not relaxed; the seed was not changed; the dataset was not swapped. The result counts as evidence precisely because the claim was committed before the data spoke.

Deviations: how to be honest about them

The single most common objection to pre-registration is operational: "what if we need to deviate?" In clinical trials the answer is a protocol amendment; in PRML it's an amendment manifest with prior_hash pointing at the prior canonical hash and a reason field stating why. The chain preserves both versions. The auditor (or reader, or reviewer) can see exactly what changed and when.

Crucially, deviations don't retroactively edit the original claim. They follow it. The discipline is that you can change your mind — you just can't pretend you didn't.

What PRML doesn't try to be

PRML is not Registered Reports. It does not gate publication. It does not assess whether the claim is interesting or the methodology is sound. It is the smallest possible primitive that makes the pre-experimental commitment verifiable — a building block, not a workflow. Layer Registered Reports, peer review, and external benchmarking on top as you need them. The positioning analysis compares PRML against in-toto, SLSA, Model Cards, HELM, and ClinicalTrials.gov; each does a different job.

For regulated ML, the same manifest doubles as evidence for EU AI Act Article 12 logging and Article 72 post-market monitoring — see the Article 12 page and the crosswalk for the article-by-article reading.

For researchers crossing over from clinical or psychology

If you are familiar with the OSF template or AsPredicted, PRML will feel under-specified — that is deliberate. The clinical pre-registration burden is heavy for a reason (human subjects, ethics, IRBs). The ML pre-registration burden has to be light enough that researchers actually do it, or it won't propagate. PRML is the minimum viable commitment: metric, threshold, data hash, seed, signed and dated. Everything else lives in the prose around the manifest, where it can be richer and more discipline-specific.

The hash is the part that has to be machine-checkable. The rest can stay in your usual writing.

Read the v0.2 RFC. PRML v0.2 RFC is in public review and freezes 2026-05-22. Comments via GitHub Discussions. To register your first claim in a browser without installing anything, use registry.falsify.dev.