Research

How we measure ourselves

A verification tool is only as trustworthy as the way it's tested. We benchmark Hallucite continuously and publish the numbers — including the ones we're still improving.

Last validated 2026-06-17 · refreshed automatically from each nightly benchmark run

The synthetic benchmark (100,000 cases)

Real references seeded from OpenAlex (CC0) real metadata, rendered across every major citation style and corrupted at graded severity — from cosmetic formatting changes through metadata errors to outright fabrication. As of 2026-06-17.

0.84

F1 score

92%

Precision

79%

Recall

False alarm on clean refs

Figures are measured on our public benchmark under controlled conditions.

Because a citation checker has to be right in two directions, we measure both: catching corrupted references and leaving clean ones alone. A purely cosmetic change — a reformatted page range, an abbreviated journal — wrongly triggers a false alarm only 9% of the time.

Where it's strong, where it isn't

Error type	Cases	Accuracy
Cosmetic (should stay verified)	44,250	91%
Metadata error	36,320	70%
Fabrication	19,430	95%

Outright fabrications are the easiest to catch and metadata errors the hardest — which is exactly why we keep investing there. On our differentiator, DOI identifier-hijacking (a real-looking DOI that resolves to a different paper), Hallucite flags 65% of cases — a class that title-only checks miss entirely. We publish the hard numbers rather than a single flattering headline.

Validation against live databases

Synthetic tests use a controlled stand-in for the scholarly databases. So we also run a curated set of 137 adversarial cases against all five live sources (Crossref, PubMed, Semantic Scholar, OpenAlex, Google Books) — the product's exact path. There, Hallucite caught 100% of fabricated citations (none missed) and flagged 9/9 retracted works.

Validation against the real world

Synthetic tests can be unrealistically easy, so we also validate against fabricated citations found in the wild. We use 151 adjudicated real cases compiled from public audits of accepted papers (GPTZero NeurIPS 2025 audit and GPTZero ICLR 2026 audit), including 17 identifier-hijack cases — the metadata class broad audits exclude.

We report only aggregate results from these sets and do not redistribute their data.

What we don't do yet