Research

How we measure ourselves

A verification tool is only as trustworthy as the way it's tested. We benchmark Hallucite continuously and publish the numbers — including the ones we're still improving.

Last validated 2026-06-17 · refreshed automatically from each nightly benchmark run

The synthetic benchmark (100,000 cases)

Real references seeded from OpenAlex (CC0) real metadata, rendered across every major citation style and corrupted at graded severity — from cosmetic formatting changes through metadata errors to outright fabrication. As of 2026-06-17.

0.84
F1 score
92%
Precision
79%
Recall
9%
False alarm on clean refs

Figures are measured on our public benchmark under controlled conditions.

Because a citation checker has to be right in two directions, we measure both: catching corrupted references and leaving clean ones alone. A purely cosmetic change — a reformatted page range, an abbreviated journal — wrongly triggers a false alarm only 9% of the time.

Where it's strong, where it isn't

Error typeCasesAccuracy
Cosmetic (should stay verified)44,25091%
Metadata error36,32070%
Fabrication19,43095%

Outright fabrications are the easiest to catch and metadata errors the hardest — which is exactly why we keep investing there. On our differentiator, DOI identifier-hijacking (a real-looking DOI that resolves to a different paper), Hallucite flags 65% of cases — a class that title-only checks miss entirely. We publish the hard numbers rather than a single flattering headline.

Validation against live databases

Synthetic tests use a controlled stand-in for the scholarly databases. So we also run a curated set of 137 adversarial cases against all five live sources (Crossref, PubMed, Semantic Scholar, OpenAlex, Google Books) — the product's exact path. There, Hallucite caught 100% of fabricated citations (none missed) and flagged 9/9 retracted works.

Validation against the real world

Synthetic tests can be unrealistically easy, so we also validate against fabricated citations found in the wild. We use 151 adjudicated real cases compiled from public audits of accepted papers (GPTZero NeurIPS 2025 audit and GPTZero ICLR 2026 audit), including 17 identifier-hijack cases — the metadata class broad audits exclude.

We report only aggregate results from these sets and do not redistribute their data.

What we don't do yet

Confirming a cited work exists is a different, easier problem than confirming it supports the claimit's cited for. Claim-support verification remains an active research challenge, and we're explicit that Hallucite checks existence and consistency, not whether a real source actually backs the sentence citing it.

Further reading

  • Zhao et al., LLM hallucinations in the wild: large-scale evidence from non-existent citations (2026) — the 111-million-reference audit.
  • Linardon et al. (2025) and Walters & Wilder (2023) — field-level measurements of how LLMs fabricate and mis-format references.
  • See also our plain-language guides and how verification works.