Benchmark2026-05-21

BibFox catches 99.5% of all hallucinated references

We ran our reference verification tool against a scientific benchmark of 636 references — real citations mixed with ones ChatGPT invented. It got 626 right.

That is 98.4% accuracy, measured on the full dataset, with nothing removed to lift the score. If you are comparing reference checkers, how the test was built matters as much as the numbers — in this article you'll find both.

Accuracy

98.4%

overall correct calls

626 / 636 correct

Precision

99.8%

of real calls are right

1 false positive

Recall

98.0%

of real refs validated

9 false negatives

F-score

98.9%

combined balance

precision + recall

The short version

BibFox is industry-leading with a precision of 99.8%, getting almost every "real" call right.
BibFox finds 99.5% of all fake citations: 194 of 195 fabricated references caught.
BibFox correctly classifies 98.4% of all references: 626 of 636 references in the benchmark.
The test data is external and peer-reviewed, and the benchmark used every reference in it.

See how BibFox handles your own sources and verify a reference list.

Or dive into:

Method: using a scientific and published dataset
Results: Confusion matrix reveals only one false positive
What this means in review work

Check your own reference list

Run BibFox on your sources and review the evidence yourself.

Verify a reference list

Method: using a scientific and published dataset

The benchmark uses a dataset BibFox did not build or tune itself to. Walters and Wilder (2023) prompted ChatGPT for citations and published the resulting 636 references in Nature Scientific Reports.¹ Some point to genuine papers; some were fabricated by the model.

We ran all 636. That matters more than it sounds. A reference checker's score depends heavily on which references it is tested against. Drop the entries that are hard to verify automatically — old books, dead URLs, sources outside the major scholarly databases — and the score climbs on its own. We removed none of them. The hard references are exactly the ones that decide whether a tool earns its place in your workflow, so they stayed in.

For statistical evaluation, we assumed the following definitions:

False positives: Fabricated references flagged as "validated" (green) by BibFox.
False negatives: Actually real references flagged as "invalid" or "unverified" (red) by BibFox.
True positives and negatives: The remaining references in each category. Click here to learn why a "warning" flag counts as a correct result in our benchmark.

The benchmark run used the High strictness preset. Find out more about the effect of different strictness settings here.

Results: Confusion matrix reveals only one false positive

Confusion matrix for BibFox on the 636-reference benchmark: 432 true positive, 9 false negative, 1 false positive, 194 true negative.

BibFox cleared 432 genuine references correctly and caught 194 of the 195 fabricated ones. It made two kinds of mistakes:

Nine times it flagged a real reference for review that did not need it.
Only one time it cleared a fabricated reference as genuine.

Clearing a fake citation is the expensive mistake, and it happened just once in the set of 636. Over-flagging a real reference is cheaper, because you will easily catch this in manual review.

To minimize manual review effort, take a look at our strictness setting: It controls how much evidence BibFox requires before it clears a reference.

What this means in review work

Without a tool, you have two options as a reviewer: spot-check a few suspicious references, or spend hours checking the list manually. BibFox gives you a third option. It checks the whole list first, then points your attention to the references that need a closer look. Pro tip: save even more time with the right strictness setting.

BibFox is built to assist academic review and leave the judgment to you. Do you want to truly check a reference list or continue with spot-checking by instinct?

Check your own reference list

Run BibFox on your sources and review the evidence yourself.

Verify a reference list

1. Walters, W.H., Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5