BibFox catches 99.5% of all hallucinated references
We ran our reference verification tool against a scientific benchmark of 636 references — real citations mixed with ones ChatGPT invented. It got 626 right.
That is 98.4% accuracy, measured on the full dataset, with nothing removed to lift the score. If you are comparing reference checkers, how the test was built matters as much as the numbers — in this article you'll find both.
Accuracy
98.4%
overall correct calls
626 / 636 correct
Precision
99.8%
of real calls are right
1 false positive
Recall
98.0%
of real refs validated
9 false negatives
F-score
98.9%
combined balance
precision + recall
The short version
- BibFox is industry-leading with a precision of 99.8%, getting almost every "real" call right.
- BibFox finds 99.5% of all fake citations: 194 of 195 fabricated references caught.
- BibFox correctly classifies 98.4% of all references: 626 of 636 references in the benchmark.
- The test data is external and peer-reviewed, and the benchmark used every reference in it.
See how BibFox handles your own sources and verify a reference list.
Or dive into:
- Method: using a scientific and published dataset
- Results: Confusion matrix reveals only one false positive
- What this means in review work
Check your own reference list
Run BibFox on your sources and review the evidence yourself.
Method: using a scientific and published dataset
The benchmark uses a dataset BibFox did not build or tune itself to. Walters and Wilder (2023) prompted ChatGPT for citations and published the resulting 636 references in Nature Scientific Reports.1 Some point to genuine papers; some were fabricated by the model.
We ran all 636. That matters more than it sounds. A reference checker's score depends heavily on which references it is tested against. Drop the entries that are hard to verify automatically — old books, dead URLs, sources outside the major scholarly databases — and the score climbs on its own. We removed none of them. The hard references are exactly the ones that decide whether a tool earns its place in your workflow, so they stayed in.
For statistical evaluation, we assumed the following definitions:
- False positives: Fabricated references flagged as "validated" (green) by BibFox.
- False negatives: Actually real references flagged as "invalid" or "unverified" (red) by BibFox.
- True positives and negatives: The remaining references in each category. Click here to learn why a "warning" flag counts as a correct result in our benchmark.
The benchmark run used the High strictness preset. Find out more about the effect of different strictness settings here.
Results: Confusion matrix reveals only one false positive

BibFox cleared 432 genuine references correctly and caught 194 of the 195 fabricated ones. It made two kinds of mistakes:
- Nine times it flagged a real reference for review that did not need it.
- Only one time it cleared a fabricated reference as genuine.
Clearing a fake citation is the expensive mistake, and it happened just once in the set of 636. Over-flagging a real reference is cheaper, because you will easily catch this in manual review.
To minimize manual review effort, take a look at our strictness setting: It controls how much evidence BibFox requires before it clears a reference.
What this means in review work
Without a tool, you have two options as a reviewer: spot-check a few suspicious references, or spend hours checking the list manually. BibFox gives you a third option. It checks the whole list first, then points your attention to the references that need a closer look. Pro tip: save even more time with the right strictness setting.
BibFox is built to assist academic review and leave the judgment to you. Do you want to truly check a reference list or continue with spot-checking by instinct?
Check your own reference list
Run BibFox on your sources and review the evidence yourself.
1. Walters, W.H., Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5