BibFox
Benchmark

BibFox catches 99.5% of all hallucinated references

We ran our reference verification tool against a scientific benchmark of 636 references — real citations mixed with ones ChatGPT invented. It got 626 right.

That is 98.4% accuracy, measured on the full dataset, with nothing removed to lift the score. If you are comparing reference checkers, how the test was built matters as much as the numbers — in this article you'll find both.

Accuracy

98.4%

overall correct calls

626 / 636 correct

Precision

99.8%

of real calls are right

1 false positive

Recall

98.0%

of real refs validated

9 false negatives

F-score

98.9%

combined balance

precision + recall

The short version

  • BibFox is industry-leading with a precision of 99.8%, getting almost every "real" call right.
  • BibFox finds 99.5% of all fake citations: 194 of 195 fabricated references caught.
  • BibFox correctly classifies 98.4% of all references: 626 of 636 references in the benchmark.
  • The test data is external and peer-reviewed, and the benchmark used every reference in it.

See how BibFox handles your own sources and verify a reference list.

Or dive into:

Check your own reference list

Run BibFox on your sources and review the evidence yourself.

Verify a reference list

Method: using a scientific and published dataset

The benchmark uses a dataset BibFox did not build or tune itself to. Walters and Wilder (2023) prompted ChatGPT for citations and published the resulting 636 references in Nature Scientific Reports.1 Some point to genuine papers; some were fabricated by the model.

We ran all 636. That matters more than it sounds. A reference checker's score depends heavily on which references it is tested against. Drop the entries that are hard to verify automatically — old books, dead URLs, sources outside the major scholarly databases — and the score climbs on its own. We removed none of them. The hard references are exactly the ones that decide whether a tool earns its place in your workflow, so they stayed in.

For statistical evaluation, we assumed the following definitions:

  • False positives: Fabricated references flagged as "validated" (green) by BibFox.
  • False negatives: Actually real references flagged as "invalid" or "unverified" (red) by BibFox.
  • True positives and negatives: The remaining references in each category. Click here to learn why a "warning" flag counts as a correct result in our benchmark.

The benchmark run used the High strictness preset. Find out more about the effect of different strictness settings here.

Results: Confusion matrix reveals only one false positive

Confusion matrix for BibFox on the 636-reference benchmark: 432 true positive, 9 false negative, 1 false positive, 194 true negative.
Confusion matrix for BibFox on the 636-reference benchmark: 432 true positive, 9 false negative, 1 false positive, 194 true negative.

BibFox cleared 432 genuine references correctly and caught 194 of the 195 fabricated ones. It made two kinds of mistakes:

  • Nine times it flagged a real reference for review that did not need it.
  • Only one time it cleared a fabricated reference as genuine.

Clearing a fake citation is the expensive mistake, and it happened just once in the set of 636. Over-flagging a real reference is cheaper, because you will easily catch this in manual review.

To minimize manual review effort, take a look at our strictness setting: It controls how much evidence BibFox requires before it clears a reference.

What this means in review work

Without a tool, you have two options as a reviewer: spot-check a few suspicious references, or spend hours checking the list manually. BibFox gives you a third option. It checks the whole list first, then points your attention to the references that need a closer look. Pro tip: save even more time with the right strictness setting.

BibFox is built to assist academic review and leave the judgment to you. Do you want to truly check a reference list or continue with spot-checking by instinct?

Check your own reference list

Run BibFox on your sources and review the evidence yourself.

Verify a reference list

1. Walters, W.H., Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5