Documentation2026-05-21

Why a "warning" flag counts as a correct result in our benchmark

When BibFox flags a real reference with an orange warning, it looks like a mistake. The paper exists; you can open it.

Yet in our benchmark, every one of those warnings counted as a correct call.

That is not a scoring trick. A warning is correct because of what BibFox is built to do: surface a discrepancy and route the reference to you, rather than deliver the verdict itself. A warning that triggers the right human check did its job. Only a clear green or red verdict can be a benchmark error.

This post explains the scoring rule behind the 98.4% accuracy figure, and what an orange pile actually tells you about your own reference list.

How the benchmark scores each label. For a real reference: validated is a true positive, warning is a true positive, invalid is a false negative. For a fabricated reference: validated is a false positive, warning is a true negative, invalid is a true negative.

The short version

A warning is scored as correct because it does the one job it exists for: it surfaces a discrepancy and routes the reference to a human instead of guessing.
For a fabricated reference, a warning is a true negative. BibFox flagged it, the manual check confirmed the source is fake, and the fabrication never passed as validated.
For a genuine reference, a warning is still a true positive. The source exists, and the flagged discrepancy is real, just not always material.
Only green ("validated") and red ("invalid" or "unverified") verdicts can be counted as errors. A warning makes no claim the benchmark can prove wrong.
Pro tip: Whether a warning becomes annoying depends on what you are checking. The strictness setting controls how many you see.

Jump to:

A warning is the tool working, not failing
Why a warning counts as correct for fake and real references
Why only a clear verdict can be wrong
What to do when warnings become annoying

Check your own reference list

Run BibFox on your sources and review the evidence yourself.

Verify a reference list

A warning is the tool working, not failing

A warning is the result BibFox produces on purpose when it finds a discrepancy it should not resolve on its own. BibFox checks each reference against scholarly databases and the web. When the citation and the source record line up, it returns green. When the reference cannot be matched to any real source, it returns red.

A warning is the third outcome. BibFox found a difference between your citation and the record, and that difference is neither large enough to call the source fabricated nor small enough to call the citation clean.

At that point BibFox stops, and the stop is deliberate. It surfaces the discrepancy, shows you the evidence behind it, and routes the reference to you. It marks the difference; it does not adjudicate it. BibFox cannot reliably know which side of a discrepancy is right, the database record or your citation, because both are fallible. Scholarly databases are maintained imperfectly. Citations copied from a download carry their own errors.

So a warning is BibFox reaching the right verdict for that reference: this is a human's call. A benchmark that scored that as an error would penalize the tool for doing exactly what it was built to do.

Why a warning counts as correct for fake and real references

A warning lands on two very different references, a fabrication and a genuine source, and the benchmark counts it as correct in both cases. The reasoning is not the same, so take them one at a time.

A fabricated reference with a warning is a true negative

This is the straightforward case. The benchmark set contains 195 references that ChatGPT invented. When BibFox puts a warning on one of them, the warning triggers a manual check, and the check confirms what the warning suspected: the source does not exist.

That is the task completed. The job was to stop a fabricated citation from passing as a real one, and it was stopped. The fabrication never reached a green "validated" label, which is the only outcome that would have let it through. It counts as a true negative: a fake reference correctly kept out of the validated pile.

A genuine reference with a warning is still a true positive

This is the case that looks wrong. The reference is real. The paper exists. BibFox put an orange warning on it anyway. The intuition is immediate: it should have been green, so the warning must be a misclassification, and a misclassification on a real reference should count against the tool.

We encourage you to look closer at the data and challenge your intuition. The benchmark set comes from Walters and Wilder (2023),¹ and the study is titled, precisely, Fabrication and errors in the bibliographic citations generated by ChatGPT. Errors, not only fabrications. 33% of the genuine references in that set are real sources cited imperfectly: a wrong author rendering, a mismatched journal, an off page range.

When BibFox flags one of those, it has detected a real discrepancy. Take the shape of a typical one: the paper is genuine and indexed, but the citation lists the author as "F. Kalter" where the record has "Kalter Frank", and gives the pages as 112–119 where the journal has 112–121. Both differences are real. Neither makes the source fake.

The source exists, so the reference is not red. The citation does not match cleanly, so it is not green. A warning is the accurate label. The reference was not wrongly rejected — it counts as a true positive.

Why only a clear verdict can be wrong

Only green and red feed the error counts, because only a clear verdict can be clearly wrong. A warning makes no claim the benchmark can falsify.

Start from how BibFox works. It compares your citation to a source record field by field. Algorithmically, some discrepancy is almost always present: a name formatted differently, or a missing DOI. Detecting a discrepancy is therefore not interesting on its own — it is the normal case. The real question is never whether a discrepancy exists. It is whether the discrepancy is material: does it mean the source was fabricated, or is it a swapped first and last name, or a wrong page number?

That question is a judgment call, and BibFox does not make it. It hands it to you. A warning is BibFox saying, in effect: I found a difference, you decide if it matters. You cannot mark that as a wrong answer, because it is not an answer. It is a correctly identified open question.

A green or red label is different. Green says the source is real and the citation checks out. Red says the source could not be verified. Those are verdicts, and a verdict can be wrong: green on a fabricated reference is a false positive, red on a genuine source is a false negative. In the 636-reference run there was exactly one false positive and nine false negatives. Those ten are the only calls the benchmark could fairly count against BibFox, because they are the only ten where BibFox actually made a call.

This is why counting warnings as correct is the only honest scoring. A warning is BibFox declining to adjudicate, and "a human should decide this one" cannot be graded as a mistake.

What to do when warnings become annoying

A warning being correct and a warning being useful are two different things. The benchmark settles the first. The second depends on your use case.

Take two examples at opposite ends. You are polishing your own bibliography before submission. You want every author name spelled right and every page number exact. Here a warning on a swapped name or an off page range is precisely what you want: it points you at a small error in your own list while you can still fix it. The labels around it are right too: the reference should not be green, because your citation is not clean, and not red, because the source genuinely exists.

Now you are peer-reviewing a colleague's manuscript, or checking the reference list of a student thesis. You may not care whether a journal title is abbreviated to the character, or whether the publisher field is exact. What you care about is whether the source exists at all, and whether the title, authors, and year hold together as a real paper. For that question, a warning on a trivial formatting difference is friction. It is a manual check you did not need.

Same warning, same correct detection, different value to you. This is what the strictness setting is for. A lower setting stops surfacing the less material discrepancies as warnings, so the orange pile shrinks to the references actually worth your attention. Save time with the right strictness setting.

The bottom line

A warning is a correct benchmark result because it is BibFox surfacing a real discrepancy and handing the judgment to you. You cannot score "a human should decide this one" as a wrong answer. Only a clear green or red verdict can be wrong, and in the 636-reference benchmark only ten were.

Counting warnings as correct is the only way: the question is never whether BibFox found a discrepancy, because it almost always will. The question is whether that discrepancy is material — and that call is yours to make, or yours to tune with the right strictness setting.

Check your own reference list

Run BibFox on your sources and review the evidence yourself.

Verify a reference list

1. Walters, W.H., Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5