May 8, 2026
The Papers That Never Existed Are Already Shaping What We Know
Science built a system to catch human error. It was never designed for this.
Vruddhi Shah
6 min read
Imagine you're a doctor making a treatment decision. You pull up a recent study — published in a reputable journal, peer-reviewed by experts, cited by three other papers. You trust it. Why wouldn't you? It went through the process.
Now imagine that one of the sources it cites to support its core claim doesn't exist. Not retracted, not disputed — never published. The authors, the journal, the DOI: all invented. Invented by an AI, inserted into the manuscript, and missed by every reviewer who read it.
This isn't a thought experiment anymore. It's already happened. More times than anyone has counted.
What GPTZero Found at the World's Most Prestigious AI Conference
In January 2026, GPTZero released a scan of all 4,841 papers accepted to NeurIPS 2025 — one of the most competitive AI research conferences in the world, where fewer than 1 in 4 submissions make it through. The result: at least 100 confirmed hallucinated citations, distributed across 51 published papers.
These weren't formatting errors or broken links. They were citations to sources that don't exist — invented authors, fabricated paper titles, fake DOIs, journals that never published those volumes. Everything generated from whole cloth by an AI writing assistant, and left in by researchers who either didn't notice or didn't check.
Each of those 51 papers had been reviewed by three to five expert researchers. Each one beat out more than 15,000 competing submissions. Each was publicly presented at one of the year's most watched academic events.
The challenge is partly one of scale. The main NeurIPS research track received 21,575 valid submissions in 2025 — up from 15,671 in 2024 and 12,343 in 2023. Even with thousands of volunteer reviewers, that volume makes deep scrutiny of every paper and its references increasingly difficult.
The reviewers weren't negligent. They were overwhelmed. And the citations looked real.
The Three Tiers of a Convincing Fake
Not all hallucinated citations are created equal. The GPTZero analysis found they tend to fall into three distinct patterns of sophistication.
The obvious ones — "Firstname Lastname and Others. A large-scale multi-agent benchmark, 2023. URL to be updated." — are easy to spot if you're looking. Nobody reads a citation that carefully. And that's the point.
The second tier is subtler: real authors attached to papers they didn't write. Real journals attributed with papers that never appeared in them. Real conference names, wrong years, wrong volumes. Plausible enough that a reviewer skimming at 11pm, reviewing their ninth paper of the week, would not pause.
The third tier is what should genuinely unsettle anyone working in research. In some cases, the model started from a real paper but made subtle changes — expanding an author's initials into a guessed first name, dropping or adding coauthors, or paraphrasing the title. The citation passes a surface check because it looks like a legitimate variant of something real. The paper it points to doesn't exist, but you'd have to actually retrieve and read the source to find out. Peer reviewers rarely do.
It's Not Just One Conference
NeurIPS made headlines because someone scanned it. The actual scope is much larger.
A Nature analysis suggests tens of thousands of publications from 2025 might include invalid references generated by AI. A separate investigation of ICLR 2026 submissions found that 20% of a sample drawn from 20,000 total submissions contained at least one AI hallucination, with over 50 hallucinated citations appearing in papers that peer reviewers had already scored highly.
A 2025 study published in PubMed Central evaluated citations produced by GPT-based systems and reported that 19.9% of AI-generated references were completely fabricated, with no traceable existence in the scholarly record. Among the remaining citations, 45.4% contained serious bibliographic errors. In effect, fewer than one in three references generated by the model were fully accurate and verifiable.
That's a failure rate that would disqualify any measurement instrument from clinical use. It's also the rate at which the academic world's most widely-used writing tool generates its citations.
And it's getting into published work. A bibliometric study published in Frontiers in January 2026 analyzed 335 retracted AI-related papers and found that compromised peer review was the most common retraction reason, while 37.9% of retractions lacked any specific justification — meaning editors couldn't even fully articulate why the papers failed.
Medicine Is Reading These Papers Too
The reach of this problem extends well beyond computer science conferences. A longitudinal study tracking AI-generated content across all publications in JAMA Network Open found that the proportion of articles containing AI-generated text rose from zero in January 2022 to 11.3% by March 2025 — a vertical line on a chart that maps almost exactly onto the release of ChatGPT.
In the era of LLMs, the speed and scale at which incorrect claims about a pharmaceutical product or medical procedure could be generated and disseminated — potentially perceived as credible — greatly increases the risk of misguided healthcare decisions and the spread of misinformation. The researchers drawing that comparison had a historical reference point: digital misinformation during the COVID-19 pandemic directly influenced harmful behaviors, from the inappropriate use of hydroxychloroquine to reduced vaccine uptake.
The mechanism here is the same, but slower and harder to trace. A hallucinated citation in a paper about a drug interaction doesn't cause immediate harm. It sits in the literature. Other researchers cite it. It gets incorporated into systematic reviews. Those reviews inform clinical guidelines. The guidelines reach doctors. The doctors make decisions.
The median time between publication and retraction of an AI-related paper is 550 days. That's a year and a half of a bad paper circulating through the literature before anyone pulls it.
The Feedback Loop Nobody Wants to Name
Here's the part that makes researchers go quiet when you raise it. The papers being published today — hallucinated citations and all — will become training data for tomorrow's AI models.
The AI generates a fake citation. It passes peer review. It gets published. It's indexed. It gets scraped into training corpora. The next generation of models learns from it and treats the fabricated reference as a real part of scientific history. The model cites it again, in new papers, with greater confidence because it's seen it multiple times.
According to a study in the Journal of Advanced Research, LLMs are using material from retracted scientific papers to answer questions on their chatbot interfaces. "People are increasingly using ChatGPT or similar to summarize topics, and this shows that they risk being misled by the inclusion of retracted information."
The problem isn't just that AI is polluting the literature. It's that the literature is now training the AI. These aren't separate issues running in parallel. They're a loop — and nobody has closed it.
Why Peer Review Can't Catch This
The honest answer is that peer review was never designed to catch it.
The system was built for a different failure mode: human researchers making honest mistakes, or occasionally dishonest ones, in a world where producing a single paper required months of real work. Reviewers were supposed to check logic, methods, and framing. They were not supposed to retrieve and verify every citation, because doing so would have consumed more time than the reviewing itself.
AI has broken that assumption. A researcher can now generate a plausible-looking 20-page paper with 40 citations in an afternoon. The citations require more time to verify than the paper required to produce. The volume of submissions at major conferences has doubled in two years, while the reviewer pool hasn't. The math doesn't work anymore.
No author within the NeurIPS dataset acknowledged using AI to generate citations, even though all four conference policies required disclosure, indicating that current policies are insufficient. The rule exists. The enforcement mechanism doesn't. You can declare a policy requiring AI disclosure and still have no way to verify whether researchers followed it.
Some journals are beginning to deploy automated citation-checking tools. GPTZero's hallucination detection is one. Others are in development. But editors observing an increase in AI-generated references note that fake citations can propagate through scientific literature, decreasing the reliability of published research, and that a coordinated effort across stakeholders is needed to combat the issue. A coordinated effort. Across stakeholders. In a publishing ecosystem with thousands of journals, no shared infrastructure, and strong financial incentives to publish more, not less.
The Trust We Assumed We Could Keep
Science doesn't work without trust. Not blind trust — the whole architecture of peer review, replication, and citation exists to make trust earned rather than assumed. But somewhere inside that architecture was a foundational assumption: that the papers being reviewed were actually written by the humans who submitted them, who had actually read the sources they cited.
That assumption is now wrong in a measurable, documented, and growing number of cases. And the disturbing part isn't the cases we've found. It's the ones we haven't.
NeurIPS got scanned because someone had a tool and decided to use it. Most conferences haven't been scanned. Most journals haven't been scanned. The 11% of JAMA Network Open articles flagged for AI-generated content represents only what detection software caught — and AI detection software is reliably unreliable at catching sophisticated generation.
The papers we found are embarrassing. The papers we haven't found are the ones informing policy, training doctors, and teaching graduate students what the literature says.
Science has survived fraud before. It has mechanisms for it: retractions, corrections, post-publication peer review. Those mechanisms assume that bad papers get identified eventually. They assume the volume of fraud is manageable.
What they don't assume is a world where one researcher with a laptop can generate a hundred plausible papers in a month, each with forty citations to sources that never existed, each passing review because the reviewers are checking nine other papers at the same time, from a pool of 21,000 submissions that arrived this year alone.
The question worth sitting with isn't whether this is a problem. We have the numbers now. The question is whether the institution of science can move faster than the incentives that are breaking it — and whether we'll know the difference between a paper we can trust and one that simply learned to sound like one.