Imagine an AI chatbot confidently telling an investor that a company's revenue jumped 40% last quarter, when in fact, it fell. The correct data existed on the internet, but the model retrieved its answer from an unreliable source, misread inconsistent labels, or grabbed the wrong reporting period. This isn't a thought experiment. As AI becomes embedded in investment decisions, the quality of the underlying data matters more than ever.
For nearly two decades, I have focused on open public datasets and data standardisation, joining the XBRL community in 2006 to pursue the promise that structured, digital reporting would finally deliver high-quality data that users could trust and analyse at scale. That promise is now more important than ever with the emergence of natural language AI interfaces where the investor can ask any question they want, such as above.
Working at Oracle on data warehousing and business intelligence, I saw first-hand how difficult it was to turn inconsistent data into meaningful insight. The classic promises of Business Intelligence (BI), 'turning data into information' and 'helping users make better decisions', consistently fell apart when the underlying data was fragmented and used different labels and definitions. I came to realize that standardizing data at source and validating quality is a prerequisite for BI systems to deliver real value.
For AI chatbots, data quality, data structure, and the authority of the original publisher are even more critical to producing reliable analysis than they are for traditional BI systems built on tightly defined datasets.
Unfortunately, investment in data collection systems is being questioned like never before, while the benefits of data standardisation remain poorly understood. The result is too many fragmented implementations, driven by local interests insisting on exceptions that ultimately erode trust in the data itself.
The Quiet Crisis in Public Data
One claim commonly used in the calls to scale back data collection frameworks or to delay new ones is that digital reporting standards, ones that require some form of consistent data tagging and data validation, increase compliance costs for companies, and hence impact negatively on global competitiveness and growth.
Meanwhile, the costs of not collecting and standardizing data for analysis remain hidden. They can be found in the wasted time of numerous users struggling to find and pull data together, and in the systems built to transform data from multiple sources so they can be integrated and analysed together.
My own experience suggests that well-structured, standardised public data is not a burden; it is part of the important infrastructure that helps information flow in the so-called 'age of information'. In fact, the potential benefits of standardising open datasets are huge and can promote growth through better allocation of economic assets.
Open Data Sets are Underfunded
Alan Smith, who worked for the UK's Office for National Statistics, wrote an article for the Financial Times (August 21 2025: The challenge of restoring credibility to the nation's statistics), that caught my attention. He started by saying that "2025 had been somewhat of an annus horribilis for the nation's number cruncher."
He also noted that trust in UK key statistics was at an all-time low and several data series have been suspended or delayed. Much of this he attributed to cuts in government budgets, but also to a lack of awareness of the important value that such datasets have. Alan also noted that investment in roads and hospitals is easy to see, while saving analysts time and money, and demonstrating the impact on growth, is not.
The UK is not alone. This 'crisis' is mirrored in many developed countries with growing concerns over the quality of data, which leads to economic policymakers and business leaders 'flying blind'. How can you be 'data-driven' when the data is not available.
AI Chatbots that retrieve data and assemble answers in a 'black box' manner introduce a new issue, identified above, is the data correct or not? Without authoritative sources and well-structured data, AI agents could mislead analysts, leading companies, investors, and governments to make costly wrong decisions.
Company Finance and Sustainability Reporting: A Use Case
Underfunding is the start of the story; lack of knowledge and implementation failures compound the problem. European company financial and sustainability reporting are good examples of the key challenges.
Focus on the Wrong Stakeholders
The European Single Electronic Format (ESEF) and its UK equivalent have been successful in delivering thousands of digital reports, which can be read by investors and analysed by computer systems (using inline XBRL, per earlier articles).
However, ESMA has primarily engaged with local country business registers to get its feedback from (EU politics). Only a few of these have their own well-structured collection systems and many of these want to protect their local role of collecting and publishing the data, so as, to keep their relationship with significant local companies, accounting bodies, and auditors.
ESMA has only slowly (…maybe reluctantly) engaged with the submitting companies, auditors, investors, and the technical data experts to help resolve the many issues that have arisen. The impact is that it has left the historical set of documents littered with many errors.
New Data Sets: Corporate Sustainability Reporting Directive (CSRD)
Establishing new public datasets is always a huge challenge. The initial scope and detail of data collection requirements can make or break a project: too high-level means thin analysis, whereas too detailed and submitting firms find requests costly and overwhelming.
The EU's Corporate Sustainability Reporting Directive (CSRD) was launched with great ambition and widespread support. However, like many Business Intelligence systems I have seen, the taxonomy attempted to codify the complete regulatory requirement set from the outset (… as directed by EU authorities). The new XBRL taxonomy was an avalanche of detailed specifications and complex rules, which predictably triggered a backlash.
The EU Parliament asked for a reassessment (EU Omnibus), leading to simplification: reducing mandated reporting to the largest companies; cutting required data points to be reported to about a third; and pushing back deadlines for other companies by two years.
In hindsight, BI specialists would have recommended the 'THINK BIG, start small' strategy, i.e., starting with a smaller data slice from broader contributors, then evolving over time to meet developing requirements. This approach would have served both policymakers and companies better.
Local Interest Groups Muddle the Water
Practices change very slowly and cautiously in accounting and pockets of resistance still remain in the community to the changes forced upon them by the move to pan-European digital systems. For example, the Deutsches Aktieninstitut (German Stock Institute) has called for the European Securities and Markets Authority (ESMA) to remove the digital iXBRL reporting format completely. It claims that iXBRL imposes considerable expenses and complexity on companies without providing corresponding benefits to investors.
The question is: what would they replace it with? Perhaps they would prefer to continue with unstructured PDFs (also known as 'paper under glass') or go back further to pen and paper? What would be the net impact on the total system cost, from collection to analysis? Would it really be cheaper, or would it just move greater costs along the information supply chain to the analysis stage? Plus, what would be the impact on the 'great' EU idea of a single united capital market to help drive growth?
Europe is also still waiting for the European Single Access Point (ESAP), which would provide access to these reports in a single place rather than investors having to root them out across 28 different local country systems or thousands of private company websites. The delivery is mired in discussions on its implementation with the local business registers, identified above.
As we move to an era of AI search and analysis, can we afford to continue making these mistakes when errors and missing data are amplified by AI's confident delivery of incorrect information?
Is Public Data Ready for AI?
Much of the focus around AI is on the general language intelligence (interpretation of the question, and answer generation) that Large Language Models (LLMs) possess. This is important for natural language interfaces and the flowing interaction that users want with conversational interfaces. However, in practice, the success of an AI application in answering a specific question depends far more on the underlying data structure, format, and trustworthiness of its source.
Ask a question and an AI chatbot needs to identify which sources to check to assemble an answer. It will come up with a priority list based upon some filtering algorithm. The AI chatbot will prioritize content that is already summarized and easily accessible and ignore data which is costly to search for. AI does not ask why, it simply computes probabilities and selects the best options.
Based upon this data, AI Chatbots can churn out new KPIs or custom aggregates, transform the data and reorganise it into new formats and structures. However, how does the user know that AI agents are finding the 'right' data rather than misinformation, or worse, fake data that leads to sophisticated scams?
The key issue here for analysts using open public data is that AI is very good at sounding confident even when it is wrong.
The Hallucination Challenge
In traditional BI dashboards, it is often easy to see and check for incorrect data given their structured approach. If a systematic error is found, the computer code can be checked and fixed. Painful but well understood route.
AI chatbots effectively 'grow' code as they evaluate the question and cannot always reveal what the query is that they have written, even when asked (intermediate discarded code steps). AI agents can also interoperate in ways that we cannot predict. So, AI chatbots often produce plausible but incorrect results due to their probabilistic nature.
These plausible but incorrect results, what marketers euphemistically call 'hallucinations', arise from AI's probabilistic nature. Users can address this through detailed prompt engineering, but this only underscores the need for rethinking public datasets themselves; the guardrails and verification of AI outcomes.
How do policymakers and authorities address the risks involved?
I would argue that trusted organizations need to invest more in their collection systems and in data standardisation. The critical part is for the published datasets to have clear definitions and structure, for the data to be error-free (or as near as can be), and to include metadata (data about the data) that helps AI agents find the right data.
The ODI Framework
The Open Data Institute (ODI) has released an AI-ready framework, which identifies four critical areas for 'AI Ready' data:
- Qualities of the data — Compliance with data standards, and use of appropriate file formats that are easily machine-readable and interoperable.
- Metadata (Context) — The data model should detail the data structure, types, and constraints, plus the data lineage to track its origin and transformations.
- Infrastructure — The physical and software infrastructure must be designed for accessibility, scalability, and control.
- Proactive data Governance — Updating and automating governance policies.
These types of systems do need additional investment to set up and to maintain but also need a different mindset when implementing and maintaining. They need a global view of the application scope, which takes into account that every local deviation in data standards and definitions, increases global costs and reduces comparability, and that data about the data (system or semantic metadata) is just as important as the data itself.
Why XBRL Makes AI Smarter
There have been numerous articles and posts on why XBRL provides a platform for AI and company financial and sustainability reports. The reasons are straightforward: structured data includes the context and semantic information and links this to a wider knowledge base.
Consider the fundamental difference in approach:
- Company analysts learn accounting from the ground up, building a broad model of how concepts relate and have a specific knowledge of how this is applied to their company's financial data. Auditors also have deep background knowledge that lets them spot when something looks wrong if a report diverges from typical patterns.
- AI systems use probability and statistics to identify specific numbers and language patterns in analysing a company report, assigning meaning based on the context provided by the question and general training data.
This statistical approach is powerful but fundamentally different and without trustworthy, structured, semantic data like XBRL, it's unreliable. Company analysts may have holes in their knowledge… but they can always ask an AI Chatbot.
The natural and confident response of AI Chatbots to questions, such as the one at the beginning of this article, has tempted non-technical users to believe that AI doesn't need data tagging and standardization, a dangerous misconception.
Recent research using large volumes of filings, by XBRL US, has shown that AI systems perform significantly better when they are trained on or use the models provided by XBRL taxonomies to understand reports rather than raw text or HTML.
An XBRL taxonomy defines what each reported fact represents, its data type, its relationships to other facts, and the rules that govern its use. For example, 'Cash' is explicitly defined as a 'current asset' 'Revenue' belongs on the income statement. The units used, context (entity, period, dimensions), and precision are unambiguous.
The semantic context provided by structured, tagged data is exactly what AI needs to reason reliably. Without reliable, structured sources, AI systems fall back on scraping text, interpreting inconsistent labels, or relying on third-party aggregations of unknown provenance. That opens the door to misinformation, biased results, and sophisticated errors that are extremely difficult to detect.
XBRL is also interesting from another aspect, it is a direct connection to what the company wanted to report. So, XBRL tags can tell us useful information on the company's intentions and approach, better than any general data provider. However, data and tagging errors in the historical reports, such as for ESEF, significantly reduce the value of the dataset, which is why it is so important to fix issues in the reporting framework as soon as possible.
Signs of Convergence
Despite these challenges, encouraging developments suggest the tide is turning. Institutions are recognizing that structured data isn't optional, it's part of essential infrastructure, like roads and bridges.
Industry Recognition
Global bodies such as IOSCO have explicitly called for machine-readable disclosures to improve market efficiency. Sustainability initiatives increasingly emphasize cross-border comparability, such as, the ISSB proposal of 'data passporting' so jurisdictions can accept reports using standard base taxonomies, reducing duplication and fragmentation.
Meanwhile, new data toolsets are emerging. Snowflake's Open Semantic Interchange (OSI) aims to standardize how semantic models are shared across platforms, recognizing that standardizing semantic metadata, not just the data, is critical for interoperability across both BI and AI tools.
XBRL Evolution and AI Adaptation
XBRL is not a silver bullet, but it already provides much of the semantic foundation AI systems need in company financial reporting. The real challenge now is for the XBRL standard to develop an open specification that also fits with modern data architectures and evolving semantic standards.
The XBRL community understands this, and the proposed Open Information Model (OIM) update aims to simplify and modernize how XBRL data is expressed, making it easier to integrate into contemporary data and AI platforms. It is now time to deliver on this promise.
Greater exposure of AI also means that AI tools are also being adapted. For example, financial services applications are moving beyond text-only language models toward multimodal systems that integrate company reports, financial data, and other media simultaneously, mirroring how human analysts work, according to Ben Lorica of Gradient Flow.
More importantly, the industry is implementing 'White Box' verification frameworks that use LLMs as auditors, validating numerical claims against primary documents. These developments enable AI systems to process multiple knowledge sets simultaneously, verify claims against source documents, maintain audit trails for compliance, and operate securely within institutional boundaries, directly addressing the hallucination risks that could undermine investment decisions.
However, these sophisticated AI systems still depend on the quality and structure of underlying data. Without standardized, semantic-rich sources like XBRL, even advanced verification frameworks struggle to ensure accuracy.
Conclusion
Public company data is one of the most valuable datasets available, we possess, to drive growth. When prepared properly, it supports better investment decisions, more effective regulation, and improved public policy.
The lesson is not to collect more data, but to collect the right data, once, using shared standards, and to treat that data as essential infrastructure that needs public investment.
Of course, the challenge is not just technological. It is also organizational and political, such as aligning incentives, reducing data framework fragmentation, and investing in the unglamorous work of data standardization.
I would make some specific recommendations:
Regulators: Deliver AI-ready datasets like ESAP now, every delay multiplies costs as analysts waste time on searching for the data and resist local interests which lead to creating incomparable data. Once established, data repositories like ESAP should enable 'crowdsourcing' to root our errors.
Companies: Demand software built on digital-first principles, which fully support XBRL and, are capable of exporting data to multiple frameworks from a single source. These systems exist (see earlier article), it is just when you decide to move away from PDF conversion tools.
Technologists: Contribute to semantic layer standardization through initiatives like the XBRL community's Open Information Model (OIM) project or Snowflake's Open Semantic Interchange (OSI).
The choice is stark, invest in proper data infrastructure now, or watch AI confidently mislead us at scale. XBRL and similar data standards offer a proven path forward if we have the collective will to follow the logical conclusion of the need to provide AI with 'context' (a standardised semantic layer).
We all can understand that AI is only as good as the data it finds; and right now, we're not giving it the foundation it needs.
The author is Martin DeVille of AM2 Limited
Please send comments, corrections, and any alternative ideas to mdv@am2ltd.com.