Academic study finds HTML reports wield more influence in LLMs than PDF
Generative AI (genAI) has become a central gatekeeper between companies and stakeholders. A growing number of investors use large language models (LLMs) such as ChatGPT to research companies, challenge investment ideas and analyze financial and market information. But how visible are annual reports as sources in these systems? And how can IROs improve the visibility of verified financial disclosures?
Verified information in unverified ecosystems
The shift in information usage creates new risks for IROs. LLMs can hallucinate or generate plausible but incorrect answers, especially when relying on third-party sources such as Reddit or Wikipedia, which are quite prominent in genAI. User-generated content often mixes facts with opinions and framing.
Corporate reports, by contrast, are the most comprehensive and reliable source of information on a company’s financial and non-financial development.
Corporate reports undergo internal verification and external auditing, clearly distinguishing them from most unverified online sources. In the age of AI, listed companies have a strong interest in ensuring that this verified content is reflected in LLM-generated answers.
Visibility of annual reports in gen AI
In theory, corporate reports are valuable inputs for tools such as ChatGPT. But how visible are they in practice? To address this question, USTP – University of Applied Sciences St. Pölten, HHL Leipzig Graduate School of Management and nexxar have initiated a joint research project. Within a large-scale study, we analyzed more than 2,500 structured prompts across 20 publicly listed European companies, examined over 24,000 cited sources and recorded nearly five mn automated requests to digital annual reports.
Some key results:
- Trustworthiness: Reports are a reliable source on finance- and ESG-related questions for genAI. Almost three fifths (58 percent) of all cited sources in reporting-related prompts refer to the annual report
- Format matters: Companies with structured, HTML-based online annual reports were cited 3.05 times more frequently than those relying primarily on PDF reports
- Accuracy: Responses related to companies with HTML reports were significantly more accurate (71 percent) compared to PDF-only reports (54 percent), driven by less relevance of external information sources
- Content: GenAI access activity focuses on core report sections, particularly operating activities, strategic direction, financial statements, financial metrics and sustainability reporting.
How to optimize reports for LLMs?
Although LLMs can process PDFs, they show a clear preference for structured, machine-readable disclosures. HTML reports offer major advantages:
- Clear semantic structure and clean code: Headings, paragraphs, lists and sections are clearly labeled in the source code (<h1>, <h2>, <ol>, <ul> etc), so LLMs can easily identify document hierarchy and extract specific parts
- Tables are reliably structured in HTML: Table head, body, rows and columns are explicitly encoded (<thead>, <tbody>, <tr>, <td> etc), while in PDFs, tables are often just visually aligned text that must be reconstructed
- Consistent and linear text flow reduces parsing errors: Text in HTML flows according to DOM order (Document Object Model). It provides text in correct reading order; PDFs often require layout interpretation and may mix columns or split words
- Optimized for web retrieval: Web crawlers and AI systems use HTML as their standard format, making websites easier and faster to find, read and process than PDFs
- Lower processing complexity: HTML can be parsed directly, while PDFs often require specialized parsing tools, layout reconstruction or OCR (Optimal Character Recognition, which converts scanned documents into readable text).
There are, of course, practical advantages as well. Linking to a dedicated website (such as the income statement) is easier than linking to a 300-page report in PDF format. HTML reports can also be optimized with Generative Engine Optimization (GEO). GEO structures digital content so that genAI models can interpret and retrieve it more effectively. Modern reports may include descriptive summaries and structured metadata such as JSON-LD (a JavaScript method of data structuring designed to help search engines understand website content).
What this means for IROs
For IROs, AI visibility affects investor communication, reputation and misinformation risks. When verified disclosures are less accessible, third-party interpretations gain importance, reducing corporate presence. As genAI becomes a central intermediary, visibility in LLMs will be an increasingly critical challenge.
Eloy Barrantes is CEO at nexxar and associate lecturer at the USTP – University of Applied Sciences St. Pölten
