Reputation in AI responses: What language models reveal and what they don't

IROs might be tempted to take AI-driven reputation data as gospel, but there are important limitations to bear in mind

In October 2025, a merchant sent a screenshot to his account manager. He had searched Google for Is Unzer trustworthy?, a seemingly harmless question that is likely asked about companies thousands of times every day. The answer he received was a Google AI Overview stating: ‘Whether Unzer is trustworthy cannot be answered definitively, as there are conflicting experiences.’

The response went on to highlight several issues from the company’s past that had long since been resolved and publicly addressed. The problem was not simply the content itself. The bigger problem was its prominence: the AI-generated answer appeared at the very top of the search results, above our own website and above media coverage that provided a more complete picture.

Many users never scroll beyond an AI summary when it appears at the top of the page. The answer effectively becomes the result. For a European payment service provider working with more than 90,000 merchants, whose business is built on trust and compliance, such a response represents a clear reputational risk.

We suddenly faced a communications challenge for which there was no established playbook. It also raised a question that communication teams increasingly need to answer: What exactly do large language models (LLMs) know about a company’s reputation?

Over the following months, we built a monitoring framework to answer that question. The most important insight was surprisingly simple: language models do not measure reputation, but reveal discursive reputation. Everything else follows from that observation.

What language models measure

Reputation is the overall perception of an organisation. It reflects how a company is seen, evaluated and categorised by others. As such, it is not something a company possesses but an external attribution.

Traditional reputation research therefore relies on structured stakeholder surveys. Frameworks such as RepTrak or the Reputation Quotient satisfy the methodological requirements of empirical research: objectivity, reliability and validity.

Language models satisfy none of these criteria.Their responses depend on prompts, context, language and system settings. The same question may produce different answers, and seemingly precise reputation scores often vary substantially between identical queries. These outputs are not measurements. They are statistically generated text that creates the appearance of measurement. Trying to calculate reputation scores with language models is therefore methodologically unsound.

Although language models cannot measure reputation, dismissing them would be a mistake, because they reveal something different: discursive reputation. Rather than producing metrics, LLM-based monitoring generates structured observations about how organisations are described in an increasingly AI-mediated public discourse.

The distinction matters. Experienced reputation is rooted in direct interactions and personal experiences. Discursive reputation, by contrast, reflects the image of an organisation that has become established within the public sphere of language.

Individual experiences may contribute to that image, but discursive reputation is shaped primarily by what is repeatedly said, written, and cited about a company over time. It emerges through media coverage, customer reviews, analyst reports, social media discussions and other forms of public communication.

This is where language models are particularly useful. Trained on vast amounts of text, they are exceptionally good at identifying recurring themes and compressing them into coherent narratives. Through web search integration, many systems can also incorporate more recent information, although historical training data often continues to play a significant role in shaping their outputs.

How we monitor AI-mediated reputation

At Unzer, we monitor this discursive reputation across three dimensions: awareness, sentiment and attribution.

Awareness captures whether and how prominently a company appears within relevant market conversations. Does it feature in the responses generated by language models when users ask realistic decision-making or orientation questions?

Sentiment focuses on the overall tone of those responses. Is the company described positively, negatively or somewhere in between?

Attribution examines meaning. What does the company stand for? Which characteristics, capabilities and narratives are consistently associated with it?

For each dimension, we use standardised prompts that are applied consistently across multiple language models. For example, awareness is assessed with questions such as Which five providers of online payment solutions for mid-sized merchants in Europe are particularly relevant? Sentiment analysis identifies recurring themes in customer reviews, while attribution compares the adjectives language models associate with our company against our intended positioning.

We do not focus on individual responses. What matters is the aggregate pattern. When multiple systems converge on similar descriptions, this suggests the presence of a relatively stable discursive profile. When descriptions differ substantially, it may indicate a fragmented or less clearly defined reputation.

What we’ve learned

Developing a framework is one thing. The more important question is whether it generates meaningful insights in practice. Over the past six months, we have built a pragmatic monitoring programme for AI-mediated reputation. The results are cautiously encouraging.

One of the clearest benefits lies in visibility. Specialised tools provide transparency into whether, how often, and in which contexts companies appear in AI-generated responses across different systems, from ChatGPT and Gemini to Perplexity. The same tools also reveal which competitors are mentioned alongside a company and which alternatives AI systems recommend. For organisations seeking to understand their position within an increasingly AI-mediated information environment, these insights can be highly valuable.

The same applies to sentiment. Most tools provide an initial classification of responses as positive, neutral or critical. While these assessments are often simplistic and sometimes require manual correction, they offer a useful starting point. Language models can also help identify dominant narratives and recurring themes associated with a brand.

Another advantage is the relatively low barrier to entry. Traditional reputation studies often require significant budgets and substantial research effort. AI-based monitoring, by contrast, can be implemented at comparatively modest cost. Depending on the provider, organisations can expect to spend between €200 and €600 ($228 to $684) per month.

Dealing with limitations

Despite these benefits, AI-based reputation monitoring has important limitations.

The first is methodological. As discussed earlier, language models do not reflect the views of a clearly defined public. Their responses often sound convincing, but they remain statistically generated text rather than empirical observations. What appears plausible is not necessarily true.

The problem becomes particularly obvious when organisations attempt to quantify reputation. A model can be asked to calculate a reputation score between 0 and 100 based on a set of dimensions. The result typically looks credible, complete with explanations and seemingly rigorous reasoning. Yet repeated queries often produce different outcomes. One response may generate a score of 72, another 68 and a third 81, even though nothing has changed in the real world. These numbers are not measurements. They are plausible inventions, making reliable longitudinal tracking impossible.

A second challenge stems from AI’s dual role as both analytical instrument and object of analysis. The training data and response patterns of language models are derived from public discourse, but public discourse itself is increasingly influenced by AI-generated content. Reputation therefore emerges within a circular process of public communication, AI-driven analysis and AI-generated reproduction. Understanding this growing self-referentiality will be one of the central challenges of reputation management in the years ahead.

Beyond methodological concerns, there are also practical limitations. For example, while visibility tools can easily identify the sources that are most relevant for prompts, it is a lot of manual work to influence these sources. In the early stages of our monitoring efforts, I contacted authors and website operators whenever I encountered information that was outdated or misleading. Increasingly, however, there was no identifiable contact person, no active editor and no clear mechanism for correction. The effort required quickly outweighed the potential impact.

This highlights a broader challenge. Many AI visibility tools generate hundreds of recommendations every month and can even produce suggested articles, FAQs and website content automatically. Organisations can easily find themselves caught in a race to create ever more generative engine optimization (GEO)-driven content.

Unlike traditional SEO, LLM visibility remains largely opaque. That makes me sceptical of vendors promising systematic GEO . High-quality content, strong brands and credible third-party mentions almost certainly matter. Whether endless AI-generated FAQs and GEO projects do is far less clear.

Most importantly, no tool can replace strategic judgement. Data and analysis can reveal patterns, but they cannot determine what those patterns mean, which issues deserve attention or which actions should be prioritised. Those decisions remain fundamentally human responsibilities.

After six months of AI-based reputation monitoring, my most important conclusion is straightforward: language models do not measure reputation, but they are becoming one of the places where reputation is encountered. They create a new observation layer for communications professionals because they increasingly condense public discourse into the answers millions of people rely on every day.

The challenge is therefore neither to embrace every new optimisation technique nor to dismiss AI-generated responses altogether. The challenge is to understand what these systems actually reveal, and what they do not.

Lydia Prexl is a communications strategist with more than fifteen years of experience. Since 2022, she has led corporate communications at the European payment service provider Unzer. Prior to joining Unzer, she built the communications function at the fintech insurer Getsafe. She is also the author and editor of several books and practical guides on communication and writing.