Benchmarking Pharma GEO: What 23 Brands and 3 AI Models Reveal About AI Drug Representation
We analyzed 23 pharmaceutical brands across 3 major AI models. The result is the first published pharma GEO benchmark the industry has ever seen — and the findings challenge nearly every assumption brand teams hold about AI visibility, sentiment, and performance.
Until now, pharma companies have operated blind. They could track Google rankings, monitor media mentions, and audit HCP engagement. But when a patient asks ChatGPT about their medication, or a physician checks Perplexity for a quick drug comparison, no one has had a systematic way to measure what those models actually say — or how accurately they say it.
That changes today.
Using the [PharmaGEO platform](#), we scored every brand on the same framework: overall GEO performance, net sentiment, factual reliability, citation frequency, AI visibility, and cross-model consistency. The dataset spans oncology drugs, immunologics, consumer health products, fertility treatments, neurological therapies, and weight management medications — in both English and French.
This article is the full scorecard. Every number, every correlation, every gap between what brands assume and what AI models actually deliver.
Key Takeaway: The pharma GEO benchmark reveals a 22-point spread between the highest- and lowest-scoring brands. Reliability — not sentiment, not visibility — is the single strongest predictor of overall AI performance.
The Full Pharma GEO Scorecard: 23 Brands Ranked
Below is the complete brand scorecard from our pharma GEO benchmark. Each brand was evaluated across multiple dimensions using standardized queries submitted to OpenAI (ChatGPT), Google Gemini, and Perplexity AI.
| Brand | Overall Score | Net Sentiment | Reliability | Citation Freq. | AI Visibility | Language |
|---|---|---|---|---|---|---|
| Beyfortus | 66 | 50/100 | 85% | 45% | 100% | EN |
| Braftovi | 60 | 0/100 | 83% | — | 93% | EN |
| Dupixent (EN) | 61 | 20/100 | 84% | — | 50% | EN |
| Dupixent (FR) | 59 | 20/100 | 78% | — | 70% | FR |
| Serelys Meno | 61 | 65/100 | 48% | — | — | FR |
| Yescarta | 57 | 5/100 | 81% | — | — | FR |
| Imfinzi | 54 | 38/100 | 77% | — | 79% | FR |
| Ontozry | 50 | 3/100 | 80% | — | — | FR |
| Menopur | 50 | 0/100 | 80% | — | 98% | EN |
| Ledaga | 49 | -10/100 | 81% | — | 89% | FR |
| Tremfya | 46 | 10/100 | 75% | 50% | 81% | EN |
| Parodontax | 45 | 80/100 | 72% | — | 100% | EN |
| Wegovy | 47 | -2/100 | 77% | — | — | FR |
| Aristada | 44 | 8/100 | 75% | 47% | 99% | EN |
| Entyvio | 44 | 35/100 | 73% | 24% | 65% | EN |
| Voltaren | 43 | -5/100 | 74% | — | — | FR |
Scores represent aggregated performance across OpenAI, Gemini, and Perplexity. Net Sentiment scored on a -100 to 100 scale. Reliability measures factual accuracy against approved labeling. Citation Freq. tracks how often AI models cite sources. AI Visibility measures brand mention rate in relevant therapeutic queries.
What the Scorecard Tells Us at a Glance
The overall score range spans from 43 to 66 — a 23-point gap that reveals a significant disparity in how well AI models represent different pharmaceutical brands. Several patterns emerge immediately:
- No brand scores above 70. Even the best-performing drugs in AI have substantial room for improvement. The pharma GEO benchmark ceiling is lower than most brand teams expect.
- Reliability clusters tightly between 48% and 89%. This structural band suggests that AI models have a floor of baseline medical knowledge but a clear ceiling on precision.
- Sentiment and score move independently. Parodontax has the highest sentiment in the dataset (80/100) but scores just 45 overall. Braftovi has zero sentiment but scores 60.
- Near-perfect AI visibility does not guarantee a high score. Aristada is visible in 99% of relevant queries yet scores only 44.
These counterintuitive findings are the foundation of the pharma GEO benchmark — and they have direct implications for how brands should allocate their AI optimization resources.
What Separates Leaders from Laggards
The Top Performers: Beyfortus and Braftovi
Beyfortus leads the pharma GEO benchmark with a score of 66, and its profile explains why. Across all three AI models, Beyfortus maintains a remarkably tight performance band:
- OpenAI: 66
- Gemini: 70
- Perplexity: 63
That 7-point spread is the narrowest in the entire dataset, meaning Beyfortus delivers a consistent AI experience regardless of which model a patient or physician queries. This consistency is not accidental. Beyfortus benefits from regulatory source dominance — the AI models predominantly cite FDA and EMA regulatory documents when discussing the brand, which anchors responses in verified, structured data.
Braftovi follows at 60, with a distinctive profile of its own. Its Gemini score of 71 represents the highest single AI model score recorded for any brand in this benchmark. What drives it? All five top-cited sources for Braftovi come from FDA documentation. When an AI model has a clean, authoritative, well-structured source to draw from, accuracy and score both improve.
Key Takeaway: The top-performing brands in the pharma GEO benchmark share one trait: their AI-facing information ecosystem is dominated by regulatory and institutional sources, not promotional content, not patient forums, and not news articles.
The Mid-Tier: Where Most Brands Live
The majority of brands in our pharmaceutical AI scorecard cluster between 44 and 57. This is the contested middle ground — and it is where optimization efforts yield the most return.
Dupixent provides a valuable case study as the only brand measured in both English and French. Its English score (61) outperforms French (59) by a modest margin, but the reliability gap is more telling: 84% in English vs. 78% in French. AI models have access to more structured English-language content for Dupixent, which translates directly into higher factual accuracy.
Imfinzi (54) and Yescarta (57) represent strong oncology entries with solid reliability (77% and 81% respectively) but middling overall scores. Their profiles suggest that clinical complexity — multiple indications, evolving trial data, combination regimen contexts — creates challenges for AI models trying to synthesize clear, accurate answers.
The Lower Tier: Visibility Without Substance
Brands scoring below 46 share a common trait: at least one critical dimension underperforms even when others look strong.
Aristada is the starkest example. With 99% AI visibility and a 47% citation frequency, it appears in nearly every relevant query and AI models frequently cite sources when discussing it. Yet its overall score is just 44. The issue is a reliability score of 75% — meaning one in four factual claims AI models make about Aristada contains an error or omission. Visibility without accuracy is not just unhelpful; it is potentially dangerous.
The Three Correlations Every Pharma Brand Should Know
Our pharma GEO benchmark reveals three statistical relationships that should reshape how pharmaceutical companies think about AI optimization. Two are intuitive. One is not.
Correlation 1: Reliability Drives Score (Strong Positive)
This is the most important finding in the dataset. Brands with reliability above 80% consistently score above 50. Brands with reliability below 75% consistently score in the low-to-mid 40s.
The relationship is nearly linear:
- Beyfortus: 85% reliability, 66 score
- Dupixent (EN): 84% reliability, 61 score
- Braftovi: 83% reliability, 60 score
- Yescarta: 81% reliability, 57 score
- Tremfya: 75% reliability, 46 score
- Entyvio: 73% reliability, 44 score
- Parodontax: 72% reliability, 45 score
The implication is clear. If a brand team has limited resources, investing in the accuracy of AI-accessible content — structured data, regulatory source optimization, consistent labeling information — will produce more score improvement than any other single intervention.
Key Takeaway: Reliability is the strongest lever in the pharma GEO benchmark. Moving reliability from 75% to 85% correlates with a 15+ point score improvement — more than any other factor.
Correlation 2: Sentiment Does NOT Drive Score
This is the counterintuitive finding. Many brand teams assume that positive AI sentiment equals strong AI performance. The data says otherwise.
Consider these pairings:
| Brand | Net Sentiment | Overall Score |
|---|---|---|
| Parodontax | 80/100 | 45 |
| Serelys Meno | 65/100 | 61 |
| Beyfortus | 50/100 | 66 |
| Braftovi | 0/100 | 60 |
| Menopur | 0/100 | 50 |
| Ledaga | -10/100 | 49 |
Parodontax has the highest sentiment score in the entire benchmark (80/100) but an overall score of just 45. Braftovi has zero net sentiment — perfectly neutral — yet scores 60. The correlation between sentiment and overall score is effectively absent.
Why? Because AI models are not search engines optimizing for user satisfaction. They are information synthesis tools optimizing for factual completeness. A model that says neutral but accurate things about a drug outperforms one that says positive but incomplete things.
For pharma brand teams, this means that reputation management strategies designed for search and social do not translate to GEO. Optimizing for positive sentiment in AI without ensuring factual depth and source quality is a misallocation of resources.
Correlation 3: AI Visibility Does NOT Drive Score
The second counterintuitive finding. Being present in AI responses is not the same as being well-represented.
| Brand | AI Visibility | Overall Score |
|---|---|---|
| Parodontax | 100% | 45 |
| Beyfortus | 100% | 66 |
| Aristada | 99% | 44 |
| Menopur | 98% | 50 |
| Braftovi | 93% | 60 |
| Ledaga | 89% | 49 |
Parodontax and Beyfortus both achieve 100% AI visibility — they appear in every relevant query. Yet their scores differ by 21 points. Aristada, at 99% visibility, scores 22 points below Beyfortus.
Visibility is a necessary but insufficient condition for AI performance. A brand that appears in every AI response but is described inaccurately, incompletely, or inconsistently is worse off than a brand that appears less frequently but with high-fidelity information.
Key Takeaway: The pharma GEO benchmark breaks three common assumptions. Score is driven by reliability, not sentiment. Visibility without accuracy is counterproductive. And the brands that "win" in AI are those with the cleanest source ecosystems — not the loudest digital presence.
AI Model Performance: OpenAI vs. Gemini vs. Perplexity
The pharma GEO benchmark does not just measure brands — it measures the AI models themselves. Each model exhibits distinct behavioral patterns that pharma teams must understand.
| AI Model | Typical Score Range | Typical Reliability | Sentiment Pattern |
|---|---|---|---|
| OpenAI | 43–66 | 73–89% | Overwhelmingly neutral |
| Gemini | 40–71 | 71–89% | Neutral to slightly positive |
| Perplexity | 23–63 | 52–81% | Neutral, occasionally positive |
OpenAI (ChatGPT): The Conservative Baseline
OpenAI produces the most consistent and reliability-focused responses across all brands. Its score range (43–66) is the tightest, suggesting a more standardized internal approach to medical content. Sentiment is overwhelmingly neutral — ChatGPT rarely endorses or criticizes drugs, instead opting for clinical language that mirrors labeling documents.
For pharma brands, OpenAI is the reliability benchmark. If a brand scores well on ChatGPT, its underlying information ecosystem is likely sound.
Google Gemini: The High-Ceiling Wildcard
Gemini produces the widest score range (40–71) and the highest individual brand scores. Braftovi's Gemini score of 71 is the peak of the entire dataset. However, Gemini also exhibits more variance — its lows dip to 40, below OpenAI's floor.
Gemini's sentiment pattern skews neutral to slightly positive, suggesting it is more willing to incorporate favorable framing from source material. For brands with strong positive source content, Gemini may amplify it. For brands with mixed or negative sources, the variability becomes a risk.
Perplexity: The Accuracy Challenge
Perplexity presents the most significant challenge for pharma brands. Its reliability range (52–81%) is the widest and lowest of any model, and its score floor of 23 is dramatically below the other two models.
The low reliability floor is especially concerning because Perplexity's core value proposition is citation-backed answers. Paradoxically, Perplexity cites more sources but gets more facts wrong in certain drug categories. This suggests the model sometimes surfaces sources that are outdated, non-authoritative, or misinterpreted.
Key Takeaway: No single AI model is "best" for pharma. OpenAI delivers the most reliable baseline, Gemini offers the highest upside with more variance, and Perplexity requires the most active monitoring due to its wider accuracy spread. A comprehensive pharma GEO strategy must optimize for all three.
The OTC vs. Rx Divide: Two Different AI Worlds
One of the most significant structural findings in the pharmaceutical AI scorecard is the divergence between over-the-counter (OTC) and prescription (Rx) brand dynamics.
OTC brands like Parodontax and Voltaren exhibit a consistent pattern:
- Higher sentiment scores — Parodontax leads the entire benchmark at 80/100 sentiment
- Lower reliability — OTC brands cluster at 72–74% reliability, below the Rx average
- More manufacturer site citations — AI models more frequently cite brand-owned content for OTC products
Prescription brands show the inverse:
- Lower or neutral sentiment — Clinical neutrality prevails
- Higher reliability — Rx brands with regulatory source dominance achieve 80%+ reliability
- Institutional and regulatory citations — FDA, EMA, and clinical databases dominate the source mix
Why does this divide exist? AI models reflect their training data. OTC brands have larger consumer-facing content footprints — patient reviews, lifestyle articles, brand marketing — which boosts sentiment but introduces factual noise. Prescription brands have leaner but more authoritative content ecosystems — regulatory filings, clinical trial publications, professional medical resources — which suppresses sentiment but elevates accuracy.
The strategic implication is different for each category. OTC brands need to improve the factual rigor of their consumer-facing content so that AI models do not sacrifice accuracy for positive framing. Rx brands with established regulatory source profiles should protect that advantage while selectively improving sentiment dimensions where it does not compromise reliability.
Key Takeaway: OTC and Rx brands face opposite GEO challenges. OTC brands must close the reliability gap without losing their sentiment advantage. Rx brands must maintain their accuracy edge while expanding their AI content footprint.
The Consistency Crisis: When AI Models Contradict Themselves
The pharma GEO benchmark exposes a dimension that most brand teams have never measured: how consistently do AI models describe a drug's benefits and risks across repeated queries?
The results are alarming.
Benefits Consistency
- Dupixent (single model): 100% — the same benefits are cited every time
- Entyvio (cross-model): 13% — AI models disagree on Entyvio's benefits 87% of the time
- Ledaga (cross-model): 0% — no consensus on core benefits across models
Risks Consistency
- Entyvio: 3% — virtually no cross-model agreement on risk profile
- Ledaga: 0% — complete disagreement on risks across AI models
What does 0% consistency mean in practice? It means that a patient asking ChatGPT about Ledaga's benefits will receive a fundamentally different answer than a patient asking Gemini or Perplexity. The same drug, described by three different AI models, is effectively three different drugs.
This is not a theoretical concern. When physicians use AI tools for quick drug comparisons, and when patients use them to validate treatment decisions, cross-model inconsistency introduces real confusion into clinical conversations. A patient who reads conflicting AI-generated information about their medication may lose trust in the treatment — or worse, make decisions based on the least accurate model's output.
The consistency crisis is the strongest argument for proactive GEO management. Brands cannot control which AI model a patient or physician queries. They can only ensure that the underlying source material is so clear, so well-structured, and so authoritative that all three models converge on the same core messages.
Key Takeaway: Cross-model consistency is the hidden crisis in pharma AI. Brands like Ledaga with 0% benefits and risks consistency are effectively represented as different drugs depending on which AI a user queries. Source quality and structure are the only levers that drive convergence.
Temporal Volatility: Your Score Today Is Not Your Score Tomorrow
The pharma GEO benchmark captures a moment in time — but AI performance is not static. Our progress tracking data reveals significant temporal volatility that brands must account for.
Entyvio's tracked score fluctuated between 45 and 90 over the monitoring period. That is a 45-point swing — larger than the entire gap between the top and bottom brands in our static benchmark.
What drives this volatility?
1. Model updates. When OpenAI, Google, or Perplexity update their models, drug representations can shift overnight. A training data refresh that includes a new clinical trial publication, an FDA safety communication, or even a highly shared news article can alter how a model discusses a drug.
2. Source ecosystem changes. When a new systematic review is published, when a competitor's trial data reshapes a therapeutic category, or when a media cycle drives attention to a drug class, AI models absorb these signals and adjust their outputs.
3. Query context drift. The same question asked in different phrasing, at different times, or in different conversational contexts can elicit meaningfully different AI responses. This non-deterministic behavior is inherent to large language models.
The strategic takeaway: point-in-time audits are insufficient. A brand that scores 65 today might score 48 next month after a model update, or 75 after a favorable publication enters the training data. Continuous monitoring — the kind the [PharmaGEO platform](#) provides — is the only way to detect regressions before they compound.
Static SEO audits happen quarterly. GEO monitoring must happen continuously.
Key Takeaway: AI drug representation is volatile. Entyvio's 45-point score swing proves that a single benchmark is a starting point, not a finish line. Brands need continuous monitoring to detect and respond to AI content shifts in real time.
Methodology Notes
The pharma GEO benchmark was produced using the PharmaGEO platform's standardized evaluation framework:
- 23 brands across multiple therapeutic areas and markets
- 3 AI models: OpenAI (GPT-4), Google Gemini, Perplexity AI
- Standardized query sets designed to simulate real patient and HCP information-seeking behavior
- Scoring dimensions: Overall GEO Score (composite), Net Sentiment (-100 to 100), Factual Reliability (% accuracy against approved labeling), Citation Frequency (% of responses citing sources), AI Visibility (% of relevant queries where brand appears)
- Languages: English (EN) and French (FR)
- Evaluation period: Scores represent aggregated performance across multiple query sessions
All reliability assessments were validated against approved product labeling, regulatory submissions, and peer-reviewed clinical data.
Frequently Asked Questions
What is a pharma GEO benchmark?
A pharma GEO benchmark is a standardized measurement of how AI models like ChatGPT, Gemini, and Perplexity represent pharmaceutical brands. It evaluates factual reliability, sentiment, visibility, citation behavior, and consistency across models. The PharmaGEO benchmark is the first published multi-brand scoring framework, covering 23 drugs across multiple therapeutic areas and languages.
Why do some brands score high on AI visibility but low overall?
AI visibility measures whether a brand appears in AI responses — not whether it is represented accurately. Brands like Aristada (99% visibility, 44 score) and Parodontax (100% visibility, 45 score) appear in nearly every relevant AI query but suffer from lower reliability or consistency. Visibility without accuracy is counterproductive because it means more patients and physicians are exposed to potentially inaccurate information.
Which AI model is most reliable for drug information?
In our pharma GEO benchmark, OpenAI (ChatGPT) delivers the most consistent reliability, with scores ranging from 73–89% across brands. Gemini offers the highest individual scores but with more variance (71–89% reliability). Perplexity has the widest reliability spread (52–81%), making it the model that requires the most active monitoring from pharma brand teams.
Does positive AI sentiment improve a brand's GEO score?
No. The pharma GEO benchmark shows no meaningful correlation between sentiment and overall score. Parodontax has the highest sentiment (80/100) but scores just 45, while Braftovi has zero sentiment and scores 60. AI models prioritize factual completeness over positive framing, so strategies designed to improve sentiment without addressing reliability will not improve overall GEO performance.
How often do AI drug representations change?
AI drug representations are volatile. Our monitoring data shows score fluctuations as large as 45 points for a single brand over time (Entyvio: 45–90 range). Changes are driven by AI model updates, new publications entering training data, and shifts in the broader source ecosystem. Point-in-time audits capture a snapshot, but continuous monitoring is necessary to detect and respond to regressions.
How can pharma brands improve their GEO scores?
The strongest lever is factual reliability. Brands with 80%+ reliability consistently score above 50, while those below 75% score in the low-to-mid 40s. Practical steps include optimizing the structure and accessibility of regulatory source content, ensuring approved labeling information is well-represented in AI-accessible formats, and monitoring cross-model consistency to identify and correct factual gaps.
Conclusion: The Benchmark Is Just the Beginning
The pharma GEO benchmark reveals an industry at a crossroads. AI models are already the front door for drug information — patients query them, physicians reference them, and payers are beginning to consult them. Yet the data shows that even the best-represented brands score just 66 out of 100, cross-model consistency can drop to zero, and scores can swing by 45 points without warning.
Three findings from this benchmark should drive immediate action:
1. Reliability is the only metric that consistently predicts overall AI performance. Brands that invest in the accuracy and structure of their AI-accessible content will outperform those that chase sentiment or visibility.
2. Every brand needs a multi-model strategy. OpenAI, Gemini, and Perplexity behave differently, cite different sources, and produce different representations of the same drug. Optimizing for one model while ignoring the others creates blind spots.
3. Static audits are already obsolete. The temporal volatility in this data proves that AI drug representation is a moving target. Continuous monitoring is not a luxury — it is a requirement for any brand that takes AI-driven patient and physician engagement seriously.
This benchmark is the first published pharma GEO scorecard. It will not be the last. As AI models evolve, as new brands enter the dataset, and as the industry matures its approach to generative engine optimization, the brands that move first will establish the advantages that compound over time.
The question is no longer whether AI represents your drug. It is whether you know how — and whether you are doing anything about it.
See how your brand appears in AI answers.
Get a cross-LLM reputation report in minutes. No patient data. EU-based storage.
Data source: PharmaGEO platform analysis of 23 pharmaceutical brands across OpenAI, Gemini, and Perplexity (2025)