Benchmarking 23 pharma brands across 3 LLMs

A benchmark that measures pharma brand AI visibility on a single engine in a single language is not a benchmark. It is a point estimate of one-sixth of the picture. The May 2026 PharmaGEO public index — measuring Answer Rate (AR) and Share of Voice (SOV) across OpenAI, Gemini, and Perplexity, in English, French, and Spanish, across four therapeutic areas — shows that leadership positions shift materially between engines, and that a brand calling itself the "AI category leader" based on ChatGPT performance alone may be a distant third on Perplexity and nearly invisible in French.

This article anchors a public benchmark on the May 2026 PharmaGEO public index data. No proprietary client data is used. Every figure references publicly observable brand scores across the four TAs — atopic dermatitis, obesity, psoriasis, and lung cancer — in the index. The central argument: a benchmark that does not span multiple engines and multiple languages is structurally misleading for any brand with a European patient or HCP audience.

Why multi-engine benchmarking is structurally necessary

The theoretical foundation for engine divergence comes from citation overlap data. Digital Bloom's 2025 AI Citation LLM Visibility Report found that the domain overlap between ChatGPT and Perplexity citations is only 11%. The two engines are drawing from fundamentally different source pools — not just ranking the same sources differently, but citing largely non-overlapping sets of domains. A content strategy that earns strong citations from one source set may have near-zero visibility in the other.

The upstream driver of this divergence is brand mention density. Ahrefs' study of 75,000 brands found a correlation of r = 0.664 between web brand-mentions and AI Overview citation frequency. YouTube mentions were even more predictive at r ≈ 0.737. But because each engine weights different domains and source types differently, the same brand-mention distribution produces different citation outcomes per engine. Two brands with similar total web mention counts but different distributions — one concentrated in clinical journals, one distributed across news and forums — will have reversed rankings on Perplexity versus OpenAI.

The 4-TA × 3-engine matrix: where leadership actually sits

The May 2026 PharmaGEO public index allows construction of a brand leadership matrix across four TAs and three engines. The key finding: "leadership" is not a stable category property. It shifts by engine, and the shifts are not small.

Therapeutic area	OpenAI top-2 (SOV)	Gemini top-2 (SOV)	Perplexity notable
Atopic Dermatitis	Dupixent 16.3% • Rinvoq 14.9%	Dupixent 23.5% • Rinvoq (est. lower)	Protopic AR 29.9% (rank 4 — 2000-era brand)
Obesity	Wegovy 22.2% • Zepbound 19.5%	Wegovy 34.3% • Zepbound (est. higher)	Ozempic ~6% SOV (T2D-only brand in obesity TA)
Psoriasis	Cosentyx 10.4% • Skyrizi 9.4%	Cosentyx leads (est.)	Tremfya #2 at 10% — Skyrizi drops to #6 at 7%
Lung Cancer	Keytruda 7.6% • Tagrisso 7.2%	Similar (highly fragmented TA)	29 products with non-zero SOV (vs 10 on OpenAI)

The Wegovy SOV anomaly: 22.2% vs 34.3%

Wegovy (semaglutide for obesity) holds 22.2% SOV on OpenAI and 34.3% on Gemini — a 12.1-point difference in the same query set, same week. This reflects a structural pattern in Gemini's retrieval behavior: across TAs, Gemini gives consistently larger SOV to category leaders. The hypothesis is that Gemini surfaces fewer distinct products per response than OpenAI, amplifying the winner-take-most effect. For Wegovy, this is a favorable asymmetry. For the #3 or #4 brand in a metabolic TA, Gemini is structurally harder to displace the leader on than OpenAI is.

For brand teams modeling GEO ROI, the Gemini SOV premium on category leaders means that a brand in the top-2 position needs fewer content interventions to maintain its Gemini visibility than to maintain its Perplexity presence. But a challenger brand that measures only its Gemini SOV will overestimate how well it is doing relative to the leader, because the gap is actually smaller on Gemini than on the platforms where the leader's premium is not as pronounced.

The Skyrizi vs Tremfya flip: a 4-rank swing on Perplexity

In the psoriasis TA, Skyrizi holds rank 2 on OpenAI with 9.4% SOV and rank 4 on Gemini. On Perplexity it falls to rank 6 with 7% SOV. Tremfya, conversely, ranks 4th on OpenAI (8.1% SOV) and rises to rank 2 on Perplexity with 10% SOV. A brand that looks like a strong #2 on the engine the team monitors is actually a mid-tier #6 on Perplexity — a 4-rank swing within a single IL-23 inhibitor class, in the same category, in the same week.

This is not an edge case. It is a structural consequence of the engines drawing from different source pools. Perplexity's live retrieval surfaces different content — specialist hubs, current medical news, psoriasis patient communities — than OpenAI's corpus-weighted model. A brand with strong coverage in psoriasis-hub.com and dermnetnz.org — two of Perplexity's high-use sources in this TA — will rank differently there than a brand whose content is concentrated in NEJM publications that OpenAI's corpus has absorbed but Perplexity's live retrieval weights less.

Lung cancer: the fragmentation benchmark

The most fragmented TA in the index is lung cancer. The top-3 brands — Keytruda (7.6%), Tagrisso (7.2%), and Tecentriq (6.1%) — combine for only 20.9% SOV on OpenAI. The remaining 79.1% is distributed across 26 or more products. On Perplexity, 29 distinct products show non-zero SOV versus 10 in the OpenAI public-facing table. This is the benchmark case for why Perplexity cannot be omitted from an oncology GEO program: its competitive set is fundamentally different and materially larger than what OpenAI surfaces.

The source concentration in lung cancer makes the category's GEO dynamics distinctive. NCCN's NSCLC guideline PDFs account for approximately 218 of 258 total citation uses in Perplexity lung cancer answers — roughly 85% of the citation stack dominated by a single organization. An oncology brand's path to AI visibility in lung cancer runs almost entirely through NCCN guideline inclusion, not general web content or press syndication. A benchmark that does not include Perplexity misses the engine where NCCN dominance is most acute and most legible.

Language shifts the benchmark again

The third dimension the index adds is language. Across TAs, the English-language benchmark substantially misrepresents AI visibility for brands with European footprints.

Brand	English AR (OpenAI)	French AR (OpenAI)	Spanish AR (OpenAI)
Dupixent	65.5%	63.7%	62.3%
Rinvoq	59.8%	57.5%	59.7%
Adtralza (tralokinumab)	13.8%	48.8%	51.9%
Cibinqo	58.6%	56.3%	58.4%

Global brands with simultaneous FDA/EMA approvals and large uniform clinical programs — Dupixent, Rinvoq, Cibinqo — are stable across languages, varying by less than 3 percentage points. Mid-tier brands with regional approval profiles are wildly unstable. Adtralza's 35-point gain from English to French is not a data artifact: it reflects the fact that tralokinumab has a larger clinical content footprint in the EU-language internet than in the English-language internet, and LLMs draw on that asymmetric source pool.

The benchmark implication: a pharma brand team that runs an English-only GEO benchmark and concludes its brand is a minor player is potentially wrong for its French and Spanish prescriber audiences, and vice versa. European market access strategies that depend on AI visibility for HCP awareness need language-level measurement as standard input, not as a specialized add-on.

Why a single-engine, single-language benchmark is misleading

The three structural features of the May 2026 index — engine divergence, SOV compression, and language leaderboard shifts — combine to make single-point benchmarking a strategically dangerous practice.

Engine divergence: a 33-point AR gap on the same brand in the same TA (A1) means your "AI visibility" is 4x different depending on which engine you measure.
SOV compression: Wegovy's 93.2% Answer Rate compresses to 22.2% SOV (C3). A benchmark that reports AR only will consistently overstate a brand's actual conversation surface by approximately 4x.
Language shifts: Adtralza's English AR of 13.8% vs French AR of 48.8% means an English-only benchmark assigns a major European brand to the "tail" category it does not actually occupy for the majority of its prescriber base.
Citation pool overlap is only 11%: According to Digital Bloom, ChatGPT and Perplexity share only 11% of their cited domains. A strategy built entirely on one engine's source pool will have near-zero lift on the other.

The minimum viable benchmark for a pharma brand team in 2026 is three engines × two languages × two query types (branded + disease-state), measured monthly. Anything less is a partial measurement presented as a complete picture. Vendors and agencies offering single-engine GEO audits are selling an instrument that cannot measure the instrument's own blind spots.

What a meaningful benchmark looks like in practice

A structurally valid pharma GEO benchmark has four components:

Multi-engine coverage — at minimum OpenAI, Gemini, and Perplexity, because their citation pools overlap by only 11% and their retrieval architectures produce systematically different brand rankings. A brand's competitive set on Perplexity (29 products in lung cancer) is not the same competitive set as on OpenAI (10 products).
Multi-language coverage — at minimum English and the primary market language for each geography where the brand has material commercial goals. French and Spanish are the obvious extensions for EU/LATAM brands. Language choice determines which source pool the engine draws from, which determines which brands lead.
Both AR and SOV — Answer Rate measures reach; SOV measures depth. The 4x compression ratio means they require different strategic responses and should not be reported as the same metric.
TA-specific source attribution — because NCCN guideline PDFs dominate lung cancer citations while FDA labels dominate obesity citations, the GEO lever set is fundamentally different by TA. A benchmark without source attribution cannot inform content investment decisions.

The May 2026 PharmaGEO public index applies all four of these dimensions to publicly named brands across four TAs. The result is not a comfortable league table with stable leaders and stable laggards. It is a picture of a measurement problem: brand AI visibility is a six-dimensional metric that most pharma GEO programs are measuring in one dimension and calling it complete.

The citation volatility dimension: why benchmarks go stale

Beyond the structural engine and language gaps, there is a temporal dimension that makes static benchmarks particularly unreliable in pharma. Digital Bloom reports 59.3% monthly volatility in AI citation graphs — meaning that the set of sources an engine cites for a given topic changes by nearly 60% month over month. A benchmark conducted in January may have limited predictive value for April. In practice this means pharma GEO benchmarks need to be run at minimum quarterly, and monthly for categories with rapid clinical development or pipeline news coverage that could shift the source pool.

The volatility rate has an important implication for competitive analysis. A brand that was ranked third on Perplexity in Q1 and has not been measured since may have moved to first or fifth by Q3, depending on the clinical publication cycle and guideline updates in its TA. In lung cancer specifically, where NCCN updates its NSCLC guidelines multiple times per year, each guideline revision represents a potential reordering of the citation stack. Any brand in an oncology TA that benchmarks less than quarterly is effectively flying blind through the update cycles that most determine its AI visibility.

The practical conclusion is that multi-engine, multi-language benchmarking at the right cadence is not an advanced capability. It is the baseline. A benchmark that does not use multiple engines and multiple languages cannot answer the most commercially significant questions a pharma brand team needs answered: where am I weak that my team does not currently monitor, and what changed this month that I need to respond to?

Want a real audit on your brand? Request a sample report or get the full PharmaGEO Playbook.

Benchmarking 23 pharma brands across 3 LLMs.