Measuring share of voice in LLM answers

A brand that appears in 93% of AI responses to obesity queries still holds only 22% of the conversation surface in those responses. This is the AR/SOV compression problem — and it is the reason that a single headline metric cannot tell a brand team whether their GEO position is strong or fragile. The metrics framework that follows is designed to resolve that ambiguity. It covers what to track, how to build a sampling design, and what each number actually means operationally.

This article is the metrics layer of the PharmaGEO measurement framework. Readers who want to go deeper on the specific distinction between Answer Rate and Share of Voice as independent strategic signals should read the companion piece on AR vs SOV dynamics, which covers the structural forces that drive the compression ratio and how to diagnose which lever to pull in each case.

Why AR alone is a misleading headline metric

The AR/SOV compression documented in the May 2026 PharmaGEO public index is the starting point for any serious GEO metrics conversation. Two data points make the problem concrete (insight C3, May 2026 PharmaGEO public index):

Brand	Therapeutic area	Answer Rate (AR)	Share of Voice (SOV)	AR/SOV ratio
Dupixent	Atopic Dermatitis	65.5%	16.3%	4.0x
Wegovy	Obesity	93.2%	22.2%	4.2x

Wegovy's AR of 93.2% means the brand appears in nearly every AI response to obesity queries on OpenAI. A brand team seeing this metric in isolation would consider the position very strong. But the SOV of 22.2% reveals that every Wegovy answer is also an answer that names Zepbound, Qsymia, and multiple other competitors — and that Wegovy accounts for less than a quarter of the conversation surface, despite appearing in almost all of it. The AR/SOV ratio of 4.2x is the compression factor that explains the gap.

This is not a failure of Wegovy's GEO strategy. It is a structural feature of AI answer behaviour: answers to treatment landscape queries name multiple brands per response. The brand that appears in 93% of answers shares each of those answers with competitors. AR tells you whether you are in the room; SOV tells you how much of the room you control. Both metrics are necessary to understand competitive position.

Concentration by therapeutic area: what SOV ceiling your category allows

The upper bound on achievable SOV is partly determined by the structural concentration of the therapeutic area. May 2026 PharmaGEO public index data for OpenAI shows four distinct market structures (insight C1):

Therapeutic area	Top-3 SOV combined	Top brands (SOV)	Structural pattern
Obesity	53.9%	Wegovy 22.2%, Zepbound 19.5%, Qsymia 12.2%	Bipolar GLP-1 duopoly with gap to tail
Atopic Dermatitis	45.8%	Dupixent 16.3%, Rinvoq 14.9%, Cibinqo 14.6%	Tight three-way cluster + extended tail
Psoriasis	28.2%	Cosentyx 10.4%, Skyrizi 9.4%, Taltz 8.4%	Crowded biologic field, highly fragmented
Lung Cancer	20.9%	Keytruda 7.6%, Tagrisso 7.2%, Tecentriq 6.1%	Highly fragmented, 26+ brands with non-zero SOV

The concentration index directly shapes the goal-setting for any SOV improvement programme. An obesity brand targeting a 5-percentage-point gain in SOV is attempting to capture roughly 23% of the non-leader conversation surface — a concrete target. A lung cancer brand targeting the same 5-point gain is attempting to capture nearly a quarter of the total top-3 SOV, which means directly displacing one of the three leading brands. The same number means entirely different things in different categories. SOV targets should always be set relative to the concentration structure of the TA, not as absolute figures.

The six-axis PharmaGEO score: what to track across dimensions

A single SOV figure, even one that correctly distinguishes AR from SOV, is insufficient for GEO programme management. A brand may have high SOV driven by inaccurate claims, or high accuracy but negative sentiment, or good English-language performance with invisible cross-engine gaps. The PharmaGEO measurement framework tracks six axes, each measuring a distinct and non-redundant dimension of GEO health.

Axis 1: Visibility (Answer Rate)

Visibility — the percentage of structured prompts for which your brand is named at least once — is the entry-level GEO metric. A rate above 70% on a given platform is generally strong; below 50% indicates a structural gap. Visibility should be reported per engine, not blended. A brand with 80% visibility on ChatGPT and 40% on Gemini has a different problem from one with 60% on both — and the diagnosis and fix are different in each case.

Axis 2: Share of Voice (SOV)

SOV measures brand-token surface area as a proportion of total brand mentions in the answer set. Calculate it as: (brand token mentions / total brand token mentions across all answers in the prompt set) × 100. SOV should be tracked per engine, per query type (branded, unbranded, comparative, clinical evidence), and compared to the TA-level concentration benchmark described above.

Axis 3: Accuracy index

The accuracy index measures the proportion of factual claims about your brand that are correct against the current label. It requires human review — a medical affairs reviewer reads each answer and codes each factual claim as accurate, inaccurate, or incomplete. An accuracy index below 80% should be treated as a compliance risk signal, not just a brand signal. Material inaccuracies about dosing, contraindications, or approval status can reach HCPs at scale with no opportunity for correction before harm occurs. Track accuracy by claim category: dosing accuracy, indication accuracy, safety accuracy, and mechanism accuracy. These have different root causes and different interventions.

Axis 4: Sentiment score

Sentiment measures the tonal framing of your brand across answers, assessed on a five-point scale from strongly positive to strongly negative. A brand can be visible, accurate, and negatively framed simultaneously — typically when third-party content in the indexed ecosystem emphasises tolerability issues or access limitations. Sentiment is the metric most sensitive to what competitors and advocacy groups have published about your brand, not what you have published about yourself.

Axis 5: Source quality and citation diversity

Source quality measures whether the citations driving your brand's mentions come from authoritative indexed sources (society guidelines, peer-reviewed journals, regulatory documents) or from consumer content, forums, or low-authority aggregators. Citation diversity measures how many independent domains are driving your citations — a brand driven by a single high-authority source is more fragile than one backed by 10–15 independent sources across multiple archetype categories. Both metrics are available from Perplexity's explicit citation layer.

Axis 6: Competitive share

Competitive share of mention measures how often your brand is named relative to competitors in shared-category queries. If disease-state queries produce 150 total brand mentions and your brand accounts for 30, your competitive share is 20%. Track this per query type — comparative queries typically show lower competitive share for non-leaders, while disease-state queries show more balanced distribution because the AI is constructing a treatment landscape rather than executing a direct comparison.

Sampling design: building a prompt set that reflects real-world queries

The six-axis framework is only as good as the prompt set driving the audit. The most common sampling error is a prompt set built around branded queries — queries that name your brand directly — where your brand is naturally most likely to appear. The real competitive risk is in unbranded disease-state queries, which represent the majority of real-world HCP search behaviour and the query type where AI most actively constructs the treatment landscape.

A robust GEO prompt set covers four query categories:

Unbranded disease-state queries: "What are the current treatment options for moderate-to-severe atopic dermatitis in adult patients?" These are the queries where the AI builds the competitive landscape without being asked about any specific brand.
Mechanism and class queries: "How do IL-4/IL-13 inhibitors work in atopic dermatitis?" These reveal class-level attribution — which brand gets treated as the representative example of the class.
Direct comparative queries: "How does Brand A compare to Brand B for [indication]?" Run in both orders. Score both which brand is treated as the reference point and the direction of the framing.
Clinical evidence queries: "What Phase 3 data supports [brand] in [indication]?" These reveal whether your key trial is being cited accurately and whether the source driving that citation is owned or third-party content.

The recommended cadence is quarterly deep audits (full prompt set across all target engines and languages) supplemented by monthly pulse checks (a 20-prompt subset covering visibility and accuracy). This provides a trend line rather than a point-in-time snapshot — important because monthly citation graph volatility runs at 59.3%, according to the Digital Bloom 2025 AI Citation and LLM Visibility Report.

Building the dashboard: the six-axis view

A functional GEO dashboard requires a consistent reporting format that translates six-axis data into clear strategic priorities. The key view is a spider chart — one axis per metric, plotted for each engine — which makes it immediately clear where the brand is strong, where it is weak, and how the profile differs across platforms.

Alongside the spider chart, the dashboard should include three trend lines: SOV over time (to detect whether content investments are moving the number), accuracy index over time (to catch label drift or third-party inaccuracy spreading in the indexed ecosystem), and competitive share over time per query type (to identify where competitors are gaining ground). These three trends, updated quarterly, provide the signal required to prioritise content investment, syndication effort, and medical education strategy in the AI answer layer.

For brand teams building this capability for the first time: start with the AR/SOV pair on a single engine, establish the TA-level concentration baseline, and add axes quarterly as the measurement infrastructure matures. The goal is not a perfect dashboard on day one — it is a dashboard that improves decision-making on day one and becomes more comprehensive over time.

Want a real audit on your brand? Request a sample report or get the full PharmaGEO Playbook.

Measuring share of voice in LLM answers.