AI Visibility Intelligence

How AI Systems Decide What to Quote (and Ignore)

You publish good content. It ranks. It gets shared. AI answers ignore you. The instinct is to chase “authority.” More links. More brand mentions. That’s almost always the wrong fix. You’re not losing a popularity contest. You’re failing a pipeline.

~1%
of users click cited sources in Google AI summaries
Pew Research, 2025
50–90%
of LLM citations fail to fully support the claims they’re attached to
Nature Communications, 2025
>60%
of tested AI search queries returned incorrect answers
Tow Center / CJR, 2025

Three terms decide whether you get cited.

Most AI answer systems that show citations behave like a retrieval pipeline. They assemble candidates, extract claims, then select what to present. This architecture is well-documented: Lewis et al. defined RAG as combining “a pre-trained retriever with a generator” [3], Karpukhin et al. showed QA systems “rely on efficient passage retrieval to select candidate contexts” [4], and Nakano et al.’s WebGPT demonstrated that citation-bearing systems “collect references while browsing” web pages [5].

That retrieve-then-generate pattern maps onto three failure points. Click each stage to see what breaks.

STAGE 01
Eligibility
Can the system access your page at all?
STAGE 02
Extractability
Can it pull a clean answer without parsing a novel?
STAGE 03
Selection
Does it choose your page over competitors?
Can the system see your page?
Crawlable. Indexable. Not blocked. Not paywalled. Not a rendering mess. If you fail here, nothing downstream matters. You don’t exist in the candidate set. The system doesn’t penalize you — it simply never finds you.
Failure = Invisible. Nothing downstream matters.
Can it lift a clean, bounded answer from your page?
Clear headings. Tight definitions. Copyable blocks. Systems split documents into passage-sized chunks and evaluate them individually [4]. If your page buries the answer in paragraph 12 of a 2,500-word narrative, the system moves to the next candidate. Your content doesn’t get a second chance.
Failure = Hard to quote. System picks a cleaner competitor.
Does it pick your page over alternatives?
This is where trust, canonicity, relevance, and duplication control matter. This is the only stage where anything resembling “authority” plays a role. It’s the last filter, not the first. Factors include source trust, content freshness, entity consistency, and whether the cited page is the canonical version of that answer.
Failure = Outcompeted. But at least you’re in the race.
Critical insight Most failures happen at stages one and two. You’re not losing a competition. You’re getting filtered out before the competition starts.

How the retrieval pipeline actually works.

Keep this four-step flow in your head while you diagnose pages.

1
Query enters the system
The model’s retrieval layer translates the user’s question into a search against its available corpus. In RAG systems, this typically uses dense vector retrieval to find candidate passages [3][4].
2
Candidate set forms
The system pulls pages it can access. Anything blocked by robots.txt, stuck behind a login wall, tagged noindex, or buried in JavaScript that doesn’t render server-side never enters this set. These pages are not penalized — they are absent.
Robots rules control crawling. Noindex controls indexing. They break different things [6][7].
3
Passages get extracted
From the candidate pages, the system identifies answer-shaped content: passages, claims, definitions, data points. Pages with clean headings and bounded answer blocks make extraction fast. Pages where the answer lives halfway through a narrative get skipped. When extraction is hard, the system moves to the next candidate.
Your content doesn’t get a second chance.
4
Citation gets assigned
The system picks the best extracted passage and attributes it. Factors include source trust, content freshness, entity consistency, and whether the cited page is the canonical version. But even this step is unreliable: 50–90% of LLM citations are “not fully supported” by the sources cited [8].

Four tests to isolate your bottleneck.

You don’t need pep talks. You need checks that isolate the problem. Click any test to expand the full procedure.

TEST 01
Indexation & canonical inspection
Confirm eligibility and canonical consistency.

Run site:yourdomain.com/target-page in Google. No result means no index. Note: Google warns the site: operator is “not guaranteed” to show all indexed pages [9].

In Google Search Console, use URL Inspection to check index status and the canonical Google selected [10]. Look for: noindex detected, blocked by robots.txt, duplicate without user-selected canonical, crawl anomaly.

Working theory: if Google can’t settle on a canonical, other retrieval systems may also treat your site as duplicative noise.

TEST 02
Retrieval check with a query set
Measure whether you appear as a cited source for queries that matter.

Pick 20 representative queries using real buyer language. Run each in ChatGPT, Perplexity, Google AI Overviews, and Bing Copilot. Log outcomes per query per surface:

Cited = your URL appears as a source. Mentioned = brand referenced but not linked. Absent = nothing.

Do this monthly. Treat it like a visibility KPI, not a one-time audit. Absent everywhere points to eligibility. Retrieved on some but not others suggests surface-specific crawling differences. Retrieved but never quoted points to extractability.

TEST 03
15-second extractability test
See whether your page produces a clean, liftable answer block.

Open your page. Start a timer. Find the single best answer to the query your page targets. Copy and paste it into a blank document.

If it takes longer than 15 seconds, your content is not extractable. The answer is buried, split across sections, or tangled with unrelated material.

Additional checks: Is the answer near the top, under a heading that reflects the query? Is it self-contained in 2–4 sentences? Are entity names consistent throughout?

TEST 04
Duplication check across your own site
Identify selection killers you created yourself.

Search your own site for the core intent behind your target query. Count how many pages could plausibly answer it. Look for: five “What is X” posts with slightly different framing, three comparison pages that all claim to be definitive, docs and blog posts both defining the same term.

If multiple URLs can answer the same question, you’re forcing the system to deduplicate you. Google confirms: “the canonical page will be crawled most regularly” while “duplicates are crawled less frequently” [11].

The 15-second extractability test
Open your target page in another tab. Hit start. Can you find, select, and copy the best answer before time runs out?
15.0

Failure mode reference table.

Map your symptom to the pipeline stage that’s broken.

Symptom Stage Likely cause First fix
Absent in all AI surfaces Eligibility Blocked by robots, noindex, login wall, or JS rendering failure Audit robots.txt, noindex tags, rendering
Indexed but never cited Extractability Answer is buried, split across sections, or tangled with unrelated content Add a 2–4 sentence citation target near the top
Cited on some surfaces, absent on others Eligibility Surface-specific crawling/rendering differences Check crawl access per AI system; server-render critical content
Mentioned but not linked Extractability System knows your brand but can’t find a clean passage to cite Create explicit, bounded answer blocks
Competitor cited instead of you Selection Competitor has cleaner canonical, fresher content, or stronger trust signals Consolidate duplicates; add sourcing and update signals
Wrong page from your site cited Selection Multiple pages compete for same intent; system picked the weaker variant Merge/redirect to one canonical page per intent
Cited but misquoted Extractability Ambiguous passage boundaries; entity naming inconsistency Tighten definition blocks; standardize terminology

The fix list, in priority order.

Work top-down. If you skip step 1, steps 2 through 5 waste time.

P0 — HIGHEST Remove eligibility blockers +
  • Audit robots rules, noindex tags, and login gating on your best answer pages
  • Reduce reliance on client-side rendering for core explanatory content
  • Fix broken canonicals and parameterized duplicates

Eligibility determines whether you enter the candidate set at all. Vercel’s analysis found that “none of the major AI crawlers currently render JavaScript” [14]. If your critical content depends on client-side rendering, it may be invisible to the fastest-growing class of crawlers.

URL Inspection shows “indexable” and a stable canonical [10]. A fetch test shows meaningful content in the response body. Your “Absent” count drops in the 20-query log.

P1 — HIGH Create citation targets +
  • Add a 2–4 sentence definition block near the top of the page
  • Add a short “best answer” paragraph that directly answers the query
  • Add a small table of comparisons or a checklist with literal labels

Retrieval systems extract passages, not pages [4][5]. You’re reducing extraction cost and the risk of misquoting. Given that 50–90% of LLM citations fail to fully support their claims [8], making your content unambiguous isn’t optional — it’s how you become the source the system can safely quote.

P1 — HIGH Make one page the canonical answer per intent +
  • Identify clusters of near-duplicate intent across your site
  • Merge content where it’s truly the same question
  • Redirect old pages to the canonical page
  • Use consistent canonical signals: redirects, rel=canonical, sitemap [15]

Selection includes duplication control. When multiple URLs compete, you dilute signals. Google confirms duplicate pages are “crawled less frequently” while the canonical gets priority [11]. You also increase the chance the system picks a weaker version of your own content.

P2 — MEDIUM Add extractable structure +
  • Add FAQ sections for common objections and “how do I” queries
  • Use step lists for processes
  • Add “If X then Y” decision rules for comparison pages
  • Use literal headings, not clever ones

Structure creates boundaries. Boundaries create extractable chunks. FAQs mirror how queries are phrased. Each section should control one answer.

P2 — MEDIUM Strengthen selection signals +
  • Cite original sources when you make factual claims
  • Add an author and editorial policy consistent across the site
  • Add update notes when content changes
  • Standardize entity naming (products, features, standards, competitors)

Google’s Search Quality Evaluator Guidelines identify Trust as “the most important member” of E-E-A-T [18]. Selection favors sources that look stable, accountable, and consistent. Original references reduce risk. Update notes reduce staleness ambiguity.


Three scenarios that explain most “why aren’t we cited” cases.

SCENARIO 01
You rank, but your answer is buried under story time
The classic thought leadership trap. Your intro is a memoir. Your definition appears halfway down the page. Humans tolerate that when they’re already committed. Retrieval systems do not. Your 800-word preamble isn’t context — it’s noise.
Fix: Write a literal answer block first, then earn the right to tell the story.
SCENARIO 02
You have five near-duplicate pages, so the system picks someone else
You create “Best alternatives,” “Competitor vs Us,” “Competitor comparison,” “Switch from Competitor,” and “Competitor pricing.” All partially answer the same intent. The system deduplicates — and might not pick yours.
Fix: Consolidate to one canonical “answer page” per intent. The rest support or redirect.
SCENARIO 03
Your “helpful” interactive page is JS-heavy and invisible
Interactive calculators and configurators can be great for conversion. They can also be invisible. None of the major AI crawlers currently render JavaScript [14]. If core text only appears after client-side rendering, many systems see a thin shell.
Fix: Server-render the core explanation. Keep the interactive layer as an enhancement.

Checklists you can hand to your team today.

Click items to mark them done. Progress tracks per checklist.

Eligibility
0 / 5 complete
  • Robots rules allow crawl for citation-target pages
  • No noindex on pages you want cited
  • Page accessible without login or fragile session state
  • Canonical is correct and consistent across duplicates
  • Critical text available without complex JS rendering
Extractability
0 / 5 complete
  • Best answer appears in first screen
  • One literal H1 that matches query intent
  • Tight definition block (2–4 sentences)
  • Lists and tables are copyable and self-contained
  • Images are not doing the job of text for key facts
Selection
0 / 5 complete
  • One canonical page per intent; duplicates redirect
  • Original sources referenced when making claims
  • Consistent author and update signals across the site
  • Stable entity naming and terminology
  • Page is clearly better for the query than adjacent pages on own domain

A reproducible tracking method.

Build the query set. Run it monthly. Track movement across stages: Absent → Mentioned → Cited.

AI Citation Tracking Sheet — Example
Run monthly
Query Intent Target URL ChatGPT Perplexity Google AIO
What is SOC 2 Type II Define /guides/soc-2 Cited Cited Mentioned
SAML vs SCIM difference Compare /guides/saml-vs-scim Absent Mentioned Absent
How to calculate seat utilization How-to /resources/seat-calc Cited Absent Cited
Best alternative to [Competitor] Compare /vs/competitor Absent Absent Absent
Query set structure (20 queries) 10 definitional (What is X, How does Y work) + 5 comparison (A vs B, alternatives) + 5 procedural (How to do Z, checklist for Z). Treat this like a visibility KPI, not a one-time audit.

Myths to kill before they waste your quarter.

MYTH “Schema is a magic key.”
It’s labeling, not eligibility. Google’s own guidelines state: “Google does not guarantee that structured data will show up” in enhanced results [21].
Do instead: Fix crawl, index, and answer blocks first.
MYTH “Longer content wins.”
Unbounded answers lose. Extraction prefers tight passages [4]. Long is fine if it’s well segmented.
Do instead: Add bounded citation targets near the top.
MYTH “Backlinks guarantee AI citations.”
Links can help selection. Google calls them “one of the factors” [20]. They don’t fix extractability.
Do instead: Make your best answer easy to lift, then worry about authority.
MYTH “If we’re authoritative, AI will find us.”
Authority doesn’t resurrect blocked pages. No candidate set, no citation.
Do instead: Run the eligibility audit first.
MYTH “We just need to mention the keyword more.”
Entity clarity beats repetition. Google notes: “You likely don’t want a page with the word ‘dogs’ hundreds of times” [20]. Systems look for clean definitions and consistent naming, not density.
Do instead: Tighten terminology and headings around real questions.
Bottom line Start with an eligibility audit — fix robots rules, noindex mistakes, canonical drift, and fragile rendering. Then create extractable answer blocks. Then consolidate to one canonical page per intent. Ignore hype tactics. Ignore “AI hacks.” Ignore tool-chasing before your plumbing works. If you want AI citations, stop treating visibility like vibes. Treat it like a pipeline.

Not sure where your pages are failing?

The Visibility Scan diagnoses eligibility, extractability, and selection issues across your site in 48 hours — with dev-ready tickets to fix them.

Get the Visibility Scan → Book a Call
References
  • [1]Pew Research Center. (2025). Do people click on links in Google AI summaries? pewresearch.org
  • [2]Tow Center / Columbia Journalism Review. (2025). AI Search Has a Citation Problem. cjr.org
  • [3]Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
  • [4]Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
  • [5]Nakano, R., et al. (2021). WebGPT: Browser-assisted question-answering with human feedback. arXiv.
  • [6]Koster, M., et al. (2022). Robots Exclusion Protocol. RFC 9309, IETF.
  • [7]Google Search Central. Block Search indexing with noindex.
  • [8]Wu, K., et al. (2025). Automated framework for assessing LLM citations. Nature Communications, 16, 3615.
  • [9]Google Search Central. site: search operator documentation.
  • [10]Google Search Console Help. URL Inspection tool.
  • [11]Google Search Central. What is canonicalization.
  • [12]Google Search Central. Google Search Essentials.
  • [13]Google Search Central. JavaScript SEO basics.
  • [14]Vercel. (2024). The rise of the AI crawler.
  • [15]Google Search Central. How to specify a canonical URL.
  • [16]Google Search Central. Fix canonicalization issues.
  • [17]MDN Web Docs. <meta name=”robots”>.
  • [18]Google. (2025). Search Quality Evaluator Guidelines.
  • [19]Google Search Central. Creating helpful, reliable, people-first content.
  • [20]Google Search. How Search works: ranking results.
  • [21]Google Search Central. General structured data guidelines.