GEON GEON
GEO Guide 22 hours ago 7 min

Your robots.txt Isn't Blocking What You Think: A 2026 LLM Crawler Decoder

Blocking GPTBot doesn't stop ChatGPT from citing you, and Google-Extended isn't a crawler at all. Here's the per-bot robots.txt policy 2026 LLM crawlers actually require.

Your robots.txt Isn't Blocking What You Think: A 2026 LLM Crawler Decoder

A single blanket Disallow line in robots.txt no longer matches how 2026 LLM vendors actually crawl the web. OpenAI, Anthropic, Perplexity, and Google each operate two or three distinct user-agents that do different jobs — training, search indexing, and user-triggered fetch — and you need to allow or block each one based on whether you want citations, training data inclusion, or neither. The right approach is a deliberate per-bot policy: keep search-indexing bots if you want AI citations, block training-only bots if you don't want your content reused for model training, and use Google-Extended as a separate switch for Gemini.

The 2026 LLM crawler landscape

Ten or more LLM-related user-agents now hit your robots.txt from four main vendors, and they fall into three independent job categories. Conflating them is what makes most blanket rules useless.

  • Training crawlers fetch content to build or update foundation models. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), and content fetched by Googlebot when Google-Extended is allowed.
  • Search-indexing crawlers build the live index that powers AI search citations. Examples: OAI-SearchBot (ChatGPT Search), Claude-SearchBot (Claude with web search), PerplexityBot.
  • User-triggered fetchers fire only when an end user pastes a URL or asks about a specific page inside the chat. Examples: ChatGPT-User, Claude-User, Perplexity-User.

The most common 2026 misconception is treating these as the same actor. Blocking GPTBot does not block OAI-SearchBot, because they are separate user-agents with separate purposes — one trains models, the other powers citations in ChatGPT Search. Google-Extended adds another wrinkle: it is a robots.txt product token, not a crawler, so disallowing it controls whether Googlebot's already-fetched content can be reused to train Gemini, without affecting Google Search indexing at all.

How robots.txt is interpreted in 2026

The Robots Exclusion Protocol was finally standardized as RFC 9309 in 2022, ending decades of ambiguity. All four major AI vendors publicly commit to honoring it. The rules that matter for LLM crawlers:

  • User-agent matching is case-insensitive substring match on the first user-agent token. GPTBot and gptbot both match.
  • Longest-match path wins when Allow and Disallow conflict within a group.
  • A specific User-agent group overrides the * group entirely — there is no inheritance. If you have a block for User-agent: GPTBot, you must list every directive that applies; rules under User-agent: * are ignored once a specific block matches.
  • Empty Disallow allows everything; Disallow: / blocks everything in that group.

Common pitfalls that quietly break LLM blocks:

  • Misordered groups — each group must be a contiguous block. Writing rules between two User-agent: lines in the same group produces undefined behavior.
  • Missing newline at EOF — some parsers truncate the final directive.
  • Mid-directive commentsDisallow: /private # secret stuff works, but Disallow: /private/# secret does not, because # only starts a comment when preceded by whitespace.
  • Wrong assumptions about trailing slashesDisallow: /docs blocks /docs and /docs/foo; precedence is governed by longest-match, not literal slash counting.

Decoder table: what each directive actually blocks

This is the lookup most teams need. Each row maps a documented user-agent to its purpose and what blocking it costs.

User-agent Vendor Purpose Blocking removes you from…
GPTBot OpenAI Training data crawl OpenAI model training reuse
OAI-SearchBot OpenAI ChatGPT Search index ChatGPT Search citation surface
ChatGPT-User OpenAI User-pasted URL fetch URL summaries inside ChatGPT
ClaudeBot Anthropic General crawl / training Anthropic model training reuse
Claude-SearchBot Anthropic Claude web search citations Claude citation surface
Claude-User Anthropic User-triggered fetch URL summaries inside Claude
PerplexityBot Perplexity Index for Perplexity answers Perplexity citation surface
Perplexity-User Perplexity User-triggered fetch URL summaries inside Perplexity
Google-Extended Google Token controlling AI training reuse Gemini training reuse (Search unaffected)
Googlebot Google Web Search crawl Google Search index

Two worked examples make the trap concrete:

  • Disallow: / under User-agent: GPTBot still allows ChatGPT Search to cite you, because OAI-SearchBot is a separate user-agent with its own rules.
  • Blocking Claude-SearchBot removes you from Claude's citation surface but does not stop ClaudeBot from fetching for training — those are independent decisions.

Copy-paste templates by strategy

Pick the strategy that matches your goals; copy verbatim and tune paths.

Strategy A — citation-friendly (recommended for content sites)

# Allow AI search and on-demand fetch; block training-only bots.
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Strategy B — selective (block training, gate private paths)

Useful when you have a paid section under /members/ that should never be cited or trained on:

User-agent: *
Disallow: /members/

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Strategy C — full opt-out (block everything AI)

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

A common publisher scenario: a SaaS docs site that wants developers asking "how do I use library X" to be cited from its docs but does not want raw docs reused to train competitor models. Strategy A is the answer — keep the three search bots open, close GPTBot, ClaudeBot, and Google-Extended.

Verification: confirm your policy actually works

Writing the directives is the easy part. Confirming they take effect is where most teams skip steps.

  • Check the file itself. Fetch https://yoursite/robots.txt exactly as a bot would — no cookies, no auth — and verify it returns 200 with Content-Type: text/plain. CDN edges occasionally serve cached HTML 404 pages on this path.
  • Match user-agent strings precisely. Vendors publish exact strings: OpenAI's are listed at their crawlers page, Anthropic's at its bot policy article, and Perplexity's at docs.perplexity.ai/guides/bots along with the IP ranges it publishes for verification. Server logs should show entries like Mozilla/5.0 ... GPTBot/1.2 when OpenAI's training crawl visits.
  • Watch the log for the user-agents you blocked. If you see them succeeding with 200 responses on disallowed paths, your block is not matching. The most common cause is a stray rule in User-agent: * that the specific block does not duplicate — remember, no inheritance.
  • Review every quarter and after vendor announcements. New crawlers get launched periodically; the list above is current for 2026 Q2 and will grow. Set a calendar reminder, or check our blog for ongoing tracking.

If a block is being ignored despite a correct directive, confirm the exact user-agent token in vendor docs — case is irrelevant, spelling and hyphenation are not — then escalate via the vendor's support channel. Every major AI vendor in this list has a published path for that.

A blanket Disallow: / is a 2019 answer to a 2026 problem. Treat each crawler as a separate decision, write the policy once with comments, and revisit it quarterly. GEON helps content teams track how AI search engines actually treat their sites, but the robots.txt itself is yours to own.

Deniz

Deniz

Content & GEO Strategy