Your robots.txt Isn't Blocking What You Think: A 2026 LLM Crawler Decoder

Blocking GPTBot doesn't stop ChatGPT from citing you, and Google-Extended isn't a crawler at all. Here's the per-bot robots.txt policy 2026 LLM crawlers actually require.

A single blanket Disallow line in robots.txt no longer matches how 2026 LLM vendors actually crawl the web. OpenAI, Anthropic, Perplexity, and Google each operate two or three distinct user-agents that do different jobs — training, search indexing, and user-triggered fetch — and you need to allow or block each one based on whether you want citations, training data inclusion, or neither. The right approach is a deliberate per-bot policy: keep search-indexing bots if you want AI citations, block training-only bots if you don't want your content reused for model training, and use Google-Extended as a separate switch for Gemini.

The 2026 LLM crawler landscape

Ten or more LLM-related user-agents now hit your robots.txt from four main vendors, and they fall into three independent job categories. Conflating them is what makes most blanket rules useless.

Training crawlers fetch content to build or update foundation models. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), and content fetched by Googlebot when Google-Extended is allowed.
Search-indexing crawlers build the live index that powers AI search citations. Examples: OAI-SearchBot (ChatGPT Search), Claude-SearchBot (Claude with web search), PerplexityBot.
User-triggered fetchers fire only when an end user pastes a URL or asks about a specific page inside the chat. Examples: ChatGPT-User, Claude-User, Perplexity-User.

The most common 2026 misconception is treating these as the same actor. Blocking GPTBot does not block OAI-SearchBot, because they are separate user-agents with separate purposes — one trains models, the other powers citations in ChatGPT Search. Google-Extended adds another wrinkle: it is a robots.txt product token, not a crawler, so disallowing it controls whether Googlebot's already-fetched content can be reused to train Gemini, without affecting Google Search indexing at all.

How robots.txt is interpreted in 2026

The Robots Exclusion Protocol was finally standardized as RFC 9309 in 2022, ending decades of ambiguity. All four major AI vendors publicly commit to honoring it. The rules that matter for LLM crawlers:

User-agent matching is case-insensitive substring match on the first user-agent token. GPTBot and gptbot both match.
Longest-match path wins when Allow and Disallow conflict within a group.
A specific User-agent group overrides the * group entirely — there is no inheritance. If you have a block for User-agent: GPTBot, you must list every directive that applies; rules under User-agent: * are ignored once a specific block matches.
Empty Disallow allows everything; Disallow: / blocks everything in that group.

Common pitfalls that quietly break LLM blocks:

Misordered groups — each group must be a contiguous block. Writing rules between two User-agent: lines in the same group produces undefined behavior.
Missing newline at EOF — some parsers truncate the final directive.
Mid-directive comments — Disallow: /private # secret stuff works, but Disallow: /private/# secret does not, because # only starts a comment when preceded by whitespace.
Wrong assumptions about trailing slashes — Disallow: /docs blocks /docs and /docs/foo; precedence is governed by longest-match, not literal slash counting.

Decoder table: what each directive actually blocks

This is the lookup most teams need. Each row maps a documented user-agent to its purpose and what blocking it costs.

User-agent	Vendor	Purpose	Blocking removes you from…
GPTBot	OpenAI	Training data crawl	OpenAI model training reuse
OAI-SearchBot	OpenAI	ChatGPT Search index	ChatGPT Search citation surface
ChatGPT-User	OpenAI	User-pasted URL fetch	URL summaries inside ChatGPT
ClaudeBot	Anthropic	General crawl / training	Anthropic model training reuse
Claude-SearchBot	Anthropic	Claude web search citations	Claude citation surface
Claude-User	Anthropic	User-triggered fetch	URL summaries inside Claude
PerplexityBot	Perplexity	Index for Perplexity answers	Perplexity citation surface
Perplexity-User	Perplexity	User-triggered fetch	URL summaries inside Perplexity
Google-Extended	Google	Token controlling AI training reuse	Gemini training reuse (Search unaffected)
Googlebot	Google	Web Search crawl	Google Search index

Two worked examples make the trap concrete:

Disallow: / under User-agent: GPTBot still allows ChatGPT Search to cite you, because OAI-SearchBot is a separate user-agent with its own rules.
Blocking Claude-SearchBot removes you from Claude's citation surface but does not stop ClaudeBot from fetching for training — those are independent decisions.

Copy-paste templates by strategy

Pick the strategy that matches your goals; copy verbatim and tune paths.

Strategy A — citation-friendly (recommended for content sites)

# Allow AI search and on-demand fetch; block training-only bots.
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Strategy B — selective (block training, gate private paths)

Useful when you have a paid section under /members/ that should never be cited or trained on:

User-agent: *
Disallow: /members/

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Strategy C — full opt-out (block everything AI)

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

A common publisher scenario: a SaaS docs site that wants developers asking "how do I use library X" to be cited from its docs but does not want raw docs reused to train competitor models. Strategy A is the answer — keep the three search bots open, close GPTBot, ClaudeBot, and Google-Extended.

Verification: confirm your policy actually works

Writing the directives is the easy part. Confirming they take effect is where most teams skip steps.

Check the file itself. Fetch https://yoursite/robots.txt exactly as a bot would — no cookies, no auth — and verify it returns 200 with Content-Type: text/plain. CDN edges occasionally serve cached HTML 404 pages on this path.
Match user-agent strings precisely. Vendors publish exact strings: OpenAI's are listed at their crawlers page, Anthropic's at its bot policy article, and Perplexity's at docs.perplexity.ai/guides/bots along with the IP ranges it publishes for verification. Server logs should show entries like Mozilla/5.0 ... GPTBot/1.2 when OpenAI's training crawl visits.
Watch the log for the user-agents you blocked. If you see them succeeding with 200 responses on disallowed paths, your block is not matching. The most common cause is a stray rule in User-agent: * that the specific block does not duplicate — remember, no inheritance.
Review every quarter and after vendor announcements. New crawlers get launched periodically; the list above is current for 2026 Q2 and will grow. Set a calendar reminder, or check our blog for ongoing tracking.

If a block is being ignored despite a correct directive, confirm the exact user-agent token in vendor docs — case is irrelevant, spelling and hyphenation are not — then escalate via the vendor's support channel. Every major AI vendor in this list has a published path for that.

A blanket Disallow: / is a 2019 answer to a 2026 problem. Treat each crawler as a separate decision, write the policy once with comments, and revisit it quarterly. GEON helps content teams track how AI search engines actually treat their sites, but the robots.txt itself is yours to own.