Your robots.txt Isn't Blocking What You Think: A 2026 LLM Crawler Decoder
Blocking GPTBot doesn't stop ChatGPT from citing you, and Google-Extended isn't a crawler at all. Here's the per-bot robots.txt policy 2026 LLM crawlers actually require.
A single blanket Disallow line in robots.txt no longer matches how 2026 LLM vendors actually crawl the web. OpenAI, Anthropic, Perplexity, and Google each operate two or three distinct user-agents that do different jobs — training, search indexing, and user-triggered fetch — and you need to allow or block each one based on whether you want citations, training data inclusion, or neither. The right approach is a deliberate per-bot policy: keep search-indexing bots if you want AI citations, block training-only bots if you don't want your content reused for model training, and use Google-Extended as a separate switch for Gemini.
The 2026 LLM crawler landscape
Ten or more LLM-related user-agents now hit your robots.txt from four main vendors, and they fall into three independent job categories. Conflating them is what makes most blanket rules useless.
- Training crawlers fetch content to build or update foundation models. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), and content fetched by Googlebot when Google-Extended is allowed.
- Search-indexing crawlers build the live index that powers AI search citations. Examples: OAI-SearchBot (ChatGPT Search), Claude-SearchBot (Claude with web search), PerplexityBot.
- User-triggered fetchers fire only when an end user pastes a URL or asks about a specific page inside the chat. Examples: ChatGPT-User, Claude-User, Perplexity-User.
The most common 2026 misconception is treating these as the same actor. Blocking GPTBot does not block OAI-SearchBot, because they are separate user-agents with separate purposes — one trains models, the other powers citations in ChatGPT Search. Google-Extended adds another wrinkle: it is a robots.txt product token, not a crawler, so disallowing it controls whether Googlebot's already-fetched content can be reused to train Gemini, without affecting Google Search indexing at all.
How robots.txt is interpreted in 2026
The Robots Exclusion Protocol was finally standardized as RFC 9309 in 2022, ending decades of ambiguity. All four major AI vendors publicly commit to honoring it. The rules that matter for LLM crawlers:
- User-agent matching is case-insensitive substring match on the first user-agent token.
GPTBotandgptbotboth match. - Longest-match path wins when Allow and Disallow conflict within a group.
- A specific User-agent group overrides the
*group entirely — there is no inheritance. If you have a block forUser-agent: GPTBot, you must list every directive that applies; rules underUser-agent: *are ignored once a specific block matches. - Empty Disallow allows everything;
Disallow: /blocks everything in that group.
Common pitfalls that quietly break LLM blocks:
- Misordered groups — each group must be a contiguous block. Writing rules between two
User-agent:lines in the same group produces undefined behavior. - Missing newline at EOF — some parsers truncate the final directive.
- Mid-directive comments —
Disallow: /private # secret stuffworks, butDisallow: /private/# secretdoes not, because#only starts a comment when preceded by whitespace. - Wrong assumptions about trailing slashes —
Disallow: /docsblocks/docsand/docs/foo; precedence is governed by longest-match, not literal slash counting.
Decoder table: what each directive actually blocks
This is the lookup most teams need. Each row maps a documented user-agent to its purpose and what blocking it costs.
| User-agent | Vendor | Purpose | Blocking removes you from… |
|---|---|---|---|
| GPTBot | OpenAI | Training data crawl | OpenAI model training reuse |
| OAI-SearchBot | OpenAI | ChatGPT Search index | ChatGPT Search citation surface |
| ChatGPT-User | OpenAI | User-pasted URL fetch | URL summaries inside ChatGPT |
| ClaudeBot | Anthropic | General crawl / training | Anthropic model training reuse |
| Claude-SearchBot | Anthropic | Claude web search citations | Claude citation surface |
| Claude-User | Anthropic | User-triggered fetch | URL summaries inside Claude |
| PerplexityBot | Perplexity | Index for Perplexity answers | Perplexity citation surface |
| Perplexity-User | Perplexity | User-triggered fetch | URL summaries inside Perplexity |
| Google-Extended | Token controlling AI training reuse | Gemini training reuse (Search unaffected) | |
| Googlebot | Web Search crawl | Google Search index |
Two worked examples make the trap concrete:
Disallow: /underUser-agent: GPTBotstill allows ChatGPT Search to cite you, because OAI-SearchBot is a separate user-agent with its own rules.- Blocking
Claude-SearchBotremoves you from Claude's citation surface but does not stop ClaudeBot from fetching for training — those are independent decisions.
Copy-paste templates by strategy
Pick the strategy that matches your goals; copy verbatim and tune paths.
Strategy A — citation-friendly (recommended for content sites)
# Allow AI search and on-demand fetch; block training-only bots.
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Strategy B — selective (block training, gate private paths)
Useful when you have a paid section under /members/ that should never be cited or trained on:
User-agent: *
Disallow: /members/
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Strategy C — full opt-out (block everything AI)
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
A common publisher scenario: a SaaS docs site that wants developers asking "how do I use library X" to be cited from its docs but does not want raw docs reused to train competitor models. Strategy A is the answer — keep the three search bots open, close GPTBot, ClaudeBot, and Google-Extended.
Verification: confirm your policy actually works
Writing the directives is the easy part. Confirming they take effect is where most teams skip steps.
- Check the file itself. Fetch
https://yoursite/robots.txtexactly as a bot would — no cookies, no auth — and verify it returns 200 withContent-Type: text/plain. CDN edges occasionally serve cached HTML 404 pages on this path. - Match user-agent strings precisely. Vendors publish exact strings: OpenAI's are listed at their crawlers page, Anthropic's at its bot policy article, and Perplexity's at docs.perplexity.ai/guides/bots along with the IP ranges it publishes for verification. Server logs should show entries like
Mozilla/5.0 ... GPTBot/1.2when OpenAI's training crawl visits. - Watch the log for the user-agents you blocked. If you see them succeeding with 200 responses on disallowed paths, your block is not matching. The most common cause is a stray rule in
User-agent: *that the specific block does not duplicate — remember, no inheritance. - Review every quarter and after vendor announcements. New crawlers get launched periodically; the list above is current for 2026 Q2 and will grow. Set a calendar reminder, or check our blog for ongoing tracking.
If a block is being ignored despite a correct directive, confirm the exact user-agent token in vendor docs — case is irrelevant, spelling and hyphenation are not — then escalate via the vendor's support channel. Every major AI vendor in this list has a published path for that.
A blanket Disallow: / is a 2019 answer to a 2026 problem. Treat each crawler as a separate decision, write the policy once with comments, and revisit it quarterly. GEON helps content teams track how AI search engines actually treat their sites, but the robots.txt itself is yours to own.
Deniz
Content & GEO Strategy