robots.txt for AI Crawlers: The 2026 Setup

If you want AI engines to cite you, your robots.txt must allow their search crawlers: OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended, and Bingbot. Block one of these and you remove yourself from that engine's answers entirely. The catch most people miss: search crawling and training crawling are separate decisions, controlled by separate bots. You can allow citation while blocking training, or the reverse. Conflating the two is how well-meaning teams accidentally make themselves invisible.

This is the single most common own goal in AI search. A site does everything right with content, then quietly blocks the crawler in a robots.txt rule someone added two years ago to "keep AI from stealing our content," and wonders why it never gets cited. The rule did its job too well.

The two decisions you are actually making

For each AI company there are usually two crawlers doing two jobs:

Search crawlers fetch pages so the engine can cite them in live answers. Allow these if you want visibility in that engine's responses.
Training crawlers fetch pages to train the underlying model. This is a separate call about whether you want your content in training data.

OpenAI is the clearest example. OAI-SearchBot powers ChatGPT Search citations. GPTBot is about training. Allowing one does not require allowing the other, and they are governed by separate lines in your robots.txt. If you want to be cited in ChatGPT but kept out of training, you allow OAI-SearchBot and disallow GPTBot. That combination is completely valid.

Who's who

Bot	Company	Job	Allow if you want...
`OAI-SearchBot`	OpenAI	ChatGPT Search citations	To be cited in ChatGPT
`GPTBot`	OpenAI	Model training	Your content in training data
`ChatGPT-User`	OpenAI	User-triggered fetches	ChatGPT to browse to your page on request
`PerplexityBot`	Perplexity	Search and citations	To be cited in Perplexity
`ClaudeBot`	Anthropic	Crawling for Claude	Claude to access your content
`Google-Extended`	Google	Gemini and AI Overviews	To appear in AI Overviews and Gemini
`Bingbot`	Microsoft	Bing index, powers Copilot	Copilot visibility (and Bing)
`CCBot`	Common Crawl	Open training dataset	(Block to limit broad training reuse)

A working config

Here is a robots.txt that allows the search and AI-answer crawlers, makes the OpenAI training decision explicit, and blocks the open training crawler. Adjust to your own policy.

# Allow AI search and answer engines to cite you
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Bingbot
Allow: /

# Training decision (allow if you're comfortable with training reuse)
User-agent: GPTBot
Allow: /

# Block the open training crawler
User-agent: CCBot
Disallow: /

# Everyone else
User-agent: *
Allow: /

Sitemap: https://okara.ai/sitemap.xml

If you would rather keep your content out of training while staying citable, change the GPTBot rule to Disallow: / and keep OAI-SearchBot on Allow: /. The citation path stays open. There is no setting that lets you be cited without being crawled by the search bot, so allowing the search crawlers is non-negotiable if visibility is the goal.

Beyond robots.txt: don't block yourself other ways

robots.txt is necessary but not sufficient. A page also has to be reachable and renderable:

Do not noindex or nosnippet pages you want quoted. If a page cannot show a snippet in Search, it cannot be cited in an AI Overview.
Server-render or pre-render the content. If the answer only appears after heavy client-side JavaScript, many crawlers never see it. This catches a lot of modern single-page apps.
Submit your sitemap to Bing Webmaster Tools, since ChatGPT and Copilot lean on Bing's index.
Watch for accidental blocks at the CDN or firewall level. Some bot-protection services block AI crawlers by default, which overrides whatever your robots.txt says.

How to audit what you have now

Three quick checks:

Open yoursite.com/robots.txt and search for Disallow rules under any of the bots in the table above. If a search crawler is disallowed, that is very likely why you are absent from that engine.
In a logged-out browser, run your top queries through ChatGPT, Perplexity, and Google. If you are never cited despite ranking, suspect crawl access first.
Check your CDN or WAF (Cloudflare and similar) for "AI bot" or "scraper" blocking toggles that may be on by default.

What about llms.txt?

You will see advice to add an llms.txt file to your root. It is harmless, and some practitioners include it, but there is no solid evidence that major engines currently use it for ranking or citation. Treat it as optional, and do not let it distract from the things that demonstrably matter: crawler access, page structure, and sourcing. Spend the hour on your robots.txt and schema instead.

Where Okara fits

A blocked crawler is the kind of single-line problem that costs you an entire engine and goes unnoticed for months because nothing surfaces it. Okara's SEO agent audits your technical setup continuously, including robots.txt and crawlability, and flags exactly these silent blockers, while the coding agent can apply the fix without you digging through config. It is the safeguard against doing all the GEO work and then quietly disqualifying yourself on a technicality. Point it at your site for a crawlability and AI-visibility check.

Frequently asked questions

If I block GPTBot, will ChatGPT stop citing me? No. GPTBot is for training. ChatGPT Search citations come through OAI-SearchBot, which is a separate rule. Block training, keep search, and you stay citable.

I blocked AI bots a while ago. Is that why I'm not cited? Very possibly. Audit your robots.txt for Disallow rules targeting any of the search crawlers above. This is one of the most common reasons a site is absent from AI answers.

Does allowing these bots hurt my SEO? No. These are AI-specific crawlers. Allowing them does not affect how Googlebot ranks your pages.

Should I add llms.txt? You can, but do not expect it to do much yet. There is no clear evidence engines use it for ranking. Prioritize crawler access and content structure first.

Could my CDN be blocking AI crawlers even if robots.txt allows them? Yes. Some bot-protection services block AI crawlers by default at the network level, which overrides robots.txt. Check your CDN or firewall settings if you are crawlable on paper but still absent.