You've published great content. Maybe you've even set up llms.txt and per-page .md files. But none of that matters if your robots.txt is blocking the AI agents that want to consume it.
Many websites are invisible to AI search - not because their content is poor, but because their robots.txt rules inadvertently shut the door. This article walks you through auditing and fixing it.
The problem you don't know you have
robots.txt has been around since 1994. It tells web crawlers what they can and can't access. Most robots.txt files were written with Google in mind - and they work fine for traditional search.
But AI crawlers use different user-agent strings than search engine crawlers. A robots.txt that allows Googlebot may inadvertently block ClaudeBot, GPTBot, and PerplexityBot through broad disallow rules, default-deny configurations, or security plugins that block unrecognised bots.
Check your robots.txt right now: go to yoursite.com/robots.txt in your browser. If you see any of these patterns, you may have a problem:
User-agent: *
Disallow: /
That blocks everything - including all AI crawlers.
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
That only allows Google. Every AI crawler is blocked.
Some CMS platforms and security plugins add these rules automatically. You might not even know they're there.
The AI crawlers you need to know
There are twelve major AI crawler families. Each serves a different purpose:
| Crawler | Company | What it does |
|---|---|---|
| ClaudeBot | Anthropic | Powers Claude AI assistant |
| GPTBot | OpenAI | ChatGPT training and search |
| ChatGPT-User | OpenAI | ChatGPT browsing mode (when users ask it to browse) |
| PerplexityBot | Perplexity | Perplexity AI search engine |
| Google-Extended | Gemini AI training | |
| Googlebot | Google Search (also feeds AI Overviews) | |
| BingBot | Microsoft | Bing Search and Copilot |
| Applebot | Apple | Siri and Apple Intelligence |
| Meta-ExternalAgent | Meta | Meta AI assistant |
| Bytespider | ByteDance | TikTok and Douyin AI |
| CCBot | Common Crawl | Open dataset for AI training |
| Amazonbot | Amazon | Alexa and Amazon search |
Not all of these crawlers are equal. Some drive AI search visibility (GPTBot, PerplexityBot, ClaudeBot). Some are primarily for training (CCBot, Google-Extended). Some do both (Googlebot feeds both traditional search and AI Overviews).
How to audit your robots.txt
Step 1: Read your current file. Navigate to yoursite.com/robots.txt and read what's there. Look for blanket disallow rules, and check whether AI-specific user agents are mentioned.
Step 2: Test specific agents. Google's robots.txt tester lets you test whether specific user agents can access specific URLs. Test with ClaudeBot, GPTBot, and PerplexityBot.
Step 3: Check server logs. If you have access to your server or CDN logs, search for AI crawler user agents. If you see requests being served with 403 (forbidden) status codes, your robots.txt or server configuration is blocking them.
Step 4: Check CMS settings. Some platforms manage robots.txt through their own interface:
- HubSpot: Settings → Website → Pages → SEO → robots.txt
- WordPress: Often managed by SEO plugins (Yoast, Rank Math)
- Webflow: Site Settings → SEO → Robots.txt
- Shopify: Managed via the robots.txt.liquid template
The recommended configuration
Add these rules to your robots.txt to explicitly allow AI crawlers:
# AI crawlers
User-agent: ClaudeBot
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
If your robots.txt has a blanket User-agent: * / Disallow: / rule, you'll need to either remove it or add explicit Allow rules for each AI crawler above it (specific user-agent rules take precedence over the wildcard).
Selective access: it's a legitimate choice
You don't have to allow everything. You might want:
- Search bots allowed, training bots blocked - Allow GPTBot, PerplexityBot, ClaudeBot (they drive search citations) but block CCBot and Bytespider (primarily training crawlers)
- All major AI bots allowed, except specific ones - Allow the major twelve but block crawlers from companies you don't want accessing your content
- Specific paths blocked - Allow AI crawlers site-wide but block them from specific directories (e.g.,
/internal/or/members/)
This is where robots.txt does well - it gives you per-bot, per-path control. But it's still binary: allow or block. For more nuanced governance, like allowing search citations but blocking training use on the same content, you need Content Signals.
The CMS complication
Some CMS platforms make robots.txt management harder than it should be:
HubSpot manages robots.txt through its own interface but doesn't always support custom user-agent rules in the way you'd expect. You may need to use the "Additional rules" field or edit the raw file.
WordPress with security plugins like Wordfence or Sucuri may add rules that block bots with unrecognised user agents. Check your security plugin settings alongside your robots.txt.
Webflow allows custom robots.txt editing in site settings, but changes only take effect on publish. Make sure to publish after editing.
Shopify uses a template-based approach (robots.txt.liquid) that requires some Liquid syntax knowledge to customise.
In all cases: after editing, verify by visiting yoursite.com/robots.txt in a browser and confirming your changes are live.
The quick win
This is the highest-impact, lowest-effort step in making your website AI-readable. Five minutes of editing your robots.txt can be the difference between being visible and being invisible to AI search.
If you're doing nothing else from this series, do this:
- Go to
yoursite.com/robots.txt - Check for blanket disallow rules
- Add explicit Allow rules for the AI crawlers listed above
- Verify the changes are live
Then move on to the bigger wins: llms.txt, per-page .md files, and AI crawl analytics to confirm bots are actually getting through.
This article is part of our series on making your website AI-readable. Next: Content structure for AI · Also in this series: What is markdown? · What is llms.txt? · Per-page .md files · Content Signals · How to track LLM indexing