You Blocked ChatGPT and Didn't Even Know It
Your website may be refusing AI crawlers right now. 5.89% of all sites block GPTBot. Cloudflare blocks AI by default since July 2025. Here's how to check and fix it in 2 minutes.
Right now, your website may be invisible to ChatGPT, Perplexity, and Claude — not because your content is bad, but because your server is refusing to let them in. Open a new tab. Type yoursite.com/robots.txt. If you see GPTBot, ClaudeBot, or PerplexityBot next to a Disallow line, you found the problem.
Check Right Now (10 Seconds)
Open https://yoursite.com/robots.txt in a new tab. This file tells every crawler — search engines and AI bots alike — what they're allowed to access. It takes 10 seconds to read and it controls 100% of your AI visibility.
Look for these user agents next to "Disallow" lines:
Also check these three places — robots.txt isn't always the culprit:
- Cloudflare dashboard: Security > Bots > "AI Bots" toggle. Since July 2025, new domains block AI by default.
- WordPress Settings: Settings > Reading > "Discourage search engines" checkbox. This adds noindex and can trigger broad bot blocks.
- Security plugins: Wordfence, Sucuri, and similar plugins maintain bot blocklists that may include AI crawlers.
The AI Crawler Field Guide
Not all AI bots are equal. The critical distinction: training bots vs. search/citation bots. Block training if you want your content excluded from model training. Never block search bots if you want AI visibility. In the table below, red rows are search/citation bots you should never block. Gray rows are training bots that are safe to block.
| User Agent | Company | Purpose | What Blocking Means |
|---|---|---|---|
| GPTBot | OpenAI | Model training | No training (fine to block) |
| OAI-SearchBot | OpenAI | ChatGPT search | Can’t cite you in ChatGPT search |
| ChatGPT-User | OpenAI | User browsing | Can’t browse your page in ChatGPT |
| ClaudeBot | Anthropic | Chat citation | Claude can’t cite you |
| anthropic-ai | Anthropic | Bulk training | No training (fine to block) |
| PerplexityBot | Perplexity | Search index | Invisible in Perplexity search |
| Google-Extended | Gemini training | No Gemini training (doesn’t affect AI Overviews) | |
| Googlebot | Search + AIO | Invisible in Google entirely |
Key rule: Block GPTBot, anthropic-ai, and Google-Extended if you want to opt out of training. Keep OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Googlebot allowed. Always.
5 Ways You Accidentally Blocked AI
Most sites don't intentionally block AI crawlers. These five causes account for the majority of accidental blocks we see in audits.
Cloudflare's Default Flip (July 2025)
Cloudflare enabled "Block AI Bots" by default for all new domains. With ~20% of the web behind Cloudflare, millions of sites started blocking AI crawlers overnight without any action from site owners.
Exact path: Cloudflare Dashboard > Security > Bots > "Block AI scrapers and crawlers". If this toggle is on, every AI bot gets a 403 before it even sees your robots.txt. Existing domains may have been auto-opted-in during Cloudflare plan renewals.
WordPress Security Plugins
Wordfence: Firewall > Blocking > "Advanced Blocking" — check the User Agent Pattern field for GPTBot, ClaudeBot, or wildcard patterns like *bot*. Sucuri: WAF > Settings > "Blocked User Agents" list. iThemes Security: Security > Bots > "Banned User Agents."
Plugin updates silently add new AI user agents to block lists. After every update, verify your bot allowlist. These plugins also stack: Wordfence can block a bot that Cloudflare already allowed through.
Server-Level Firewall Rules
WAF rules that require JavaScript execution or CAPTCHAs silently reject AI crawlers. Unlike browser-based bots, AI crawlers cannot solve CAPTCHAs and cannot execute JavaScript challenges (Cloudflare Turnstile, hCaptcha, reCAPTCHA). They receive a 403 or a challenge page, fail the check, and move on.
This is invisible to you. The bot never reaches your server, so your analytics show nothing. Check your WAF's "challenge" or "managed challenge" rules — if they apply to all traffic (not just suspicious IPs), AI crawlers are getting blocked on every request.
Staging robots.txt Leftover
The classic two-line disaster that blocks every crawler — Googlebot, GPTBot, all of them:
User-agent: * Disallow: /
It's the standard staging robots.txt, designed to prevent staging from being indexed. But it ships to production more often than anyone admits — through CI/CD pipelines that copy the wrong file, environment-variable misconfigs, or merge conflicts that default to the restrictive version. Always verify robots.txt after every deployment.
CDN/Hosting Rate Limiting
AI crawlers make burst requests — they don't browse page by page like humans. A search bot indexing your site may hit 50–100 pages in a few seconds. If your rate limit is set to 30 requests/minute per IP, the bot gets 429 Too Many Requests after the first burst.
After repeated 429s, crawlers deprioritize your domain and reduce crawl frequency — sometimes permanently. Check your hosting provider's rate limiting settings (Vercel, Netlify, AWS CloudFront, and Nginx all have different defaults). Whitelist known AI bot IP ranges, or raise your burst threshold to at least 120 requests/minute.
The Recommended robots.txt
Two approaches below. Pick the one that fits your content policy. Both are copy-pasteable — replace your entire robots.txt with one of these.
Approach 1: Block Training, Allow Search (Recommended)
Keeps your content out of model training datasets while preserving full AI search visibility. This is the best balance for most sites.
# =========================================== # robots.txt — Block training, allow search # =========================================== # --- Training bots (safe to block) --- # These crawl your site to feed model training. # Blocking them does NOT affect search citations. User-agent: GPTBot Disallow: / # OpenAI model training. Does NOT affect ChatGPT search. User-agent: Google-Extended Disallow: / # Gemini model training. Does NOT affect Google Search or AI Overviews. User-agent: anthropic-ai Disallow: / # Anthropic bulk training crawler. Does NOT affect Claude citations. # --- Search/citation bots (NEVER block) --- # These fetch your pages when a user asks a question. # Blocking them = invisible in that AI search engine. User-agent: OAI-SearchBot Allow: / # ChatGPT search — serves 900M+ weekly active users. User-agent: ChatGPT-User Allow: / # ChatGPT browse mode — when a user clicks "browse" on your link. User-agent: ClaudeBot Allow: / # Claude citation crawler — fetches pages to answer user queries. User-agent: PerplexityBot Allow: / # Perplexity search index — 100M+ monthly queries. # --- Standard search engines (NEVER block) --- User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # --- Everything else: allow by default --- User-agent: * Allow: / # Sitemap (update with your actual URL) Sitemap: https://yoursite.com/sitemap.xml
Approach 2: Allow All Bots (Maximum Visibility)
If you don't mind your content being used for training and want maximum discoverability, use this minimal config.
# =========================================== # robots.txt — Allow all crawlers # =========================================== # Maximum visibility: all bots can access all pages. # Your content may be used for AI model training. User-agent: * Allow: / Sitemap: https://yoursite.com/sitemap.xml
Important: robots.txt rules are processed per user agent. An Allow: / for OAI-SearchBot is completely independent from a Disallow: / for GPTBot — they are separate bots with separate rule blocks.
Also note: robots.txt is a request, not enforcement. Well-behaved bots obey it, but it provides no technical barrier. For true access control, use authentication or server-level IP restrictions.
What Blocking Costs You
AI search is no longer a niche channel. ChatGPT alone surpassed 900 million weekly active users in early 2026 — that's more weekly users than X (Twitter) and LinkedIn combined. Blocking search bots means zero presence in platforms that collectively serve billions of queries per week.
900M+
ChatGPT weekly active users
100M+
Perplexity monthly queries
5.89%
of all websites block GPTBot
To make this tangible: if your site gets cited in a ChatGPT search answer that's shown to even 0.001% of those weekly users, that's 9,000 potential visitors — from a single query. Block the bot, and that number is permanently zero. Every day your site blocks AI crawlers is a day competitors accumulate citations you're not eligible for.
The Perplexity Stealth Crawler Controversy (August 2025)
In August 2025, Cloudflare publicly documented that Perplexity was using stealth crawlers — user agents that didn't identify as PerplexityBot — to bypass robots.txt blocks. The crawlers impersonated regular browser user agents while systematically scraping content for Perplexity's search index.
Cloudflare responded by fingerprinting the stealth bots and offering blocking tools, and Perplexity faced significant backlash from publishers. The incident underscores a practical reality: robots.txt is a voluntary standard. Well-behaved bots respect it, but there's no technical enforcement.
The takeaway isn't to give up on robots.txt. It's to be strategic: allow the search bots you want to cite you, block the training bots you don't, and accept that controlling all AI access is not realistic. The sites that win are the ones that make themselves easy to cite, not the ones that try to hide.
Not sure what's blocking you? Our audit checks AI crawler accessibility as the first of 7 branches. Indexability testing identifies robots.txt blocks, Cloudflare settings, firewall issues, and rate limiting — in 60 seconds.
Run your first audit freeFirst 5 audits free. No credit card required.
Frequently Asked Questions
No. AI crawlers (GPTBot, ClaudeBot, PerplexityBot) are completely separate from Googlebot. Allowing or blocking them has zero effect on your Google rankings. Google uses Googlebot for search indexing and Google-Extended only for Gemini training.
Your choice. Blocking GPTBot prevents your content from being used in OpenAI model training, but it does not affect ChatGPT search citations. ChatGPT search uses OAI-SearchBot, which is a separate user agent. Block training bots if you want; just keep search bots allowed.
No. Google AI Overviews use Googlebot, not Google-Extended. Cloudflare’s AI bot toggle only affects non-Google AI crawlers. Your AI Overview eligibility is determined by standard Googlebot access and content quality signals.
Check your server access logs for 403 or 429 responses to AI user agents (GPTBot, ClaudeBot, PerplexityBot). If you don’t have log access, run a free audit — our indexability branch checks crawler accessibility as its first test.
Yes. Use path-specific rules in robots.txt (e.g., Disallow: /private/ for a specific bot) or X-Robots-Tag HTTP headers for per-page control. This lets you protect sensitive content while keeping public pages citable.