AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended)
The web agents AI companies use to read, train on and cite the web.
AI crawlers are the web agents AI companies use to read web content, each independently controllable in robots.txt. A crucial distinction is between training crawlers and search/citation crawlers — they have separate purposes and separate tokens. OpenAI runs GPTBot (training data for foundation models), OAI-SearchBot (powers ChatGPT search results and citations) and ChatGPT-User (user-initiated page fetches). Anthropic runs ClaudeBot (training), Claude-SearchBot (indexing for Claude's search) and Claude-User (user-triggered fetch).
Perplexity runs PerplexityBot, which it states surfaces and links sites in results and is "not used to crawl content for AI foundation models," plus Perplexity-User. Google-Extended (introduced 28 September 2023) is not a separate crawler but a robots.txt token that controls whether content trains and grounds Gemini; Google states it "does not impact a site's inclusion in Google Search nor is it used as a ranking signal." The practical takeaway: to be cited you must allow the relevant search/citation crawlers (OAI-SearchBot, PerplexityBot, Claude-SearchBot/Claude-User) — blocking only training crawlers does not make you citable, and blocking search crawlers makes you invisible.
Sources
- Overview of OpenAI Crawlers (GPTBot, OAI-SearchBot, ChatGPT-User) — OpenAI
- Does Anthropic crawl the web, and how to block the crawler — Anthropic (Claude Help Center)
- Perplexity Crawlers (PerplexityBot, Perplexity-User) — Perplexity
- Google-Extended — Google's common crawlers — Google Search Central