IndustriesCase StudiesBlogAboutContact
SEO

llms.txt and AI Crawlers: A Practical 2026 Guide

AI crawlers now take 30%+ of all bot bandwidth. Here's what robots.txt and llms.txt actually do in 2026 — and the specific configuration that protects visibility.

May 15, 202611 min read
Isometric illustration for llms txt ai crawler guide article

In 2026, AI crawlers are no longer a niche concern. Google-Extended accounts for roughly 31.6% of all bot bandwidth globally. Meta-ExternalAgent has climbed to the second most active AI crawler at 16.7% bandwidth share. OpenAI's GPTBot, Anthropic's ClaudeBot, PerplexityBot, and a growing list of others now visit most public sites daily. Every site owner has to make a decision about how to handle them — and most of the decisions being made are based on outdated guidance.

The confusion is understandable. A new file format called llms.txt has been proposed as the AI-era equivalent of robots.txt. Major frameworks have shipped support for it. Hundreds of articles recommend implementing it as a priority. And yet a 300,000-domain study published in late 2025 found that llms.txt produced no measurable improvement in AI citation rates, while the actual control over how AI engines treat a site still happens almost entirely through robots.txt — the same file that has governed crawler behavior for the last 30 years.

This guide explains what the AI crawler landscape actually looks like in 2026, what robots.txt and llms.txt each do (and do not do), which bots site owners should think carefully about, and the configuration that protects both Google search visibility and AI citation eligibility.

The AI Crawler Landscape in 2026

The bot ecosystem has changed substantially in two years. Where 2024 had a handful of major search crawlers and a small number of AI training bots, 2026 has a sprawling population of specialized crawlers serving different purposes — and the distinctions matter for how to handle them.

The major search crawlers remain. Googlebot crawls for search indexing as it always has. Bingbot serves Bing and Copilot search. These should be allowed without question for any site that wants visibility in traditional search.

The AI training crawlers are the larger and more confusing category. Google-Extended is Google's separate crawler for AI training (Gemini, Vertex AI), distinct from Googlebot which serves search. Blocking Google-Extended has no measurable impact on Google Search rankings, but it removes content from the data Google uses to train its AI products. Meta-ExternalAgent crawls for Meta's Llama training. Bytespider crawls for ByteDance's models. CCBot crawls for Common Crawl, which feeds a long list of downstream AI training datasets.

The AI search crawlers are a third category, often confused with the training crawlers. OAI-SearchBot serves ChatGPT's search and citation features — this is the crawler that determines whether a site appears in ChatGPT responses. Blocking OAI-SearchBot removes a site from ChatGPT search answers entirely. GPTBot is OpenAI's separate training crawler. The two are sometimes treated as one bot in robots.txt configurations, but they serve different purposes and the decision to allow one does not automatically apply to the other.

PerplexityBot crawls for Perplexity's search and answer features. ClaudeBot (sometimes appearing as Claude-User or Claude-Web) crawls for Anthropic's training and product features. Each of the major AI engines now operates one or more crawlers, and the configurations that govern them are inconsistent across providers.

robots.txt: Still the Real Lever

Despite the attention llms.txt has received, robots.txt remains the only standardized mechanism that AI crawlers actually respect at scale. It is the same file format that has governed crawler behavior since 1994, formalized in RFC 9309 in 2022. Every major AI crawler reads it, follows its directives, and uses it as the authoritative source for what it is permitted to access.

The mechanics are familiar. A robots.txt file lives at the root of a domain. It contains directives that allow or disallow specific user agents from accessing specific paths. The directives are case-sensitive on user agent matching, and the matching is exact rather than partial — a directive for `GPTBot` does not affect `Googlebot`.

What has changed in the AI era is the granularity that matters. Blocking a single user agent string used to be sufficient because each crawler had one job. Now, blocking "GPTBot" affects training but not search, while blocking "OAI-SearchBot" affects ChatGPT search visibility but not training. The same is true for Google-Extended versus Googlebot, and for several other providers. Site owners who want to block training but stay visible in AI search need to be specific about which user agent they are targeting.

The other practical change is that robots.txt is now scanned by AI engines as a signal in its own right. A robots.txt that explicitly allows a long list of AI crawlers signals openness; one that disallows many of them signals restriction. While this is not a direct ranking factor, it does affect whether a site is treated as available for citation by AI engines that are increasingly making proactive decisions about which sources to draw from.

What llms.txt Actually Does

Infographic for llms txt ai crawler guide article

llms.txt is a proposed file format intended to give AI systems structured guidance about a site's content — what it covers, how it should be referenced, which pages are canonical for which topics. It is positioned as the AI-era complement to robots.txt: where robots.txt governs access, llms.txt governs interpretation.

The reality of llms.txt in 2026 is more limited than its proponents suggest. The major AI crawlers — OpenAI, Google, Anthropic, Perplexity — do not request llms.txt in meaningful volume. GPTBot occasionally fetches it; the others rarely or never do. The 300,000-domain study cited above found no statistically significant relationship between having an llms.txt file and AI citation rates.

This does not mean llms.txt is useless. It does mean the use case is narrower than the marketing suggests.

Where llms.txt has genuine value is as a documentation file that human developers and operators of AI-powered tools can reference when building integrations with a site. A well-structured llms.txt explains the site's content architecture, points to canonical resources, and signals which content is approved for downstream use. For sites with API consumers, partner integrations, or documentation that gets referenced by AI-assisted developer tools, an llms.txt file is a useful artifact.

Where llms.txt does not have value is as a citation optimization tool. Adding an llms.txt to a site does not improve AI citation rates. The signals that drive AI citations remain content quality, structured data, author authority, internal linking, and topical depth — the same factors that drive answer engine optimization and generative engine optimization generally.

The honest framing: llms.txt is worth implementing if the site has a developer-facing audience or partner ecosystem. It is not worth implementing as a substitute for the underlying content and structural work that actually drives AI visibility.

Which AI Bots to Allow

For most sites, the working configuration is to allow the AI crawlers that drive citations and traffic, and to be deliberate about the training crawlers that do not.

The crawlers worth allowing without hesitation:

  • OAI-SearchBot — controls ChatGPT search visibility. Blocking this removes the site from ChatGPT answers entirely.
  • PerplexityBot — controls Perplexity citations.
  • ClaudeBot / Claude-User / Claude-Web — controls Claude citations and product features.
  • Googlebot — search visibility.
  • Bingbot — Bing and Copilot search visibility.

These crawlers determine whether a site is eligible to appear in the AI answers and traditional search results that drive most visibility and traffic. Blocking them is a significant decision with measurable visibility consequences.

The training crawlers — GPTBot, Google-Extended, Meta-ExternalAgent, Bytespider, CCBot — are a separate decision. Allowing them contributes a site's content to model training, which does not directly produce traffic but may indirectly contribute to the model's familiarity with the site's content and brand. Blocking them prevents that contribution. There is no consensus position on the right choice here, and the decision depends on how a site values training contribution versus restriction.

A common pragmatic configuration is to allow the AI search crawlers and the training crawlers from the major AI providers (which are also likely to drive citations) while blocking the more aggressive scraping bots like Bytespider and CCBot that contribute to large-scale training datasets without producing direct citation benefit.

A Recommended Starter Configuration

The following robots.txt fragment is a reasonable starting point for most sites that want AI visibility while preserving some control over training data use.

```

User-agent: Googlebot

Allow: /

User-agent: Bingbot

Allow: /

User-agent: GPTBot

Allow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: PerplexityBot

Allow: /

User-agent: ClaudeBot

Allow: /

User-agent: Claude-Web

Allow: /

User-agent: Google-Extended

Allow: /

User-agent: Meta-ExternalAgent

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: CCBot

Disallow: /

```

This configuration allows the crawlers that affect visibility and citation in major search and AI surfaces while blocking aggressive third-party training crawlers. Each line is a deliberate choice, and the right configuration for a specific site depends on its content, audience, and posture toward training data use.

For sites where content scraping is a significant business concern — proprietary research, paid content, or licensed media — the configuration should be more restrictive, with explicit blocks for any training crawler and careful evaluation of which search crawlers to allow. For sites whose content is meant to be widely referenced — documentation, public knowledge bases, openly published guides — a more permissive configuration is usually appropriate.

llms.txt Adoption Reality Check

For sites that want to implement llms.txt despite its limited current impact, the format is straightforward. The file lives at the root of the domain (`/llms.txt`). It is written in Markdown, with a structured header section followed by a list of significant resources on the site. The basic structure is:

```

> Brief description of the site and its content focus

Documentation

Articles

```

The file is a curation, not a sitemap. It points to the resources the site considers canonical for the topics it covers, with descriptions that help AI tools understand what each resource is for.

The realistic expectation: llms.txt is unlikely to materially affect AI citation rates in 2026. It may become more impactful over time as more AI tools adopt the format, particularly developer-facing and agent-based tools that benefit from structured site guidance. For most sites, implementing llms.txt is a low-cost, low-impact investment that may pay off later. The high-impact work remains in content quality, structured data, author authority, and the broader optimizations covered in AI-powered SEO practice.

Monitoring AI Bot Traffic

A practical step that is often overlooked: most sites have no visibility into which AI bots are actually visiting and how often. Server logs reveal this directly, and the patterns are often surprising.

Bot traffic from Google-Extended, GPTBot, and Meta-ExternalAgent tends to be substantial and consistent — these crawlers are running continuous coverage of the public web. PerplexityBot and OAI-SearchBot show more query-driven patterns, fetching pages in response to user queries that surface the site. ClaudeBot's behavior varies depending on whether it is being triggered by a user query or running broader training collection.

Tracking this traffic over time reveals two useful signals. First, it confirms that AI engines are actually visiting the site — if they are not, the content may be undiscovered for reasons unrelated to robots.txt. Second, it shows how AI traffic patterns shift over time as the site builds visibility in AI surfaces, which is one of the few direct indicators of AI search visibility growth available to site owners.

For Philippine businesses specifically, monitoring AI bot traffic also helps identify which AI engines are surfacing the site's content for queries from the Philippines audience, which informs where to focus optimization effort. A site that gets significant Perplexity traffic but minimal ChatGPT traffic has a different optimization priority than one with the reverse pattern. This kind of monitoring is increasingly part of a comprehensive SEO strategy rather than a niche concern.

How This Connects to Broader AI Visibility

The crawler configuration work is foundational but limited in scope. It determines whether AI engines are technically able to access a site's content. It does not determine whether they choose to cite the content when answering user queries — that decision depends on the content, structure, author signals, and topical depth that have been discussed throughout this guide.

The right mental model: robots.txt and llms.txt are gating decisions, not optimization mechanisms. Get them right so that the site is eligible for AI visibility, then invest in the content and structural work that actually earns it. Sites that spend significant effort on llms.txt while neglecting content optimization for ChatGPT and the broader structural patterns AI engines reward are optimizing the wrong layer of the problem.

For sites integrating AI visibility into their broader strategy, the sequence is straightforward. First, get the crawler configuration right — allow what should be allowed, block what should be blocked. Second, audit the content for the structural elements AI engines weight: clear headings, FAQ blocks, schema, named authors, internal linking. Third, build the topical depth and ongoing publishing cadence that produces citation-worthy content over time. The crawler work takes hours; the content work takes years; the relative impact reflects that ratio.

FAQs

Frequently Asked Questions

Should I implement llms.txt on my site in 2026?+

It is worth implementing if the site has a developer audience, partner integrations, or documentation that benefits from structured guidance for AI tools. It is not worth prioritizing as a way to improve AI citation rates — current data shows no measurable impact. Implement it when convenient, but invest the larger share of effort in content and structural optimization.

Does blocking GPTBot affect my Google search rankings?+

No. GPTBot is OpenAI's training crawler and has no relationship to Google's ranking systems. Blocking it removes content from OpenAI's training data but does not affect Google search visibility. The crawler to think carefully about for Google is Google-Extended, which controls Google's AI training (Gemini), and even that does not affect search rankings — only Google's AI training data use.

What happens if I accidentally block OAI-SearchBot?+

The site is removed from ChatGPT search answers entirely. ChatGPT users asking questions where the site's content would have been a relevant citation will get answers that exclude it. This is one of the most consequential bot decisions a site can make, and worth double-checking on any site that wants AI search visibility.

Are there legal implications for allowing or blocking AI training crawlers?+

The legal landscape is unsettled. Some jurisdictions are introducing or considering regulations around AI training data and the right to opt out. The practical position in 2026 is that robots.txt directives are the standard mechanism for opting out, and major AI providers have stated they respect those directives. Sites with significant proprietary content concerns should consult legal counsel rather than rely on technical configuration alone.

How do I know which AI bots are actually visiting my site?+

Server logs are the authoritative source. Filter access logs by user agent to identify which AI crawlers are visiting, how often, and which pages they access. Most hosting platforms provide log access; many analytics tools now have bot traffic dashboards that surface this without raw log analysis. Monitoring this over time reveals both the technical posture of AI engines toward the site and shifts in how AI visibility is evolving.

Ready to stop guessingand start growing?

Get a free, no-obligation SEO audit. We'll show you exactly where you're losing traffic — and how to win it back.

No contracts required. Month-to-month. Full transparency.

llms.txt and AI Crawlers 2026: Practical Configuration Guide | SEO.com.ph