Crawler Guide

What to Allow in robots.txt for AI

Most AI crawler policy is copied, not chosen. Companies block or allow bots in bulk without deciding what business outcome they actually want. That is how teams end up blocking citation while thinking they only blocked training, or vice versa.

Updated 2026-03-19

First principle

Separate search access from training access

Verify on

The live robots.txt file, not only the repo

Important distinction

Search bots and training bots are not the same policy decision

Common failure

Upstream services overriding origin robots settings

Business question

Do you want citation, training access, both, or neither?

Best practice

Document the policy instead of inheriting defaults

The policy decisions that matter

Whether search-oriented AI bots can crawl the site.
Whether training-oriented bots can crawl the site.
Whether upstream infrastructure is injecting its own directives.
Whether the live file matches what the team thinks it shipped.

Why teams get surprised

They assume all AI bots are equivalent. They are not. Search discovery, user-initiated fetches, and training crawls are different behaviors.

They also forget that CDN or bot-management layers can prepend or override robots behavior even when the application code looks correct.

The practical process

Choose the business outcome first.
Set the policy at the origin and any upstream layer that can override it.
Fetch the live robots file and confirm the public result.
Re-check after infrastructure changes, not just content deploys.

Frequently Asked Questions

Short answers to the questions serious buyers and operators ask first.

Can we allow search bots but still restrict training bots?

Yes. That is often the most sensible middle ground for businesses that want visibility and citation without granting blanket training access.

What if our repo looks correct but live traffic still fails?

Check the public robots.txt first. CDN or bot-management layers can override the origin output, and the live file is the one crawlers actually see.

How often should we review this?

Any time infrastructure changes, bot controls change, or AI-search visibility becomes a meaningful growth channel. This is not a set-it-once decision.