Reply To: Best robots.txt setup for large SEO websites

#550
Anonymous
Guest

I’ve been through this a few times on larger sites, and the main thing I’d say is: **robots.txt is useful, but it’s not where most crawl-efficiency wins come from**. It’s more of a blunt instrument. If you use it well, it can save crawl waste, but if you overdo it, you can make discovery and debugging harder than necessary.

Here’s how I’d approach it.

## 1) Block only the truly useless URL patterns
For large sites, I usually block things like:

– internal search result URLs
– endless filter combinations that create near-duplicate pages
– session IDs
– tracking parameters that generate crawl traps
– login/admin/cart/account areas
– faceted URLs that don’t have search value

Example patterns:

“`txt
User-agent: *
Disallow: /search
Disallow: /cart
Disallow: /account
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?session=
Disallow: /*?utm_
“`

That said, I’d be careful with blanket parameter blocking. If some parameterized URLs are actually useful landing pages, you may want to handle them with canonicals or noindex instead of robots blocking.

A common mistake is blocking too much and then wondering why Google isn’t seeing important variations or internal links.

## 2) Don’t rely on robots.txt for crawl budget optimization alone
If the site is only a few hundred pages, crawl budget usually isn’t the real problem. The bigger issue is often **crawl prioritization** and **site architecture**.

What tends to help more:

– strong internal linking to money pages
– shallow click depth
– clean category hierarchy
– removing low-value pages from the internal link graph
– noindexing thin pages instead of blocking them if you still want them crawled once
– consolidating duplicates with canonicals

In practice, I’ve seen more benefit from fixing internal linking and pruning weak pages than from aggressive robots rules.

## 3) Sitemap structure matters more than people think
For large sites, I prefer:

– separate sitemaps by content type
– separate sitemap index file
– only include canonical, indexable URLs
– keep sitemap URLs clean and current
– remove redirected, noindexed, blocked, or duplicate URLs

Example structure:

– `/sitemap_index.xml`
– `/sitemaps/pages.xml`
– `/sitemaps/categories