Reply To: Best robots.txt setup for large SEO websites

#575
mercer
Participant

For a large site, I’d keep robots.txt pretty conservative. The main goal isn’t to “control Google” so much as to stop wasting crawl on junk while not accidentally blocking something that still needs discovery.

A setup that usually works well for me looks like this:

### 1) Block true junk, not just “thin” pages
I’d block:
– internal search results
– faceted/filter combinations that create endless URL variants
– session IDs / tracking parameters
– sort parameters
– cart/account/login areas
– duplicate print versions if they exist

I would **not** block parameter URLs blindly if some of them are actually useful landing pages or if Google needs to see them to understand the site structure.

A common mistake is blocking too much and then wondering why some pages stop getting discovered or re-crawled efficiently.

### 2) Use robots.txt for crawl control, not index control
This is where a lot of people go wrong.

If a URL is already known and you want it out of the index, robots.txt alone is often the wrong tool. Blocking prevents crawling, but the URL can still linger indexed if Google finds it elsewhere.

For pages like:
– internal search pages
– filter pages you don’t want indexed
– low-value parameter URLs

I usually prefer:
– `noindex` where the page can still be crawled
– canonical tags when there’s a clear preferred version
– robots.txt only for the worst crawl traps

That combination is usually safer than aggressive disallow rules.

### 3) Parameter URLs: be selective
For parameter handling, I’d only block patterns that are clearly infinite or useless, like:
– `?sort=`
– `?session=`
– `?replytocom=`
– `?filter=` if it creates huge combinations with no SEO value

But if a parameterized URL is effectively a unique category landing page with search demand, I’d leave it crawlable and handle it with canonicals / internal linking strategy instead.

If your site has hundreds of pages, you probably don’t have a “crawl budget crisis” in the classic sense. You usually have a **site architecture and duplication problem** that shows up as crawl inefficiency.

### 4) Sitemap structure matters more than people think
I’d split sitemaps by page type:
– core money pages
– categories
– supporting content
– maybe images/video if relevant

Keep only canonical URLs