Reply To: Best robots.txt setup for large SEO websites

#560
Anonymous
Guest

I’d keep it pretty simple and avoid getting too clever with robots.txt.

For large SEO sites, my usual setup is:

– block obvious junk URLs only
– let Google crawl money pages freely
– keep internal search pages out of the index
– use canonicals + noindex where needed, not robots.txt for everything
– make sure sitemaps are clean and split by type

A few practical thoughts:

### 1) Parameter URLs
If params are creating tons of duplicate crawl paths, block the worst offenders in robots.txt only if they’re truly useless.

Example:
“`txt
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?price=
“`

But I wouldn’t block every parameter blindly. Sometimes Google needs to see the page to understand canonicals and content relationships. If the page is duplicate but still useful for discovery, `noindex,follow` or canonical is usually better than a hard block.

### 2) Crawl budget
For big sites, crawl waste usually comes from:
– internal search pages
– faceted navigation
– endless sort/filter combos
– calendar/archive junk
– thin tag pages

Best move is to reduce URL creation at the source. Robots.txt is more of a cleanup tool, not the main fix.

If you want faster indexing of important pages:
– keep internal links tight
– reduce orphan pages
– link priority pages from hubs/categories
– keep XML sitemaps clean and updated

### 3) Sitemap structure
I’d split sitemaps by page type:
– /sitemap-products.xml
– /sitemap-categories.xml
– /sitemap-articles.xml
– /sitemap-location.xml if relevant

That makes it easier to spot crawl/index issues. Also only include URLs you actually want indexed. Don’t dump junk into the sitemap just because the CMS spits it out.

### 4) Filters and search pages
Internal search pages should usually be blocked or noindexed. They’re almost always crawl traps.

Filters are trickier:
– if a filtered page has real search demand and unique value, let it exist and optimize it
– if it’s just a duplicate combination, noindex it or prevent it from being generated/indexed
– if there are thousands of combinations, don’t let them explode

I’ve seen sites bleed crawl budget hard from faceted nav. Fixing that alone can improve indexation on the real pages pretty quickly