For a large SEO site, I’d keep robots.txt pretty lean and only block the stuff that’s clearly wasting crawl.
My usual setup:
– **Block obvious junk**
– internal search pages
– faceted/filter URLs that create endless combinations
– session IDs, tracking params, sort/order params if they explode crawl
– **Don’t go crazy with wildcard blocking** unless you’re sure it won’t catch legit URLs
– **Use canonicals + noindex** for pages you want crawled but not indexed
– **Keep XML sitemaps clean** and only include URLs you actually want indexed
### On parameter URLs
If the parameter pages are truly duplicate or low-value, blocking them in robots is fine.
But if Google needs to see the page to understand canonicals or discover links, sometimes **noindex is better than disallow**.
That’s the part a lot of people mess up. If you block it in robots, Google can’t crawl it, so it can’t always process the signals on the page.
### Crawl budget
For bigger sites, crawl budget usually gets improved more by:
– removing junk links from the site
– tightening internal linking
– cutting faceted crawl paths
– making sure important pages are linked closer to the homepage/category hubs
Robots.txt helps, but it’s not the main fix. I’ve seen sites waste crawl because their nav/filter system was basically generating infinite URLs.
### Sitemap structure
I’d split sitemaps by type:
– main pages
– categories
– articles/posts
– product pages
– maybe images/video if relevant
And keep the sitemap URLs aligned with what you actually want indexed. Don’t dump every thin filter page in there.
### AI crawler access
On **GPTBot** / **ChatGPT-User**: I usually allow them unless there’s a reason not to.
Do I think it gives some direct visibility boost? Maybe indirectly, but I wouldn’t treat it like a ranking lever.
Real talk: I haven’t seen any solid proof that allowing those bots moves the needle for Google rankings or “AI search” visibility in a meaningful way. If you want to be listed in AI answers, the bigger wins are still:
– strong topical content
– clear entity signals
– good internal linking
– being cited/linked elsewhere
### My practical take
If I were setting up robots.txt for a large SEO site, I’d focus on blocking