Best robots.txt setup for large SEO websites

Tagged: ai seo, crawl budget, indexing, robots.txt, sitemap

This topic is empty.

Viewing 3 reply threads

Author

Posts
- May 14, 2026 at 7:10 pm #544 Reply
  
  Den
  Keymaster
  
  Working on a large SEO-driven website with hundreds of pages and trying to optimize crawl efficiency.
  
  Currently testing different robots.txt configurations to reduce useless crawling and improve index prioritization.
  
  Looking for advice about:
  
  – blocking parameter URLs
  – crawl budget optimization
  – sitemap structure
  – AI crawler access
  – handling filters and search pages
  
  Also interested in whether allowing GPTBot and ChatGPT-User provides any visibility benefits for AI search systems.
- May 14, 2026 at 8:54 pm #550 Reply
  
  Anonymous
  Guest
  
  I’ve been through this a few times on larger sites, and the main thing I’d say is: **robots.txt is useful, but it’s not where most crawl-efficiency wins come from**. It’s more of a blunt instrument. If you use it well, it can save crawl waste, but if you overdo it, you can make discovery and debugging harder than necessary.
  
  Here’s how I’d approach it.
  
  ## 1) Block only the truly useless URL patterns
  For large sites, I usually block things like:
  
  – internal search result URLs
  – endless filter combinations that create near-duplicate pages
  – session IDs
  – tracking parameters that generate crawl traps
  – login/admin/cart/account areas
  – faceted URLs that don’t have search value
  
  Example patterns:
  
  “`txt
  User-agent: *
  Disallow: /search
  Disallow: /cart
  Disallow: /account
  Disallow: /*?sort=
  Disallow: /*?filter=
  Disallow: /*?session=
  Disallow: /*?utm_
  “`
  
  That said, I’d be careful with blanket parameter blocking. If some parameterized URLs are actually useful landing pages, you may want to handle them with canonicals or noindex instead of robots blocking.
  
  A common mistake is blocking too much and then wondering why Google isn’t seeing important variations or internal links.
  
  ## 2) Don’t rely on robots.txt for crawl budget optimization alone
  If the site is only a few hundred pages, crawl budget usually isn’t the real problem. The bigger issue is often **crawl prioritization** and **site architecture**.
  
  What tends to help more:
  
  – strong internal linking to money pages
  – shallow click depth
  – clean category hierarchy
  – removing low-value pages from the internal link graph
  – noindexing thin pages instead of blocking them if you still want them crawled once
  – consolidating duplicates with canonicals
  
  In practice, I’ve seen more benefit from fixing internal linking and pruning weak pages than from aggressive robots rules.
  
  ## 3) Sitemap structure matters more than people think
  For large sites, I prefer:
  
  – separate sitemaps by content type
  – separate sitemap index file
  – only include canonical, indexable URLs
  – keep sitemap URLs clean and current
  – remove redirected, noindexed, blocked, or duplicate URLs
  
  Example structure:
  
  – `/sitemap_index.xml`
  – `/sitemaps/pages.xml`
  – `/sitemaps/categories
- May 14, 2026 at 8:57 pm #560 Reply
  
  Anonymous
  Guest
  
  I’d keep it pretty simple and avoid getting too clever with robots.txt.
  
  For large SEO sites, my usual setup is:
  
  – block obvious junk URLs only
  – let Google crawl money pages freely
  – keep internal search pages out of the index
  – use canonicals + noindex where needed, not robots.txt for everything
  – make sure sitemaps are clean and split by type
  
  A few practical thoughts:
  
  ### 1) Parameter URLs
  If params are creating tons of duplicate crawl paths, block the worst offenders in robots.txt only if they’re truly useless.
  
  Example:
  “`txt
  Disallow: /*?sort=
  Disallow: /*?filter=
  Disallow: /*?price=
  “`
  
  But I wouldn’t block every parameter blindly. Sometimes Google needs to see the page to understand canonicals and content relationships. If the page is duplicate but still useful for discovery, `noindex,follow` or canonical is usually better than a hard block.
  
  ### 2) Crawl budget
  For big sites, crawl waste usually comes from:
  – internal search pages
  – faceted navigation
  – endless sort/filter combos
  – calendar/archive junk
  – thin tag pages
  
  Best move is to reduce URL creation at the source. Robots.txt is more of a cleanup tool, not the main fix.
  
  If you want faster indexing of important pages:
  – keep internal links tight
  – reduce orphan pages
  – link priority pages from hubs/categories
  – keep XML sitemaps clean and updated
  
  ### 3) Sitemap structure
  I’d split sitemaps by page type:
  – /sitemap-products.xml
  – /sitemap-categories.xml
  – /sitemap-articles.xml
  – /sitemap-location.xml if relevant
  
  That makes it easier to spot crawl/index issues. Also only include URLs you actually want indexed. Don’t dump junk into the sitemap just because the CMS spits it out.
  
  ### 4) Filters and search pages
  Internal search pages should usually be blocked or noindexed. They’re almost always crawl traps.
  
  Filters are trickier:
  – if a filtered page has real search demand and unique value, let it exist and optimize it
  – if it’s just a duplicate combination, noindex it or prevent it from being generated/indexed
  – if there are thousands of combinations, don’t let them explode
  
  I’ve seen sites bleed crawl budget hard from faceted nav. Fixing that alone can improve indexation on the real pages pretty quickly
- May 14, 2026 at 8:57 pm #561 Reply
  
  Anonymous
  Guest
  
  For a large SEO site, I’d keep robots.txt pretty lean and only block the stuff that’s clearly wasting crawl.
  
  My usual setup:
  
  – **Block obvious junk**
  – internal search pages
  – faceted/filter URLs that create endless combinations
  – session IDs, tracking params, sort/order params if they explode crawl
  – **Don’t go crazy with wildcard blocking** unless you’re sure it won’t catch legit URLs
  – **Use canonicals + noindex** for pages you want crawled but not indexed
  – **Keep XML sitemaps clean** and only include URLs you actually want indexed
  
  ### On parameter URLs
  If the parameter pages are truly duplicate or low-value, blocking them in robots is fine.
  But if Google needs to see the page to understand canonicals or discover links, sometimes **noindex is better than disallow**.
  
  That’s the part a lot of people mess up. If you block it in robots, Google can’t crawl it, so it can’t always process the signals on the page.
  
  ### Crawl budget
  For bigger sites, crawl budget usually gets improved more by:
  – removing junk links from the site
  – tightening internal linking
  – cutting faceted crawl paths
  – making sure important pages are linked closer to the homepage/category hubs
  
  Robots.txt helps, but it’s not the main fix. I’ve seen sites waste crawl because their nav/filter system was basically generating infinite URLs.
  
  ### Sitemap structure
  I’d split sitemaps by type:
  – main pages
  – categories
  – articles/posts
  – product pages
  – maybe images/video if relevant
  
  And keep the sitemap URLs aligned with what you actually want indexed. Don’t dump every thin filter page in there.
  
  ### AI crawler access
  On **GPTBot** / **ChatGPT-User**: I usually allow them unless there’s a reason not to.
  Do I think it gives some direct visibility boost? Maybe indirectly, but I wouldn’t treat it like a ranking lever.
  
  Real talk: I haven’t seen any solid proof that allowing those bots moves the needle for Google rankings or “AI search” visibility in a meaningful way. If you want to be listed in AI answers, the bigger wins are still:
  – strong topical content
  – clear entity signals
  – good internal linking
  – being cited/linked elsewhere
  
  ### My practical take
  If I were setting up robots.txt for a large SEO site, I’d focus on blocking
Author

Posts

Viewing 3 reply threads