Robots.txt Analyzer - Validate Crawl Rules

Robots.txt Content

Enter your robots.txt content to analyze for issues and SEO best practices

Simulate URL

Test if a specific URL is allowed or blocked for a user-agent

Analysis Results

Analyzing robots.txt for SEO and Crawl Control

The robots.txt file is the first document search engine crawlers read when visiting your site. It controls which pages Googlebot, Bingbot, and other crawlers can access, directly impacting your site's search visibility. A single misconfigured Disallow rule can remove entire sections from search indexes, while contradictory rules create unpredictable crawler behavior. The Robots.txt Analyzer parses your file structure, detects problematic rules, simulates URL matching for specific crawlers, and identifies configurations that may harm your SEO performance.

Beyond basic syntax validation, the analyzer understands crawler-specific behavior — Googlebot interprets rules differently than other bots, crawl-delay is supported by some crawlers but ignored by Google, and sitemap declarations enable discovery of content not linked from your main navigation. The tool flags dangerous blocks (disallowing CSS/JS files, blocking entire directories unintentionally) and suggests corrections to maximize crawl efficiency while protecting sensitive paths.

User-Agent Groups and Rule Precedence

robots.txt organizes rules into user-agent groups, each targeting specific crawlers. Understanding precedence is critical for correct configuration:

User-agent: * — Default rules applying to all crawlers without specific groups
User-agent: Googlebot — Rules specific to Google's primary crawler
User-agent: Bingbot — Rules specific to Microsoft's crawler
User-agent: Googlebot-Image — Google's image-specific crawler

When a specific user-agent group exists, crawlers use ONLY those rules and ignore the wildcard group entirely. This means a permissive Googlebot section overrides restrictive wildcard rules — a common source of unintended exposure. The analyzer detects these precedence conflicts and warns when crawler-specific rules contradict the general configuration.

Dangerous Disallow Patterns

The analyzer detects Disallow rules that commonly cause SEO damage:

Disallow: / — Blocks the entire site from indexing. Occasionally intentional for staging sites but catastrophic if deployed to production.
Blocking CSS/JS: Disallow: /assets/ or /static/ prevents crawlers from rendering pages properly, harming mobile-friendliness scores.
Overly broad patterns: Disallow: /p blocks /products/, /pages/, /pricing/ — anything starting with /p.
Query parameter blocking: Disallow: /*? blocks all URLs with parameters, including paginated content and filtered category pages that should be indexed.
Contradictory Allow/Disallow: Rules that overlap creating ambiguous behavior depending on which crawler-specific precedence rules each bot implements.

URL Simulation and Crawl Testing

The analyzer includes URL simulation — test whether a specific URL would be allowed or blocked for a given user-agent. This is essential for verifying that:

Important pages (homepage, product pages, blog posts) are accessible to crawlers
Admin panels, user account pages, and internal tools are properly blocked
API endpoints that should not be indexed are covered by Disallow rules
Paginated URLs and filtered views follow your intended crawl policy

URL matching in robots.txt uses path prefix matching with limited wildcard support (* for any sequence, $ for end-of-URL anchor). The simulator applies the same matching logic that Google's crawler uses, revealing whether complex wildcard patterns actually match the URLs you intend to block or allow.

Code Examples

robots.txt with Issues Detected by Analyzer

# Problematic robots.txt example
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /assets/       # ISSUE: Blocks CSS/JS rendering
Disallow: /p             # ISSUE: Overly broad, blocks /products/

User-agent: Googlebot
Allow: /                 # ISSUE: Overrides ALL wildcard rules for Googlebot
                         # including the /admin/ block

Sitemap: http://example.com/sitemap.xml  # ISSUE: HTTP, site uses HTTPS

# Analyzer findings:
# - WARN: Disallow /assets/ may block CSS/JS rendering
# - ERROR: Disallow /p is overly broad (matches /products/, /pricing/, etc.)
# - ERROR: Googlebot group allows / without re-specifying needed blocks
# - WARN: Sitemap uses HTTP but site likely uses HTTPS
# - INFO: No crawl-delay specified (acceptable for most sites)

Well-Structured robots.txt

# Production robots.txt — clean configuration
User-agent: *
Allow: /

# Block admin and internal paths
Disallow: /admin/
Disallow: /api/internal/
Disallow: /account/
Disallow: /checkout/
Disallow: /*?sort=
Disallow: /*?filter=

# Allow all assets for rendering
Allow: /assets/
Allow: /static/

# Sitemap declaration
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml

Frequently Asked Questions

What does the Robots.txt Analyzer detect?

The analyzer parses user-agent groups, detects contradictory rules (same path both allowed and disallowed), simulates URL allow/block for specific crawlers, identifies sitemap declarations, flags dangerous blocks (blocking entire site, important paths, or major search engine bots), and checks for structural issues like missing sitemaps or excessive crawl-delay values.

What are contradictory rules in robots.txt?

Contradictory rules occur when the same URL path is both allowed and disallowed for the same user-agent. For example, having both 'Allow: /blog' and 'Disallow: /blog' in the same group. Most crawlers resolve this using longest-match-wins, but it creates ambiguity and should be cleaned up.

Why is 'Disallow: /' flagged as critical?

Disallow: / blocks access to the entire website for the specified user-agent. When applied to all crawlers (User-agent: *) or major search engines, it prevents indexing of your entire site, effectively removing it from search results. This should only be used on staging or private environments.

How does URL allow/block simulation work?

The analyzer uses the longest-match-wins strategy (the same approach used by Googlebot). For a given URL and user-agent, it checks all matching rules and the one with the longest path prefix wins. On equal-length matches, Allow takes priority over Disallow. If no rules match, the URL is allowed by default.

Is my robots.txt sent to any server?

No. All analysis happens entirely in your browser using JavaScript. Your robots.txt content — which may reveal internal URL structures and website architecture — never leaves your device. No data is stored, logged, or transmitted.

Why should I include a Sitemap directive?

The Sitemap directive tells crawlers where to find your XML sitemap, helping them discover pages more efficiently. While crawlers may find your sitemap through other means (like Google Search Console), including it in robots.txt provides an additional discovery mechanism that works across all compliant crawlers.

What format should the input be?

Paste your robots.txt file content exactly as it appears on your server (typically at /robots.txt). The analyzer expects standard robots.txt syntax with User-agent, Allow, Disallow, Crawl-delay, and Sitemap directives. Comments (lines starting with #) are preserved but ignored during analysis.

What is crawl-delay and why is a high value flagged?

Crawl-delay tells bots how many seconds to wait between requests. Values above 10 seconds are flagged because they severely limit how quickly search engines can crawl your site, potentially causing pages to be indexed slowly or not at all. Note that Googlebot ignores crawl-delay — use Google Search Console's crawl rate settings instead.