
The robots.txt
file is used by websites to communicate with web crawlers and bots about which parts of the site should not be crawled or accessed. Here are the main directives you can use in a robots.txt
file, along with detailed explanations and examples:
1. User-agent
- Description: This directive specifies which web crawler the rules apply to. You can specify particular user agents or use
*
to apply the rules to all crawlers. - Example:
plaintext User-agent: *
This means that the following rules apply to all web crawlers.
2. Disallow
- Description: This directive tells the web crawler which parts of the site should not be accessed. It can specify a path to a particular file or directory.
- Example:
plaintext User-agent: * Disallow: /private/
This tells all user agents that they should not access any URL that begins with/private/
.
3. Allow
- Description: This directive is used to override a
Disallow
directive, specifically when a directory is blocked, but you want to allow a specific page within that directory. - Example:
plaintext User-agent: * Disallow: /private/ Allow: /private/public-info.html
This means that all user agents are disallowed from accessing/private/
but can access/private/public-info.html
.
4. Sitemap
- Description: This directive specifies the location of the sitemap for the web crawler. This helps crawlers find all the important pages on the site.
- Example:
plaintext Sitemap: https://www.example.com/sitemap.xml
This line tells crawlers where to find the sitemap.
5. Crawl-delay
- Description: This instructs the web crawler to wait a specified number of seconds between requests to the server. Note that not all search engines respect this directive.
- Example:
plaintext User-agent: Googlebot Crawl-delay: 10
This tells Googlebot to wait 10 seconds between requests.
6. Noindex (not standard)
- Description: While not part of the official
robots.txt
specifications, some users include theNoindex
directive to indicate that a page should not be indexed. A more reliable way to prevent indexing is through meta tags or HTTP headers. - Example:
plaintext User-agent: * Disallow: /noindex-directory/ Noindex: /noindex-directory/page.html
This is not recommended, as theNoindex
directive will not be recognized by most crawlers.
7. Multiple User-Agents
- Description: You can specify multiple user agents and have different rules for each.
- Example: “`plaintext User-agent: Googlebot Disallow: /nogoogle/User-agent: Bingbot Disallow: /nobing/User-agent: * Disallow: /private/
`` Here, Googlebot is disallowed from accessing
/nogoogle/, Bingbot is disallowed from
/nobing/, and all other bots are disallowed from
/private/`.
8. Wildcards
- Description: You can use asterisks
*
as wildcards in theDisallow
paths. - Example:
plaintext User-agent: * Disallow: /*.jpg$
This disallows all URLs that end with.jpg
.
9. Comments
- Description: Comments can be added to the
robots.txt
file for clarification. Comments start with a#
. - Example:
plaintext # This is a comment User-agent: * Disallow: /private/
This comprehensive guide to robots.txt
directives should cover most scenarios you’ll encounter. Remember that while robots.txt
provides direction to crawlers, it is not a security feature and should not be relied upon for protecting sensitive information.