It is common knowledge that numerous websites provide indexing or searching services – Google and Yahoo being the two most popular such websites. Perhaps one less commonly known fact is that robots (or bots or crawlers as they are sometimes referred to) index all the files they can access.
There are two issues that can arise from this:
- Bots can index sensitive data making it simple for the public to access it; and,
- When bots process more files, it causes a higher load to the server.
The good news is that there is a way to constrain the crawlers: a robots.txt file. This file consists of instructions for search robots and has to be placed in a domain’s root directory. These instructions can include:
- Disallowing access to some parts of a website
- Indicating how to “mirror” a website accurately
- Determining a time limit for robots to download a file from the server
Creating a robots.txt file is as straightforward as creating a regular .txt file. You can either use your favourite text editor and upload it to the server, or create the file with the help of cPanel file manager or console tools, provided you have shell access enabled.
It is still recommended that you create an empty robots.txt file even if you have no intention of creating disallowing instructions for indexers.
Helpful Examples of robots.txt Files
Denying indexing of entire website for all bots:
User-agent: * Disallow: /
Allowing access to a single robot:
User-agent: Google Disallow: User-agent: * Disallow: /
Denying access for all robots to a part of a website:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /personal/
Disallow indexing of some files in a directory for all bots:
User-agent: * Disallow: /~user/junk.html Disallow: /~user/playlist.html Disallow: /~user/photos.html
You can find a list of robots here: http://www.robotstxt.org/db.html
There are directives not supported by all the robots, such as in the case of text after # (pound sign) is a considered a comment, therefore, indexing rules are not affected.
|User-agent: *||# rules applied to all robots|
|Disallow: /downloads/||# disallowing access to ‘downloads’ directory|
|Request-rate: 1/5||# the directive sets the limit for crawlers allowing to load 1 page per 5 seconds|
|Visit-time: 0600-0845||# the directive states that pages are allowed to be indexed from 6 a.m till 8:45 a.m. only|
While the directives are quite simple, they make effective management of robots access to websites possible.
Note: The robots.txt file may be ignored by some small crawlers.