Understanding robots.txt Files

23 Oct, 2012 Tips and Tricks 0 Comments 2

It is common knowledge that numerous websites provide indexing or searching services – Google and Yahoo being the two most popular such websites. Perhaps one less commonly known fact is that robots (or bots or crawlers as they are sometimes referred to) index all the files they can access.

There are two issues that can arise from this:

Bots can index sensitive data making it simple for the public to access it; and,
When bots process more files, it causes a higher load to the server.

The good news is that there is a way to constrain the crawlers: a robots.txt file. This file consists of instructions for search robots and has to be placed in a domain’s root directory. These instructions can include:

Disallowing access to some parts of a website
Indicating how to “mirror” a website accurately
Determining a time limit for robots to download a file from the server

Creating a robots.txt file is as straightforward as creating a regular .txt file. You can either use your favourite text editor and upload it to the server, or create the file with the help of cPanel file manager or console tools, provided you have shell access enabled.

It is still recommended that you create an empty robots.txt file even if you have no intention of creating disallowing instructions for indexers.

Helpful Examples of robots.txt Files

Denying indexing of entire website for all bots:

User-agent: *
Disallow: /

Allowing access to a single robot:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Denying access for all robots to a part of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /personal/

Disallow indexing of some files in a directory for all bots:

User-agent: *
Disallow: /~user/junk.html
Disallow: /~user/playlist.html
Disallow: /~user/photos.html

You can find a list of robots here: http://www.robotstxt.org/db.html

There are directives not supported by all the robots, such as in the case of text after # (pound sign) is a considered a comment, therefore, indexing rules are not affected.

User-agent: *	# rules applied to all robots
Disallow: /downloads/	# disallowing access to ‘downloads’ directory
Request-rate: 1/5	# the directive sets the limit for crawlers allowing to load 1 page per 5 seconds
Visit-time: 0600-0845	# the directive states that pages are allowed to be indexed from 6 a.m till 8:45 a.m. only

While the directives are quite simple, they make effective management of robots access to websites possible.

Note: The robots.txt file may be ignored by some small crawlers.

Have a Question?

Understanding robots.txt Files

Helpful Examples of robots.txt Files

Leave a Reply