SE Submission

Thursday, July 22, 2010

Robots.txt: Learn about robots.txt file

What is robot.txt file?

robots.txt file is that file what tell the search engine robots or spiders that which page the search engine spiders will access and index and which page will not.

We can say that robots.txt file restricts to access search engine robots to index your site totally or certain sections. This file is a guideline or instructions to search engine robots or spiders about which page can visit and which page can’t.

robots.txt file placed on your server where the index page or home page stay. Which pages or areas you want to keep far from search engine spiders only for that pages or areas you have to apply this robots.txt file.

Whenever a robot want to visit a site then at first it visit the robots.txt file of that sites. On that robots.txt file the robot find:

               User-agent: *

Disallow: /

The "User-agent: *" means this section applies to all robots.

The "Disallow: /" tells the robot that it should not visit any pages on the site.

The simplest robots.txt file uses two rules:

  • User-agent: the robot the following rule applies to
  • Disallow: the URL you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

Each section in the robots.txt file is separate and does not build upon previous sections. For example:

User-agent: *

Disallow: /folder1/

User-Agent: Googlebot

Disallow: /folder2/

In this example only the URLs matching /folder2/ would be disallowed for Googlebot.

Web site needs to put robots.txt in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page.

Use all lower case to write filename of "robots.txt".

robots.txt file can contain one or more records but every single line contain a single record. So avoid multiple records for a single line. Use separate line for each record.

Example:

User-agent: *
Disallow: /skin/
Disallow: /user/
Disallow: /~style/

Not like this:

"Disallow: /skin/ /user/"

"Disallow: /skin/, /user/"

Here you have to remember that regular expressions are not supported by tobots.txt. So you can’t use like this:

“User-agent: *bot*"

"Disallow: /user/*"

"Disallow: *.js"

Note that “User-agent: *” here * is not a regular expression it’s a special character means “any robot".
 

1. Exclude a file from an individual Search Engine

You have a file, privatefile.htm, in a directory called 'private' that you do not wish to be indexed by Google. You know that the spider that Google sends out is called 'Googlebot'. You would add these lines to your robots.txt file:

User-Agent: Googlebot
Disallow: /private/privatefile.htm

2. Exclude a section of your site from all spiders and bots

You are building a new section to your site in a directory called 'newsection' and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, '*', to exclude them all.

User-Agent: *
Disallow: /newsection/

Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.

3. Allow all spiders to index everything

Once again you can use the wildcard, '*', to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere.

User-agent: *
Disallow:

4. Allow no spiders to index any part of your site

This requires just a tiny change from the command above - be careful!

User-agent: *
Disallow: /

If you use this command while building your site, don't forget to remove it once your site is live!

To exclude all robots from part of the server
User-agent: *
Disallow: /skin/
Disallow: /user/
Disallow: /jobs/admin.php
 
To exclude a single robot
User-agent: BadBot
Disallow: /
 
To allow a single robot
 
User-agent: Google
Disallow:

To prevent all robots from indexing a page on your site, place the following meta tag into the section of your page:

To allow other robots to index the page on your site, preventing only Google's robots from indexing the page:

User-agents and bots

A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:

User-agent: *

Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.

Blocking user-agents

The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).

  • To block the entire site, use a forward slash.
Disallow: /
  • To block a directory and everything in it, follow the directory name with a forward slash.
Disallow: /junk-directory/ 
  • To block a page, list the page.
Disallow: /private_file.html
  • To remove a specific image from Google Images, add the following:
·         User-agent: Googlebot-Image
Disallow: /images/dogs.jpg 
  • To remove all images on your site from Google Images:
·         User-agent: Googlebot-Image
Disallow: / 
  • To block files of a specific file type (for example, .gif), use the following:
·         User-agent: Googlebot
Disallow: /*.gif$
  • To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
·         User-agent: *
·         Disallow: /
·          
·         User-agent: Mediapartners-Google
Allow: /

Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.

  • To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
·         User-agent: Googlebot
Disallow: /private*/
  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
·         User-agent: Googlebot
Disallow: /*?
  • To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
·         User-agent: Googlebot 
Disallow: /*.xls$

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?

The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Help taken from:

http://www.outfront.net/tutorials_02/adv_tech/robots.htm

http://www.robotstxt.org/robotstxt.html

http://www.outfront.net/tutorials_02/adv_tech/robots.htm

http://www.google.com/support/webmasters/

Leia mais...

  ©SEO and Online Marketing in Bangladesh - Todos os direitos reservados.

Template by Dicas Blogger | Topo