How to use the Robots.txt file
Project: Create a page explaining the tools need to learn how do you block a site using the Robots.txt file? |
Description: A short tutorial on what the robots.txt file does. |
Project Goals: The basic lines of text that are in the robots.txt file are important in terms of what the search engines index on your website. How to do that is explained in this page, taken from robotstxt.org … one of the best sources for learning more about what this file does. This page is here, so that I can keep the information easily accessible to me, and others… |
About /robots.txt In a nutshell Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds: User-agent: * Disallow: / The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site. There are two important considerations when using /robots.txt:
See also:
The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:
The /robots.txt standard is not actively developed. See What about further development of /robots.txt? for more discussion. The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ. How to create a /robots.txt file Where to put it The short answer: in the top-level directory of your web server. The longer answer: When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place. For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt". So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software. Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT. See also:
The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/ In this example, three directories are excluded. Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records. Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif". What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples: To exclude all robots from the entire server User-agent: * Disallow: / To allow all robots complete access User-agent: * Disallow: (or just create an empty "/robots.txt" file, or don't use one at all) To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ To exclude a single robot User-agent: BadBot Disallow: / To allow a single robot User-agent: Google Disallow: User-agent: * Disallow: / To exclude all files except one This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory: User-agent: * Disallow: /~joe/stuff/ Alternatively you can explicitly disallow all disallowed pages: User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html Clipped on 28-December-2010, 3 : 53 PM from http://www.robotstxt.org/robotstxt.html Other Notes |
Author Contact: Stan Bush Contact Phone: (478) 227-8260 or (478) 227-4SEO |
Contact Email : stan@salestrainingandnetworking.com |
Contact Twitter : stanbush Contact Facebook : http://profile.to/stanbush |
Company Name : Sales Training And Networking Company Twitter: StanInc |
Comments