As with alot of the Advanced Website Development topics, this tutorial should not be read until you officialy published your website on the net. No webmaster should worry about a robots.txt until they have a decent site security policy as well as a site map put together. Secondly, you need to have your website setup as far as stucture is concerned.
Robots.txt files are only one way of telling spiders how to index your site. The robots.txt counterpart is the HTML meta tag. The advantage to using a robots.txt file is having a central location to edit and control spidering capabilities; however, the HTML meta tag used for spidering can be much more descriptive. First we're going to discuss the robots.txt file, if you would like to skip below to the meta tag descriptions click here.
So alot of people ask about robots.txt files. Some ask, is it a rumor that search engines actually use them to spider your site? Others say, robots.txt files don't work. Well, the fact of the matter is that search engines (at least all of the big names: google, yahoo, ask, etc) read and respect robots.txt files. Google even has pages of their website that discuss the issue.
What can a robots.txt file do for your site? It can dictate a single search engine as well as all search engines how to treat your website's individual files as well as directories. That is, to prevent a search engine from creating a list of keywords in order to link to certain pages of your site. A robots.txt file can also tell a search engine not to follow the links that are in the webpage. Usually a search engine finds all of your site by going from the links on your main page to links on another, etc. Lastly, you can tell search engines how often they should update their keyword tables for a webpage. This is a very important and useful ability.
To create a robots.txt file, all you need is a text editor like notepad or wordpad. You simply create the robots.txt file and save it in simple text format to the root directory(the directory that contains your homepage) of your website. Unlike HTML, you don't need to declare the contents of your robots.txt file, anything in the robots.txt file is automatically read and used by search engines.
Now lets get to the meat of it all, the robots.txt file codes. They are a set standard and go across platforms for every computer and every search engine. The first step in a robots.txt command is to declare what search engine your website attributes are made for. Obviously text that goes after a # sign are comments that the search engines to pay any attention to:
# This says that every search engine that spiders the site should...
# ...are not allowed to spider the http://www.yoursite.com/private-stuff.html page.
So our two main commands will be User-agent, declaring an engine or all engines to page attention to it. Then to disallow an engine from spidering a certain part of the site you use the disallow command. Here are some examples of more command options:
# This is google's spidering engine...
# ...and it is not allowed to spider ANY of the website.
After understanding the basics to the code, let us go through all of the options. You've seen all of the options for User-agent, to declare a specific one like "Googlebot" or to declare all engines with "*". For the disallow command you've learned that you can target a specific file, and the entire site. You can also specify a specific directory.
# Spiders are not allowed to view any contents of this directory.
Here are a few common examples of what some webmaster's prefer to use for their robots.txt file. Remember, these aren't meant for all sites, and if they are used as the base of a robots.txt file then more entries should probably be added.
# This allows all bots everywhere:
# This disallows all bots everywhere:
# This disallows all bots to several common directories and files that are private:
# This disallows access to engine specific pages:
# Here we are disallowing a whole list of directories and pages to googlebot:
We've seen some examples, now we want to see some real live uses that serve a purpose:
Great, that's what we're here for. Blazedent is going to show you a few very purposeful robots.txt entries. Again, these entries vary from site to site, but its a great place to start.
# Robots.txt created by Blazedent.com
# Last modified 04.13.2006
# This tells all bots not to spider the listed entries:
# Disallow harmful search eng
While the robots.txt file above is simple, above, it still remains effctive.
Lastly, we're going to show you the HTML meta version of the robots.txt file. While it takes more administration work to add to every page of your site, the possibilities are greater than that of just the robots.txt file.
The only way to work with robots is if you include one of the tags in the $tghead> on your webpag. Seeing the commands below clearly shows you the advantages of using %lt;meta...> tags opposed to robots.txt file. The > meta...>
The extent of the robot associated meta tags are:
<!--// all robots, index this page and follow the links to other pages to spider //-->
<meta name="robots" content="index,follow">
<!--// all robots, do not index this page, but follow the links to other pages to spider //-->
<meta name="robots" content="noindex,follow">
<!--// all robots, index this page, but do not follow the links to other pages to spider //-->
<meta name="robots" content="index,nofollow">
<!--// all robots, do not index this page and do not follow the links to other pages to spider //-->
<meta name="robots" content="noindex,nofollow">
Pretty self explanatory. Just be careful when using the robots.txt file and/or meta tags. Using them the wrong way could cause all search engines to completely disreguard your site and not spider anything (meaning it will hard to find your site). So be careful of the directories and engines you list, also be careful as to not contradict yourself (to have entries that say different things for the same search engine or part of your site).
Now that you know how to dictate search engines that spider your website, you should put a security policy in place. A security policy would have a list of files and folders that should be blocked. It is usually hard and difficult to create think of the files and folders that you would want.
Like a security policy, figuring out fake search engines that are meant to be malicious should come when you're first developing your robots.txt file. Across the web their are plenty of websites that offer such lists. Try not to use all of the common hacked search engines because other engines like Google only read so much a robots.txt file. To figure out what hacked search engine(s) might visit your site, view your sites stats. If some get through then you should know what to add to your robots.txt file.
- Joseph Lookabaugh