How to use Robots.txt
October 24th, 2008 by J.K.
The robots.txt file is a file that the search engine bots scan prior to scanning or crawling your site. In simple terms a robots.txt is used when you want to direct the search engine robot on how to crawl your site. If you want to tell it not to index certain pages or if you want to disallow access to certain directories you can call them out in the robots.txt and the search engine robots will not crawl or index those pages/directories.
You can also disallow certain search engine robots from crawling your site. Many people would wonder why you would not want to allow access to certain search engines, but this is actually a feature that can come in handy. If you are being indexed by search engines that are not in the same language as your site or if you are being indexed by search engines that are hotlinking your content or any number of other reasons could give you cause to disallow certain search engines access to your site. While you may not be able to block everything you don’t want accessing your site, disallowing access through robots.txt can cut down on unwanted indexing dramatically.
A robots.txt file can be very basic or very complex. The most basic robots.txt file will look like this:
User-agent: *
Disallow:
This tells the search engines that all robots can crawl your site and that they can have full access to every page and directory on your site. This is the most common form of robots.txt you will find.
One simple modification like this can disallow access to your entire site.
User-agent: *
Disallow: /
When you create your robots.txt you will want to make sure not to have that slash in there if you want search engines to crawl your site because that simple slash will turn the search engine robots away at the door.
Another basic robots.txt file would be this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/
This robots.txt file allows all robots to crawl your site but tells it not to crawl the cgi-bin, temp and private directories.
User-agent: *
Disallow: /stuff.html
Disallow: /food.html
Disallow: /junk.html
This version of the robots.txt gives access to all robots and lets them crawl the entire site except for the stuff.html, food.html and junk.html pages.
You can combine the disallows and tell the robots not to crawl any combination of directories or pages. There are many different ways you can use the robots.txt file to help the search engines properly index your site so that you avoid duplicate content penalties or to help you keep content you don’t want in the search engines from being indexed. Even a simple robots.txt can be a major part of properly optimizing your site for search engines.
Contact Got Web Host for superior web hosting and SEO services.
|
October 25th, 2008 at 3:25 pm
[...] Use the robots.txt to disallow certain areas of your blog. You might want to disallow the search engines from crawling things like your rss feeds, trackbacks and some directories in order to help cut down on the chance of getting duplicate content penalties. [...]
October 28th, 2008 at 8:29 pm
[...] can use direction. You can control spiders by using a robots.txt file on you site which can tell them what to crawl and not to crawl. You can also use internal [...]