Here are some important ideas about what a robots.txt file is, why you want one and what to expect.
Why a Robots.txt File?
The web is built on standards, often called protocols. These “rules” allow us all to communicate more effectively and allow the various systems to work effectively. The formal names for the “rules” behind the Robots.txt file include the Robots Exclusion Protocol or the Robot Exclusion Standard and sometimes the robots.txt protocol.
The purpose of the Robots.txt protocol is to allow website owners to provide instructions for web crawlers or spiders and web robots that search the web for data.This file is meant to tell them where NOT to go. This is the no trespassing sign for your files.
On the other hand the Sitemap is the welcome sign for web crawlers or spiders and web robots.
Robots.txt = exclusion
Sitemaps = inclusion
It is for this reason that the Allow syntax in a robots.txt is not really needed normally. The intent is for exclusion. The rest of the site is assumed to be free and open for searching.
Most thorough site evaluations check to ensure that a Robots.txt file is present, Web Page Advisor included.
What Goes In a Robots.txt File?
- User-agent: *
- Disallow: /cgi-bin
- Disallow: /wp-admin
- Disallow: /wp-includes
- Disallow: /wp-content/plugins
- Disallow: /wp-content/cache
- Disallow: /wp-content/themes
- Disallow: /trackback
- Disallow: /feed
- Disallow: /comments
- Disallow: /category/*/*
- Disallow: */trackback
- Disallow: */feed
- Disallow: */comments
- Disallow: /*?*
- Disallow: /*?
- Allow: /wp-content/uploads
- # Google Image
- User-agent: Googlebot-Image
- Allow: /*
- # Google AdSense
- User-agent: Mediapartners-Google*
- Allow: /*
- # digg mirror
- User-agent: duggmirror
- Disallow: /
- Sitemap: http://www.example.com/sitemap.xml
What Does that Mean?
Line 1 above allows you to name the search bots that these instructions apply to. In this case the asterisk * means that it applies to all.
Line 2 above is the first of many lines specifying which directories to stay out of using the Disallow syntax.
Line 18 is the first time we see a comment. The # before any text is a comment. In this case the comment tells us that the following section is specifically for the Google Images bot
# Google Image
Line 29 shows he path to your sitemap
What Should Your Robots.txt File Contain?
You can survey your colleagues and competition to get ideas to consider. Visit any site and append “/robots.txt” to the URL. So, for this site, you would visit www.WebPageAdvisor.com/robots.txt to see my Robots.txt file.
Here are a few ideas to consider excluding for personal or security reasons:
- Personal photos stored on your hosting account
- Password files
- Backups or previous versions of your website
- Sensitive e-commerce data
- PayPal connection strings
- Admin and User account files
- Plugin, cache and theme files
The other exclusions included above are for SEO purposes, specifically to reduce duplicate content. Although valid, keep in mind that Google sees thousands of WordPress sites and has probably figured out that the feed, category and comment pages are not meant to be the canonical URLs for your content.
Google Image Search can bring a hefty amount of traffic to many sites. However, if you sell photos, you may not want that kind of traffic. You then have a reason to exclude the Google Images bot.
Some people don’t want to appear in the Internet Wayback Machine, which is also called the Internet Archive. The following will block the Wayback Machine from visiting your site.
- See what your current Robots.txt file looks like (type in yourdomain.com/robots.txt)
- See what your competitors and colleagues are doing
- Think about what is on your account that you don’t want to see in the Search Results
It is interesting to look around and find sites with a Robots.txt file such as this example. It may be part of a default install on that host.
For example, I happened across this one
User-agent: * Disallow: Sitemap: http://www.woothemes.com/sitemap.xml.gz
That is pretty useless. It basically means “All User-agents are Disallowed nowhere. Here is my sitemmap” I think all of that is implied by default. In the same way that the Robots.txt is found in the same location on most servers, the Sitemap.xml file is also found in a standard location. So, the above file serves no purpose literally. Again you can visit the sitemap.xml of most sites by following this example: http://webpageadvisor.com/sitemap.xml
Be careful. If you copy an example be sure to edit the website name and enter your own domain name and remove the example.com domain name.
Be sure you know what you are excluding. Many a site has accidentally excluded all search engines from all or part of a public website. Not good.
Test your Robots.txt file immediately. Create an Account with Google WebMaster Tools. Go into Crawler Access and then the Test Robots.txt tab and follow the instructions. Or you could use this site http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php which looks to be helpful in testing your Robots.txt file. Both sites will also help you create the file as well.
Here is the kicker, Just Because You Tell Them to Stay Away Doesn’t Mean that They Will Stay Away. There are bots that ignore the Robots.txt. So, please do not think of this file as a “security measure”. This file provides advisory information and has no capacity to prevent access. To prevent access you need to go elsewhere. Consider moving it or adding additional security measures to restrict access. For example on a Linux server use the .htaccess file to restrict access. Password protection on the file or using https (SSL) can also help.
Here is a list of web robots with a fairly extensive set of data for each one. Maybe you want to research a bot that has been sucking up your bandwidth and then block it after review.
As mentioned above, AskApache has a wealth of information on many subjects, including Robots.txt
WordPress.org offers an example and a few links for more information which is helpful.
And as a comparative resource you may find the Robots.txt file entry at Wikipedia helpful.