When searchengine robots are hammering your site...

SEO is good. SEO is great. Making search-engine spiders visiting your site is even better! Or not??? Recently one of my client has complained, that he was almost banned from one host, because the search-engine bots where "hammering" - read overloading - the server. Ouch...

So, if this happens to you, what you can do?

There are couple of things you can do, and a good mix of them can solve lot of the problems - you can obtain easily a sharp decline in server load caused by these search-engine bots.

1. SEF

Use a good SEF component to control your URL's. set it carefully to keep your duplicates under control. Otherwise these bots will crawl the same content too many times - without real benefits, just overusing your server's resources. Also set up there your preferred URL's of the site (with or without www in front). With this trick alone you will reduce with 50% the "hammering". Allow both forms of URL to be crawled but use a CANONICAL META tag to specify that you want the rewritten version to be indexed.

2. Sitemap

Use a good sitemap component, and set it properly. Be sure, to add to it only your important pages - the ones you want to highlight - and set the key parameters properly. Recommended values are 0.5 for priority, and 2 weeks to 1 month for frequency.

3. Robots.txt

Tweak your default Joomla robots.txt, there are much more power here, as you might think.

First of all, try to slow down the search-engine bots which are complying to the standards, by placing an extra line:

Crawl-delay: 10

This line instructs the robots to scan your site by requesting/crawling one page per 10 seconds. Unfortunately Google ignores this, but that also can be solved.

Next add your sitemap(s) to the Robots.txt like:

Sitemap: http://the.exact.url/to/your_sitemap.xml

4. Webmaster account

Set up a Google Webmaster account (and a Bing Webmaster account too, but Google's the culprit generally for hammering) and tweak it!

a. First of all set up your URL preference (as described in SEF section above). This might need to add two separate entries for www.yourdomain.com and yourdomain.com, and setting up the preference in both. The rest of settings you can do it only for the preferred variant - but is better to do for both.

b. Add your sitemap (submit it), and check, if was accepted. It's important. If correct sitemap is found, the bots will spider first and foremost these URL's

c. Tweak your crawling rate. For Google Webmaster Tools this is in Configuration>Settings, in the same place where you have set up your URL preferences. Don't set it too low, because Google will ignore it. A value of 0.2 is probably an optimal one.

d. Go to URL Parameters section, and see what parameters Google discovered, and tweak how they are handled. Joomla has tons of these parameters, just eliminate the unneeded ones - Remember, you want from Google to see your content - and not your tricks ;)

5. (Re)organize your site

If you are having similar issues with over-crawling then I would advise you to first check your sites structure to see if the problem is due to bad structure, invalid sitemap values and over categorization first before changing the crawl rate. Remember a sites SEO is unrelated to the amount of crawler activity and more is not necessarily better. Its not the number of crawled pages that counts but rather the quality of the content that is found when the crawlers visit that matters.

Avoid dead links or links that lead to pages with no content. If you have a category index page and some categories have no content related to them then don't make the category into a link or otherwise link to a page that can show related content rather than nothing.

Avoid circular references such as placing links to a site-index or category listings index in the footer of each page on a site. It makes it hard for the bot to determine the site structure as every path it drills down its able to find the parent page again. Although I suspect the bots technology is clever enough to realize its found a link already spidered and not crawl it again I have heard that it looks bad in terms of site structure.

6. Ban crawlers who misbehave.

You don't really need all search-engine bots to spider your site, so punish those that misbehave. If you want to know those bots who ignore the Robots.txt rules then there are various ways such as parsing your web-server log files or using a dynamic Robots.txt file to record those agents that access it. There are other ways such as using the IsBanned flag available in the Browscap.ini file however this relies on the user-agent being correct and more and more people spoof their agent nowadays. Not only is banning bots good for your servers performance as it reduces load its good for your sites security as bots that ignore the Robots.txt rules are more likely to hack, spam, and scrape your sites content.