Bad WebBots
If you run a website, then you need a robots.txt file at the top of your hierarchy to indicate which pages may be scanned by bots, and which may not. Good bots like the google bot obey your robots.txt instructions and do not index things you forbid. Bad bots ignore your robots.txt file and look at everything they can find.
But it gets worse. Bad bots can cost you money. They come back to your site many times an hour using your bandwidth. Or they scan all of your pages many times a minute, effectively doing a denial-of-service attack by blocking others from visiting your site while monopolising your server. Some companies running these bots are not nice, they steal your content ignoring copyright. For example, if you have a terms of use policy on your site, they ignore it. If you have a copyright notice, they ignore it. And so on.
Sadly the number of bad bots on the net has been rapidly increasing. Most are sources from three countries - the United States, Russia, and China. But there are others - Brazil and Korea (North and South) being home to lots of these. There are automated scripts which will recognise the activity of most bad bots, and automatically add the IPs they use to a deny list. Unfortunately, they only work up to a point. Sometimes you just have to get your hands dirty, and block the bots manually.
Here’s how to do it with the world’s best web server software, Nginx. This little bit of code scans the user_agent field as it arrives at your site and before a page is served. If any of the following bot names are found in that field, further activity of the bot is blocked by returning an error page.
To use this script, just put it in a file somewhere in above your Nginx root - a good place is in the ’CONF’ directory. Call the file ’badbotblock.txt’. Then in your sites-available directory, (or wherever you have placed the config file for your virtual host), add the line ’include ../CONF/badbotblock.txt’. Easy
# Note that the ’~*’ means if the following sub-string is found, regardless of case
# Just add the names you want to the list below, with the ’|’ between them - my own list contains a few hundred bad bots