www.80legs.com – stopping them – technical
Recently, one of the sites I admin for came under what I would refer to as a DDOS attack by http://www.80legs.com/webcrawler.html.
This claims to be a ordinary web spider, but it does some things that other web spiders don’t:
1) It makes between 20 and 100 connections to the server, from different IP addresses
2) It makes requests as fast as the server will answer
Now, for a web server with flat files, this is fine. But this particular web server had very complex database content that involved a lot of joins and multiple queries to build each page. It runs on a fairly powerful box – four of them, actually – but it still wasn’t up for 100 connections querying as fast as it would respond. I think probably most database-driven sites would have some problems with this.
As 80legs points out on their web site, blocking them by IP will not work because they are a distributed engine spanning thousands of IPs. Kind of like a botnet. And their indexing is user-driven.. that is, you can pay them to index a particular site for you. Good way to mess with your competitors. 😉
Anyway, my solution was simple and elegent. We already use haproxy to distribute load among the web servers, so I just pulled out the ‘tarpit’ and wrote a quick regex. For those of you not familiar with haproxy, it’s a single threaded non blocking daemon (Oh, i love those! Just like ew-too!) that proxies web requests to servers, automatically adjusts when servers go down, and has a bunch of neat features. It’s free software, and it has worked extremely well for us.
Anyway, I stuck the following in haproxy.cfg:
reqitarpit ^User-Agent: .*www.80legs.com.*
Goodbye, 80legs. Have fun hanging out in 30-second-delay-for-any-request land 😉
For those of you who haven’t set up haproxy before, it’s pretty trivial. It can run on the same box as your web server and just attach to a different interface (i.e. bind the webserver to localhost and it to the outside interface) or a different machine, or whatever. It’s a very lightweight load, as STNB things tend to be.
Random factoid for those of you not familiar with ew-too – the reason ew-too was written STNB is that it was originally designed to run on university computers, and be such a light load that the administrators never noticed it – on machines that were the equivalent of a 486. With a hundred people or more connected. STNB is a very clever approach for situations that it works for.