This website and one or two others I run recently experienced what appeared to be a denial-of-service attack.
Looking at the access logs, I could see several tens of thousands of requests all originating from a range of amazonaws.com IP addresses. All with the useragent "bitlybot".
This post is a quick postmortem of what went wrong and why.
So what happened?
I've been happily using the excellent bit.ly URL shortening API on the Readability Test Tool website for over a year with no problems at all. Whenever a user checks the readability of a web page using the Readability Test Tool, a convenient "tweet this" link is provided for the results page.
My bit.ly link also innocently appended a query string — &utm_source=twitter&utm_medium=retweet — so that I can track click-throughs from Twitter in Google Analytics.
Looking back at this, it wasn't that clever a thing to do, but it only took a couple of minutes to implement, so was very little effort for a nice bit of analytics/measurement return.
All was good for a year. Google Analytics tracking worked well. There were no problems. Indeed looking back at the access logs, the bitlybot user agent had not so much as sniffed the website once in that time.
One day, something changed. Overnight bitlybot started crawling my website for all the links it had created over the year. Unfortunately for every link it crawl, it also created another link appending more parameters to the query string.
Which it then crawled. Creating another link with more appended query parameters. Ouch.
e.g.
http://www.read-able.com/check.php?uri=http%3A%2F%2Fwww.example.com%2F&utm_source=twitter&utm_medium=retweet
http://www.read-able.com/check.php?uri=http%3A%2F%2Fwww.example.com%2F&utm_source=twitter&utm_medium=retweet&utm_source=twitter&utm_medium=retweet
http://www.read-able.com/check.php?uri=http%3A%2F%2Fwww.example.com%2F&utm_source=twitter&utm_medium=retweet&utm_source=twitter&utm_medium=retweet&utm_source=twitter&utm_medium=retweet
And so on.
What did I do?
Initially I ranted on Twitter.
Then I removed the "tweet this" link to prevent further bit.ly URLs from being created. This wouldn't stop things for while, but would at least prevent the problem from getting any worse.
Then I edited robots.txt:
# Tell "bitlybot" not to come here at all
User-agent: bitlybot
Disallow: /
This did not work - bitlybot only checks robots.txt once a day, so this would not improve matters instantly.
Then I redirected the traffic to bit.ly:
language="php"]
if ($_SERVER['HTTP_USER_AGENT'] == 'bitlybot')
{
header('Location: http://bit.ly/', true, 301);
}
That slowed it a bit. Admittedly, I was still blaming them at this point.
Then I reached out to @bitly.
They were very responsive. I sent them a detailed email with a section of access logs and they fixed it. Quickly.
They disabled my account, preventing me from causing any further mischief. They stopped bitlybot from it's crawling activity and reported the progress back to me.
Each contact with bit.ly via twitter or email resulted in a positive response — they were very quick to respond and my websites were soon back to their usual somewhat diminutive volume of traffic.
Conclusions
- Bit.ly has excellent support - they are very responsive and my little server was soon back to normal
- Think before you write code that uses other people's APIs - you may not fully understand the consequences of your actions
- My Google Analytics tracking parameters were added with little thought - I really ought to have tried a bit harder to weigh up the implications
- My small VPS does nicely thank you for the limited traffic that it experiences. Given some decent volumes of traffic it will fail.