If you’ve ever written a piece of code that communicates to other pieces of code, and your code wasn’t just a simple test, you had to handle the concept of reconnection. Usually, this is handled in a not particularly elegant fashion which can lead to lots of interesting issues such as DDOSing yourself. Let me explain:
This is some code extracted from a production system at work. It uses the venerable POE framework, in particular the POE::Component::Client::TCP module. This module will trigger either a Disconnected or ServerError state when something has happened, and the server can’t be communicated with. This could be because the server is down, because someone unplugged the network, sunspots, all kinds of stuff. So, in this case I at least had a reconnect of 60 seconds, but there’s plenty of code out there with a hardcoded reconnect of 1 second, or 5 seconds, or something like that. Let’s say that there is 50-60 instances of this code running throughout the enterprise and the backend service they communicate to goes down. What’s going to happen?
All of these instances are going to zerg rush the poor server. They are going to ceaselessly and mercilessly attempt to connect over and over again. This is a really good thing to do to a server that has already failed, as we want to make sure that when it comes back up, it experiences the most possible load it can immediately to make it nice and easy to debug. This will make your server logs more fun to understand and the behavior will be awesome.
So, everyone knows that the right thing to do is to add some sort of “backoff”. But that requires effort and thinking. You’ve got to care enough to worry about this condition which totally will never happen (because back end services never fail, the network is infinite, and has zero latency), and then you have to code a solution that handles this gracefully. That sucks and it really doesn’t add any of the features that your boss actually cares about.
So, here is where code reuse and CPAN come to the rescue. I have recently started using the perl module Proc::BackOff. This module nicely encapsulates the logic required with doing Linear, Exponential, and Random BackOff. This module is really pretty easy to use and straightforward:
In this snippet we are declaring an object of Linear type, with the equation y=mx+b being y=5x. So, the failures are going to be 1, 5, 10, 15, 20… until we hit our maximal sleep threshold of 60 * 5 seconds. This will at least stop the non-stop hammering effect and give your servers an opportunity to recover. There is an exponential backoff version of the module, but to be honest, I think that the linear backoff is sufficient when your primary goal is to make your reconnects smart, but not hammer the server.
However, I like Random backoff even better. The code is frightfully similar:
The only difference here is that we have replaced the Linear object with a Random object, and given it a range in which to give us a Random reconnection time. I believe (though I have not tested it) that this should give us a standard normal distribution within that range, which will not only allow us to *not* hammer the service, but will also make it so the Zerg rush problem of reconnections doesn’t DDOS our poor hurt service.
There is further work, however I want to do. I am playing around with making a Proc::BackOff::Linear::Random. I think that it would be cool to be able to do a version of Proc::Backoff which based on how many times it fails, it fails a random amount of time, but the range scales up linearly based on number of subsequent failures. Basically, the best of both worlds. This code is still in the play phases, and if it ends up not being garbage, look for it on my github!