Vivek Haldar

More machines, more reliability

Lets say your application is buggy, and crashes every now and then. Lets say it’s mean time to failure (or in this case, to crash) is ’t’. This is usually modeled as an exponential probability distribution, so that after time t, the probability of your application being up is only 37%, or conversely, the probability of it having crashed is 67%. If your application only runs on one machine, then those are your probabilities.

But if your application is a server that is load-balanced on two machines, the definition of success is different. Assuming either machine can handle the full load, your application will continue working even if it crashes on one of the machines. Your entire application fails is both machines do. Thus, the probability of failure after time ’t’ now becomes 0.67 x 0.67 = 0.45, or 45%.

If your application is load-balanced on three machines, and it crashes on one, then the remaining two servers only have to bear 50% more load (as opposed to 100% more load in the two-machine case.) The probability of failure after time ’t’ is 0.67 x 0.67 x 0.67 = 0.30, or 30%.

With just ten instances, you’re down to an under 2% chance of failure after time ’t’.

(If you’ve been paying very close attention, you’ll realize that the probability of failure above is conservative, because it assumes a failed server never comes back. In real life, some sort of daemon will restart crashed servers, making the probability of real failure very small. In the one-machine case, this will be visible as downtime while the server is restarting, but in the multiple machine case it will usually not even be visibile to the outside world.)

This means that reliability or uptime of a single instance of your program is a much bigger problem for client software that runs on a desktop or a phone, than servers that run in giant datacenters on many, many machines. Client software is the one-machine case, where every crash is a bad impression and lost business. Scaling an application across a large number of machines is certainly not easy, but it does have this one saving grace.