Jun 27, 2011

Large Computer Systems are Organic

Large computer systems, the kind that span multiple datacenters, are rapidly approaching organicity. This may not be readilty apparent if you look at the systems themselves, which display only limited forms of adaptive behavior, and certainly almost none of the autonomy that are the signs of living things, though the trend is positive. But it is more obvious when you look at the way humans, i.e. engineers, tend to these systems.

And the way engineers tend to planet-scale systems has many parallels with the way doctors treat patients.

Like biology, (and unlike physics) very few things about large systems can be predicted or calculated precisely. There are no closed-form solutions. Often, one has to work with estimates with an order of magnitude error margin.

Like doctors prescribing drugs, engineers treat day-to-day failures and problems with playbook solutions that have been known to work in the past, without necessarily understanding the root cause or the exact mechanism that explains why the solution works. Just as many drugs work, but its not perfectly understood why.

Solutions to problems also often work like drugs. They take time to propagate through the system. They might trigger unexpected side-effects.

In the fog of complexity, visibility is paramount. Often the first thing a doctor does when treating a patient is a battery of tests and scans. As these systems grow larger and more complex, more and more effort is spent in simply getting visibility into how they operate, tracing something up and down the layers of abstractions. Without that, you have no hope of even making a feeble attempt when something goes wrong.

There is an upside – large numbers bring resilience. An application running on one machine has a much higher probability of failure than one running across a hundred.