Erik Larson

Apr 10, 2010

Systems Accidents

Laurence Gonzales, in his best-selling Deep Survival, discusses a class of accidents that happen in complex, real-world systems, called systems accidents . The term was first introduced by John Perrow in his now famous work on accidents, Normal Accidents, published in 1984. Perrow argued that in certain kinds of systems, catastrophic accidents, while rare, are inevitable. What’s worse, and what made Perrow’s work so controversial, attempts to avoid such accidents only make them more inevitable, because the safety measures put in place to expunge them also make the systems where they occur more complex. It is precisely the complexity of such systems which make them prone, one way or the other, to rare but catastrophic failure.

Gonzalez (citing Perrow) points to two features of complex systems that make them prone to systems accidents. One, the systems must support “unintended complex interactions among components and forces”. Two, they must be “tightly coupled”, or constructed in such a way that forces, even if initially small, can magnify as they propagate from component to component. In systems with such features, the result of forces acting on or in the system cannot be exactly known: seemingly insignificant events can chain together in unexpected ways, leading to unintended results (feature one), and some of these interactions can have global effects: the system itself can be “blown apart” by ostensibly innocuous forces that, by cause and effect, release the energy in the whole system destructively (feature two). Given these features, systems accidents will occur; they are a natural result of the complexity of such systems.

Gonzalez mentions a few examples of Perrow’s complex systems, like the modern airline, a “large mass containing explosive fuel, flying at high speeds, and operating along a fine boundary between stability and instability”. Something seemingly insignificant happens (say, the onboard toilet malfunctions), and it leads to a catastrophic failure, one that is nearly impossible to predict in advance and, as Perrow argued, cannot be eliminated in kind , because the safety components one might introduce to solve some problems will themselves create complexities that lead to other problems that won’t be understood until an accident occurs, too. (The toilet leading to a catastrophic crash is a real example.)

Gonzalez connects Perrow’s work on systems accidents with branches of mathematics and science concerned with so-called nonlinear systems, like chaos theory. Chaos theory has entered into the popular or “pop” science vocabulary (it was mentioned in the movie Jurassic Park, for instance) as a catch-all phrase for systems with states that are difficult or impossible to predict because they have a “sensitive dependence on initial conditions”, as science writer James Gleick put it in his 1987 book Chaos. Turbulence is a classic example of a chaotic system: applying physics equations to predict where a cubic inch of water in a turbulent stream will be in, say, 10 seconds is effectively impossible, because the initial conditions of the cube of water—including all of the factors that can act on it—have to be specified so exactly, and again and again as the cube propagates in the turbulent stream, that no amount of computation can possibly determine its later position. True, the prediction is possible in principle , since the laws governing dynamic systems like flowing water are known, but not in practice , because the calculations we need for prediction require such an exactitude and are so complex that not even fantastically powerful computers (say, composed of all of the matter in the universe) could compute predictions of the position of the cube after some (relatively small) window of time. It’s effectively impossible to predict the downstream position of the cube of water, in other words. This realization, that chaotic systems give rise to unpredictable behavior, led to notions about unpredictability like the “butterfly effect”, where it’s quipped that the air displaced by a butterfly’s wings in Tokyo might, a month later, create a hurricane in the Gulf of Mexico.

Chaos theory applies to Perrow’s complex systems because, just like with the butterfly effect, such systems have a sensitive dependence on the initial conditions that might lead, ultimately, to systems accidents. No one thought about the toilet malfunction beginning a chain that led to the crash; similarly, the sensitivity of O-rings (remember them?) to cold weather was known before the Challenger explosion, but no one put together that the particular O-rings installed, along with the exact temperature in Cape Canaveral the morning of the fateful 1986 launch, would by cause and effect result in the explosion. But it happened, and if Perrow is right, such systems accidents will always happen, one way or the other. Some types of systems accidents will be analyzed and fixed, but others will emerge, and while they will remain rare, they will remain inevitable, too.

What’s interesting about Gonzalez’s treatment of systems accidents via Perrow is his application of them to human hobbies and activities. He tells stories of experienced mountain climbers who are killed precisely because they applied their experience to situations, and put in place safety measures based on what worked in past attempts. By and large, the safety measures climbers employ do work: teams of climbers roped together help reduce the likelihood that a particular climber will fall, because when someone slips, the other climbers can “self-arrest”, which means they plant their ice axes in the snow and help break the fall of the climber who slipped. But sometimes, they don’t.

They key to understanding Gonzalez’s examples is to get his distinction between general trends and specific situations. Safety systems address general risk: mountain climbing is in general more safe today than, say, a hundred years ago. Like a seatbelt worn in a car, the safety procedures climbers employ on mountains function to lessen the number of accidents and the damaging consequences of them when they do occur. But, as Gonzalez points out, sometimes the safety system itself causes or exacerbates an accident in a specific (unpredictable) situation, just as Perrow argued it will. The safety system introduces massive amounts of energy into systems of climbers, making them tightly coupled, and it introduces the possibility that some situations will lead, by cause and effect, to the “blowing apart” of the constructed system, as all of that energy put into the system (for safety), suddenly becomes the energy that magnifies the accident. As Gonzalez describes so well in Deep Survival, this is exactly what happened to a group of experienced climbers who self-arrested on Mount Hood to save a member of their team from falling. Several people died in the accident; had the safety measures not been used, one person would have fallen, and though this is of course tragic, it’s not worse than what did happen, which became one of the worst mountain climbing disasters ever.

The point of Deep Survival is not that we should throw out safety measures, or give up on making them better. The point is that systems create complexities—whether in nature, or by our own design—that limit prediction and make deterministic solutions impossible. Chaos and complexity gaurantee that life will remain messy, incomplete, and in large part unpredictable. They gaurantee that, try as we might, we’ll never, as Dostoevsky once remarked, erect a “crystal palace of reason”, where everything is in place, and understood, and predictable. Systems accidents are here to stay, and though our best efforts help reduce the inherent risk of living and acting in the world, they can’t eliminate it, and in many cases they succeed in making the risk that remains even graver, and the accidents we occasionally do have even more catastrophic.