Thursday, August 28, 2003

Who's To Blame?

The big news yesterday was the release of the Columbia space shuttle disaster report. Check it out here. It makes for fascinating reading, especially Chapter 8, which was written by Dianne Vaugh, who wrote the classic work on the original Challenger disaster. In Chapter 8, "History as Cause: Columbia and Challenger" she explores the systemic failures of the NASA safety system and how the problems uncovered after the Challenger disaster reappeared to cause the Columbia's problems.

The most interesting parts of the report focuses on the management system problems rather than individual failures. Vaughn cautions however that
the Board's focus on the context in which decision making occurred does not mean that individuals are not responsible and accountable. To the contrary, individuals always must assume responsibility for their actions. What it does mean is that NASA's problems cannot be solved simply by retirements, resignations, or transferring personnel.
The footnote accompanying this paragraph states
Changing personnel is a typical response after an organization has some kind of harmful outcome. It has great symbolic value. A change in personnel points to individuals as the cause and removing them gives the false impression that the problems have been solved, leaving unresolved organizational system problems.
Which makes the following headline from the New York Times "interesting":

Human Error Likely Cause of Blackout, Timeline Says

So, let me get this straight. Does this mean that a simple human booboo resulted in the gigantic blackout that swept parts of eight states and eastern Canada, cost billions of dollars and darkened the homes of millions of people? And does this imply that a slap on the hand (or maybe even jail time) will fix the electrical grid problem?

The only "substance" behind that headline is a quote from an unnamed investigator:
"Had all of the existing policies been followed, this would not have developed into a cascading event," the investigator said. "What we see are institutional breakdowns, not a breakdown of the system itself."
Those who do incident investigations realize, however, that the fact that procedures were not followed are rarely due to human failure. It is far more likely that the procedures were confusing or didn't anticipate the situation that the operators found themselves in.

Some people also blamed the Three Mile Island accident on the plant operators: If proper procedures had been followed, the near-disaster would have been a small unnotable incident. But the failure at TMI can more accurately be blamed on the fact that the information that the operators had available to them at the time was confusing, conflicting and inaccurate, and they had not been trained to address the specific situation they were facing. In other words, given the knowledge they had, the "standard operating procedures" were almost useless.

It may be theoretically possible to trace everything that happens in the world to humans (or nature). But in reality, barring sabotage or horseplay, there are few, if any, cases where the root cause of an incident -- workplace injury, space shuttle disaster, or huge blackout -- could be blamed on "human error."

Human error may be one of the "direct causes" of an incident. A direct cause is the action that directly results in the occurrence, while root causes are usually management system problems which, if corrected, would not only have prevented that specific problem, but other similar problems as well.

Rather than focusing on the operators who make the errors, modern accident analysis looks for the conditions -- or root causes -- that made the errors possible.

And now check this out:

AK Steel suspends 11 workers after fatal accident

MIDDLETOWN, Ohio - AK Steel Corp. has suspended 11 workers in connection with an overhead crane accident that killed a worker last month at the company's Middletown Works mill, a union official said.

Now I don't know any more about the details of this incident that what you can read from this article, but let me just suggest that you go check out that footnote above again.