When bugs really do matter: 22 years after the Therac 25

At the moment my master’s work has me in the middle of what seems like an never-ending sea of runtime errors. It’s the largest Python project I’ve ever worked on, and I’ve been forced to re-learn a lot of practices that I really should have internalized a long time ago but never really did. I’ve also been able to play around with interesting meta-programming techniques (ie, modifying the language on the fly, so to speak, though not as much as a genuine macro system would allow) to stop certain types of bugs reoccurring.

As things progress, the new code settles down into something that actually runs more often than not, and becomes predictable. One thing I’ve relearned the value of has been the sanity check: throwing assert statements around like it’s going out of style, and I’ve found myself almost wishing Python had Eiffel’s native contracts[1]. Almost. I’m willing to pay the development time/code stability trade-off of this approach.

But it’s a whole other matter when you consider articles like this one from the Baltimore Sun (June 30). It’s about a forensic bug-hunter team at the FDA set up to address the growing dangers as software runs more and more medical systems. It talks up static code analysis as a possible solution, though after passing through the mainstream journalism medium it probably comes across as much more of a panacea than it really is. After all, there is no single magic pill for complex software like this.

One passage stood out though.

FDA officials declined to name the maker of the infusion pumps. (In 2006, Cardinal Health, of Dublin, Ohio, stopped production of its Alaris SE pumps because of a key-bounce error that reportedly killed two patients, including a 16-day-old baby that got 44.8 milliliters of intravenous nutrition, rather than 4.8 milliliters.)

During the investigation into the malfunctioning pumps, nurses complained about frequent keyboard errors, while the manufacturer blamed nurses for entering the wrong drug information and then failing to double-check, said Brian Fitzgerald, who heads the FDA’s software specialists.

A shade over 23 years ago an equally tragic disregard for a reported malfunction that appeared to have no effect gave radiation overdoses to 6 people. The Therac 25, a radiation therapy machine that had a subtle bug that led to massive overdoses, went down in software engineering history as being one of the most infamous cases of a race condition bug. There was a comprehensive (and quite readable) report that’s very much worth reading for all developers working on non-trivial software or electronics. And it makes for spine-chilling reading.

From the Boston Globe, June 20, 1986 (source):

MAN KILLED BY ACCIDENT WITH MEDICAL RADIATION

by Richard Saltos, Globe Staff

A series of accidental radiation overdoses from identical cancer therapy machines in Texas and Georgia has left one person dead and two others with deep burns and partial paralysis, according to federal investigators.

Evidently caused by a flaw in the computer program controlling the highly automated devices, the overdoses – unreported until now – are believed to be the worst medical radiation accidents to date.

The malfunctions occurred once last year and twice in March and April of this year in two of the Canadian-built linear accelerators, sold under the name Therac 25.

Two patients were injured, one who died three weeks later, at the East Texas Cancer Center in Tyler, Texas, and another at the Kennestone Regional Oncology Center in Marietta, Ga.

The defect in the machines was a “bug” so subtle, say those familiar with the cases, that although the accident occurred in June 1985, the problem remained a mystery until the third, most serious accident occurred on April 11 of this year.

Late that night, technicians at the Tyler facility discovered the cause of that accident and notified users of the device in other cities.

The US Food and Drug Administration, which regulates medical devices, has not yet completed its investigation. However, sources say that discipline or penalty for the manufacturer is unlikely.

Modern cancer radiation treatment is extremely safe, say cancer specialists. “This is the first time I’ve ever heard of a death” from a therapeutic rediation accident, said FDA official Edwin Miller. “There have been overtreatments to various degrees, but nothing quite as serious as this that I’m aware of.”

Physicians did not at first suspect a rediation overdose because the injuries appeared so soon after treatment and were far more serious than an overexposure would ordinarily have produced.

“It was certainly not like anything any of us have ever seen,” said Dr. Kenneth Haile, director of radiation oncology of the Kennestone radiation facility. “We had never seen an overtreatment of that magnitude.”

Estimates are that the patients received 17,000 to 25,000 rads to very small body areas. Doses of 1,000 rads can be fatal if delivered to the whole body.

The software fault has since been corrected by the manufacturer, according to FDA and Texas officials, and some of the machines have been retured to service.

… (description of the accidents)

The Therac 25 is designed so that the operator selects either X-ray or electron-beam treatment, as well as a series of other items, by typing on a keyboard and watching a video display screen for verification of the orders.

It was revealed that if an extremely fast-typing operater inadvertently selected the X-ray mode, then used an editing key to correct the command and select the electron mode instead, it was possible for the computer to lag behind the orders. The result was that the device appeared to have made the correct adjustment but in fact had an improper setting so it focussed electrons at full power to a tiny spot on the body.

David Parnas, a programming specialist at Queens University in Kingston, Ontario, said that from a description of the problem, it appeared there were two types of programming errors.

First, he said, the machine should have been programmed to discard “unreasonable” readings – as the injurious setting presumably would have been. Second, said Parnas, there should have been no way for the computer’s verifications on the video screen to become unsynchronized from the keyboard commands.

As the report makes devastatingly clear, there was far more wrong with the system than just those two issues in the last paragraph. The code-base as a whole was shoddy, there weren’t hardware failsafes to stop unreasonable behaviour, and there was insufficient or non-existent sanity-checking (i.e, checking an internal state of the program is sensible). From the report:

The operator can later edit the mode and energy separately. If the keyboard handler sets the Data Entry Complete flag before the operator changes the data in MEOS, Datent will not detect the changes because it has already exited and will not be reentered again. The upper collimator (turntable), on the other hand, is set to the position dictated by the low-order byte of MEOS by another concurrently running task (Hand) and can therefore be inconsistent with the parameters set in accordance with the information in the high-order byte. The software appears to contain no checks to detect such an incompatibility.

Take away point (and again, reading the report will give you several): Sanity checks are good. Make use of assertions. Code defensively.

But most staggering was the way the manufacturer AECL reacted to the news. The blame for why it took two years for the bug to be recognised can be laid at their feet for not responding adequately to reports of malfunctions. Their attitude can be summed up in this paragraph from the letter they sent to hospitals after the bug was found, describing how to keep the machine safely operational.

Effective immediately, and until further notice, the key used for moving the cursor back through the prescription sequence (i.e, cursor “UP” inscribed with an upward pointing arrow) must not be used for editing or any other purpose.
To avoid accidental use of this key, the key cap must be removed and the switch contacts fixed in the open position with electrical tape or other insulating material. For assistance with the latter you should contact your local AECL service representative.

Take away point: If problems occur because your software’s allowed to drift into a nonsensical state when the users use the keyboard too fast, it’s actually not really a fix to tell them to remove buttons so they can’t use the keyboard too fast. It seems comical in hindsight. But in 2006, the FDA noted in a press release how Alaris responded to their faulty infusion pumps:

In an August 15 recall letter, Alaris informed customers that it will provide a warning label for the pumps and a permanent correction for the key bounce problem once it is available.  In the letter, Alaris also provided recommendations to pump users on steps they can take to minimize key entry errors until the problem can be corrected. The steps are as follows:

Proper Stance
When programming pumps, stand squarely in front of the keypad (ideally with the pump at eye level for best visibility) to facilitate proper depth of depressing each key.

Listen
Focus on listening to the number of beeps while programming IV pumps; each beep will correspond to a single digit entry.  Unexpected double tone could indicate an unintended entry.

Verify Screen Display
When programming the pump or changing settings, always compare the patient’s prescribed therapy or the medication administration record, original order, or bar code device to the displayed pump settings for verification before starting or re-starting the infusion.

Independent Double Check
Request an independent double check of pump settings by another practitioner before starting or changing infusions with hospital-selected high alert drugs.

Look
Before leaving the patient’s room, observe the IV tubing drip chamber to see if the observed rate of infusion looks faster or slower than expected.  Adjust accordingly.

In a way, it’s worse than “take the key off the keyboard and tape it over”. At least that stopped the error as known from happening. So here’s a final take away point: Telling your users to be constantly checking the system hasn’t malfunctioned, saying you’ll send out a warning sticker and eventually a real fix, doesn’t really count as a fix. Relying on human vigilance isn’t a solution. Not when bugs really do matter.

[1] PEP316 actually proposed that contracts be added to Python. Current status: Deferred.

7 responses to “When bugs really do matter: 22 years after the Therac 25

  1. Support for contracts has been officially proposed for Python (PEP 316). There is also a reference implementation available:
    http://www.wayforward.net/pycontract/

  2. As I noted at the bottom. 😉

    Unfortunately (or not. I like the idea, but the actual implementation of sticking all that code inside the docstring doesn’t amuse me) its deferred status probably means we’re not going to see it any time soon (especially if it was first proposed in 2003).

  3. Please do not link to wikipedia all the time. Why? When I, as your reader, encounter a link, my assumption is that you refer to some other valueable source or citation of information that I might not know myself. However, wikipedia is known to everyone and even included in my browsers context menu upon selecting some text. Therefore, links to wikipedia are pointless and annoyoing.

  4. Thanks Adrian, that’s a fair point. I’ll strive to be less lazy about finding useful “further reading” links in future.

  5. Nathan the vast majority of your readers don’t mind and appreciate links for further reading whether they be from wikipedia or elsewhere as long as they are relevant. Most folks don’t know wikipedia inside and out and frankly don’t have time to do so. Remember the overall general audience not just the overly critical folks.

  6. Pingback: Bob on Medical Device Software » Blog Archive » More Software Forensics and Why Analogies Suck

  7. Pingback: What price a byline? (Or: what’s wrong with Knol) « Paragraft

Leave a reply to ton Cancel reply