Monthly Archives: July 2008

What price a byline? (Or: what’s wrong with Knol)

A reader criticised my frequent referencing of Wikipedia in my last post, on the basis that everyone knows what WP is and that indeed some of us have Firefox extensions[1] to make quickly consulting it easy. I admitted he had a point, prompting another reader to protest that it doesn’t matter where the links go to, as long as they’re informative and well-written. The degree to which they were both right was strikingly indicative of how far WP has come. Given that it’s so often the first search result on Google for a huge number of queries, making explicit links to it can seem like adding links to dictionary.com for longer semantemes[2]. And the reason I reference it so often is that its collective writing style and usual accuracy is ideal for a quick introduction to what may be unfamiliar ground.

But its status as the #1 go-to place for so many Google queries didn’t go unnoticed in Mountain View. Yesterday Google finally released their long-in-development About.com-mimic Knol. A “Knol” is an unnecessary neologism coined by Google to mean a “unit of knowledge”, but seems the basic idea is to compete with Wikipedia on the authoritative content front, by meeting one of the oft-heard (albeit not so much anymore, if only due to exhaustion) criticism of WP: that you can’t trust it because you don’t know who wrote it. Knol’s point of differences with WP are then as follows:

  • You can have more than article on a topic.
  • Articles are signed by their authors
  • Advertising will be displayed, and it will be split with authors.
  • The level of collaborative editing allowed on each article is controlled by the author, as is the licensing.

I’ve been reading through a few of its articles, and what’s striking me is what they’ve lost by not having universal editing. So often WP was compared to the Encyclopedia Brittanica. Knol tries to compromise between the two, but in doing so completely erodes the role of the editor. The person who doesn’t actually write the content, but polishes it to a publishable standard, and makes it consistent with the rest of the corpus of work. Today’s featured Knol is on Migraines and Migraine Management. It’s written by a neurologist, so you know it’s authoritative, and it doesn’t allow public editing, so you know it hasn’t been tampered with.

But compare it with WP’s article on Migraines, and you’ll see just how needing of an editor it is. It’s written as if it’s intended for paper, with non-hyperlinked cross references “Migraine is defined as a headache that [TABLE 2]:”. “[TABLE 2]”, is a JPEG image at reduced size. There’s no reason for that and not an actual HTML table. (Additionally Google, there’s no reason for the images to be inline with the content like that. Consider a Tufte-like layout, where the tables and references and footnotes can go out to the side).

Throughout Knol you’ll find all sort of bad design practice. I swear I saw image hotlinking in one article before. But in particular, a lot the seed articles seem to be HTML dumps of articles already written by medical professionals, like this one. It’s closed collaboration, so unlike WP, you can’t just drop in and quickly format that into something presentable (at present there’s no change in style, the intra-page headings are just capitalised, there’s an odd amount of whitespace, and the page title itself isn’t capitalised).

There’s two big surprises here, given that this is a Google project, and how long it’s been in development there. And if they don’t fix this, I fear an epic failure.

The first is that they’ve provided such an unstructured writing environment. If you’re trying to create a body of high quality written material, there are ways you can structure your editing environment so that content conforms to certain styles and expectations. It’s particularly in Google’s interest to do so, since as they keep telling the SEO world, well-structured HTML and documents are easier for them to index and search. And yet Knol’s featured Migraines article has swathes of tabular content in the un-indexable, un-accessible JPEG format.

The second is much more subtle and can’t be fixed with as much of a technology patch as the first can. Google have failed to realise that often the most expert authors are going to be simultaneously the least equipped to properly format and polish their own documents (whether it be  due to lack of technical skills, or time), and also the least willing to submit their work to editorial changes from the unwashed anonymous masses. The fix for this I think will involve a recognition and separation of the two types of editing that happen on Wikipedia: authoring or fixing of content; and editing for quality control (fixing grammar, spelling, style, adding useful metadata to a document). Then build a system to acknowledge the good editors, not just the good authors. Then encourage authors to allow editorial changes from recognised quality editors. In fact, drop the “closed collaboration” option altogether.

This is even harder than getting good quality content in the first place. Writing is glamorous. Editing isn’t, but it’s so very important. Knol’s only got half the problem solved.

[1] Certainly one of my favourite extensions is the perenially useful Googlepedia, which remixes your Google search results to embed the first returned WP article on the right (It’s particularly nice on widescreen monitors).

[2] So it’s not a directly applicable synonym of ‘word’, but it was the best the thesaurus could give me

When bugs really do matter: 22 years after the Therac 25

At the moment my master’s work has me in the middle of what seems like an never-ending sea of runtime errors. It’s the largest Python project I’ve ever worked on, and I’ve been forced to re-learn a lot of practices that I really should have internalized a long time ago but never really did. I’ve also been able to play around with interesting meta-programming techniques (ie, modifying the language on the fly, so to speak, though not as much as a genuine macro system would allow) to stop certain types of bugs reoccurring.

As things progress, the new code settles down into something that actually runs more often than not, and becomes predictable. One thing I’ve relearned the value of has been the sanity check: throwing assert statements around like it’s going out of style, and I’ve found myself almost wishing Python had Eiffel’s native contracts[1]. Almost. I’m willing to pay the development time/code stability trade-off of this approach.

But it’s a whole other matter when you consider articles like this one from the Baltimore Sun (June 30). It’s about a forensic bug-hunter team at the FDA set up to address the growing dangers as software runs more and more medical systems. It talks up static code analysis as a possible solution, though after passing through the mainstream journalism medium it probably comes across as much more of a panacea than it really is. After all, there is no single magic pill for complex software like this.

One passage stood out though.

FDA officials declined to name the maker of the infusion pumps. (In 2006, Cardinal Health, of Dublin, Ohio, stopped production of its Alaris SE pumps because of a key-bounce error that reportedly killed two patients, including a 16-day-old baby that got 44.8 milliliters of intravenous nutrition, rather than 4.8 milliliters.)

During the investigation into the malfunctioning pumps, nurses complained about frequent keyboard errors, while the manufacturer blamed nurses for entering the wrong drug information and then failing to double-check, said Brian Fitzgerald, who heads the FDA’s software specialists.

A shade over 23 years ago an equally tragic disregard for a reported malfunction that appeared to have no effect gave radiation overdoses to 6 people. The Therac 25, a radiation therapy machine that had a subtle bug that led to massive overdoses, went down in software engineering history as being one of the most infamous cases of a race condition bug. There was a comprehensive (and quite readable) report that’s very much worth reading for all developers working on non-trivial software or electronics. And it makes for spine-chilling reading.

From the Boston Globe, June 20, 1986 (source):

MAN KILLED BY ACCIDENT WITH MEDICAL RADIATION

by Richard Saltos, Globe Staff

A series of accidental radiation overdoses from identical cancer therapy machines in Texas and Georgia has left one person dead and two others with deep burns and partial paralysis, according to federal investigators.

Evidently caused by a flaw in the computer program controlling the highly automated devices, the overdoses – unreported until now – are believed to be the worst medical radiation accidents to date.

The malfunctions occurred once last year and twice in March and April of this year in two of the Canadian-built linear accelerators, sold under the name Therac 25.

Two patients were injured, one who died three weeks later, at the East Texas Cancer Center in Tyler, Texas, and another at the Kennestone Regional Oncology Center in Marietta, Ga.

The defect in the machines was a “bug” so subtle, say those familiar with the cases, that although the accident occurred in June 1985, the problem remained a mystery until the third, most serious accident occurred on April 11 of this year.

Late that night, technicians at the Tyler facility discovered the cause of that accident and notified users of the device in other cities.

The US Food and Drug Administration, which regulates medical devices, has not yet completed its investigation. However, sources say that discipline or penalty for the manufacturer is unlikely.

Modern cancer radiation treatment is extremely safe, say cancer specialists. “This is the first time I’ve ever heard of a death” from a therapeutic rediation accident, said FDA official Edwin Miller. “There have been overtreatments to various degrees, but nothing quite as serious as this that I’m aware of.”

Physicians did not at first suspect a rediation overdose because the injuries appeared so soon after treatment and were far more serious than an overexposure would ordinarily have produced.

“It was certainly not like anything any of us have ever seen,” said Dr. Kenneth Haile, director of radiation oncology of the Kennestone radiation facility. “We had never seen an overtreatment of that magnitude.”

Estimates are that the patients received 17,000 to 25,000 rads to very small body areas. Doses of 1,000 rads can be fatal if delivered to the whole body.

The software fault has since been corrected by the manufacturer, according to FDA and Texas officials, and some of the machines have been retured to service.

… (description of the accidents)

The Therac 25 is designed so that the operator selects either X-ray or electron-beam treatment, as well as a series of other items, by typing on a keyboard and watching a video display screen for verification of the orders.

It was revealed that if an extremely fast-typing operater inadvertently selected the X-ray mode, then used an editing key to correct the command and select the electron mode instead, it was possible for the computer to lag behind the orders. The result was that the device appeared to have made the correct adjustment but in fact had an improper setting so it focussed electrons at full power to a tiny spot on the body.

David Parnas, a programming specialist at Queens University in Kingston, Ontario, said that from a description of the problem, it appeared there were two types of programming errors.

First, he said, the machine should have been programmed to discard “unreasonable” readings – as the injurious setting presumably would have been. Second, said Parnas, there should have been no way for the computer’s verifications on the video screen to become unsynchronized from the keyboard commands.

As the report makes devastatingly clear, there was far more wrong with the system than just those two issues in the last paragraph. The code-base as a whole was shoddy, there weren’t hardware failsafes to stop unreasonable behaviour, and there was insufficient or non-existent sanity-checking (i.e, checking an internal state of the program is sensible). From the report:

The operator can later edit the mode and energy separately. If the keyboard handler sets the Data Entry Complete flag before the operator changes the data in MEOS, Datent will not detect the changes because it has already exited and will not be reentered again. The upper collimator (turntable), on the other hand, is set to the position dictated by the low-order byte of MEOS by another concurrently running task (Hand) and can therefore be inconsistent with the parameters set in accordance with the information in the high-order byte. The software appears to contain no checks to detect such an incompatibility.

Take away point (and again, reading the report will give you several): Sanity checks are good. Make use of assertions. Code defensively.

But most staggering was the way the manufacturer AECL reacted to the news. The blame for why it took two years for the bug to be recognised can be laid at their feet for not responding adequately to reports of malfunctions. Their attitude can be summed up in this paragraph from the letter they sent to hospitals after the bug was found, describing how to keep the machine safely operational.

Effective immediately, and until further notice, the key used for moving the cursor back through the prescription sequence (i.e, cursor “UP” inscribed with an upward pointing arrow) must not be used for editing or any other purpose.
To avoid accidental use of this key, the key cap must be removed and the switch contacts fixed in the open position with electrical tape or other insulating material. For assistance with the latter you should contact your local AECL service representative.

Take away point: If problems occur because your software’s allowed to drift into a nonsensical state when the users use the keyboard too fast, it’s actually not really a fix to tell them to remove buttons so they can’t use the keyboard too fast. It seems comical in hindsight. But in 2006, the FDA noted in a press release how Alaris responded to their faulty infusion pumps:

In an August 15 recall letter, Alaris informed customers that it will provide a warning label for the pumps and a permanent correction for the key bounce problem once it is available.  In the letter, Alaris also provided recommendations to pump users on steps they can take to minimize key entry errors until the problem can be corrected. The steps are as follows:

Proper Stance
When programming pumps, stand squarely in front of the keypad (ideally with the pump at eye level for best visibility) to facilitate proper depth of depressing each key.

Listen
Focus on listening to the number of beeps while programming IV pumps; each beep will correspond to a single digit entry.  Unexpected double tone could indicate an unintended entry.

Verify Screen Display
When programming the pump or changing settings, always compare the patient’s prescribed therapy or the medication administration record, original order, or bar code device to the displayed pump settings for verification before starting or re-starting the infusion.

Independent Double Check
Request an independent double check of pump settings by another practitioner before starting or changing infusions with hospital-selected high alert drugs.

Look
Before leaving the patient’s room, observe the IV tubing drip chamber to see if the observed rate of infusion looks faster or slower than expected.  Adjust accordingly.

In a way, it’s worse than “take the key off the keyboard and tape it over”. At least that stopped the error as known from happening. So here’s a final take away point: Telling your users to be constantly checking the system hasn’t malfunctioned, saying you’ll send out a warning sticker and eventually a real fix, doesn’t really count as a fix. Relying on human vigilance isn’t a solution. Not when bugs really do matter.

[1] PEP316 actually proposed that contracts be added to Python. Current status: Deferred.