Life Saver or Life Taker ? (Therac-25) – Impact of poor testing (Testing Tragedies #1: Learning from past)

This blog is for everyone who wants to know how software testing job touches human lives and why defects in applications such as healthcare cant be ignored.

History:

The Therac-25 was a radiation therapy machine produced by Atomic Energy of Canada Limited (AECL) It was involved with at least six accidents between 1985 and 1987, in which patients were given massive overdoses of radiation, approximately 100 times the intended dose.

Loss:  

Four of the six patients died as a direct result of poor design, coding and testing

 

therac

Company’s Response

After careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac-25 or by any operator error

[raj]: Only if the company had taken very first of these incidents seriously they could have saved 3 precious lives. Every critical issue which customer finds should be given utmost priority before it becomes much worse.  

Facts

Only One person did the programming for this system and he largely did all the testing.

Therac-25 was tested as a whole machine rather then in separate modules.

[raj]: Yes, that was my reaction too. We left the lives of so many hundreds in the hands of ‘One/ person only . We are humans and to err is humans. Plus we humans are not so good in finding errors in our own work. 

If System and Integration testing is important, so is Unit testing, we can undermine the importance of any of these. They are meant to complement each other. becomes much worse.  

Incident Log & My Observations

Severity 1 Production Defect #1:  
A 40 year old women was receiving her 24th Therac-25 treatment. The machine stopped 5 seconds into the treatment with an error. The technician seeing that "No Dose" had been administered (according to the computer) hit the 'P' key thus proceeding with the dose. This was done a total of 5 times giving the patient 13 000 - 17 000 rads. To give an idea of how much of an overdose this is; a regular treatment is around 200 rads and 1000 rads of radiation to the entire body can be fatal. The patient died 3 months after the overdose.
Severity 1 Production Defect #4:  
The patient required only a small dose and according to the machine that is all he received. Yet again when the treatment was underway and error paused the machine and the technician hit the 'P' key to proceed. A overdose was administered and the man died just 3 months later.
[raj]: Better testability could have warned the technician that the dose had already been delivered. Misleading information and lack of transparency through the system confused him and he went on repeating the procedure again and again which made it fatal.
 Severity 1 Production Defect #2:  
Severity 1 Production Defect #3:  
A month later at the same hospital, with the same technician another fatal dosage was given. The technician made the same error of quickly changing the mode from X-ray mode to Electron mode using the 'cursor up' key. This again caused "Malfunction 54". The patient this time was receiving treatment on his face. When the overdose was administered he yelled and then began to moan. The audio equipment was working this time but the initial dose was too much for the man. He received severe neurological damage, fell into a coma and died only 3 weeks later.
[raj]: If system was designed considering that a simple wrong choice can have such adverse effects then a choice made by technician could have warned him and possibly stopped him from making that mistake.
 

Learning

Learning #1: Never dismiss any failure without reaching the bottom of it. Over confidence about your quality can take you and your customer down

Learning #2: Never depend on just one resource for the entire functionality. It’s dangerous. and it takes two to tango (certain activities can’t be achieved singly like arguing, fighting, dancing, making love :))

Learning #3: Unit, Integration and System testing, they all are equally important and one shouldn’t undermine importance of any of these.

Learning #4: Poor testability is extremely fatal. Lacks of user’s ability to validate the completion of the software operation/task can take lives as we have seen above

Learning #5: Don’t repeat any important function/operation/task without confirming the behaviour of the previous operation. Many times we think that running the software function again is case of failure is perfectly fine but that can be risky if the last operation resulted into corruption or left the machine in inconsistent state

Learning #6: For critical functions in your software, ensure there are provisions to handle silly human errors where we perform an action what we don't intend. Design should consider that humans can make mistakes and for important tasks, there should be a warning/message confirming the change (that can possibly warn him and correct the action as he intended).

Example of such human mistakes.

we want to click on Checkbox Yes but because page scroll happens and we click on NO and we don't even notice it

or

we are not 100 % concentrating and our brain is lost in it thoughts and we humans are sometime unaware of the the action performed.

e.g. I bet you would have felt the same more than once “Have i left the tap open after the bath?” when you would have closed it.

Thoughts for you

If you are thinking,this was a rare scenarios and example of worst engineering and the machine would have got retired for ever then i want to leave you with this fact that this machine is still in use today and there might be someone you know who might be sitting in front of the machine as we speak and that’s why it is important to find defects before a life-saver turns life-taker