That's not meant to happen�
Is your software confusing operators into doing something disastrous?
As the plane sped down the runway at Puerto Plata in the Dominican Republic, the captain noticed a problem. The Boeing 757 was accelerating but his airspeed indicator was doing nothing. "Is yours working?" he asked the co-pilot as the plane passed the point of no return - the captain could not brake and expect to have enough runway to stop. The only option is to keep accelerating and get into the air.
"Yes sir," the copilot replied a few seconds before the 757 parted company with the runway. As the aircraft climbed, the captain noticed that his airspeed indicator suddenly started working again. Maybe it was just an intermittent fault, never to be seen again. As the plane climbed, the centre autopilot was activated - a system that used the suspect airspeed indicator as one of its sources of data.
Twenty seconds later, as the plane reached an altitude of 1km, warning signs flashed up in the cockpit. And the captain realised that his airspeed indicator was acting strangely. It was not the only one. "There is something crazy there. Two hundred [knots] only is mine and decreasing, sir," the co-pilot announced.
"Both of the them are wrong," the captain exclaimed. "What can we do?"
As alarms went off in the cockpit, the captain's airspeed indicator showed the plane had reached a speed of 350 knots. The autopilot reacted, pointing the 757 up at an angle of close to 20° and reducing engine thrust. But that simply made the pilot's control stick shake from side-to-side, making a loud rattling noise. This was the warning sign of an imminent stall, the symptom of flying too slow and at too high an angle.
The stall warning was not in error. Within seconds, BirgenAir flight 301 was falling, its autopilot fighting against the human pilots' attempts to regain control. At a point where the aircraft needed full power to climb, the autopilot inexplicably cut power. Automated warnings to pull up were in vain as the autopilot was disconnected too late. At an angle of 80°, the 757 powered into the ocean 20km northeast of Puerto Plata, disintegrating on impact.
What brought down flight 301 on 6 February 1996? A damaged sensor was the root cause. Investigators thought the pitot tube, which measures differences in air pressure, had become blocked. The aircraft had stood for days while it was being repaired, providing an opportunity for local insects to possibly build a nest in the tube. However, the tubes were never recovered from the ocean and this remains just a plausible hypothesis.
The effect of a blocked pitot tube or failed sensor on the autopilot was not anticipated by the system's designers, a situation compounded by warnings from other systems that the pilots were not trained to understand. Another computer on board was able to determine that the airspeed indicators were not working properly but the warnings flashed up seemed to have little to do with the problem.
The aircraft operations manual, at the time, did not say that the warnings "MACH/SPD TRIM" and "RUDDER RATIO" appeared together when the plane's other sensors found a discrepancy of more than 10 knots from each other. Not realising their true meaning, the cockpit crew pondered over the meaning of these cryptic messages as the 757 headed towards its fatal destination.
Ultimately, the investigation into the accident blamed the pilot for the crash. Although the crew had been misled by the system, the investigators considered that the pilot made decisions that reduced the chance of the aircraft recovering from what was, initially, a mild problem. The co-pilot's speed indicator seemed to be working and the pilot's was malfunctioning, but the pilot still switched the autopilot on that was controlled by his own airspeed indicator. The flight crew realised too late what had happened.
However, there were systemic issues in the design of the autopilot software and the user interfaces employed by the control systems in the plane that exacerbated the flight crew's confusion. In fact, researchers have come across many situations where poor user-interface design has caused problems for pilots, although luckily many have realised the problem before it became fatal.
One common problem was even given a name in the 1990s by Nasa Ames Research Center scientist Everett Palmer: a 'kill-the-capture bust'. What happens in these is that an automatic transition between system states leads to a different outcome to what the operator expected.
John Rushby of the computer science laboratory at SRI International later moderated the term to 'kill-the-capture surprise', given that a lot of the problems are noticed before they become dangerous, often because the operators are able to respond to the surprise before the worst happens.
Palmer relayed a couple of events captured from flight simulators. One, which took less than 20 seconds to play out, involved confusion over autopilot settings. The crew had just missed an approach to land and had climbed back to around 600m before they received an instruction from air traffic control to climb to 1,500m. The pilot set the autopilot to climb to that height but, as it did so, changed some other settings. One of the them caused the autopilot to switch from climbing to some specific height to climbing at a constant speed.
A light showing that the target altitude was approaching lit up - and stayed lit as the aircraft sailed past 1,500m. "Five thousand...oops, it didn't arm," remarked the captain. And then the altitude alarms went off as the craft approached 1,700m.
Palmer noted that one aspect of these aviation incidents is not so much what went wrong but how disaster was averted. "The aviation system is very much an error-tolerant system," he wrote, "with almost all errors being rapidly detected and corrected."
The pilots frequently notice the readings on simple instruments, such as the airspeed indicator or the altimeter, not by the flight-mode annunciators that show the state of the automation. "The crews were apparently aware of the state of the aircraft but not aware of the state of the automation," Palmer noted.
The reason why the aircraft climbed past its target height is because the captain inadvertently selected the wrong type of thrust - a setting that effectively confused the autopilot. He failed to notice, according to Palmer, that he had selected go-around thrust - appropriate for the burst of power needed after a missed approach - rather than regular climb thrust.
For the autopilot that controls a climb to work correctly, the pilot is meant to check what the thrust reference panel says before it is engaged. Then the pilot should check that the flight-mode annuciators say what mode was actually selected. The result of this design means that the button to engage the autopilot has a 'meta meaning' - a term coined by cognitive scientist Edwin Hutchins to describe controls that change behaviour based on the state of the system. In the case of this control, that state was shown on a panel some way away from the button itself.
Professor Nancy Leveson of the Massachusetts Institute of Technology, who worked with Palmer on a later paper on automation surprises, has logged a number of software problems that have interfaces, both between the user and the computer and between software modules, at their heart. But in either case, a surprise happens when the automated system behaves differently to what its operator expected.
According to Rushby, who published a paper on the problem and potential ways round it in Reliability Engineering and System Safety in 2002, it is possible to describe the operator's mental model of a system and actual system behaviour as finite state machines. The problems occur when those two models go out of step. The solution, for Rushby, was to use model checking to check for those situations.
Palmer argued that the user interfaces need to improve: "What is needed is a 'what you see is what you will get' type of display of the aircraft's predicted vertical path", in place of the cryptic messages that the flight mode annunciators relayed.
Unfortunate interactions between the user interface and the underlying automation are far from isolated to aviation. They are simply more noticeable in aviation because of the open way in which the results of investigations into disasters and near-misses are published. Leveson covered the story of the Therac-25 radiation-therapy machine in her 1995 book 'Safeware'.
Six people suffered massive radiation overdoses between June 1985 and January 1987 from the Therac-25. No formal investigation was ever carried out into the accidents - Leveson pieced the story together from lawsuits and depositions and government records.
Controlled by a PDP-11 minicomputer, pieces and routines from earlier machines were incorporated into the Therac-25, something that the quality assurance manager on the project was unaware of until a bug encountered on the Therac-25 was also found on the earlier Therac-20.
The computer was used to position a turntable so that the powerful X-ray beam from the machine could be attenuated correctly. But this attenuator was only positioned for one of the two modes the machine could adopt. As Leveson pointed out in her account: "This is the basic hazard of dual-mode machines: if the turntable is in the wrong position, the beam flattener will not be in place."
Before the Therac-25, electromechanical interlocks were used to ensure that the X-ray beam could not strike the patient unless the attenuating beam flattener was in place. In the Therac-25, many were replaced by software checks.
To start therapy, operators had to set up the machine in the right position manually and then match both patient data and system settings from a computer console. Operators complained and the manufacturer AECL agreed to modify the software to make the process easier: a carriage return would copy the necessary data.
When things went wrong, the machine would flag up a malfunction and a code, but no additional information. A memo from the Food and Drug Administration claimed: "The operator's manual supplied with the machine does not explain nor even address the malfunction codes...[they] give no indication that these malfunctions could place a patient at risk."
One operator admitted that she became insensitive to machine malfunctions: the messages were commonplace and most did not affect the patient. In one case, the therapist delivered several bursts of radiation unwittingly, believing that malfunctions interrupted all but one of the attempts. The patient later died of a virulent cancer and an AECL technician estimated she received as much as 17,000rad - a single dose should be around 200rad.
Problems with the Therac-25 seemed to revolve around race conditions in the software, according to Leveson. In one accident, it was found that an error allowed the machine to activate even when an error was flagged. If the operator hit a button at the precise moment a counter rolled over to zero, the machine could turn on the full 25MeV beam without any attenuator in the way. As operators were running through the setup sequence more quickly, this previously hidden problem suddenly surfaced. The result was a highly concentrated electron beam. This problem was exacerbated by a lack of checks in the software to record what the machine did.
Leveson wrote: "The Therac-25 software 'lied' to the operators, and the machine itself was not capable of detecting that a massive overdose had occurred. The ion chambers on the Therac-25 could not handle the high density of ionisation from the unscanned electron beam at high current; they thus became saturated and gave an indication of a low dosage. Engineers need to design for the worst case."
And without any backup indication of what was going on, the operators were completely in the dark as to what the Therac-25 was really doing, so, unlike the pilots, they were unable to even attempt to rectify the situation. As Leveson pointed out: "Safety is a quality of the system in which the software is used; it is not a quality of the software itself."