Tom Alrich's Blog: This software supply chain flaw killed 346 people. That's a lot.

I have been following the saga of the 737 Max with fear and loathing since it started. As you probably remember, the story became public with a crash in Indonesia in October 2018, that killed 189 people. As this NY Times article from last November says, Boeing “quickly diagnosed the problem: faulty software. The company immediately promised to fix the code, reminded pilots how to handle a potential malfunction and insisted the Max was safe.” But they didn’t ground the plane worldwide, and the FAA didn’t force them to.

When a second 737 Max crashed five months later in Ethiopia – costing 157 more lives – the FAA finally ordered the planes to be grounded. Since both crashes happened shortly after takeoff and were preceded by a stall caused by the nose of the plane being pushed down against the pilots’ will, attention had immediately focused on the software that caused this to happen, called MCAS. In both crashes, the pilots fought mightily with the software, repeatedly pulling the nose up after it had pointed down. But MCAS won the battle in both cases.

Clearly, nobody at Boeing designed MCAS so that it would cause crashes. Does this mean a nefarious actor penetrated the development process and implanted the code that caused MCAS to behave as it did? There was never any question of that, either. It seemed that – for some reason – the designers of the software didn’t properly consider all the cases in which it might be activated. Even more importantly, they didn’t notify Boeing’s customers that there was a simple way to fix this problem if it ever occurred: just turn MCAS off.

The story came back into the headlines this week when Mark Forkner, who was chief technical pilot of the MAX, was indicted for misleading the FAA about MCAS. One of the more appalling stories that came out of this disaster last year was that he had bragged in emails that he’d “inadvertently” misled the FAA. If he’d misled the FAA inadvertently, why was he indicted? And how did it happen that such a terribly dangerous piece of software was developed and deployed in the first place? This clearly wasn’t just a case of a semicolon being inserted at the wrong place in the code.

Today, Holman Jenkins, Jr. of the Wall Street Journal editorial page, published a good – to a point – analysis that asks both of these questions. Spoiler alert: Mr. Jenkins answers the first question, but considers the second to be a mystery that will probably never be solved – when it was already answered in the Times article linked above. But of course, people who work for the WSJ editorial page would no more read a Times article for its content than they would have read Pravda (the official government mouthpiece of the Soviet Union) during the bad old days of the Cold War (note my lack of regard for the WSJ editorial page writers is the opposite of the high regard in which I hold the WSJ news reporters, especially those who report on cybersecurity and electric power issues, like Rebecca Smith. She reports on both those issues, especially when they intersect. BTW, there’s still a story to be told about the article I just linked, and the quite unfortunate chain of events it led to. If there’s interest, I’ll tell it sometime).

Jenkins explains that it was a change in the software that caused MCAS to become so deadly, as well as why Forkner didn’t know about this when he talked to the FAA:

MCAS had been designed to counter the tendency of the plane’s nose to rise in a maneuver that would never be experienced in normal operations, a high-speed, ever-tightening turn that would cause pandemonium in the passenger cabin if it were tried during a commercial flight. Mr. Forkner discovered only in the simulator, and then by contacting a Boeing engineer, that the system had been belatedly altered to intervene even during low-speed maneuvers.

So why was Forkner indicted? Because he subsequently did learn about the change, as shown by the famous email I mentioned above. But did he call up the FAA and tell them he’d been wrong? This might well have led to an investigation of MCAS, and the discovery that Boeing had made a major software change without doing much if any analysis to figure out what might happen if a pilot hadn’t been alerted to the possibility of MCAS pushing the nose down soon after takeoff.

Now we get to the second question, which is such a huge mystery to Mr. Jenkins: Why was this change made? Here’s what the Times article says:

In 2011, Boeing learned that American Airlines, one of its most important customers, was poised to place a major order for new jets with Airbus. Boeing had been considering designing a new midsize passenger jet, but the threat of losing out on the American deal forced the company’s hand.

Boeing decided to redesign the 737 — a plane that was introduced in 1967 — once more. To make the new Max more fuel efficient, Boeing needed bigger engines. But because the 737 was an old plane that sat low to the ground, those engines needed to be mounted farther forward on the wings, changing the aerodynamics of the plane. To compensate, Boeing introduced the ill-fated MCAS.

And even when the company knew that the software had caused the first crash, Boeing kept the Max flying until another plane fell from the sky.

But this change in itself wouldn’t have caused the accidents, if the pilots had been trained in simulators to recognize this problem and turn MCAS off. But it turns out that the FAA didn’t require any pilots to be trained in simulators for the 737 MAX, as long as they’d already been trained for “regular” 737s. But of course, the MCAS problem could only occur in the MAX. Of course, this doesn’t mean the airlines couldn’t have required this training on their own, but few (if any?) did – it would have cost a lot of money, and delayed when they would be able to start flying the MAX.

But even without simulator training, it would have been very helpful if Boeing had put out a detailed alert about this problem. This would certainly have gotten the attention of management of the airlines, who would have made sure the pilots knew what to do when MCAS kicked in at the wrong time. But Boeing didn’t do that either, perhaps because it might have made the airlines realize they did need MAX simulator training, even though it wasn’t required.

And Mr. Jenkins points out another reason why not requiring simulator training was such a serious mistake (if “mistake” is the right word. The phrase used by an Ethiopian man in the Times article is “corporate manslaughter”): In developing the training for customers (Boeing obviously had the simulator for their employees, since that’s how Forkner learned about the change in MCAS), Boeing managers might have learned of the problems that some engineers, and Forkner, hadn’t told them about, and realized that having a couple planes fall out of the sky would probably be a little more expensive for Boeing than the cost of requiring simulator training and notifying their customers of the problem.

So Boeing made the MCAS changes to compensate for the fact that their redesign of the 737 – which they undertook to prevent American Air from placing a big order with Airbus – had made the plane likely to go nose-up and stall in high-speed situations (well after takeoff). But the redesign of MCAS to activate in some low-speed situations made MCAS into a deadly weapon, aimed directly at pilots who didn’t know about these problems, and of course the unsuspecting crew members and passengers.

As I asked in this post recently (although in a very different context, I admit), “Is there any doubt that software is the biggest source of supply chain cyber risk?” True, the MAX crashes weren’t due to a supply chain cyberattack, but they could have been (perhaps due to hackers who had shorted Boeing’s stock and were looking for a huge negative event to cause it to tank). When you’re buying planes, the impact of those risks being realized can be much greater, but there’s risk in any software used for any purpose.

Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. Nor are they shared by the National Technology and Information Administration’s Software Component Transparency Initiative, for which I volunteer as co-leader of the Energy SBOM Proof of Concept. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.

Tom Alrich's Blog

Saturday, October 16, 2021

This software supply chain flaw killed 346 people. That's a lot.

No comments:

Post a Comment

Get new posts by email: