MAX’s Return Delayed by FAA Reevaluation of 737 Safety Procedures

Closed Thread Subscribe

Thread Tools

Search this Thread

1st Aug 2019, 19:12

#1681 (permalink)

Zeffy

Join Date: Feb 2006

Location: USA

Posts: 487

Likes: 0

Received 0 Likes on 0 Posts

https://www.seattletimes.com/busines...ight-controls/

Quote:

Newly stringent FAA tests spur a fundamental software redesign of 737 MAX flight controls
Aug. 1, 2019 at 11:18 am Updated Aug. 1, 2019 at 11:59 am

By Dominic Gates
Seattle Times aerospace reporter

After two deadly crashes of Boeing’s 737 MAX and the ensuing heavy criticism of the Federal Aviation Administration (FAA) for its limited oversight of the jet’s original certification, the agency conducted newly stringent tests that in June uncovered a potential flaw and have spurred Boeing to make a fundamental software-design change.

As the FAA re-evaluates and recertifies the updated flight-control systems, it has specifically rejected Boeing’s assumption that the plane’s pilots can be relied upon as the backstop safeguard in scenarios such as the uncommanded movement of the horizontal tail involved in both the Indonesian and Ethiopian crashes. That notion was ruled out by FAA pilots in June when, during testing of the effect of a glitch in the computer hardware, one out of three pilots in a simulation failed to save the aircraft.

The thoroughness of the ongoing review of the MAX flight controls in light of the two crashes is apparent in how a new potential fault with a microprocessor in the flight-control computer was discovered during the June testing. Details of that fault not previously reported were confirmed both by an FAA official and by a person at Boeing familiar with the tests.

And in response to finding that new glitch, Boeing has developed a plan to fundamentally change the software architecture of the MAX flight-control system so that it will take input from both flight-control computers at once instead of using only one on a flight.

“This is a huge deal,” said Peter Lemme, a former flight-controls engineer at Boeing and avionics expert.

The 737 has two flight-control computers, but in the architecture that has been in place for decades, only one computer is used at a time on a flight, with systems switching to use the other computer on the next flight.

Lemme said the proposed software architecture switch to a “fail-safe,” two-channel system, with each of the computers operating from an independent set of sensors, will not only address the new microprocessor issue but will also make the flawed Maneuvering Characteristics Augmentation System (MCAS) that went haywire on the two crash flights more reliable and safe.

“I’m overjoyed to hear Boeing is doing this,” Lemme said. “It’s absolutely the right thing to do.”

According to a third person familiar with the details, Boeing expects to have this new software architecture ready for testing toward the end of September. Meanwhile, it will continue certification activities in parallel so that it can stick to its announced schedule and hope for clearance from the FAA and other regulators in October.

Flipping bits
When Boeing announced June 26 that a new potential flaw had been discovered on the MAX — this time in a microprocessor in the jet’s flight-control computer — it even caught Boeing CEO Dennis Muilenburg by surprise.

Speaking at a conference in Aspen that morning, Muilenburg reiterated a prior projection that the MAX could be carrying passengers again by “the end of summer.” Later that day, Boeing announced the problem in a Securities and Exchange Commission filing, and soon after projected that the issue could add a further three months’ delay.

What the FAA was testing when it discovered this new vulnerability was esoteric and remote. According to the person familiar with the details, who asked for anonymity because of the sensitivity of the ongoing investigations, the specific fault that showed up has “never happened in 200 million flight hours on this same flight-control computer in (older model) 737 NGs.”

In sessions in a Boeing flight simulator in Seattle, two FAA engineering test pilots, typically ex-military test pilots, and a pilot from the FAA’s Flight Standards Aircraft Evaluation Group (AEG), typically an ex-airline pilot, set up a session to test 33 different scenarios that might be sparked by a rare, random microprocessor fault in the jet’s flight-control computer.

This was standard testing that’s typically done in certifying an airplane, but this time it was deliberately set up to produce specific effects similar to what happened on the Lion Air and Ethiopian flights.

The fault occurs when bits inside the microprocessor are randomly flipped from 0 to 1 or vice versa. This is a known phenomenon that can happen due to cosmic rays striking the circuitry. Electronics inside aircraft are particularly vulnerable to such radiation because they fly at high altitudes and high geographic latitudes where the rays are more intense.

A neutron hitting a cell on a microprocessor can change the cell’s electrical charge, flipping its binary state from 0 to 1 or from 1 to 0. The result is that although the software code is right and the inputs to the computer are correct, the output is corrupted by this one wrong bit.

So for example, a value of 1 on a single bit might indicate that the jet’s wing flaps are up, while a 0 would mean they are down. A value of 1 on a different bit might tell the computer that the MAX’s problematic flight-control system called MCAS is engaged, while a 0 would indicate it is not.

This isn’t as alarming as it may sound. There are standard ways to protect against such bit flips having any dangerous impact on an airplane system, and FAA regulations require that this possibility be accounted for in the design of all critical electronics on board aircraft. The simulator sessions in June were designed to test for any such vulnerability.

During the tests, 33 different scenarios were artificially induced by deliberately flipping five bits on the microprocessor, an error rate determined appropriate by prior analysis. For all five bits, each 1 became a 0 and each 0 became a 1. This is considered a single fault, on the assumption that some cause, whether cosmic rays or something else, might cause the five bits to all flip at once.

For these simulations, the five bits flipped were chosen in light of the two deadly crashes to create the worst possible combinations of failures to test if the pilots could cope.

In one scenario, the bits chosen first told the computer that MCAS was engaged when it wasn’t. This had the effect of disabling the cut-off switches inside the pilot-control column, which normally stop any uncommanded movement of the horizontal tail if the pilot pulls in the opposite direction. MCAS cannot work with those cut-off switches active and so the computer, fooled into thinking MCAS was operating, disabled them.

1st Aug 2019, 19:23

#1682 (permalink)

groundbum

Join Date: Dec 2001

Location: Leeds, UK

Posts: 281

Likes: 0

Received 0 Likes on 0 Posts

as an IT engineer this changing from flip-flop to dual reduandancy is massive, and unless work started years ago will NOT be ready for September. And if it has been rushed then it needs full end to end testing as it's such a fundamental change. There's a philosophy in coding that for every 2 things you fix, you break something else. It's the nature of the beast.

G

1st Aug 2019, 19:51

#1683 (permalink)

BDAttitude

Join Date: Apr 2019

Location: EDSP

Posts: 334

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by groundbum

Agreed.
The big unknown is, if there is a code baseline which incorporates this feature (e.g. from other projects) and does only need to be configured for the 737 and built. Sill can't see when the validation should take place. Certification must not replace validation .
Highly dubious schedule.

Edit:
From the originial fault description - AP disconnect sluggish after one of two uC faulted in a FCC, this does not sound like a fix to the actual problem ... now they let the second FCC take over. I'm not so much impressed. Not that I think that architecture change is not a leap forward but for the specific problem it seems like curing symptoms.

Last edited by BDAttitude; 1st Aug 2019 at 20:02.

1st Aug 2019, 20:04

#1684 (permalink)

royalflash

Join Date: Aug 2019

Location: Above Ground, Under Sky

Posts: 1

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by groundbum

Very true, but maybe there is something about the scope that makes Boeing believe it can be done faster. Full monitoring over everything would surely take years to develop and certify, but if they are selectively applying cross-monitoring to certain functions, it could be done fairly quickly.

1st Aug 2019, 21:04

#1685 (permalink)

thf

Join Date: May 2014

Location: living room

Posts: 47

Likes: 1

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Seattle Times, Aug 1 2019

As the FAA re-evaluates and recertifies the updated flight-control systems, it has specifically rejected Boeing’s assumption that the plane’s pilots can be relied upon as the backstop safeguard in scenarios such as the uncommanded movement of the horizontal tail involved in both the Indonesian and Ethiopian crashes. That notion was ruled out by FAA pilots in June when, during testing of the effect of a glitch in the computer hardware, one out of three pilots in a simulation failed to save the aircraft.

That shouldn't be overlooked, after all these long discussions about pilot error.

1st Aug 2019, 21:10

#1686 (permalink)

tdracer

Join Date: Jul 2013

Location: Everett, WA

Age: 68

Posts: 4,395

Likes: 97

Received 180 Likes on 88 Posts

Quote:

Originally Posted by groundbum

This is known as "Single Event Upset" or SEI. Although rare, as processors and memory get smaller and more dense, the probability of SEI goes up. In early FADECs, SEI was pretty much unheard of, during the 747-8 flight testing we found evidence of SEI a little less than once very 100 aircraft flight hours (four engines, two channels per FADEC).
SEI is fairly easy to deal with in modern FADECs. Basically, you do continuous parity or check-sum checks - if it suddenly fails the check it's assumed to be SEI and the channel resets (a reset takes about a second, the other channel will take over if necessary - worse case there may be a short, temporary thrust loss if the opposite channel is incapable of taking over). However the continuous checking has a small impact on the processor throughput capability - if they are already close on throughput margin, adding the checks could be a problem.

1st Aug 2019, 22:05

#1687 (permalink)

OldnGrounded

Thread Starter

Join Date: Apr 2015

Location: Under the radar, over the rainbow

Posts: 788

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by groundbum

Yup. It is almost inconceivable that an architectural change as fundamental as this could be implemented, tested and placed into service in anything like the reported time frame -- at least not properly implemented and tested. And the FAA would be way out on a limb if it permitted such a thing to happen.

1st Aug 2019, 22:40

#1688 (permalink)

mrdeux

Join Date: Nov 2008

Location: Melbourne, Australia

Age: 68

Posts: 365

Likes: 2

Received 7 Likes on 1 Post

Quote:

Originally Posted by Zeffy

From the AvWeek article:
Sounds as if the "bypass" is done via software and could be re-enabled via software, no?

Some years ago, I discovered a way to reliably, and repeatedly, make a 767, with autopilot engaged in VNAV, fly through the MCP altitude. I reported this to our tech people, who passed it along to Boeing. Very quickly I heard that they’d been able to replicate it in a system sim, and a red bulletin was soon issued. It was fixed in an update a few months later.

Fast forward ten years, and I was now flying the 747. An update came out, and lo and behold, the MCP bug had reappeared. Apparently the software had simply been modified to bypass the offending code, and a later update, had removed the bypass.

The point is that the software fix itself was not permanent.

1st Aug 2019, 23:06

#1689 (permalink)

david340r

Join Date: Aug 2015

Location: UK

Posts: 0

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Zeffy

https://www.seattletimes.com/busines...ight-controls/

A small point, but this seems to imply neutrons are cosmic rays. However I believe they are rarely neutrons. However when cosmic rays strike the upper atmosphere they can create a shower of secondary particles commonly including neutrons. As semiconductor devices have become smaller the charge they hold is smaller and therefore a "bit flip" is easier.

1st Aug 2019, 23:06

#1690 (permalink)

Loose rivets

Psychophysiological entity

Join Date: Jun 2001

Location: Tweet Rob_Benham Famous author. Well, slightly famous.

Age: 84

Posts: 3,270

Likes: 11

Received 33 Likes on 16 Posts

Clementine, a mass of thread history has been collated on PPRuNe's Tech Log section, one click down, and in the form of 'Stickies'. There were many thousands of posts prior to this thread starting.

One of the most extraordinary things in this whole sad affair is the high probability that the two detector faults were not mechanically/electrically the same, which in a way makes the second accident a bewildering coincidence. That a single failure of this detector information, for whatever reason, could cause such chaos is of course a prime issue.

Over the last months I have found the reports in the Seattle Times to be of outstanding quality.

1st Aug 2019, 23:29

#1691 (permalink)

groundbum

Join Date: Dec 2001

Location: Leeds, UK

Posts: 281

Likes: 0

Received 0 Likes on 0 Posts

ref the FCC changes could be done quickly,in place and piecemeal.

It was this philosophy of quickly dropping MCAS on top of existing hardware, rather than doing a clean sheet design that started the whole bloodshed. Surely B would have learnt the lesson already, that critical changes need doing properly?

G

2nd Aug 2019, 00:23

#1692 (permalink)

Fly Aiprt

Join Date: Mar 2019

Location: French Alps

Posts: 326

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by david340r

Right about the neutron.
Bit flips are rather caused by high energy charged particles or nuclei. Or X-rays, Gamma etc.
But this is no big deal in an aviation related article ;-)

2nd Aug 2019, 00:35

#1693 (permalink)

Loose rivets

Psychophysiological entity

Join Date: Jun 2001

Location: Tweet Rob_Benham Famous author. Well, slightly famous.

Age: 84

Posts: 3,270

Likes: 11

Received 33 Likes on 16 Posts

Quote:

A neutron hitting a cell on a microprocessor can change the cell’s electrical charge, flipping its binary state from 0 to 1 or from 1 to 0. The result is that although the software code is right and the inputs to the computer are correct, the output is corrupted by this one wrong bit.

So for example, a value of 1 on a single bit might indicate that the jet’s wing flaps are up, while a 0 would mean they are down. A value of 1 on a different bit might tell the computer that the MAX’s problematic flight-control system called MCAS is engaged, while a 0 would indicate it is not.

This isn’t as alarming as it may sound . . .

. . . for these simulations, the five bits flipped were chosen in light of the two deadly crashes to create the worst possible combinations of failures to test if the pilots could cope.

I'm only an electronics amateur, but for such important information, wouldn't entire packets of data be needed to convey say, flap position or things of equivalent importance? How else would you do a checksum?

Quote:

In one scenario, the bits chosen first told the computer that MCAS was engaged when it wasn’t. This had the effect of disabling the cut-off switches inside the pilot-control column, which normally stop any uncommanded movement of the horizontal tail if the pilot pulls in the opposite direction. MCAS cannot work with those cut-off switches active and so the computer, fooled into thinking MCAS was operating, disabled them.

I'm at a loss. It has to be said that in the early days of these threads, there were clear statements that one of the column switches had been removed, but then a skilled poster showed the circuitry which seemed to say it was just circuit logic that had been altered. However, I don't recall a time when MCAS was disabled by pulling back, it being said that it would mean pulling would negate the very function needed at that time. Of course, we'd have to be completely sure what 'uncommanded' means. And so it goes on.

2nd Aug 2019, 02:42

#1694 (permalink)

Mad (Flt) Scientist

Join Date: Sep 2002

Location: La Belle Province

Posts: 2,179

Likes: 0

Received 0 Likes on 0 Posts

I think some people are overestimating the difficulty of a software design change.

Having been part of a significant software redesign of a flight control system for a part 25 aircraft, which addressed a multitude of failure cases (including some we found in the course of the redesign and the associated design reviews) and which included some fundamental architectural changes, easily of greater scope than going from flip-flop alternating single input to dual inputs, and which took us from incident, through grounding, return to test flight, (re)certification and EIS inside a 12 month period, with frankly an order of magnitude less resources than Boeing can put on this task, I have to say that the timescales are more than achievable.

What appears (from the outside) to be delaying a return to flight status isn't the complexity of the task, frankly. It's FAA now going into complete CYA mode and every other decision during the MAX certification being dragged out and placed under a microscope. With the people looking through the microscope (who are not just the FAA, or even industry authorities, but every politician or journo sensing a news opportunity) sometimes having little conception of how the delegated/overseen certification process is supposed to work. (And has worked well for years)

2nd Aug 2019, 09:56

#1695 (permalink)

clearedtocross

Join Date: Jul 2007

Location: Switzerland

Age: 78

Posts: 109

Likes: 6

Received 7 Likes on 2 Posts

Sorry, Mad Scientist, but I have to disagree on your optimistic view about the timescale. What you did - obviously successfully - in 12 months cannot be done in three months just by multiplying the ressources by any factor > 4. You cannot get a baby within one month by getting nine ladies pregnant. There are steps in the redesign process that need time, like finding out what was really wrong in the first place, specify the changes, specify the required test (on all levels implicated), change code and test stand-alone and integrated on all levels affected. The fix has been promised ready (we are in the holding now!) by the management of Boeing about as many times as the new airport BER in Berlin was promised to open. None of them had any bear on reality because the management had either no clues or consists of compulsive liars, possibly both. How can they fix a timeschedule when the scope of problems is getting wider and wider at a daily rate?

2nd Aug 2019, 10:35

#1696 (permalink)

BDAttitude

Join Date: Apr 2019

Location: EDSP

Posts: 334

Likes: 0

Received 0 Likes on 0 Posts

Not to forget the tendency to gather uncritical yes men throughout the managment hierarchy

.
That Aspen incident speakes volumes.

2nd Aug 2019, 10:50

#1697 (permalink)

etudiant

Join Date: May 2011

Location: NEW YORK

Posts: 1,352

Likes: 0

Received 1 Like on 1 Post

Quote:

Originally Posted by Mad (Flt) Scientist

That all may be quite true, but the CYA mode was surely engaged by not only the FAA, but also by the other regulators, following an egregious misuse of the delegation/certification process.
So Boeing is now the crash test dummy for the new regulatory regime that is under construction by the relevant authorities. Getting everyone on the same page in this will likely be a slow process.
It is not apparent how sensitive those authorities will be to commercial pressures, but those will diminish over time as airlines adjust to the absence of the MAX in their fleet planning.
I'd not hold my breath waiting for a return to flight.

2nd Aug 2019, 10:58

#1698 (permalink)

RickNRoll

Join Date: Jul 2013

Location: Australia

Posts: 305

Likes: 1

Received 7 Likes on 5 Posts

Quote:

Originally Posted by Fly Aiprt

Right about the neutron.
Bit flips are rather caused by high energy charged particles or nuclei. Or X-rays, Gamma etc.
But this is no big deal in an aviation related article ;-)

i was scratching my head. A neutron had no charge but it has energy. A proton or anti proton has charge.

2nd Aug 2019, 12:27

#1699 (permalink)

Peter H

Join Date: Jun 2008

Location: Cambridge UK

Posts: 192

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by thcrozier

Clementine Cheetham:

I suggest you read the following for some insight on how these kinds of things happen. It's not an isolated incident.

https://science.ksc.nasa.gov/shuttle...-contents.html

From Chapter 6:

An interesting observation on the decision process:
The Challenger: An Information Disaster https://www.asktog.com/books/challengerExerpt.html

2nd Aug 2019, 15:08

#1700 (permalink)

hoistop

Join Date: Apr 2008

Location: Europe

Posts: 162

Likes: 6

Received 0 Likes on 0 Posts

I couldn´t help not to add this pass in the debate. I took it from BEA serious incident report of Falcon 7X HB-JFN in May 2011 (runaway THS on descent to Kuala Lumpur)
A single cold solder in THS control computer nearly sent them to death-avoided by copilot, former military pilot, that recognised problem, reacted correctly, yelled at his captain who instinctivelly tried to intervene, but made only things worse and eventually saved the plane.
How such deadly dormant failure was allowed to happen? Think about this:

1.18.1.5 Human factors affecting safety analysis
An enormous amount of effort has been put into studying the human factors issues
for crews, air traffic controllers and aircraft maintenance personnel. However the
ATSB report found very little research that has examined the human factors issues
affecting design engineers and safety analysts or the factors likely to lead to errors
in design.
The ability to detect errors in design or judge whether a fault tree is complete can
be affected by a range of different factors, such as experience, available knowledge,
task complexity and the fact that omissions are relatively difficult to detect. The final
result can also be affected by time-related pressure from the organisation’s activity
and industrial programme deadlines.

hoistop