PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Rumours & News (https://www.pprune.org/rumours-news-13/)
-   -   MAX’s Return Delayed by FAA Reevaluation of 737 Safety Procedures (https://www.pprune.org/rumours-news/621879-max-s-return-delayed-faa-reevaluation-737-safety-procedures.html)

Piper_Driver 5th Aug 2019 21:23

Back in ancient times I designed memory systems to be resistant to bit flipping involving cosmic rays. It was in conjunction with a computer that used a 286 processor, the same one in use in the suspect flight systems. Our solution was to use error detecting and correcting memory. This architecture used extra memory bits that would allow any single bit error in a memory word to be corrected on the fly. The technology was mature at the time the flight control computers were developed. Does anyone know if memory correction technology was used on the Boeing flight control computers? If so it would rule out random bit flips as an error condition.

HighWind 5th Aug 2019 22:17


Originally Posted by Piper_Driver (Post 10537633)
Our solution was to use error detecting and correcting memory. This architecture used extra memory bits that would allow any single bit error in a memory word to be corrected on the fly.
…. Does anyone know if memory correction technology was used on the Boeing flight control computers? If so it would rule out random bit flips as an error condition.

No, this only protects against bit-flips in memory, there is also a risk that the CPU registers get corrupted.
For this you need two ordinary CPU’s, or a lock-step CPU like TMS570.
A lock-step CPU has Error CorreCtion on memory, and two CPU’s running as one.
The second CPU runs the same machine code instructions as the first one, but one clock cycle delayed.

GordonR_Cape 5th Aug 2019 22:24


Originally Posted by Piper_Driver (Post 10537633)
Back in ancient times I designed memory systems to be resistant to bit flipping involving cosmic rays. It was in conjunction with a computer that used a 286 processor, the same one in use in the suspect flight systems. Our solution was to use error detecting and correcting memory. This architecture used extra memory bits that would allow any single bit error in a memory word to be corrected on the fly. The technology was mature at the time the flight control computers were developed. Does anyone know if memory correction technology was used on the Boeing flight control computers? If so it would rule out random bit flips as an error condition.

Please read the link quoted above, it gives a lot of useful details:

Originally Posted by BDAttitude (Post 10537583)
http://www.cs.toronto.edu/~bianca/pa...gmetrics09.pdf
For those interested in memory corruption.

Specifically:
There are error correction schemes that can detect and correct single bit flips (and schemes for multiple bit errors).
DRAM errors come in many types, some correctable, others not, and no scheme is foolproof.
In some cases there may be permanently stuck bits (on or off).
There is a substantial correlation between errors, such that they are neither random nor independent.
Cosmic rays are only one potential cause, simple ageing of hardware is also a factor.

Given these facts, it seems unreasonable to rely on statistical improbability, to excuse a single point of failure, with catastrophic consequences.

All of the news reports indicate that Boeing have accepted the FAA ruling on this. IMO further argument is futile, though more details might be enlightening.

etudiant 5th Aug 2019 23:27

Before everyone gets side tracked on fixing hugely unlikely problems such as 5 bits flipping exactly wrong, should one not focus on the elephants in the room, such as the trim wheel that is too hard a to turn and too slow to act or the prospect of sensor failure/misinstallation/miswiring etc?
I don't believe there is any indication that either crash was due to a flight computer upset, so it is irrelevant whether the flight computer will be improved until the rest of the system has been brought up to snuff.
Yet there is thus far no substantial public comment on the proposed fixes of the lethal defects demonstrated to exist.
If Boeing wants to ensure the plane does not return to service, they are going about it the right way.

phylosocopter 6th Aug 2019 01:48

as far as i am concerned the "elephant in the room" has been so long obscured that it has been forgotten about ... that is the state of alarms and displays when computer gives up. why is no one talking about this? why is this not regulated? this matter underlies so many major accidents both A and B and yet it is not being addressed. there NEEDS to be a STANDARD display and alarm state for computer disconnect . it is ridiculous to expect pilots to troubleshoot IT systems before they can start on aviate navigate communicate. how is it possible that aircraft are allowed to dump control without silencing every alarm that might possibly be spurious and blanking every display that may be incorrect. what is going on !

Tomaski 6th Aug 2019 01:55

Interesting technical discussion as to how the faults happen but as an operator I'm more interested in what I'm likely to see if the FCC starts going sideways and what I can do about it.

Okay first stipulating that the recent test failure had nothing to do with what happened to the accident aircraft, it does bring up an interesting issue. The test simulated a runaway trim with the A/P engaged. The first sign of this problem would likely be a "Stab Out of Trim" light illuminating on the forward instrument panel. This light indicates that the elevator deflection has exceeded a certain amount because the stabilizer is not properly trimmed. The structure of the non-normal checklist seems to assume that the problem may be that the A/P is not trimming sufficiently for the airspeed changes and NOT that there is a runaway trim. Specifically, if the trim wheel is moving the checklist basically says do nothing because it assumes that the A/P is catching up with the trim changes. Otherwise if the trim is not moving, then disengage the A/P and trim with the yoke switches. This action assumes the A/P was not trimming when it should have, like a change in speed, turn, etc., rather than trimming when it shouldn't have as in a runaway trim.

Thus the current procedures actually lead down the path of not intervening if the trim wheel is moving with the A/P engaged and a "Stab Out of Trim" light illuminated which would actually delay the pilot response to a runaway trim. At some point the pilot would have to make the determination that it is actually a runaway OR the A/P will eventually disconnect with a gross out of trim condition. I could see how this could create a potentially hazardous condition that would actually be harder to recover from than a runaway with the A/P off. Perhaps this NNC now needs to be reevaluated.

Tomaski 6th Aug 2019 01:57


Originally Posted by phylosocopter (Post 10537740)
as far as i am concerned the "elephant in the room" has been so long obscured that it has been forgotten about ... that is the state of alarms and displays when computer gives up. why is no one talking about this? why is this not regulated? this matter underlies so many major accidents both A and B and yet it is not being addressed. there NEEDS to be a STANDARD display and alarm state for computer disconnect . it is ridiculous to expect pilots to troubleshoot IT systems before they can start on aviate navigate communicate. how is it possible that aircraft are allowed to dump control without silencing every alarm that might possibly be spurious and blanking every display that may be incorrect. what is going on !

^^^^^^^^^^^^
Absolutely needs to be addressed. Nothing more annoying than a false alert that cannot be silenced.

safetypee 6th Aug 2019 06:48

#1786, “At some point the pilot would have to make the determination that it is actually a runaway…”
This is central to the concern about piloting contributions in alleviating malfunctions.
How is that point determined, time-wise how long, what is the severity of the malfunction during that period, and by what means is the failure detected, explicit alerting, deduction, knowledge, … experience.
Training does not guarantee that the required activity will be recalled or actioned, particularly with rare and surprising situations.
It is difficult to change the human condition, thus eliminate the extreme situations which pilots could be exposed to.
:ok:

Seamless 6th Aug 2019 07:40

This might be a stupid question, anyways: As far as I understand the two FCCs of any 737 in service do not control each other. Boeing now wants to change this, so that both FCCs are constantly checking each other. If so, won´t they have twice the workload, respectively: Won´t they need to be checked / serviced twice as often as before (twice the running time)?

There are so many more aspects which would scare me, just because they now need to change a proofed system. And changing things always inhibits the danger of new faults.

PerPurumTonantes 6th Aug 2019 07:45


Originally Posted by phylosocopter (Post 10537740)
as far as i am concerned the "elephant in the room" has been so long obscured that it has been forgotten about ... that is the state of alarms and displays when computer gives up. why is no one talking about this? why is this not regulated? this matter underlies so many major accidents both A and B and yet it is not being addressed. there NEEDS to be a STANDARD display and alarm state for computer disconnect . it is ridiculous to expect pilots to troubleshoot IT systems before they can start on aviate navigate communicate. how is it possible that aircraft are allowed to dump control without silencing every alarm that might possibly be spurious and blanking every display that may be incorrect. what is going on !

You have hit a particularly important nail right on the head Sir. Ever seen any stage magicians work? Everything they do is misdirection. They can fool people right up close to them, or even an entire audience of hundreds. Humans will always fall for it. We are terrible at noticing the gorrilla in the room.

With ET and Lion Air, the misdirection was the alarms and stick shaker. While MCAS sneakily ran the trim down in the background. If the computers had dumped control with a sensible, calm "AOA disagree" message, I think there's a good chance those crews would have saved the aircraft.

This does not absolve Boeing in any way of responsibility for the MCAS (and other systems) design and implementation shambles. But it is a very good point that phylosocopter makes, and one that urgently needs addressing.

autoflight 6th Aug 2019 11:01

Is anybody actually considering the full outcome of the MAX never returning to service?

sky9 6th Aug 2019 11:09


Originally Posted by autoflight (Post 10538039)
Is anybody actually considering the full outcome of the MAX never returning to service?

Airbus perhaps?

Luc Lion 6th Aug 2019 11:29


Originally Posted by etudiant (Post 10537699)
Before everyone gets side tracked on fixing hugely unlikely problems such as 5 bits flipping exactly wrong,

If you re-read the Seattle Times article, you'll notice that it is only 2 bits flipping that are required for enacting the scenario.
Five bits have been actually flipped because this is the standard procedure for this category of tests ; flipping what is considered the most extreme and improbable data corruption : 5 bits simultaneously.

Peter H 6th Aug 2019 11:29


Originally Posted by GordonR_Cape (Post 10537958)
AFAIK the dual FCC redesign on the MAX, would revert to the NG (grandfather) scenario with the trim wheels. What the fix should do, is stop any computer generated runaway trim (whether from MCAS or autopilot).

As SLF and retired s/w engineer I find the discussion of high-reliability computer systems interesting, but feel that in this case it is misplaced.
MCAS's use of trim to provide feel may have "seemed like a good idea at the time", a cheap and cheerful sticking plaster. But once you consider
the costs of implementing it to certifiable levels of safety it's simply a blind ally.

If you don't use trim to solve a feel requirement, them MCAS triggered trim runaway is impossible. Use a stick-pusher type mechanism and
surely inappropriate MCAS activation becomes a minor embarrassment. (All the crews seem to have managed the situation for several
minutes, if all they had to do was pull back strongly on the stick...)

Regards, Peter

PS Of course this would still leave a few issues that need attention, such as:
- Due process for any criminal/professional derelictions of duty.
- Reducing the myriad of warnings triggered by a failing AoA probe, or at least simplifying their handling.
- Telling the pilots about MCAS (and providing at least minimally adequate training).
- Fix the non-functioning AoA-disagree warning (and ensure that maintenance protocols would actually test that it works?).
- Discover why MAX AoA probes seem to be failing so frequently (and if this high rate invalidates any probability-based safety assessments).
- Fix the emergency-trim system so it can be operated in emergencies.

PPS If you want to argue that the MAX need high-reliability computers to run any software handling of the stab, surely this argument
also applies to other variants of the 737.

MurphyWasRight 6th Aug 2019 11:57


Originally Posted by Luc Lion (Post 10538068)
If you re-read the Seattle Times article, you'll notice that it is only 2 bits flipping that are required for enacting the scenario.
Five bits have been actually flipped because this is the standard procedure for this category of tests ; flipping what is considered the most extreme and improbable data corruption : 5 bits simultaneously.

Also before getting deep into probabilities of N bits flipping it seems to me that the '5 bit flip' test is also a rational stand in for "something happens we have not though about",

In other words it is impossible to contemplate model and test every failure path but it is still vital to know outcome of failures no matter what triggering event.
There could be any number of paths that result in the Seattle times scenario most of which would be hard or impossible to predict in advance.

---
On the dual FCC fix:
Reading between the lines I sense that this might be implemented with simple "both say move or nothing happens" logic, either internal or could even be done with external HW (relay or other) close to the trim motor inputs.
This would be a lot less work than truly coupling the 2 in a high reliability cross checking mode.



Notanatp 6th Aug 2019 12:33


Originally Posted by Luc Lion (Post 10538068)
If you re-read the Seattle Times article, you'll notice that it is only 2 bits flipping that are required for enacting the scenario.
Five bits have been actually flipped because this is the standard procedure for this category of tests ; flipping what is considered the most extreme and improbable data corruption : 5 bits simultaneously.

That's not what the article says:

"For these simulations, the five bits flipped were chosen in light of the two deadly crashes to create the worst possible combinations of failures to test if the pilots could cope. For these simulations, the five bits flipped were chosen in light of the two deadly crashes to create the worst possible combinations of failures to test if the pilots could cope."

"In one scenario, the bits chosen first told the computer that MCAS was engaged when it wasn’t. This had the effect of disabling the cut-off switches inside the pilot-control column, . . . A second bit was chosen to make the horizontal tail, also known as the stabilizer, swivel upward uncommanded by the pilot, which has the effect of pitching the plane’s nose down. Other bits were flipped to add three more complications.'

So the article clearly says that all five bits were specifically selected to create the simulated scenario. The three bits were not selected at random.

What is the basis for your statement that flipping five bits "is the standard procedure for this category of tests"?

As for MurphyWasRight's statement that the five-bit failure was a "rational stand in for 'something happens we have not though about'," my understanding is that the regulatory issue is that you cannot have a single failure that is catastrophic, and the FAA considered five simultaneous bit flips a single failure. If the only way to create the Seattle Times-reported scenario is to fail multiple components, then is it still a single failure?

bill fly 6th Aug 2019 12:48


Originally Posted by Peter H (Post 10538069)
As SLF and retired s/w engineer I find the discussion of high-reliability computer systems interesting, but feel that in this case it is misplaced.
MCAS's use of trim to provide feel may have "seemed like a good idea at the time", a cheap and cheerful sticking plaster. But once you consider
the costs of implementing it to certifiable levels of safety it's simply a blind ally.

If you don't use trim to solve a feel requirement, them MCAS triggered trim runaway is impossible. Use a stick-pusher type mechanism and
surely inappropriate MCAS activation becomes a minor embarrassment. (All the crews seem to have managed the situation for several
minutes, if all they had to do was pull back strongly on the stick...)

Regards, Peter

PS Of course this would still leave a few issues that need attention, such as:
- Due process for any criminal/professional derelictions of duty.
- Reducing the myriad of warnings triggered by a failing AoA probe, or at least simplifying their handling.
- Telling the pilots about MCAS (and providing at least minimally adequate training).
- Fix the non-functioning AoA-disagree warning (and ensure that maintenance protocols would actually test that it works?).
- Discover why MAX AoA probes seem to be failing so frequently (and if this high rate invalidates any probability-based safety assessments).
- Fix the emergency-trim system so it can be operated in emergencies.

PPS If you want to argue that the MAX need high-reliability computers to run any software handling of the stab, surely this argument
also applies to other variants of the 737.

Been said many times on these threads Peter. The logic of a separate dedicated feel trim compensator is indisputable.
But the software specialists are keeping their claws in deep.
Way back, British Airways helped certify the first European Cat 111 system. It took many actual approaches and a lot of flight hours until the reliability was proved (and incidentally held many folk up behind their min speed approaches) and the whole thing ran on relays.
Surprising what can/could be achieved without software - even in this day and age...

GlobalNav 6th Aug 2019 14:49


Originally Posted by bill fly (Post 10538136)


Been said many times on these threads Peter. The logic of a separate dedicated feel trim compensator is indisputable.
But the software specialists are keeping their claws in deep.
Way back, British Airways helped certify the first European Cat 111 system. It took many actual approaches and a lot of flight hours until the reliability was proved (and incidentally held many folk up behind their min speed approaches) and the whole thing ran on relays.
Surprising what can/could be achieved without software - even in this day and age...

Bottomline? MCAS could have been designed with satisfactory architecture, integrity and reliability. It was not. A CAT I Approach system cannot be made CAT III capable, certainly not by a few software changes or even the addition of CPU or sensor. It has to be designed for the task from the ground up, with qualified subsystems, redundancies, and appropriate design assurance levels of software and complex hardware. Not implying that MCAS is equivalent to a CAT III autoland system, either, but neither will MCAS become what it must be by a few tweaks.

OldnGrounded 6th Aug 2019 17:03


Originally Posted by Notanatp (Post 10538128)
As for MurphyWasRight's statement that the five-bit failure was a "rational stand in for 'something happens we have not though about'," my understanding is that the regulatory issue is that you cannot have a single failure that is catastrophic, and the FAA considered five simultaneous bit flips a single failure. If the only way to create the Seattle Times-reported scenario is to fail multiple components, then is it still a single failure?

Well, it's fairly likely that five simultaneous bit flips would have a single cause (cosmic rays/EMP), so yes, it should be considered a single failure. Pretty rare (at least), of course.


HighWind 6th Aug 2019 17:45


Originally Posted by Peter H (Post 10538069)
If you don't use trim to solve a feel requirement, them MCAS triggered trim runaway is impossible.
PPS If you want to argue that the MAX need high-reliability computers to run any software handling of the stab, surely this argument
also applies to other variants of the 737.

I you remove the code for MCAS then it can't generate a runaway :)
But the Flight Control System have been connected to the trim for decades, and a bit-flip in other functions of the Flight Control Computers might be able to generate a 'continues' runaway. (Way easier to diagnose quickly, than a intermittent runaway)
I see MCAS as a change that changed the reliability of the FCS from being several magnitudes better than specified by DAL C, to about the limit of DAL C.

You can't complain when a DAL C design, changes from e.g. DAL B to DAL C reliability following a design change. (The same could have happened due to e.g. a die shrink )
Using one AoA sensor might sufficient for DAL C, but not for the required DAL A.


All times are GMT. The time now is 15:41.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.