PPRuNe Forums - American investigates as 777 engine fails to respond to throttle

PPRuNe Forums (https://www.pprune.org/)

- Rumours & News (https://www.pprune.org/rumours-news-13/)

- - American investigates as 777 engine fails to respond to throttle (https://www.pprune.org/rumours-news/315967-american-investigates-777-engine-fails-respond-throttle.html)

Quote:

A small step in prevention would be to prohibit updating the software in both ECU's on the aircraft simultaneously.

Would you put two different tires on a dragster race car, or two different brands of spark plugs in a rotary engine.

The answer is obviously dependent on the reliability vs the differences in performance. In the case of FADECS the differences in performance is sure to be noticed and need to be accomodated by specific pilot workload changes (not practical), while the change in reliability was expected to be within a tolerance band way below the threshold of being noticed over the lifespan of the two FADECS. The decision is weighted and obvious given these conditions.

Quote:

Would you put two different tires on a dragster race car, or two different brands of spark plugs in a rotary engine.

IMHO, quite dreadful analogies :O

Not that I necessarily subscribe to the original suggestion either... but it derves a better rebuttal I think.

Was just suggesting it as a precaution against installing fresh software with an unknown bug on both engines simultaneously.

chksix said

"A small step in prevention would be to prohibit updating the software in both ECU's on the aircraft simultaneously."

Aviate opines....

Spot-on! Whenever do updates work perfectly from the start? Some snag Always crops up doesn't it?

In Fate is the Hunter, Ernest K Gann describes an alarming incident in a fully-laden C-54 out of La Guardia when 3 out of 4 engines either failed or were on the point of failing - backfiring and overrevving like crazy. Somehow they got it on the ground after something like 3 minutes total flight time, never climbing above 50 ft.

It seems that the engineers had changed the plugs on 3 engines for a new experimental type...

And they later aplogised because they hadn't had time to change the plugs on the fourth...:(

NEVER do simultaneous upgrades - whether hardware or software!!

Although the RN had a novel QA technigue for double engine changes at sea - they would 'invite' the engineering officer to occupy the looker's seat for the flight test...:ok:

Quote:

It seems that the engineers had changed the plugs on 3 engines for a new experimental type...

And they later aplogised because they hadn't had time to change the plugs on the fourth...:(

NEVER do simultaneous upgrades - whether hardware or software!!

What about changing the oil on all engines at the same time? I'm sure that got tried and screwed up once so now with perfect hindsight it's not done to all engines at the same time. There are reasons why space shuttle flight computers have two different implementations of the spec - usually one contractor provides most of them and a second contractor does an independent implementation to the same spec. in case there are bugs. Of course, if the spec is wrong...

Quote:

Originally Posted by aviate1138

Whenever do updates work perfectly from the start? Some snag Always crops up doesn't it?

I don't read Tech Log all that often, but I don't recall seeing a raft of entries on aircraft software update issues. Are you making a broad comment that covers MS Windows etc., or is it truly your experience that all aircraft updates demonstrate some sort of "snag"? Not being critical, just curious...

boaclhryul asked if I was making a broad comment....

It was broad but look at the Chinook, JSF and other machines having problems with electronics software updates.

The 777......
"
If you read the entire AD, what it says is that the anomaly was introduced by an error in a certain software update for the FADEC on the LR. However, because the -300 has an identical software update, it could presumably cause the same problems on that aircraft as well......"
"Some would be surprised that the FAA should allow an ETOPS aircraft with a defect that reduces engine power by up to 77 percent on takeoff to be considered serviceable. In theory, an engine with FADEC version A.0.4.5 installed has a defect that can't be cleared and is therefore unserviceable. Others might wonder how such safety critical software can make it through the validation and verification regime into world-wide fleet service. Overall, it's shades of the previous GE90 "rollback" and IFSDs (inflight shutdowns) from earlier days. The only difference was in those cases, it was in cruise and was caused by moisture freezing in the P3B and PS3 lines to the FADEC, and it was resolved by increasing the tubing diameters. Perhaps the software now needs uppercase zeroes and ones in its coding -- or a larger pitch font."

Air Safety Week, Oct 9, 2006

Quote:

Originally Posted by llondel

There are reasons why space shuttle flight computers have two different implementations of the spec - usually one contractor provides most of them and a second contractor does an independent implementation to the same spec. in case there are bugs. Of course, if the spec is wrong...

Not so. It is quad-redundant with identical SW. There is a fifth computer, with different HW and SW, which can conduct an abort should everything go pear-shaped.

For the architecture of the space shuttle primary control, see various articles at NASA Office of Logic Design, in particular The Space Shuttle Primary Computer System, Communications of the ACM, 27(9), September 1984; Madden and Rone, Design, Development, Integration: Space Shuttle Primary Flight Software System, also CACM 27(9), September 1984. This is direct from the people who designed and built this system.

For a very much shorter comment, see Epstein, Risks 24.71, Morton, Risks 24.73, but especially Neumann, Risks 24.73 and Passy, Risks 24.74

PBL

aviate1138 quoted an article from Air Safety Week about SW problems. I don't recall the article, and David often talked to me before writing about computer-related things, so it may have been written after he retired from ASW.

There are some facts about SW quality which you won't find in textbooks and which it may be worthwhile to mention here.

First, most problems with critical SW (which includes all that covered by DO178B Levels A and B) occur during a mismatch between operational environment and the environments envisaged in the specification of the (sub)system requirements. A study by Robin Lutz in the early 90's of mission-critical failures of NASA systems came up with a figure of over 95%; studies by the UK HSE of all types of critical systems (simple to complex, mechanical to computer-based) showed 70%.

Second, the quality of SW itself (that is, where the SW will *not* fulfil its requirements) for critical code lies typically in the region of at least one error per KLOC (thousand lines of source code). The very best measured quality of which I know has been attained by SW written by the UK system house Praxis HIS, which has a largish system which has demonstrated only one error per 25 KLOC in service. They also have one small system (10K-ish LOC) in service which has demonstrated no errors in a number of years of operation. Praxis uses so-called "Correct by Construction" (CbC) methods.

So in the EECs of the B777, which I am told contain on the order of hundreds of KLOC, we can expect in every release, as quite normal, a handful of errors, somewhere between ten and a hundred. That is the reality. Most of those errors will not be discovered.

Many of us are not content with this reality, and are actively working to propagate different approaches to the quality assurance of critical SW-based systems than are currently used.

PBL

As PBL describes, real errors will creep into software designs and updates with a certain inexorability, despite well-funded efforts by the smartest and best to prevent them during development and find them during deployment. The raw complexity of possible interactions in even moderately demanding real-time environments, combined with the fallibility of testing models, the limits of schedule and push of operations pressures increase the likelihood that slightly erroneous or simply mismatched software components will crawl into mission-critical systems. The staffing levels vs frequency of ops in Space and Military environments allow much higher levels of scrutiny than in Commercial and General aviation, but even the most careful systems teams occasionally discover unhappy surprises in software-hardware computing system designs and updates.

After an initial learning curve of months to years, the residual probability of in-use discovery of serious design flaws comes mostly from mods, changes and revisions. These are often installed "in the field" by workers who are less well trained and equipped than their fellows "at the factory". Variability of individual aircraft (differences between units creating subtle compatibility issues) is inherently greater in a working fleet for many reasons. Testing of mods and updates prior to release and during installation is necessarily limited and does not begin to approach the exhaustive top-to-bottom tests applied to system designs before initial operational release.

We have just about reached the point where electronic, component, and software technologies can provide high enough physical density and low enough incremental cost (per million gates, per terraflop or whatever) so as to permit greater redundancy in the mission-critical systems for commercial aircraft. Some of this bounty should be applied to simple redundancy - at least 3-way for anything important - but some of it also should be applied to more novel approaches to reliability. One of these would be to make systems more generally self-aware, detecting and tracking even small inconsistencies in operation among peers for local alarm sensing and more comprehensive later analysis by info-grinding systems elsewhere. If a triple-redundant control system were made quadruple or quintuple in design, then it might well be possible to have copies of one or two levels of prior maintenance revision running alongside the current one in a "monitoring" capacity (i.e. not in the active control triad, but still able to flag inconsistencies) for validating the correctness (or at least consistency) of the system. A single large aircraft might contain hundreds of quintuply redundant systems like this, with the extra processing engines serving mostly as monitors and quality-control devices but also potentially available to provide critical decision data when things are not entirely in whack. Whether one would ever want to default back to an earlier revision of a software release while in mid-mission is a question that only could be answered in the context of systems with that capability installed. Likely some fuzzy logic needed in the mix for similarity analysis in process streams that are everso slightly differently tasked.

The real gold in higher levels of redundancy for self diagnosis during operation would come from the 'consistency' log data, I reckon, and the occasional rare advisory to stay on the ground and sort something out before it can get you.

Quote:

Originally Posted by arcniz

...per terraflop...

Now you're referring to BA38 :) .

Michael

Quote:

Originally Posted by arcniz
...per terraflop...

Now you're referring to BA38 .

Michael

... or could be a moniker for a noticeably awkward Danish gymnast ...

We apologise for our excessive r's.

For those normal folks who do not dwell amidst 1's and 0's, some clarification may be appropriate: Teraflop is computerese for a process capability of 1,000,000,000,000 floating-point operations per second. My comment above was slightly hyperbolic, in that an extra gigaflop or two would do very nicely in nearly any practical control system of current style, and a megaflop is often more than enough. With teraflops you're in the present-day world of the code-breakers, but it may'nt be so long before they're ubiquitous in things that can use 'em.

Many controls are designed (for cost, simplicity, reliability) to deliberately avoid floating-point math in calculations; the appropriate metrics for them are MIPS, GIPS, TIPS, etc.

The advantage of increasing speed per processor is that it slices time into ever finer chunks, so that diagnostic and comparison housekeeping can be woven more extensively into the execution stream without affecting the mainstream control program. This is accomplished by stealing a microsecond now and again for the overhead stuff in an invisible and demonstrably non-interfering manner. Similar effect can be achieved with multiple processors sharing a single process stream as if they were one processor. When time-stealing is done right, the revenue process sees only a slightly slower execution universe, with no possibility of interaction and therefore no need for expensive case-by-case testing of changes to the monitoring process. It is there, watching things and doing work, but the time-slicing mechanism acts as an effective firewall. (insert theory/practice disclaimer here)