Dreamliner in emergency landing at Dublin Airport [Archive]

Brian McGrath

23rd Oct 2015, 10:36

US-bound Ethiopian Boeing 787 Dreamliner took off from Dublin at 6.10am en route to Washington. info below

Emergency landing at Dublin Airport | BreakingNews.ie (http://www.breakingnews.ie/ireland/emergency-landing-at-dublin-airport-702052.html)

Some more info from RTE

The Dreamliner had been cruising at 40,000 feet and was about 600km north west of Donegal when the pilot declared an emergency at around 7.30am.

The plane was then forced to dump thousands of litres of fuel so it could land within safe weight limits.

The crew had been in contact with controllers at the Irish Aviation Authority's North Atlantic Communications Service centre at Ballygirreen, Co Clare and advised them that they had to shut down one of the jet's two engines.

Several units of Dublin Fire Brigade along with HSE ambulances and an incident officer were mobilised to the airport.

Engineers are now investigating the problem.

tubby linton

23rd Oct 2015, 10:50

Thomsonfly binliner had engine problems a week ago.
Incident: Thomson B788 near Gander on Oct 14th 2015, engine rolled back (http://avherald.com/h?article=48e310a2&opt=0)

lilflyboy262...2

23rd Oct 2015, 10:55

And Royal Brunei also had a engine shutdown the other day too.

Max Contingency

23rd Oct 2015, 11:10

Brian

This is a professional pilots forum. I know you just direct quoted but can we leave dramatics to the press and consider a re-title of your thread. Perhaps "Boeing 787 engine failure - return to Dublin 23 Oct 2015" would be more appropriate.

RF4

23rd Oct 2015, 11:51

I am curious which of Ethiopians Dreamliners was involved. Was this one of their earlier purchases or one of the six "Terrible Teens" that they recently purchased ? --- actually I don't know if the "Terrible Teens" are yet delivered to them, and in service.

Selfloading

23rd Oct 2015, 12:25

evansb

23rd Oct 2015, 19:52

ET-ARF. So, what is with the GEnx-1B (or not 2B) engines?http://i1047.photobucket.com/albums/b477/gumpjr_bucket/Resize.jpg

tdracer

23rd Oct 2015, 22:59

ET-ARF. So, what is with the GEnx-1B (or not 2B) engines?

The shutdown rate for the GEnx engines (both -1B and -2B) is running around 2 per million engine operating hours. That's only about ten times better than what's required for 180 minute ETOPS.:=

We better ground the fleet :ugh::ugh::ugh:

llondel

23rd Oct 2015, 23:07

It is still valid to ask why they shut down. If there was no apparent defect then clearly something is not quite right and the technical team needs to learn from that.

Una Due Tfc

24th Oct 2015, 00:33

I must say, as a controller who deals with ETOPS diversions regularly, I was finding the original, uncensored thread both educational and interesting. Bizarre.

Right to the specifics, it's implied that there was 3 separate incidents of GEnx engines rolling back uncommanded with no other abnormal indications within a few days, in different stages of flight in different parts of the world. Any new FADEC software upgrades in the last couple of weeks?

roulishollandais

24th Oct 2015, 00:51

Too many people assert it is unpossible to write zero-bug softwares. And that is wrong. In result we stay with no-answer questions about how we shall avoid the multi-repetition of failure... Until we discover the hidden Volkswagen syndrom ?:}

tdracer

24th Oct 2015, 03:25

UDT - as is sometimes the case, I know more than I can probably repeat. But we have a pretty good idea what's causing the rollbacks (all recoverable, BTW), and it's not software as such (although the fix will likely include a s/w change).

wanabee777

24th Oct 2015, 05:26

I've always been under the impression that if I could avoid shutting down an errant engine, (say just keep it at idle), that this would prevent possible adverse ETOPS penalties down the road based on my company's reliability program for that particular model of engine.

At least that's what I've been led to believe from our line check airmen (LCA's).

msbbarratt

24th Oct 2015, 05:34

msbbarratt

24th Oct 2015, 05:42

Too many people assert it is impossible to write zero-bug software. And that is wrong. In result we stay with no-answer questions about how we shall avoid the multi-repetition of failure... Until we discover the hidden Volkswagen syndrome ?:}

It is possible to write bug free software, but proving that that's what's been achieved is basically impossible except in trivial examples.

Mostly we rely on a whole lot of very carefully designed testing and many hours of logged trouble-free running before reluctantly concluding that it might be ok... That's why making changes to this kind of software is so expensive - All the software tests have to be repeated.

beardy

24th Oct 2015, 07:24

underfire

24th Oct 2015, 07:38

What about the issue where the aircraft engines need to be "re-booted" every 248 days...perhaps something left over from this issue?

Until we discover the hidden Volkswagen syndrom

Boeing defeat mechanism to make the fuel burn look better in testing!

Nialler

24th Oct 2015, 09:51

Heathrow Harry

24th Oct 2015, 12:06

agreed - it is not possible to make software totally bug free

Chris Scott

24th Oct 2015, 12:46

Quote from beardy:
"IF it's a software problem AND it's the same software on both engines I would have expected that to impinge on ETOPs certification since the risk of the second engine doing the same thing is higher than if it were a mechanical (as opposed to design) fault. ETOPs is defined by acceptable risk of the other engine failing within a set time period and whilst demonstrated failure rate is a very good metric it should not, IMHO, be considered in isolation."

That's a very powerful argument, IMHO.

In which case, are FADEC software updates permitted to be introduced simultaneously on the two engines of a given a/c? (Or, for that matter, simultaneously on all the engines of a/c with 3 or 4 engines?) As beardy implies, introducing faulty software could cause a failure on the first flight, whereas a mechanical failure caused by faulty manufacture and/or wear and tear is a different matter. It would be comparable to allowing one mechanic to change the chip detectors on both engines on the same turnround...

One hopes the loss of the A400M on a test flight due to a FADEC problem has focussed minds across the industry?

b1lanc

24th Oct 2015, 13:43

It is possible to write bug free software, but proving that that's what's been achieved is basically impossible except in trivial examples.

Mostly we rely on a whole lot of very carefully designed testing and many hours of logged trouble-free running before reluctantly concluding that it might be ok... That's why making changes to this kind of software is so expensive - All the software tests have to be repeated.

No amount of testing ever identifies all bugs. Around 1980, Airbus shared their early static fly-by-wire flight test results with the DoD program I was supporting. If memory serves, they contracted with three different companies in three different countries to perform a full suite of testing of fly-by-wire SW with the thought that the different companies would find different bugs - at least a few. None of the companies new any of the others existed, the test protocols were unique to each company and results were not shared. The hope was each company might ferret out major flaws that others might not catch. Much to their chagrin (or maybe just due to good SW coding practices) well over 90% of flaws were found by all three companies and only a few less than critical bugs were identified uniquely. The results surprised a number of people and as events would later prove, not all major issues were uncovered.

Alain67

24th Oct 2015, 16:23

Not every software failure is a bug.
You analyse a problem.
The code you write doesn't behave like expected in some cases, that's a bug.
However, if you have forgotten some cases in your analysis, I wouldn't call that a bug. Because writing down the procedure on a sheet of paper would have produced the same error... therefore it has nothing to do, specifically, with software production.

lomapaseo

24th Oct 2015, 19:28

In all liklihood software didn't cause the problem, however it can work wonders to fix it though.

There is nothing new about rollback, just the assumptions about what caused it this time. Typically one looks at the input signals to the FADEC.

Mo need to even argue who screwed up, just address it and move on.

red.sky@night

24th Oct 2015, 19:43

Meanwhile, ET-ARH calls to DUB with a spare donk.

G-CPTN

24th Oct 2015, 23:13

ET-ARH calls to DUB with a spare donk.
As internal cargo or on a pylon?

DaveReidUK

25th Oct 2015, 07:03

As internal cargo or on a pylon?

The 777 doesn't have an external engine-ferrying capability.

The GEnx fits (just) on the main cargo deck of the 777F: https://www.youtube.com/watch?v=X2jZp35BvjU

roulishollandais

25th Oct 2015, 13:24

I used the word "bug" to adress any failure in the software chain to do it shorter, I hope you will accept that word generalization inside that part of that thread.

Testing software is never a cheap operation, it is at contrary the most expensive part of building a software (1>100 is not unusual). And we have to repeat that testing some times in critical designs connected to human life or strategic aims.

That amount of work to debug really softwares is not abnormal if we accept that rule well known from the begin of software use by engineers and managers : a software is not chosen for fashion or modernity reason but to get money ! If you use your software only a little number of time it is much cheaper to do it by hand ! You have to calculate the benefit of the long, expensive, hard work creating and using software's. A sort software is used millions of times, which is much more than FADECs use. Our traditional Aeronautics ingineers had a very safe score, thank you to them.

Arrival of Personal computer induced a image of toy to the computer and the idea that everything gets magic like a play and everybody could rebuild the world in the secrecy of his office without to need acceptance by others or at least to listen their opinion about your Ptoleme's fantasma.

That experience from Airbus with DOD and some other companies is interesting but not more than that : Testing must be done by the team who elaborated the software. It is nearly impossible to debug a software yourself were not the conceiver.

Testing is a long chain of logic, with the same level of logic you used in the software conception. So you may use statistics only if you used statistics inside the software, ie theory of games to solve a great amount of equations, and respecting strticly the rules of statistics (ie if you are not sure of the distribution law, don't use tests adapted to Gaussian law).

Bug free software exists ie bootstraps and the multiple layers above the bootstrap to build operating systems or nets like the web (despite you may find some bugs in some OS...)

Ian W

25th Oct 2015, 15:08

roulishollandais
Testing must be done by the team who elaborated the software. It is nearly impossible to debug a software yourself were not the conceiver.I would disagree with this as most major 'software' problems are not faults in the code but faults in the design due to poor or incorrect systems analysis. As a simple case, if the systems analyst thought that a wind of 360 at 10kts is blowing to the North not from the North and designed the software with that in mind, the software could be tested repeatedly and the fault would not be found. This is called verification testing all it does is confirm that the software does what it was designed to do without errors. Design faults though are only be found in Validation testing where the user confirms that all the functional requirements of the system perform correctly. The user tests the system as a black box - i.e. has and needs no knowledge of the design or the software. This kind of design fault through misunderstanding would only be found in validation testing. These design faults are also the most expensive faults to fix as the system is usually close to complete when validation testing is carried out and fixes will require a large amount of repeat (regression) testing to ensure the fix has not broken anything else.

Unfortunately, while it was seen as exceptionally important in the early days of computing, systems analysis and design has been somewhat trivialized in recent years into a semi-automated process and this has led to some very costly development failures.

roulishollandais

25th Oct 2015, 17:52

Ian,
Complying with the user's request needs effectively that validation from somebody who did not conceive the software which is not his profession. But that user must be implicated all along the steps of building the system since his first request to the analyst.That request cannot be just a mute folder. A real cooperation leads analyst and customer to a safe product. Sevilla's test pilots story facing in validation a wrong system , or Warner's death show the danger to ask the user to test-validate the system. Here we are walking on eggs, little steps must be validated progressively.

I agree totally with your last sentence. That fact is really worrying. We are seeing there a negative evolution called "progress":} similar to piloting technical losing basics and grounds.
rh

peekay4

25th Oct 2015, 20:59

roulishollandais

25th Oct 2015, 21:38

That complexity is excessive like a human who would taste everything in the grocery before buying everyday his bread ! We are using to much software, to much energy, aso that we don't need !

Of course using tested modules and defensive layers is not far to be mandatory to chase the zero-bug level.

"99.9"... Once again that number is just terrifying . We had that discussion with PJ2 already !
Zero is really very lower than .1 !!! The limit to zero we have to search is given by the time of Planck ! Mostly people don't know which "zero" they need and decide to reach ! Having one chance/hundred to die going on the moon the first time is acceptable, but 1/100 or 1/1000 in FADEC failure in a airliner engine is not acceptable. You cannot say you did debug your software if you don't know exactly the risk and price of the failure of that software .

peekay4

25th Oct 2015, 22:11

Welcome to reality!!!!

For example, if you look at the avionics software standard (RCTA DO-178C), even at the most stringent "Level A" (where a system failure would be catastrophic == complete loss of aircraft and occupants) the test coverage is simply that each code decision has been exercised at least once and each condition resulting in those decision branches is independent. (So called MC/DC test coverage).

This does not require that all possible input / output / reference data combination have been exhaustively verified and validated.

Unfortunately the A400M crash is an example of such catastrophic failure, reportedly caused by missing reference (configuration) data.

MG23

25th Oct 2015, 23:15

I've never ever seen a piece of bugfree software in 30 years of working with mission-critical systems.

You can write bug-free software, so long as it doesn't have to do anything useful. And, even if the software doesn't have bugs, the hardware it's running on does. And, even if the software doesn't have bugs and the hardware doesn't have bugs, the specifications for the software have bugs. And, even if none of them have bugs, the other systems you're talking to have bugs.

lomapaseo

26th Oct 2015, 00:01

You can write bug-free software, so long as it doesn't have to do anything useful. And, even if the software doesn't have bugs, the hardware it's running on does. And, even if the software doesn't have bugs and the hardware doesn't have bugs, the specifications for the software have bugs. And, even if none of them have bugs, the other systems you're talking to have bugs.

Which is likely what this thread subject is all about. They will probably just tweak the FADEC to accomodate it somehow. It will be a little harder to modify an input, but that is always an option.

barit1

26th Oct 2015, 01:43

And, even if the software doesn't have bugs, the hardware it's running on does.

I give you the T972 fan engine on the A380. The hardware had a small flaw (oil nozzle) that revealed a big flaw (IP turbine fatal failure mode), as QF32 revealed a couple years ago.

But intel we outsiders have is that a FADEC fix was implemented - I suspect a N2/N3 mismatch detector. It's a logical approach, but it introduces new failure modes to the system. Life is never simple, is it? :ugh:

tdracer

26th Oct 2015, 02:42

I deal with "Design Assurance Level A" or DAL A software regularly. Nearly all the "software errors" we see are not really software errors - they are requirements errors. The software is doing exactly what we told it to do in the requirements, but the requirements were not representative of what was really wanted.
What's particularly common is the requirements - as written - are not clear to the people that are implementing them. The problem is that the people writing the requirements know the system intimately, and they write requirements that are clear and make perfect sense to them - but the people who implement those requirements don't know the system and what it's expected to do, and they don't interpret those requirements as the writers intended :sad:

Barit1, on most engines, if the shaft breaks the turbine will move aft and clash with the stators - it's not pretty, but it prevents a turbine overspeed and uncontained failure (or if bits do escape, they are not "high energy" and don't do significant damage). For some reason, Rolls engines don't tend to do that. This problem showed up on the RB211-524 engine - where a few fan shafts broke - one event was on the center engine on an L1011 and the fan came down through the fuselage and tried to cut the aircraft in half. Rolls came up with a 'fan catcher' that would prevent the fan from leaving the engine. The next failure was on a 747, the fan catcher worked as intended, but the unloaded LP turbine overspeed and exploded, cutting the rear of the engine off (and peppering the aircraft with shrapnel).
The Trent engine was developed with "LPTOS" - Low Pressure Turbine OverSpeed. Basically, the FADEC monitors the LP shaft speed at both ends - and if they disagree (within a small tolerance) it will shutoff the fuel. In the aftermath of the A380 event, Rolls has been implementing "IPTOS" (Intermediate Pressure TOS) on the various Trent models.
Software is not perfect, but it has often been successfully used to address various hardware shortcomings.

The A400 crash may well be the first known accident due entirely to a problem with DAL A software. All I know about it is what I've read in news accounts and I'm anxiously awaiting the official report (hopefully Airbus/Rolls won't use the military aspects of the A400 to make the report confidential). But the news reports point to a glaring requirements error - properly designed FADEC software should have put up a 'no dispatch' warning if a critical calibration was undefined.

peekay4

26th Oct 2015, 03:36

The A400 crash may well be the first known accident due entirely to a problem with DAL A software. All I know about it is what I've read in news accounts and I'm anxiously awaiting the official report (hopefully Airbus/Rolls won't use the military aspects of the A400 to make the report confidential). But the news reports point to a glaring requirements error - properly designed FADEC software should have put up a 'no dispatch' warning if a critical calibration was undefined.
Yea, although that might be indicative of something more than just a requirements error -- pointing to a larger process breakdown.

Typically there are high level requirements, specific system / software requirements, low-level requirements, etc., which all need to be traceable up and down between them, and also have full traceability to the code, to the binary, and to all the test cases (and/or formal methods verifications as applicable).

For all data elements, there should be specifications to check for valid ranges for values (data domain), missing values (null checks), etc. Functions also need to have preconditions & postconditions on what parameter values acceptable as part of the interface contract, and assertions which must hold true.

There should've also been models of both the specifications and the design and processes to check these models for completeness.

And even if there are data errors, as mentioned before the software should be designed to be fault-tolerant and fail safe instead of simply freezing up at 400' AGL.

What you don't want to do is to fix this one specific requirement while there may be other missing/incomplete/incorrect requirements out there. So you have to take a look into the SDLC process and figure out why the requirement was missed to begin with.

wanabee777

26th Oct 2015, 03:51

All this software talk is starting to make me long for plain old Jurassic J57 or JT8D technology.:)

Magplug

26th Oct 2015, 11:52

I witnessed the Ethiopian 787 arrival in to DUB recently and ATC at that awful place never cease to amaze me........

The arrival of the 787 with an IFESD was known well in advance as he was returning off the ocean. We arrived in the 1/2 hour before he landed and all the airport & external units for the emergency were already in place around the airport.

The Ethiopian 787 landed without incident, vacated, talked to the Fire Service and when all was established as safe they elected to proceed to the terminal. Up to that point it all seemed to be handled rather well...... (Congratulations lads!).

With the Fire Service following the 787 was directed by ATC to a stand on the main terminal. For reasons best known to themselves the Fire Service then rolled out multiple hoses around the area in preparation to fight any fire they might have missed earlier. As a precautionary measure this would have been all well and good except that ATC had parked the aircraft on stand 204 in very close proximity to a BA A320 being refuelled and boarded for departure. The sight of the fire-fighters rolling out their hoses around an aircraft with 200 pax on board that was blissfully loading fuel from an airport bowser was rather comical.

I probably visit 20-30 different European airports every month, some of them with diverse terrain, ATC or cultural challenges. However DUB has to be near the top of my dangerous-avoid list. The ground environment is absolutely appalling with no less than 4 different taxiway nomenclatures guaranteed to confuse the visiting pilot with clearances delivered at break-neck speed. There is no recognition that visitors to DUB might be complete strangers and if you question a clearance the reply illicits a degree of arrogance found in very few places elsewhere. I can quite understand why they have a history of ground incidents.

There is little logic to the way things happen in DUB. Whilst the rest of the world is pretty much standard-ICAO Dublin carries on business in it's own little bubble.

oldoberon

26th Oct 2015, 13:15

The 777 doesn't have an external engine-ferrying capability.

The GEnx fits (just) on the main cargo deck of the 777F: https://www.youtube.com/watch?v=X2jZp35BvjU

wow a tight fit indeed, can think of a few places where you wouldn't want local loaders putting that on/off.

oldoberon

26th Oct 2015, 13:20

IF it's a software problem AND it's the same software on both engines I would have expected that to impinge on ETOPs certification since the risk of the second engine doing the same thing is higher than if it were a mechanical (as opposed to design) fault. ETOPs is defined by acceptable risk of the other engine failing within a set time period and whilst demonstrated failure rate is a very good metric it should not, IMHO, be considered in isolation.

when put that way it is so logical, i do hope certifying bodies used same logic. long time since I did a long flight over water, and always preferred 4 to 2, having read your post that preference will remain in place.

Ian W

26th Oct 2015, 13:56

Those pilots were effectively performing verification, not validation. They were testing whether or not their aircraft performed to specs, not whether the specs were correct.

NASA did many studies over the decades and surprisingly (?) found that it is actually impossible to find all safety-critical software bugs by testing!

That's because as complexity increases, the time required to test all possible conditions rises exponentially. Completely and exhaustively testing an entire suite of avionics software could literally take thousands of years.

Therefore, instead of full exhaustive testing, we selectively test what we determine to be the most important conditions to test. Metrics are gathered and analysis is performed to provide the required test coverage, check boundary conditions, ensure that there are no regressions, etc.

However, one can't prove that a piece of software "bug free" this way, because not all possible conditions are tested.

Today as an alternative, the most critical pieces of software are verified using formal methods (i.e., using mathematical proofs) to augment -- or entirely replace -- functional testing. Unlike testing, formal methods can prove design/implementation correctness to specifications. Unfortunately, formal methods verification is a very costly process and thus is not used for the vast majority (>99.9%) of code.

The rest of the code rely on fault-tolerance. Instead of attempting to write "zero bug" software, safety is "assured" by having multiple independent modules voting for an outcome, and/or having many defensive layers so failure of one piece of code doesn't compromise the safety of the entire system (swiss-cheese model applied to software).

This "fault-tolerance" approach isn't perfect but provides an "acceptable" level risk.

Exhaustive testing: Is when either the tester or the funds are exhausted, it has no bearing on the number of bugs yet to be found.

Mathematical proof of software is an example of the 'streetlight effect' more and more effort being expended looking for bugs in an area where they are simple to find but very unlikely - in the code that can be mathematically checked, rather than where they most often are which is in system design. However, it makes some companies a lot of money and delays and even prevents implementation of modern hardware and software.

Fault tolerance by voting triplex is fine until there is a three way disagreement and/or the voting software makes a mistake and shuts down the process whose software is correct and follows the output of the two other processes whose software is incorrect. This happens surprisingly often.

Ian W

26th Oct 2015, 14:01

when put that way it is so logical, i do hope certifying bodies used same logic. long time since I did a long flight over water, and always preferred 4 to 2, having read your post that preference will remain in place.

Unfortunately, if all four have the same software version then all four could in theory crash and such faults do happen even on fully tested systems. Such as the F-22 Squadron Shot Down by the International Date Line :eek:
(http://www.defenseindustrydaily.com/f22-squadron-shot-down-by-the-international-date-line-03087/)

oldoberon

26th Oct 2015, 15:38

yes most of them been around longer so in theory SW more proven ( I hope).

your link wow close one!!

peekay4

26th Oct 2015, 16:10

There is an overarching software design & architecture requirement that any "catastrophic failure" -- a failure resulting in the loss of the airplane and deaths of its occupants -- must be "extremely improbable".

For FAR 25 aircraft, "extremely improbable" is defined as a failure rate of no more than 1 per billion flight hours (1E-9), established by a quantitative safety assessment.

However, as we found out with the Challenger shuttle disaster, this kind of quantitative assessment can be a bit pie in the sky. Still, critical software do tend to be extremely reliable. Just remember to reboot from time to time........

lomapaseo

26th Oct 2015, 16:20

For FAR 25 aircraft, "extremely improbable" is defined as a failure rate of no more than 1 per billion flight hours (1E-9), established by a quantitative safety assessment.

In General

To put this into perspective catastrophic failures (part 25) for all causes occur at rates 100 times more likely (1E-7).

I'm a lot less worried about the system causing the crash then I am the pilot's contribution

Liffy 1M

26th Oct 2015, 16:32

There is little logic to the way things happen in DUB. Whilst the rest of the world is pretty much standard-ICAO Dublin carries on business in its own little bubble.

Most of the issues you raise about taxiway nomenclature and layout and also where the Ethiopian 787 was directed to park can hardly be laid at the door of ATC. Those are all matters for the airport authority. Amongst the recommendations of a recent AAIU report into a ground collision between two 737s at Dublin was that:

"The Dublin Airport Authority (DAA) conduct a critical review of the taxiway system at Dublin Airport, to ensure that taxiway routes are as simple as possible in order to avoid pilot confusion and the need for complicated instructions."

The report also states that Dublin Airport accepts the recommendation and will undertake a critical review of the taxiway system to ensure that taxiway routes are as simple as possible.

tatelyle

26th Oct 2015, 18:37

Exhaustive testing: Is when either the tester or the funds are exhausted, it has no bearing on the number of bugs yet to be found.

Nothing new. The problem of bugs in systems has always been with us in design, it is just that computers probably give more opportunities for error.

Ask the captain of the BA flight that lost two donkeys on short finals into LHR. That bug in the fuel system had managed to hide itself for several years, before raising its ugly head.

http://i.dailymail.co.uk/i/pix/2013/07/06/article-2357585-1AB177DD000005DC-636_638x339.jpg

Infieldg

26th Oct 2015, 23:31

I've been a software developer for 26 years and 100% bug free software can simply mean you and whoever tested the software both misinterpreted the spec in the same way, OR the analyst misinterpreted the requirements and you coded their mistake perfectly and the tester agreed. This is (literally, not kidding) why I genuinely fear passengering on an Airbus. Nothing can ever replace you guys and we shouldn't be trying.

AR1

27th Oct 2015, 15:36

You really need to get out more.

Before SW failure was mechanical failure. And that's not gone away either. Despite the way software can never be 100% bug free (not my assertion) you fly in an era of unprecedented safety in air travel.

Unfortunately those same technical advances also give us the ability to spout tripe in an unprecedented way. And that scares me.

MG23

27th Oct 2015, 16:41

Before SW failure was mechanical failure. And that's not gone away either.

But you can usually detect mechanical problems early: for example, wear or cracks in metal parts. Software may work perfectly for ten years, then finally hit the rare bug that causes it to fail for no apparent reason. Worse, every instance of that software may fail at the same time all over the world (e.g. the various leap second bugs). Gears don't even know about leap seconds.

The other issue is that a third party can examine all the mechanical parts for cracks, and tell you there's a problem. A third party usually can't examine the software that runs those parts, because it's closed source. They can only test it as a black box.

There's a truly scary document online from one of the software guys who was given access to Toyota's software as an expert witness for the 'unintended acceleration' trials. Some of the things in there are quite mind-boggling, but no-one knew about them because they had no access to the software.

Software has definitely made many things far more reliable. But it's also replaced many predictable failures with unpredictable ones.

beardy

27th Oct 2015, 18:07

Who judges the acceptable risk of software bugs?

GlobalNav

27th Oct 2015, 18:24

I doubt there is any such thing. Software just is whatever has been coded. Unless the memory media on which it is stored fails somehow, the software remains intact, just as coded and compiled.

Software error? Yes, and as so many have explained, testing to find every software error (AKA bug) is fairly impractical in the large bodies of complex code used in avionics, engines and such. So rather than completely exhaustive testing, though some testing indeed is done, there is required to be a disciplined software development process, the rigor of which is driven by the safety effects that a function affected by software error might be considered to have.

Highly critical functions with potentially catastrophic effects from software errors must have a "design assurance level" of A, which of course is the highest and most expensive development process.

llondel

27th Oct 2015, 20:26

But you can usually detect mechanical problems early: for example, wear or cracks in metal parts. Software may work perfectly for ten years, then finally hit the rare bug that causes it to fail for no apparent reason.

It can happen with mechanical stuff too, Air Midwest 5481 back in 2003. Someone introduced a mechanical 'bug' in that they rigged the elevator cables incorrectly. It flew OK for several flights until circumstances conspired to trip the bug, in the form of a CofG too far aft and it pitched up and stalled. OK, not quite ten years, but no one detected the error and had the error not been made, it would have been recoverable - the limited elevator travel due to the error meant it couldn't cope.

roulishollandais

28th Oct 2015, 02:46

Who judges the acceptable risk of software bugs?ref to Ariane 501 report (4.June1996 crash) from Jacques Louis Lions about the best practices.

andrasz

28th Oct 2015, 04:10

Software just is whatever has been coded.
That is very simplistic and incorrect. Software comprises the original set of specifications on what the system is supposed to achieve, the algorithm (which is a translation of the specs into the particulars of the coding language used), the actual code, the set of static and dynamic data which are used by the code, and the user instructions/manual on how to operate the software.

"Bugs" can be introduced everywhere along this process, and coding bugs (where there is an actual syntax or logical error in the code) are usually the smallest percent of them, and the easiest to catch. The most difficult part are the specifications, where a professional in a particular subject needs to describe his/her knowledge to someone who is at best marginally versed in the profession, however is able to develop efficient algorithms to achieve what the specifications say. There are many things which may get lost in translation here, and the most dangerous are which were 'forgotten' from the specifications simply because a particular scenario was not considered. These scenarios are usually in the realm of valid data, as basic software design principles mandate that invalid data ranges must be considered and treated (eg. if a parameter must be positive, in a critical system there MUST be a loop which handles the case if that parameter is negative).

A further layer of "bugs" are as Microsoft once famously said, not bugs but features. Errors can be introduced in the user manual which may not correctly describe how the system works, especially in remote and unlikely scenarios. This causes the software to behave as specified, but differently than what users expect. More issues are introduced through the user interface, when the software users do things which are explicitly disallowed in the manual, but try it anyway, with totally unpredictable outcomes as those scenarios were neither considered nor tested.

From the user perspective all above are "bugs", but only a very small portion are actually attributable to the code itself.

DType

28th Oct 2015, 12:38

Whenever I wrote in a manual "Whatever you do, don't press button 'A'", I had to go back to the product design and delete or protect button 'A'. Eventually, I got round to writing the manual before I started the design. That only took half a lifetime to figure out, but then I'm not the sharpest knife in the box!

wanabee777

28th Oct 2015, 15:08

I never could keep my fingers off the bloody buttons. Especially on long haul flights.

Used to drive my F/O's nuts!:\

Nialler

28th Oct 2015, 17:03

@peekay4:

Yea, although that might be indicative of something more than just a requirements error -- pointing to a larger process breakdown.

Typically there are high level requirements, specific system / software requirements, low-level requirements, etc., which all need to be traceable up and down between them, and also have full traceability to the code, to the binary, and to all the test cases (and/or formal methods verifications as applicable).

For all data elements, there should be specifications to check for valid ranges for values (data domain), missing values (null checks), etc. Functions also need to have preconditions & postconditions on what parameter values acceptable as part of the interface contract, and assertions which must hold true.

There should've also been models of both the specifications and the design and processes to check these models for completeness.

And even if there are data errors, as mentioned before the software should be designed to be fault-tolerant and fail safe instead of simply freezing up at 400' AGL.

What you don't want to do is to fix this one specific requirement while there may be other missing/incomplete/incorrect requirements out there. So you have to take a look into the SDLC process and figure out why the requirement was missed to begin with.

YOu may have worked in the past with Orthogonal Defect Classification. This is where things get scary. In nailing down a coding error at one stage we drilled through to the conclusion that the error was a "missing typo". At the meeting we collapsed in laughter. The problem essentially conisted of the fact that a typo hadn't been propagated right throughout the development cycle. When we recovered ourselves we realised how utterly catastrophic such an error might be.

With teams using US and UK ENglish there were multiple risks of variable typos, with each being separately close enought to the other to pass muster, but with yet untested fallback routines failing in th event.

Avionic software at least appears to fall back to the backstop of handing things over to the pilot(s). The day that they stop doing so is the day that I keep my feet on the ground.

Systems are never perfect, and they don't exist in a vacuum; parallel systems may make un desired demands of them.

I'm not flying hen the person in the seat is a systems administrator; I want a pilot up there. One who can over-ride every damn system. Yes, they make mistakes, but at least they can react according to their skills, and at least their ass is on the line too.

esa-aardvark

29th Oct 2015, 05:21

Hello peekay4,
I think you will find the Challenger disaster was in the numbers.
The relevant engineers voted against flight. It flew because of the
common management idea that if it (they) flew several times then they were OK.

msbbarratt

29th Oct 2015, 06:38

Unfortunately, if all four have the same software version then all four could in theory crash and such faults do happen even on fully tested systems.
[/URL]

The type of error encountered in the F-22 squadron is the type of error that could be detected in well-designed testing regimes. Spherical navigation calculations have many well-known traps for software programmers. I wonder if they have flown an F-22 directly over the north pole yet?

Where it is difficult to fully test a piece of software tends to be in exploring the full state map of a system. This results in many possibilities for very, very subtle errors.

The current state of the piece of software in an engine control will depend on all that has happened to that engine in all preceding flights. Even for two identical engines bolted to the same aircraft at the same time they will both experience slightly different operating conditions throughout their service life. This may, though it is not guaranteed, result in a bug occurring in one engine but not another. Operating an aircraft with two engines of significantly different service times is a way of improving this chance (though of course there is still no guarantee).

Anyway, aren't we wildly speculating here? Just because someone has said that a software update might be one of the changes made does not mean that it was the software that went wrong. It is more likely that a software change is needed as a result of a change in the mechanical design of the engine.

tdracer

29th Oct 2015, 14:30

Anyway, aren't we wildly speculating here? Just because someone has said that a software update might be one of the changes made does not mean that it was the software that went wrong. It is more likely that a software change is needed as a result of a change in the mechanical design of the engine.

Three pages back, I wrote this:
I know more than I can probably repeat. But we have a pretty good idea what's causing the rollbacks (all recoverable, BTW), and it's not software as such (although the fix will likely include a s/w change).

As you've noted, they'll likely change the s/w to be more tolerant of the hardware issue that's causing the rollback, but once again, the rollbacks were not caused by the software.
I've come to the conclusion that reading comprehension is not a forte of some of the posters on this forum :ugh:

safetypee

29th Oct 2015, 16:31

Notwithstanding td’s input, it’s still not clear what actual problem was.
If it was a straightforward mechanical failure then the probability of a dual event is very remote like that of an improbable software malfunction – i.e. a failure rate as assumed for ETOPS.
However, if as intimated the engine experienced a ‘recoverable ice crystal encounter’ (ICE) then the subsequent shutdown could indicate damage greater than that assumed in the safety case for operations near such conditions.

Although the risk of a dual shutdown due to ICE damage has been accepted, this incident could question that judgement due to the apparent ineffectiveness of flight restrictions and by the extent, or crew's perception of whatever damage occurred, i.e a shutdown occurred when not expected, and the cause could have affected both engines simultaneously!

Some operators relate ICE with the tropics and have planned their operations to avoid these routes, but ICE is associated with large storms in a tropical air mass. Did this event, albeit in the N Atlantic, involve a tropical air mass, perhaps the remnants of a hurricane?

tdracer

29th Oct 2015, 17:52

safetypee, it's not ice crystal icing (no meaningful ICI events with the latest altitude restrictions and software). The rollbacks are related to a sensor, but I'd rather not elaborate at this time. When GE sees fit to share what's know with the operators, then I'll be willing to elaborate.

lomapaseo

29th Oct 2015, 22:25

sfaetypee

You keep using the term "shutdown" which is a pilot command when I have not seen any info to say this was anything but a rollback in power.

tdracer

29th Oct 2015, 23:19

Lomapaseo, in defense of safetypee, there are really two discussions going on that have become intertwined. The first, which was the original subject of this thread, was a shutdown and air turnback. GE has reported to the operators that it was due to a cracked AGB. Not good, but the GEnx-1B shutdown rate remains impressively good (especially given it's a brand new engine) - much better than that required for 180 minute (or even 330 minute) ETOPS.
The second discussion was another poster pointed out that there were also three uncommanded rollbacks during the preceding week or so that someone speculated was due to a software error (which it most definitely was not).
One of the rollbacks was initially reported as suspected Ice Crystal Icing, however that was later determined to be incorrect.

Now back to your regularly scheduled debates :rolleyes:

MarkD

31st Oct 2015, 11:00

The Royal Brunei incident noted on page 1 involved Trent 1000, not GEnx

safetypee

1st Nov 2015, 08:56

td, thank you very much for the clarification.
You imply that a cracked AGB is ‘not good’ on a new engine, but is this ‘not good’ of any greater significance than any component failure on an ETOPS engine?
Is AGB – Auxiliary Gear Box as in Air France Industries KLM Engineering & Maintenance - GE90 On-Wing AGB Replacement (http://www.afiklmem.com/AFIKLMEM/en/g_page_standard/MRO_lab_Innovations/GE90_ON_WING_AGB_REPLACEMENT.html) ?

If the other rollbacks, unrelated to this thread (and to any software error) did not involve ICI, but possibly a sensor, then perhaps the concerns about the potential simultaneous reduction of power are valid and relevant in all flight conditions, not just ETOPS.
I’m not ‘fishing’ here, thus a philosophical closure to this aspect might be sufficient. However, I suspect that many operators would be uneasy with the recent occurrences of apparently related rollbacks, in a highly reliable engine, without explanation.