ATSB probes 'cosmic rays' link to QF72 A330 jet upset

Reply Subscribe

Thread Tools

Search this Thread

20th Nov 2009, 20:16

#21 (permalink)

tailstrikecharles

Join Date: Oct 2009

Location: Greece

Age: 84

Posts: 63

Likes: 0

Received 0 Likes on 0 Posts

Quote:

I am not sure, but you would have to think for this to be published (Cosmic Rays) that it is designed to take the blame away from a nasty - potentially catastrophic software/hardware fault within the ADIRUs.

Yet surely, saying that the 330 is subject to random cosmic rays would have to be even less reassuring. If they had said the ADIRU can be replaced due 'this' (ie whatever fault they find) particular hardware fault, then most people would be satisfied - but now the whole jet can be susceptible to complete lack of control from unseen random cosmic rays! FFS...Really?

I can just imagine the punters now (or the random sandwhich shop worker interview) "I cant hop on an Q airbus again now due to cosmic rays"

I agree entirely. If you peeked at the source code you would see how they make certain assumptions (a previous poster alluded to same-I think, not sure if he had inside looks as well)
In any event, "cosmic rays" is utter bullocks. Its not like they were suddenly invented. Stay tuned

21st Nov 2009, 20:48

#22 (permalink)

SDFlyer

Join Date: Mar 2008

Location: San Diego

Posts: 60

Likes: 0

Received 0 Likes on 0 Posts

Simple solution for those Oz types: why speculate about such important safety matters when the technology is available (and has been for decades) ? Just mandate a Wilson Cloud Chamber in every cockpit, with detection system linked to the flight computer for instant pilot awareness.

Better yet, wrap the plane with said Cloud Chamber, for more complete coverage.

ayyyyyy ....

21st Nov 2009, 20:57

#23 (permalink)

goldfish85

Join Date: Nov 2009

Location: Near Puget Sound

Age: 86

Posts: 88

Likes: 0

Received 0 Likes on 0 Posts

Before we all talk the cosmic ray theory to death, remember that a great deal of accident investigation is going down many leads to see what didn't happen. When I first began working on NTSB teams, I was surprised by how much effort was spent explaining what couldn't have happened.

After you have discarded all the impossible explanations, whatever is left, no matter how improbable must be the truth --- Sherlock Holmes.

Dick Newman

21st Nov 2009, 21:13

#24 (permalink)

blueloo

Join Date: Oct 2002

Location: In Frozen Chunks (Cloud Cuckoo Land)

Age: 17

Posts: 1,521

Likes: 0

Received 0 Likes on 0 Posts

....But Sherlock Holmes was a figment of someones imagination wasnt he?

22nd Nov 2009, 10:25

#25 (permalink)

Rightbase

Join Date: Feb 2008

Location: UK

Posts: 117

Likes: 0

Received 1 Like on 1 Post

But Dick, we haven't ...

... eliminated all the possible explanations.

We are rightly concerned when a pilot dozes off without warning, but when an ADIRU goes into dozing mode it is a non-critical rare event?

In my experience, when a computer goes into doze mode and has to be rebooted, either the hardware failed, something in the software did other than what the programmer intended, or the system design failed to take into account all the possible consequences of all the programmers' different intentions.

Hardware faults caused by cosmic rays should happen at a statistically predictable rate depending on known parameters.

Dozing faults can be caused by software. For example, a process may end up in a tight loop (unintended) or when memory is tight, several processes may end up waiting (intended) for other processes to release memory - and they don't release it (also possibly intended). This type of fault is statistically more likely on computers that run for longer than average between reboots.

If something like dozing can happen, how can we be sure enough that something else other than what is intended will not happen?

22nd Nov 2009, 12:15

#26 (permalink)

cwatters

Join Date: Dec 2001

Location: England

Posts: 1,389

Likes: 0

Received 0 Likes on 0 Posts

One way is to use more sophisticated watchdog timers that check the computer is awake and not spending all it's time looping. If the correct actions aren't taken the hardware gets reset (or something less drastic).

22nd Nov 2009, 13:54

#27 (permalink)

ChristiaanJ

Join Date: Jan 2005

Location: France

Posts: 2,315

Likes: 0

Received 0 Likes on 0 Posts

Rightbase,

cwatters has the right answer...

The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.

Watchdog timers are small bits of independent hardware which have to be reset at regular intervals (say 100 msec, possibly less). Any fault, software or hardware, that results in the watchdog not being reset in time (such as "hanging up" in a loop, as you mentioned), will prompty produce a failure warning, and cause the computer to disconnect.

CJ

22nd Nov 2009, 14:50

#28 (permalink)

Rightbase

Join Date: Feb 2008

Location: UK

Posts: 117

Likes: 0

Received 1 Like on 1 Post

Quote:

The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.

Hmmm ....

OKAY, but from the report:

Quote:

One type of fault event associated with the ADIRU model is known as ‘dozing’. Once ‘dozing’ commences, the ADIRU stops outputting data for the remainder of the flight.

I suggest perhaps not different enough.

22nd Nov 2009, 17:37

#29 (permalink)

fc101

Join Date: Jan 2008

Location: Scandinavia

Posts: 98

Likes: 0

Received 0 Likes on 0 Posts

"Dozing" ... not really a technical term any of my more knowledgeable software engineer friends have heard of. From what they say:

There are without a doubt watchdog timers which reset parts of the system and restore the system in to a meanginful and known state - known here means stable.

The way processes in these systems are organised is NOT the same as a home PC but more or less fixed at design-time so timing and other interrelations are known and can be tested for or even proven.

Dozing appears to mean - according to some - that the ADIRU placed itself into a known state where the functions provided are effectively suspended. Why it ended up in such a state is the question - that is what set of events resulted to ADIRU to "fail" in that way. Fail means "fail safe".

As I understand there are two other ADIRUs and voters - were there failures there as well because failure of one ADIRU shouldn't cause upset.

fc101
E145 driver
--- some text rephrased from sources who know more saftey critical systems than me.

23rd Nov 2009, 02:58

#30 (permalink)

goldfish85

Join Date: Nov 2009

Location: Near Puget Sound

Age: 86

Posts: 88

Likes: 0

Received 0 Likes on 0 Posts

Of course, I don't mean to suggest that we shouldn't worry about the software code. ADIRUs have a failure rate of the order between 1/1000 and 1/10000. We still need triple redundancy to avoid a catastrophe. We need to ensure that independent computer software errors do not go uncorrected, whether caused by an ADIRU failure or be a cosmic ray upsetting a single bit.

In general, we've done a pretty good job of not have the software make mistakes in calculations. Where we may have fallen short is in writing our requirements to take these ADIRU failures or other single events into account. I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure<"when the response should have been "How could an ADURU failure make its way through to the flight control surfaces.

Dick

23rd Nov 2009, 03:27

#31 (permalink)

GarageYears

Join Date: Jun 2009

Location: VA, USA

Age: 58

Posts: 578

Likes: 1

Received 0 Likes on 0 Posts

Hmm, Intel thinks this is a real problem...

From an article in New Scientist (March 2008):

"But Intel thinks we may still be living on borrowed time:

"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "

Their patent suggests built-in cosmic ray detectors may be the best option. The detector would either spot cosmic ray hits on nearby circuits, or directly on the detector itself.

When triggered, it could activate error-checking circuits that refresh the nearby memory, repeat the most recent actions, or ask for the last message from outside circuits to be sent again.

But if cosmic ray detectors make it into desktops, would we get to know when they find something? It would be fun to suddenly see a message pop up informing a cosmic ray had been detected. I haven't seen any recent figures on how often they happen, but back in 1996 IBM estimated you would see one a month for every 256MB of RAM."

Although I'm not directly involved in aircraft avionics, the problem of cosmic ray effects on computing devices is REAL. Don't dismiss this as goofy pseudo-science - there is a lot of money being spent investigating this.

- GY

23rd Nov 2009, 10:36

#32 (permalink)

JenCluse

Join Date: Jul 2006

Location: Brisbane, Oz

Age: 82

Posts: 46

Likes: 0

Received 0 Likes on 0 Posts

If I may put my 5 cents worth in (used to be a penny)? There is a general misrepresentation of the colloquial term 'cosmic rays'. Did I say anything about the 'media stock phrases and cliches' handbook? Wash my mouth out!

This discussion concerns high energy particles, and a reading of Cosmic ray - Wikipedia, the free encyclopedia will bring one up to speed.

They are singularities, and although they can occur in 'showers', read high_incidence_of, they *are problematic, and how much so depends on each individual particle's very variable energy level. They are not just a threat to electronics, but also to DNA and indeed any of your cells.

On the well _known _in _the _trade basis that such an particle can 'take out' an individual electronic component, whether temporarily if low energy or sometimes permanently if high energy, any problem should be an isolated event that can in no way known to wo/man be specifically guarded against, short of using lead wrapping on all boxes.

As another ancient here says, the design must fully guard against any individual failure.

On a related matter, here’s snippet of information related to Airbus’s design philosophy. I haven’t seen this mentioned since my engineering course on the second lot of free range A320s. (Gosh! Have they been flying for *that long.) It was stated then that Airbus went to what I would have thought were excessive pains to diversify the build parameters and supply sources of all duplicated equipment.

We were told by an Airbus rep that duplicate suppliers were given design parameters which they were free to achieve electronically anyway they chose, but obviously to tight aviation constraints. The ultimate black boxes. The idea was that a *design flaw in one element of the control architecture would be isolated to one item in the control chain by default.

To the best of my recall this philosophy was applied across the entire airframe, and I have been surprised at reports that certain Airbus aircraft have finished up flying with all pitots from the same manufacturer. That certainly was not the original designers' intent.

No doubt the cost of extensive duplication of non-identical but similarly functioning components has attracted the attention of the financial fine tuners. <sigh>

(Written from the future as this appears, the comment re pitots seems rather relevant to the current (20100820) threat mulling the AF447 loss. Amended by Jencluse.)

Last edited by JenCluse; 20th Aug 2011 at 10:58.

23rd Nov 2009, 16:48

#33 (permalink)

vovachan

Join Date: Mar 2009

Location: us

Age: 63

Posts: 206

Likes: 0

Received 0 Likes on 0 Posts

PS I stand corrected:

The flight computer does filter and compare AOA data coming from the 3 ADIRUs, but there is a scenario when it can be fooled:

Quote:

• there were at least two short duration, high amplitude spikes
• the first spike was shorter than 1 second
• the second spike occurred and was still present 1.2 seconds after the detection of
the first spike.

during the incident flight the bad ADIRU produced 42 data spikes, 40 of which were caught by the computer except the 2 which caused the upset. All these dozens of spikes did not make the computer realize the ADIRU was bad.

23rd Nov 2009, 18:45

#34 (permalink)

CONF iture

Join Date: Jan 2005

Location: W of 30W

Posts: 1,916

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Dick Newman

I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure" when the response should have been "How could an ADURU failure make its way through to the flight control surfaces"

Especially when the aircraft is in a nice level flight with AP selected OFF and sidesticks NOT solicited ...

23rd Nov 2009, 21:20

#35 (permalink)

xetroV

Join Date: Jan 2005

Location: Europe

Posts: 260

Likes: 0

Received 4 Likes on 2 Posts

Quote:

Originally Posted by Lookleft

xetroV- The 777 does indeed only have one ADIRU unit but that unit consists of multiple accelerometers and l@ser gyros. This redundancy within the unit didn't prevent an incident to an MAS 777 doing something very similar to the Qf incident-also off the coast of WA. I don't think its a problem unique to one manufacturer or another but an indication of the lack of understanding of how software interacts.

Thanks for that information, makes sense! Interesting to know that an "ADIRU" in one aircraft may be much more elaborate than an "ADIRU" in another. Apart from the obvious difference with the A330, it should be noted that Boeing itself uses dual (or optionally triple) ADIRUs with less built-in redundancy in their own B737 NGs. Strange that nobody at Boeing thought of a more sexy acronym for the B777. AADIRU, anyone?

I agree with your statement about software interaction. This is becoming increasingly important as more and more aircraft systems are being integrated and interconnected, while at the same time the required navigation performance and vertical separation are continuously being reduced, as the skies get busier. At the very least, accurate and quick internal error detection algorithms should provide smooth systems degradation that is immediately obvious and totally transparent to the flight crew. Sudden uncommanded autopilot upsets are not what I call "fail passive" (let alone "fail safe").

Last edited by xetroV; 24th Nov 2009 at 20:41.

24th Nov 2009, 00:02

#36 (permalink)

Pugilistic Animus

Join Date: Dec 2006

Location: The No Transgression Zone

Posts: 2,483

Likes: 1

Received 5 Likes on 3 Posts

A&WST had an article about the criteria 'space hardened' electronic if I can remeber to look I'll post--but CJ has given me alot to think about

24th Nov 2009, 06:28

#37 (permalink)

cwatters

Join Date: Dec 2001

Location: England

Posts: 1,389

Likes: 0

Received 0 Likes on 0 Posts

Quote:

It's very important to know if your error correction curcuit is being triggered and take some action. Some decades ago I worked for a company that made "mini computers". These had error correcting memory boards. Some two year old units sent back for repair were discovered to have been incorrectly manufactured. This led to the identification of a whole batch that had been built with one of the memory chips in backwards....and the error correcting circuit had been correcting the consequences for two years.

24th Nov 2009, 21:00

#38 (permalink)

Nemrytter

Join Date: Apr 2008

Location: .

Posts: 309

Likes: 0

Received 0 Likes on 0 Posts

Quote:

A&WST had an article about the criteria 'space hardened' electronic if I can remeber to look I'll post--but CJ has given me alot to think about

The criteria are icnredibly strict, you can find them here:
ECSS-E-ST-40C - Software general requirements
ECSS-Q-ST-60C Rev.1 - Electrical, electronic and electromechanical (EEE) components
Although you have to log in to see them (registration is free).

Slightly more generalised versions are online here: https://escies.org/ReadArticle?docId=167

25th Nov 2009, 14:18

#39 (permalink)

vovachan

Join Date: Mar 2009

Location: us

Age: 63

Posts: 206

Likes: 0

Received 0 Likes on 0 Posts

Quote:

It's very important to know if your error correction circuit is being triggered and take some action.

I agree. Right now it seems like they have the worst a possible worlds: an error correcting system which is not 100% foolproof, a computer which can override the pilot and start flying the plane, based on erroneous AOA inputs, and the pilot who sits there and has no clue what's going on.

25th Nov 2009, 14:36

#40 (permalink)

jimworcs

Join Date: Oct 2004

Location: England

Age: 65

Posts: 87

Likes: 0

Received 0 Likes on 0 Posts

I am not a pilot, just an interested observer. Is this incident relevant to what happened on the Qantas aircraft?

Incident: US Airways A333 over Atlantic on Nov 17th 2009, computer issues
By Simon Hradecky, created Friday, Nov 20th 2009 14:30Z, last updated Friday, Nov 20th 2009 14:30Z

A US Airways Airbus A330-300, flight US-740 from Philadelphia,PA (USA) to Madrid,SP (Spain), was enroute at FL390 about 350nm east of Philadelphia overhead the Atlantic about 40 minutes into the flight, when the crew announced they needed to return and was cleared to turn to the left. About 40 seconds later during the turn the crew declared emergency and requested to descend. About another 5 minutes later while levelling at FL300 the crew reported, that everything had returned to normal explaining, that they had experienced computer problems they were unable to resolve and they had been "missing control". The emergency was cancelled, the airplane continued back to Philadelphia. The airplane landed safely on Philadelphia's runway 09R about 75 minutes after the onset of trouble.