PDA

View Full Version : ATSB probes 'cosmic rays' link to QF72 A330 jet upset


BorneoFly
18th Nov 2009, 00:45
This was reported in the West Australian newspaper today as "breaking news". I'm not a pilot nor involved in the aviation industry but merely a passenger with more than a passing interest in aviation. My question to those in the know is, "Is this possible, feasible or just another ho hum theory"?:confused:

grizzled
18th Nov 2009, 01:42
Another Factual Report was released by the ATSB today. Here is the link:

MEDIA RELEASE : 18 November 2009 - 2009/16: ATSB Second Interim Factual Report into the Qantas Airbus A330-303 in-flight upset, 154 km west of Learmonth WA, on 7 October 2008 (http://www.atsb.gov.au/newsroom/2009/release/2009_16.aspx)

jcjeant
18th Nov 2009, 02:16
Hi,

That's very interesting ...
I wonder when the International Space Station will go upside down .. and fall on the Earth ? :}
But maybe there they know .. and have some protections ? :rolleyes:

Rolling-Thunderbird
18th Nov 2009, 02:40
To save others the time....link to article

ATSB probes 'cosmic rays' link to Qantas jet plunge - The West Australian (http://au.news.yahoo.com/thewest/a/-/newshome/6486629/atsb-probes-cosmic-rays-link-to-qantas-jet-plunge/)

Deaf
18th Nov 2009, 03:33
"Is this possible, feasible or just another ho hum theory"?

Crudely:

We want to do more calculations so need more transistors running faster = more power
more power = more heat
To keep everything working OK the individual transistors must be smaller.
A side effect of smaller transistors is they are affected more by radiation and are more likely to flip a bit.

What happens next is depends on what bit is flipped, it can be short term if in RAM and recovered by rebooting alternatively it can be long term if in ROM or flash affecting the program or stored data eg locations.

It is known to be a serious problem for space applications and special chips are used although they don't have the computing power of newer consumer type chips. The RCA 1802 chips in Voyager have outlasted RCA.

TWT
18th Nov 2009, 05:09
Link to report

Interim Factual Report No.2 (http://www.atsb.gov.au/media/748444/ao2008070_ifr_2.pdf)

training wheels
18th Nov 2009, 05:37
My question to those in the know is, "Is this possible, feasible or just another ho hum theory"?:confused:

Well, I guess that's why the ATSB is conducting the investigation, to see whether it's possible, feasible or a ho hum theory.

dkaarma
18th Nov 2009, 05:47
hahaha.. and everybody said I was crazy with my tin foil hat!

on a serious note, It will be interesting to read the reoprt after they finish this tangent of their investigation. I would be flabbergasted if a solar flare on the sun could bring an aircraft down..

Putting my tin foil hat back on... I always knew CASA/ATSB were apologists for Qantas... but attributing an incident to cosmic forces would be an interesting low! :hmm:

Nemrytter
18th Nov 2009, 06:28
I wonder when the International Space Station will go upside down .. and fall on the Earth ?

The ISS is heavily affected by cosmic ray interactions, but as it has triplicated backups for everything this normally isn't a problem and the affected bit of kit just resets itself. The only problem noticable to the crew is when it happens to one of their laptop computers.

I assumed that aircraft would have similar protection in place to the ISS with regards to multiple backup flight control systems, but if they're engineered differently then cosmic rays could still pose a problem. Not particularly likely though, the odds of a cosmic ray hitting the wrong thing are spectacularly tiny

cwatters
18th Nov 2009, 06:32
Los Alamos Helps Industry By Simulating Circuit Failures From Cosmic Rays (http://www.spacedaily.com/news/cosmicrays-04d.html)

Los Alamos Simulates Circuit Failures From Cosmic Rays....snip...

We can't fully predict the effect of these interactions, which makes having a standardized way to test circuits extremely valuable" Wender said. "Very similar devices show radically different failure rates due to neutron interactions, and we have some evidence that the smaller transistors and lower operating voltages in newer devices produce higher failure rates....snip....

In the case of the latest, totally computer-controlled aircraft, these tiny cosmic gremlins could cause trouble, especially because the problem gets worse as atmospheric shielding dwindles at higher altitudes. At sea level, the shielding provided by the air is equivalent to more than ten feet of concrete shielding. The neutron flux at LANSCE, 7,000 feet above sea level is approximately three times greater than at sea level; and at 40,000 feet, the cosmic-ray neutron flux is several hundred times greater than the neutron flux seen on the earth's surface....snip...

The Laboratory and NASA recently placed a complete aircraft control system in the LANSCE beam and linked it locally with a computer simulation for a Boeing 737.

A future experiment will examine whether pilots can compensate for control system upsets during simulated flight, by remotely linking a computer undergoing tests in the ICE House to the flight simulator located in the NASA System Airframe Failure Emulation Testing and Integration Laboratory at the Langley Research Center. Los Alamos is collaborating in NASA's development of the SAFETI Laboratory, with networked links to individual NASA labs for aircraft structures, cockpit motion and propulsion systems


"Cosmic rays cause compure downtime" (very technical slideshow)..
http://www.ewh.ieee.org/r6/scv/rl/articles/ser-050323-talk-ref.pdf

Taildragger67
18th Nov 2009, 06:42
dkaarma,

I always knew CASA/ATSB were apologists for Qantas

CASA maybe - but I think there are numerous instances where the ATSB has called a spade a spade with respect to Qantas. Aspects of the airline certainly didn't come out looking too good after the golf buggy incident.

goldfish85
18th Nov 2009, 23:11
Actually, this isn't too far fetched. Many years ago, in a previous life, we had a civil computer being adapted to a military program. One of the tests was to change every bit in the program code from a one to a zero or vice versa and ensure that nothing bad would happen. (The system halting in this test was not considered "bad."

I've read the material on the QANTAS A-330 upset and am favorably impressed with the ATSB's work in this case.

Dick Newman

vovachan
18th Nov 2009, 23:50
This is all very cute but beside the point. The thing is any component may fail and the Airbus has more than one ADIRU. A failure of a single component whether caused by cosmic rays or little green men should never lead to near-catastrophic results. The computer should have been able to detect an ADIRU disagree and identify the bad data, or if not possible just discard the AOA data altogether.

What happened is simply unacceptable

xetroV
19th Nov 2009, 18:47
"the Airbus has more than one ADIRU"

Interesting. The B777 has only one ADIRU, plus a secondary unit called SAARU. The latter will take over attitude and airdata indications (but no lat/long IRS data) in case of a(n) (partial) ADIRU failure, but I wonder how Boeing has solved the problem of identifying an ADIRU failure in the first place. Majority vote is not an option with only one unit installed, is it?

Or does the B777's ADIRU have more elaborate internal fault recognition capabilities and more built-in redundancy than the Airbus's? (I can imagine that it is probably merely a matter of defining what components constitute a 'unit', but then the use of the same acronym is sort of puzzling.)

Sorry for being off-topic...

ChristiaanJ
19th Nov 2009, 20:12
vovachan and xetroV,
You have me scratching my head....

I'm an ancient from the Concord era, when the integrated circuits were still so huge, that a single cosmic ray, neutron or alpha particle couldn't really upset the electronics.

But components could fail. So (to stay with flight control computers, analog in those far-off days), each computer had two virtually identical channels, dubbed "command" and "monitor", and comparators all down the chain (and those were duplicated too) checked that "C" and "M" told exactly the same story. If they didn't ... "boing" and the computer disengaged. Then, on the other side of the aircraft, a second computer, until then in standby, would take over.

Checks for passive failures (like a comparator failing "healthy") were dealt with by preflight BITE (built-in tests), and some of those tests were repreated just before an autoland, reducing the "period at risk" to only minutes.

I wasn't directly involved in the earliest DAFS, but by looking over their shoulder, I saw most of the same principles were applied.

So what's been going on since?
Even with ROMs, RAMs and everything else today being far far smaller, I still would think the probabilty of two particles hitting the same spot in two "halves" of a system provoking an identical spike, that then would be missed by the comparators, would be infinitesimal.

So has there been a fundamental change in architecture?
And wouldn't that kind of change be equally bad at catching component and software failures as "cosmic ray" events?

I can see the line of thinking of the ATSB... I would have be tempted too, if everything afterwards worked perfectly, and there was no way of reproducing the fault. But there still seems to be something wrong with that reasoning....

CJ

Lookleft
19th Nov 2009, 20:52
xetroV- The 777 does indeed only have one ADIRU unit but that unit consists of multiple accelerometers and laser gyros. This redundancy within the unit didn't prevent an incident to an MAS 777 doing something very similar to the Qf incident-also off the coast of WA. I don't think its a problem unique to one manufacturer or another but an indication of the lack of understanding of how software interacts.

steamchicken
20th Nov 2009, 17:27
"Cosmic ray" = IT equivalent of "Gremlins", i.e. joking term for an unexplained failure and especially one caused by human factors.

vovachan
20th Nov 2009, 18:21
vovachan and xetroV,
You have me scratching my head....

Me too. The Airbus had 3 ADIRUS. That's why while the readouts on one side were all over the place, the other pilot's screen was showing perfectly fine. Because they are each fed by their own independent units. However the guy on the bad side was flying the plane, and the good ADIRU and the standby unit had no effect whatsoever. Now the question is how come, since there are 3 redundant ADIRUs on board, there is no cross-checking of data between them?

Also electronics are prone to transient failures. You take it to the repair shop, they plug it in and it works fine.

ChristiaanJ
20th Nov 2009, 18:59
OK, ancient here again....

In my days, when two 'halves' of a computer disagreed, it was "ping", " "boing", "click", and the (analogue) computer took itself off-line, with usually a blinking light on the CWS (central warning system) as well, and handed over to the pilot, who then had the choice of staying in manual, or engaging the standby on the 'other side'.
Only during the last minutes of an autoland, the failed computer would hand over automatically to n° 2, which would already be synchronised, and would already have been tested and found healthy.

It woiked well, mostly because the probability of two identical components on two sides failing in the same way within a few minutes could be shown to be in the order of 10-9 to 10-12, depending on the "time at risk".

From the little I know about DAFS, much the same was achieved initially with the two 'halves' using different processors, diifferent languages for the software, and different compilers.
Sure, if the software spec was wrong, there could still be problems, but that was no different in the analogue-and-logic world.

So what's happened since?
Leaving a computer in control of an aircraft while responding to "data spikes" gives me the cold shivers...... yet that seems what has been happening....

Can anybody elucidate....?

CJ

blueloo
20th Nov 2009, 20:11
"Cosmic ray" = IT equivalent of "Gremlins", i.e. joking term for an unexplained failure and especially one caused by human factors.


I am not sure, but you would have to think for this to be published (Cosmic Rays) that it is designed to take the blame away from a nasty - potentially catastrophic software/hardware fault within the ADIRUs.

Yet surely, saying that the 330 is subject to random cosmic rays would have to be even less reassuring. If they had said the ADIRU can be replaced due 'this' (ie whatever fault they find) particular hardware fault, then most people would be satisfied - but now the whole jet can be susceptible to complete lack of control from unseen random cosmic rays! FFS...Really?

I can just imagine the punters now (or the random sandwhich shop worker interview) "I cant hop on an Q airbus again now due to cosmic rays"

tailstrikecharles
20th Nov 2009, 20:16
I am not sure, but you would have to think for this to be published (Cosmic Rays) that it is designed to take the blame away from a nasty - potentially catastrophic software/hardware fault within the ADIRUs.

Yet surely, saying that the 330 is subject to random cosmic rays would have to be even less reassuring. If they had said the ADIRU can be replaced due 'this' (ie whatever fault they find) particular hardware fault, then most people would be satisfied - but now the whole jet can be susceptible to complete lack of control from unseen random cosmic rays! FFS...Really?

I can just imagine the punters now (or the random sandwhich shop worker interview) "I cant hop on an Q airbus again now due to cosmic rays"

I agree entirely. If you peeked at the source code you would see how they make certain assumptions (a previous poster alluded to same-I think, not sure if he had inside looks as well)
In any event, "cosmic rays" is utter bullocks. Its not like they were suddenly invented. Stay tuned

SDFlyer
21st Nov 2009, 20:48
Simple solution for those Oz types: why speculate about such important safety matters when the technology is available (and has been for decades) ? Just mandate a Wilson Cloud Chamber in every cockpit, with detection system linked to the flight computer for instant pilot awareness.

Better yet, wrap the plane with said Cloud Chamber, for more complete coverage.

ayyyyyy ....
:ugh:

goldfish85
21st Nov 2009, 20:57
Before we all talk the cosmic ray theory to death, remember that a great deal of accident investigation is going down many leads to see what didn't happen. When I first began working on NTSB teams, I was surprised by how much effort was spent explaining what couldn't have happened.

After you have discarded all the impossible explanations, whatever is left, no matter how improbable must be the truth --- Sherlock Holmes.


Dick Newman

blueloo
21st Nov 2009, 21:13
....But Sherlock Holmes was a figment of someones imagination wasnt he? :}

Rightbase
22nd Nov 2009, 10:25
... eliminated all the possible explanations.

We are rightly concerned when a pilot dozes off without warning, but when an ADIRU goes into dozing mode it is a non-critical rare event?

In my experience, when a computer goes into doze mode and has to be rebooted, either the hardware failed, something in the software did other than what the programmer intended, or the system design failed to take into account all the possible consequences of all the programmers' different intentions.

Hardware faults caused by cosmic rays should happen at a statistically predictable rate depending on known parameters.

Dozing faults can be caused by software. For example, a process may end up in a tight loop (unintended) or when memory is tight, several processes may end up waiting (intended) for other processes to release memory - and they don't release it (also possibly intended). This type of fault is statistically more likely on computers that run for longer than average between reboots.

If something like dozing can happen, how can we be sure enough that something else other than what is intended will not happen?

cwatters
22nd Nov 2009, 12:15
One way is to use more sophisticated watchdog timers that check the computer is awake and not spending all it's time looping. If the correct actions aren't taken the hardware gets reset (or something less drastic).

ChristiaanJ
22nd Nov 2009, 13:54
Rightbase,

cwatters has the right answer...

The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.

Watchdog timers are small bits of independent hardware which have to be reset at regular intervals (say 100 msec, possibly less). Any fault, software or hardware, that results in the watchdog not being reset in time (such as "hanging up" in a loop, as you mentioned), will prompty produce a failure warning, and cause the computer to disconnect.

CJ

Rightbase
22nd Nov 2009, 14:50
The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.Hmmm ....

OKAY, but from the report:

One type of fault event associated with the ADIRU model is known as ‘dozing’. Once ‘dozing’ commences, the ADIRU stops outputting data for the remainder of the flight.

I suggest perhaps not different enough.

fc101
22nd Nov 2009, 17:37
"Dozing" ... not really a technical term any of my more knowledgeable software engineer friends have heard of. From what they say:

There are without a doubt watchdog timers which reset parts of the system and restore the system in to a meanginful and known state - known here means stable.

The way processes in these systems are organised is NOT the same as a home PC but more or less fixed at design-time so timing and other interrelations are known and can be tested for or even proven.

Dozing appears to mean - according to some - that the ADIRU placed itself into a known state where the functions provided are effectively suspended. Why it ended up in such a state is the question - that is what set of events resulted to ADIRU to "fail" in that way. Fail means "fail safe".

As I understand there are two other ADIRUs and voters - were there failures there as well because failure of one ADIRU shouldn't cause upset.

fc101
E145 driver
--- some text rephrased from sources who know more saftey critical systems than me.

goldfish85
23rd Nov 2009, 02:58
Of course, I don't mean to suggest that we shouldn't worry about the software code. ADIRUs have a failure rate of the order between 1/1000 and 1/10000. We still need triple redundancy to avoid a catastrophe. We need to ensure that independent computer software errors do not go uncorrected, whether caused by an ADIRU failure or be a cosmic ray upsetting a single bit.

In general, we've done a pretty good job of not have the software make mistakes in calculations. Where we may have fallen short is in writing our requirements to take these ADIRU failures or other single events into account. I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure<"when the response should have been "How could an ADURU failure make its way through to the flight control surfaces.

Dick

GarageYears
23rd Nov 2009, 03:27
From an article in New Scientist (March 2008):

"But Intel thinks we may still be living on borrowed time:"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "
Their patent suggests built-in cosmic ray detectors may be the best option. The detector would either spot cosmic ray hits on nearby circuits, or directly on the detector itself.

When triggered, it could activate error-checking circuits that refresh the nearby memory, repeat the most recent actions, or ask for the last message from outside circuits to be sent again.

But if cosmic ray detectors make it into desktops, would we get to know when they find something? It would be fun to suddenly see a message pop up informing a cosmic ray had been detected. I haven't seen any recent figures on how often they happen, but back in 1996 IBM estimated you would see one a month for every 256MB of RAM."

Although I'm not directly involved in aircraft avionics, the problem of cosmic ray effects on computing devices is REAL. Don't dismiss this as goofy pseudo-science - there is a lot of money being spent investigating this.

- GY :ooh:

JenCluse
23rd Nov 2009, 10:36
If I may put my 5 cents worth in (used to be a penny)? There is a general misrepresentation of the colloquial term 'cosmic rays'. Did I say anything about the 'media stock phrases and cliches' handbook? Wash my mouth out!

This discussion concerns high energy particles, and a reading of Cosmic ray - Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Cosmic_rays) will bring one up to speed.

They are singularities, and although they can occur in 'showers', read high_incidence_of, they *are problematic, and how much so depends on each individual particle's very variable energy level. They are not just a threat to electronics, but also to DNA and indeed any of your cells.

On the well _known _in _the _trade basis that such an particle can 'take out' an individual electronic component, whether temporarily if low energy or sometimes permanently if high energy, any problem should be an isolated event that can in no way known to wo/man be specifically guarded against, short of using lead wrapping on all boxes.

As another ancient here says, the design must fully guard against any individual failure.

On a related matter, here’s snippet of information related to Airbus’s design philosophy. I haven’t seen this mentioned since my engineering course on the second lot of free range A320s. (Gosh! Have they been flying for *that long.) It was stated then that Airbus went to what I would have thought were excessive pains to diversify the build parameters and supply sources of all duplicated equipment.

We were told by an Airbus rep that duplicate suppliers were given design parameters which they were free to achieve electronically anyway they chose, but obviously to tight aviation constraints. The ultimate black boxes. The idea was that a *design flaw in one element of the control architecture would be isolated to one item in the control chain by default.

To the best of my recall this philosophy was applied across the entire airframe, and I have been surprised at reports that certain Airbus aircraft have finished up flying with all pitots from the same manufacturer. That certainly was not the original designers' intent.

No doubt the cost of extensive duplication of non-identical but similarly functioning components has attracted the attention of the financial fine tuners. <sigh>

(Written from the future as this appears, the comment re pitots seems rather relevant to the current (20100820) threat mulling the AF447 loss. Amended by Jencluse.)

vovachan
23rd Nov 2009, 16:48
PS I stand corrected:

The flight computer does filter and compare AOA data coming from the 3 ADIRUs, but there is a scenario when it can be fooled:

• there were at least two short duration, high amplitude spikes
• the first spike was shorter than 1 second
• the second spike occurred and was still present 1.2 seconds after the detection of
the first spike.

during the incident flight the bad ADIRU produced 42 data spikes, 40 of which were caught by the computer except the 2 which caused the upset. All these dozens of spikes did not make the computer realize the ADIRU was bad.

CONF iture
23rd Nov 2009, 18:45
I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure" when the response should have been "How could an ADURU failure make its way through to the flight control surfaces"
Especially when the aircraft is in a nice level flight with AP selected OFF and sidesticks NOT solicited ...

xetroV
23rd Nov 2009, 21:20
xetroV- The 777 does indeed only have one ADIRU unit but that unit consists of multiple accelerometers and l@ser gyros. This redundancy within the unit didn't prevent an incident to an MAS 777 doing something very similar to the Qf incident-also off the coast of WA. I don't think its a problem unique to one manufacturer or another but an indication of the lack of understanding of how software interacts.
Thanks for that information, makes sense! Interesting to know that an "ADIRU" in one aircraft may be much more elaborate than an "ADIRU" in another. Apart from the obvious difference with the A330, it should be noted that Boeing itself uses dual (or optionally triple) ADIRUs with less built-in redundancy in their own B737 NGs. Strange that nobody at Boeing thought of a more sexy acronym for the B777. AADIRU, anyone? ;)

I agree with your statement about software interaction. This is becoming increasingly important as more and more aircraft systems are being integrated and interconnected, while at the same time the required navigation performance and vertical separation are continuously being reduced, as the skies get busier. At the very least, accurate and quick internal error detection algorithms should provide smooth systems degradation that is immediately obvious and totally transparent to the flight crew. Sudden uncommanded autopilot upsets are not what I call "fail passive" (let alone "fail safe").

Pugilistic Animus
24th Nov 2009, 00:02
A&WST had an article about the criteria 'space hardened' electronic if I can remeber to look I'll post--but CJ has given me alot to think about :)

PA

cwatters
24th Nov 2009, 06:28
during the incident flight the bad ADIRU produced 42 data spikes, 40 of which were caught by the computer except the 2 which caused the upset. All these dozens of spikes did not make the computer realize the ADIRU was bad.

It's very important to know if your error correction curcuit is being triggered and take some action. Some decades ago I worked for a company that made "mini computers". These had error correcting memory boards. Some two year old units sent back for repair were discovered to have been incorrectly manufactured. This led to the identification of a whole batch that had been built with one of the memory chips in backwards....and the error correcting circuit had been correcting the consequences for two years.

Nemrytter
24th Nov 2009, 21:00
A&WST had an article about the criteria 'space hardened' electronic if I can remeber to look I'll post--but CJ has given me alot to think about http://images.ibsrv.net/ibsrv/res/src:www.pprune.org/get/images/smilies/smile.gif

The criteria are icnredibly strict, you can find them here:
ECSS-E-ST-40C - Software general requirements (http://www.ecss.nl/forums/ecss/dispatch.cgi/standards/docProfile/100741/d20090306202209/No/t100741.htm)
ECSS-Q-ST-60C Rev.1 - Electrical, electronic and electromechanical (EEE) components (http://www.ecss.nl/forums/ecss/dispatch.cgi/standards/docProfile/100749/d20090327143145/No/t100749.htm)
Although you have to log in to see them (registration is free).

Slightly more generalised versions are online here: https://escies.org/ReadArticle?docId=167

vovachan
25th Nov 2009, 14:18
It's very important to know if your error correction circuit is being triggered and take some action.

I agree. Right now it seems like they have the worst a possible worlds: an error correcting system which is not 100% foolproof, a computer which can override the pilot and start flying the plane, based on erroneous AOA inputs, and the pilot who sits there and has no clue what's going on.

jimworcs
25th Nov 2009, 14:36
I am not a pilot, just an interested observer. Is this incident relevant to what happened on the Qantas aircraft?

Incident: US Airways A333 over Atlantic on Nov 17th 2009, computer issues
By Simon Hradecky, created Friday, Nov 20th 2009 14:30Z, last updated Friday, Nov 20th 2009 14:30Z

A US Airways Airbus A330-300, flight US-740 from Philadelphia,PA (USA) to Madrid,SP (Spain), was enroute at FL390 about 350nm east of Philadelphia overhead the Atlantic about 40 minutes into the flight, when the crew announced they needed to return and was cleared to turn to the left. About 40 seconds later during the turn the crew declared emergency and requested to descend. About another 5 minutes later while levelling at FL300 the crew reported, that everything had returned to normal explaining, that they had experienced computer problems they were unable to resolve and they had been "missing control". The emergency was cancelled, the airplane continued back to Philadelphia. The airplane landed safely on Philadelphia's runway 09R about 75 minutes after the onset of trouble.