ATSB probes 'cosmic rays' link to QF72 A330 jet upset
Join Date: Oct 2009
Location: Greece
Age: 84
Posts: 63
Likes: 0
Received 0 Likes
on
0 Posts
I am not sure, but you would have to think for this to be published (Cosmic Rays) that it is designed to take the blame away from a nasty - potentially catastrophic software/hardware fault within the ADIRUs.
Yet surely, saying that the 330 is subject to random cosmic rays would have to be even less reassuring. If they had said the ADIRU can be replaced due 'this' (ie whatever fault they find) particular hardware fault, then most people would be satisfied - but now the whole jet can be susceptible to complete lack of control from unseen random cosmic rays! FFS...Really?
I can just imagine the punters now (or the random sandwhich shop worker interview) "I cant hop on an Q airbus again now due to cosmic rays"
Yet surely, saying that the 330 is subject to random cosmic rays would have to be even less reassuring. If they had said the ADIRU can be replaced due 'this' (ie whatever fault they find) particular hardware fault, then most people would be satisfied - but now the whole jet can be susceptible to complete lack of control from unseen random cosmic rays! FFS...Really?
I can just imagine the punters now (or the random sandwhich shop worker interview) "I cant hop on an Q airbus again now due to cosmic rays"
In any event, "cosmic rays" is utter bullocks. Its not like they were suddenly invented. Stay tuned
Join Date: Mar 2008
Location: San Diego
Posts: 60
Likes: 0
Received 0 Likes
on
0 Posts
Simple solution for those Oz types: why speculate about such important safety matters when the technology is available (and has been for decades) ? Just mandate a Wilson Cloud Chamber in every cockpit, with detection system linked to the flight computer for instant pilot awareness.
Better yet, wrap the plane with said Cloud Chamber, for more complete coverage.
ayyyyyy ....
Better yet, wrap the plane with said Cloud Chamber, for more complete coverage.
ayyyyyy ....
Join Date: Nov 2009
Location: Near Puget Sound
Age: 86
Posts: 88
Likes: 0
Received 0 Likes
on
0 Posts
Before we all talk the cosmic ray theory to death, remember that a great deal of accident investigation is going down many leads to see what didn't happen. When I first began working on NTSB teams, I was surprised by how much effort was spent explaining what couldn't have happened.
After you have discarded all the impossible explanations, whatever is left, no matter how improbable must be the truth --- Sherlock Holmes.
Dick Newman
After you have discarded all the impossible explanations, whatever is left, no matter how improbable must be the truth --- Sherlock Holmes.
Dick Newman
But Dick, we haven't ...
... eliminated all the possible explanations.
We are rightly concerned when a pilot dozes off without warning, but when an ADIRU goes into dozing mode it is a non-critical rare event?
In my experience, when a computer goes into doze mode and has to be rebooted, either the hardware failed, something in the software did other than what the programmer intended, or the system design failed to take into account all the possible consequences of all the programmers' different intentions.
Hardware faults caused by cosmic rays should happen at a statistically predictable rate depending on known parameters.
Dozing faults can be caused by software. For example, a process may end up in a tight loop (unintended) or when memory is tight, several processes may end up waiting (intended) for other processes to release memory - and they don't release it (also possibly intended). This type of fault is statistically more likely on computers that run for longer than average between reboots.
If something like dozing can happen, how can we be sure enough that something else other than what is intended will not happen?
We are rightly concerned when a pilot dozes off without warning, but when an ADIRU goes into dozing mode it is a non-critical rare event?
In my experience, when a computer goes into doze mode and has to be rebooted, either the hardware failed, something in the software did other than what the programmer intended, or the system design failed to take into account all the possible consequences of all the programmers' different intentions.
Hardware faults caused by cosmic rays should happen at a statistically predictable rate depending on known parameters.
Dozing faults can be caused by software. For example, a process may end up in a tight loop (unintended) or when memory is tight, several processes may end up waiting (intended) for other processes to release memory - and they don't release it (also possibly intended). This type of fault is statistically more likely on computers that run for longer than average between reboots.
If something like dozing can happen, how can we be sure enough that something else other than what is intended will not happen?
Join Date: Dec 2001
Location: England
Posts: 1,389
Likes: 0
Received 0 Likes
on
0 Posts
One way is to use more sophisticated watchdog timers that check the computer is awake and not spending all it's time looping. If the correct actions aren't taken the hardware gets reset (or something less drastic).
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
Rightbase,
cwatters has the right answer...
The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.
Watchdog timers are small bits of independent hardware which have to be reset at regular intervals (say 100 msec, possibly less). Any fault, software or hardware, that results in the watchdog not being reset in time (such as "hanging up" in a loop, as you mentioned), will prompty produce a failure warning, and cause the computer to disconnect.
CJ
cwatters has the right answer...
The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.
Watchdog timers are small bits of independent hardware which have to be reset at regular intervals (say 100 msec, possibly less). Any fault, software or hardware, that results in the watchdog not being reset in time (such as "hanging up" in a loop, as you mentioned), will prompty produce a failure warning, and cause the computer to disconnect.
CJ
The kind of real-time software in digital autopilots, etc. is very different from 'data-processing' software, be it PCs or mainframes, which is mostly interrupt-driven.
OKAY, but from the report:
One type of fault event associated with the ADIRU model is known as ‘dozing’. Once ‘dozing’ commences, the ADIRU stops outputting data for the remainder of the flight.
Join Date: Jan 2008
Location: Scandinavia
Posts: 98
Likes: 0
Received 0 Likes
on
0 Posts
"Dozing" ... not really a technical term any of my more knowledgeable software engineer friends have heard of. From what they say:
There are without a doubt watchdog timers which reset parts of the system and restore the system in to a meanginful and known state - known here means stable.
The way processes in these systems are organised is NOT the same as a home PC but more or less fixed at design-time so timing and other interrelations are known and can be tested for or even proven.
Dozing appears to mean - according to some - that the ADIRU placed itself into a known state where the functions provided are effectively suspended. Why it ended up in such a state is the question - that is what set of events resulted to ADIRU to "fail" in that way. Fail means "fail safe".
As I understand there are two other ADIRUs and voters - were there failures there as well because failure of one ADIRU shouldn't cause upset.
fc101
E145 driver
--- some text rephrased from sources who know more saftey critical systems than me.
There are without a doubt watchdog timers which reset parts of the system and restore the system in to a meanginful and known state - known here means stable.
The way processes in these systems are organised is NOT the same as a home PC but more or less fixed at design-time so timing and other interrelations are known and can be tested for or even proven.
Dozing appears to mean - according to some - that the ADIRU placed itself into a known state where the functions provided are effectively suspended. Why it ended up in such a state is the question - that is what set of events resulted to ADIRU to "fail" in that way. Fail means "fail safe".
As I understand there are two other ADIRUs and voters - were there failures there as well because failure of one ADIRU shouldn't cause upset.
fc101
E145 driver
--- some text rephrased from sources who know more saftey critical systems than me.
Join Date: Nov 2009
Location: Near Puget Sound
Age: 86
Posts: 88
Likes: 0
Received 0 Likes
on
0 Posts
Of course, I don't mean to suggest that we shouldn't worry about the software code. ADIRUs have a failure rate of the order between 1/1000 and 1/10000. We still need triple redundancy to avoid a catastrophe. We need to ensure that independent computer software errors do not go uncorrected, whether caused by an ADIRU failure or be a cosmic ray upsetting a single bit.
In general, we've done a pretty good job of not have the software make mistakes in calculations. Where we may have fallen short is in writing our requirements to take these ADIRU failures or other single events into account. I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure<"when the response should have been "How could an ADURU failure make its way through to the flight control surfaces.
Dick
In general, we've done a pretty good job of not have the software make mistakes in calculations. Where we may have fallen short is in writing our requirements to take these ADIRU failures or other single events into account. I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure<"when the response should have been "How could an ADURU failure make its way through to the flight control surfaces.
Dick
Hmm, Intel thinks this is a real problem...
From an article in New Scientist (March 2008):
"But Intel thinks we may still be living on borrowed time:
When triggered, it could activate error-checking circuits that refresh the nearby memory, repeat the most recent actions, or ask for the last message from outside circuits to be sent again.
But if cosmic ray detectors make it into desktops, would we get to know when they find something? It would be fun to suddenly see a message pop up informing a cosmic ray had been detected. I haven't seen any recent figures on how often they happen, but back in 1996 IBM estimated you would see one a month for every 256MB of RAM."
Although I'm not directly involved in aircraft avionics, the problem of cosmic ray effects on computing devices is REAL. Don't dismiss this as goofy pseudo-science - there is a lot of money being spent investigating this.
- GY
"But Intel thinks we may still be living on borrowed time:
"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "
Their patent suggests built-in cosmic ray detectors may be the best option. The detector would either spot cosmic ray hits on nearby circuits, or directly on the detector itself.When triggered, it could activate error-checking circuits that refresh the nearby memory, repeat the most recent actions, or ask for the last message from outside circuits to be sent again.
But if cosmic ray detectors make it into desktops, would we get to know when they find something? It would be fun to suddenly see a message pop up informing a cosmic ray had been detected. I haven't seen any recent figures on how often they happen, but back in 1996 IBM estimated you would see one a month for every 256MB of RAM."
Although I'm not directly involved in aircraft avionics, the problem of cosmic ray effects on computing devices is REAL. Don't dismiss this as goofy pseudo-science - there is a lot of money being spent investigating this.
- GY
Join Date: Jul 2006
Location: Brisbane, Oz
Age: 82
Posts: 46
Likes: 0
Received 0 Likes
on
0 Posts
If I may put my 5 cents worth in (used to be a penny)? There is a general misrepresentation of the colloquial term 'cosmic rays'. Did I say anything about the 'media stock phrases and cliches' handbook? Wash my mouth out!
This discussion concerns high energy particles, and a reading of Cosmic ray - Wikipedia, the free encyclopedia will bring one up to speed.
They are singularities, and although they can occur in 'showers', read high_incidence_of, they *are problematic, and how much so depends on each individual particle's very variable energy level. They are not just a threat to electronics, but also to DNA and indeed any of your cells.
On the well _known _in _the _trade basis that such an particle can 'take out' an individual electronic component, whether temporarily if low energy or sometimes permanently if high energy, any problem should be an isolated event that can in no way known to wo/man be specifically guarded against, short of using lead wrapping on all boxes.
As another ancient here says, the design must fully guard against any individual failure.
On a related matter, here’s snippet of information related to Airbus’s design philosophy. I haven’t seen this mentioned since my engineering course on the second lot of free range A320s. (Gosh! Have they been flying for *that long.) It was stated then that Airbus went to what I would have thought were excessive pains to diversify the build parameters and supply sources of all duplicated equipment.
We were told by an Airbus rep that duplicate suppliers were given design parameters which they were free to achieve electronically anyway they chose, but obviously to tight aviation constraints. The ultimate black boxes. The idea was that a *design flaw in one element of the control architecture would be isolated to one item in the control chain by default.
To the best of my recall this philosophy was applied across the entire airframe, and I have been surprised at reports that certain Airbus aircraft have finished up flying with all pitots from the same manufacturer. That certainly was not the original designers' intent.
No doubt the cost of extensive duplication of non-identical but similarly functioning components has attracted the attention of the financial fine tuners. <sigh>
(Written from the future as this appears, the comment re pitots seems rather relevant to the current (20100820) threat mulling the AF447 loss. Amended by Jencluse.)
This discussion concerns high energy particles, and a reading of Cosmic ray - Wikipedia, the free encyclopedia will bring one up to speed.
They are singularities, and although they can occur in 'showers', read high_incidence_of, they *are problematic, and how much so depends on each individual particle's very variable energy level. They are not just a threat to electronics, but also to DNA and indeed any of your cells.
On the well _known _in _the _trade basis that such an particle can 'take out' an individual electronic component, whether temporarily if low energy or sometimes permanently if high energy, any problem should be an isolated event that can in no way known to wo/man be specifically guarded against, short of using lead wrapping on all boxes.
As another ancient here says, the design must fully guard against any individual failure.
On a related matter, here’s snippet of information related to Airbus’s design philosophy. I haven’t seen this mentioned since my engineering course on the second lot of free range A320s. (Gosh! Have they been flying for *that long.) It was stated then that Airbus went to what I would have thought were excessive pains to diversify the build parameters and supply sources of all duplicated equipment.
We were told by an Airbus rep that duplicate suppliers were given design parameters which they were free to achieve electronically anyway they chose, but obviously to tight aviation constraints. The ultimate black boxes. The idea was that a *design flaw in one element of the control architecture would be isolated to one item in the control chain by default.
To the best of my recall this philosophy was applied across the entire airframe, and I have been surprised at reports that certain Airbus aircraft have finished up flying with all pitots from the same manufacturer. That certainly was not the original designers' intent.
No doubt the cost of extensive duplication of non-identical but similarly functioning components has attracted the attention of the financial fine tuners. <sigh>
(Written from the future as this appears, the comment re pitots seems rather relevant to the current (20100820) threat mulling the AF447 loss. Amended by Jencluse.)
Last edited by JenCluse; 20th Aug 2011 at 10:58.
Join Date: Mar 2009
Location: us
Age: 63
Posts: 206
Likes: 0
Received 0 Likes
on
0 Posts
PS I stand corrected:
The flight computer does filter and compare AOA data coming from the 3 ADIRUs, but there is a scenario when it can be fooled:
during the incident flight the bad ADIRU produced 42 data spikes, 40 of which were caught by the computer except the 2 which caused the upset. All these dozens of spikes did not make the computer realize the ADIRU was bad.
The flight computer does filter and compare AOA data coming from the 3 ADIRUs, but there is a scenario when it can be fooled:
• there were at least two short duration, high amplitude spikes
• the first spike was shorter than 1 second
• the second spike occurred and was still present 1.2 seconds after the detection of
the first spike.
• the first spike was shorter than 1 second
• the second spike occurred and was still present 1.2 seconds after the detection of
the first spike.
Join Date: Jan 2005
Location: W of 30W
Posts: 1,916
Likes: 0
Received 0 Likes
on
0 Posts
Originally Posted by Dick Newman
I was distressed during my previous employment when my boss reacted to the QANTAS upset with "Well, it was only an ADIRU failure" when the response should have been "How could an ADURU failure make its way through to the flight control surfaces"
Originally Posted by Lookleft
xetroV- The 777 does indeed only have one ADIRU unit but that unit consists of multiple accelerometers and l@ser gyros. This redundancy within the unit didn't prevent an incident to an MAS 777 doing something very similar to the Qf incident-also off the coast of WA. I don't think its a problem unique to one manufacturer or another but an indication of the lack of understanding of how software interacts.
I agree with your statement about software interaction. This is becoming increasingly important as more and more aircraft systems are being integrated and interconnected, while at the same time the required navigation performance and vertical separation are continuously being reduced, as the skies get busier. At the very least, accurate and quick internal error detection algorithms should provide smooth systems degradation that is immediately obvious and totally transparent to the flight crew. Sudden uncommanded autopilot upsets are not what I call "fail passive" (let alone "fail safe").
Last edited by xetroV; 24th Nov 2009 at 20:41.
Join Date: Dec 2001
Location: England
Posts: 1,389
Likes: 0
Received 0 Likes
on
0 Posts
during the incident flight the bad ADIRU produced 42 data spikes, 40 of which were caught by the computer except the 2 which caused the upset. All these dozens of spikes did not make the computer realize the ADIRU was bad.
Join Date: Apr 2008
Location: .
Posts: 309
Likes: 0
Received 0 Likes
on
0 Posts
A&WST had an article about the criteria 'space hardened' electronic if I can remeber to look I'll post--but CJ has given me alot to think about
ECSS-E-ST-40C - Software general requirements
ECSS-Q-ST-60C Rev.1 - Electrical, electronic and electromechanical (EEE) components
Although you have to log in to see them (registration is free).
Slightly more generalised versions are online here: https://escies.org/ReadArticle?docId=167
Join Date: Mar 2009
Location: us
Age: 63
Posts: 206
Likes: 0
Received 0 Likes
on
0 Posts
It's very important to know if your error correction circuit is being triggered and take some action.
Join Date: Oct 2004
Location: England
Age: 65
Posts: 87
Likes: 0
Received 0 Likes
on
0 Posts
I am not a pilot, just an interested observer. Is this incident relevant to what happened on the Qantas aircraft?
Incident: US Airways A333 over Atlantic on Nov 17th 2009, computer issues
By Simon Hradecky, created Friday, Nov 20th 2009 14:30Z, last updated Friday, Nov 20th 2009 14:30Z
A US Airways Airbus A330-300, flight US-740 from Philadelphia,PA (USA) to Madrid,SP (Spain), was enroute at FL390 about 350nm east of Philadelphia overhead the Atlantic about 40 minutes into the flight, when the crew announced they needed to return and was cleared to turn to the left. About 40 seconds later during the turn the crew declared emergency and requested to descend. About another 5 minutes later while levelling at FL300 the crew reported, that everything had returned to normal explaining, that they had experienced computer problems they were unable to resolve and they had been "missing control". The emergency was cancelled, the airplane continued back to Philadelphia. The airplane landed safely on Philadelphia's runway 09R about 75 minutes after the onset of trouble.
Incident: US Airways A333 over Atlantic on Nov 17th 2009, computer issues
By Simon Hradecky, created Friday, Nov 20th 2009 14:30Z, last updated Friday, Nov 20th 2009 14:30Z
A US Airways Airbus A330-300, flight US-740 from Philadelphia,PA (USA) to Madrid,SP (Spain), was enroute at FL390 about 350nm east of Philadelphia overhead the Atlantic about 40 minutes into the flight, when the crew announced they needed to return and was cleared to turn to the left. About 40 seconds later during the turn the crew declared emergency and requested to descend. About another 5 minutes later while levelling at FL300 the crew reported, that everything had returned to normal explaining, that they had experienced computer problems they were unable to resolve and they had been "missing control". The emergency was cancelled, the airplane continued back to Philadelphia. The airplane landed safely on Philadelphia's runway 09R about 75 minutes after the onset of trouble.