Boeing 787 integer overflow bug
Join Date: Jul 2002
Location: UK
Posts: 3,093
Likes: 0
Received 0 Likes
on
0 Posts
...Certainly that's how they would have been done 30 years ago. Generator control didn’t need software for the first 80 years of flight, and not a lot has changed.
However, kids coming out of university have almost no idea what an analogue control circuit is.
But they do know what software is. So guess what they choose to use when they're a bit older and end up designing GCUs? Software. Is it overkill? Possibly. Would they ever think of building it a different way? No.
The only thing software has going for it in such a situation is that a whole ton of functionality can be implemented with a very low size/weight penalty.
Though in the case of a GCU I'm struggling to see what that extra functionality might be.
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes
on
0 Posts
I don't know too much about a GCU, but I can't imagine that there is much about them that couldn't be run with a good old fashioned analogue control circuit.
When you combine an inherently analogue mechanical system with an analogue control system, you are essentially fighting a losing battle because both the system under control and the system doing the controlling will independently and jointly go out of tolerance. So analogue systems constantly requires external tuning to keep their performance levels within an acceptable range.
Analogue systems are also susceptible to changing environmental conditions. The change in temperature from a hot ground tarmac to freezing flight levels are more than enough to affect the performance of analogue components (capacitors, resistors, amplifiers, etc.) So again, it's very difficult to maintain tight tolerances with analogue controls.
Digital systems, on the other hand, are not susceptible to these problems. If I set a certain digital memory parameter to the value of 128 decimal, it will not vary to become 129 or 127 over time. It will also remain exactly at 128 over its entire design temperature & environmental range. While memory can get corrupted -- and there are ways to automatically detect & correct this -- bits and bytes don't age, freeze or boil over.
The need to meet modern tolerance / precision requirements alone justifies the motivation to use digital vs. analogue techniques.
Plus are many other advantages to using a digital control systems. A single integrated digital controller can monitor, control and tune hundreds of parameters simultaneously in real time -- something that's impossible (or at least impractical) to do with analogue controllers.
Analogue controllers also tend to be fairly "dumb". With digital controllers, manufacturers can implement sophisticated control algorithms to increase performance, economy, reliability, etc.
Thread Starter
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes
on
0 Posts
Probability vs Possibility
[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...
The fact that a ( simple ? ) ( single / ) counter could trip ALL systems ( including the APU ( total 6 generators ) off line until some sort of ground access reset doesn't say much for redundancy or safety.
And while the RAT could keep the controls running, it is NOT clear if engines could be restarted [takes electrical along with windmilling ( air start ) ]
Granted it is very very very unlikely- but the probibility and possibility is NOT ZERO.
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes
on
0 Posts
And it is still NOT clear if only one continuous run generator/power system/ground power connection for that time would trip ALL off line.
Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.
The danger is if all four GCUs were powered up at the same time, then all four would fail at the same time.
This is all clearly discussed in the original FAA AD under the "Supplementary Information" section:
https://s3.amazonaws.com/public-insp...2015-10066.pdf
Also the APUs are not affected (different make/model).
I have no problem with digital systems. I have no problem with computers (I fly Airbus's !), and I agree that extra sophistication can be acheived with computers, or micro-controllers. I also agree that analogue systems need regular manual adjustment and calibration, which is a pain. (my first profession was as an electronics engineer, and believe me, I have spent plenty of time doing just that).
However, I do have a problem with a vital system shutting down because a mere clock or counter has reached a particular limit. All a generator, GCU, or hydraulic pump need is a simple logic circuit to determine if it is working within it's parameters, and warn the pilots if it is not. It can be monitored by a computer by all means, but a computer should not have executive control, unless there is a catastrophic situation developing.
I am a line pilot, and if we have just gone around at Innsbruck because we've lost both GCU's in our No 1 engine, and then the two GCU's in our No 2 engine quit in the climb out, and I later discover that all four GCU's quit; not because they overheated, not because they oversped, not because the voltage or frequency was wrong but because a bloody clock said so....That is really going to xxxx me off ! - assuming of course that a quadruple genny failure did not distract us so much in that valley that we flew into a mountain!
I like having FADECs to look after the engines and help me prevent exceedances, but I don't expect one to shut down simply because a register has become full.
System software designers must not lose sight of how their systems will be used, or the fact that such systems need to keep running unless a catastrophic or potentially catastrophic situation has arisen.
Clock/counter overflow is not catastrophic. Nor should the shut down potential of a clock/counter/register overflow need to be carefully checked for, because it should never be an issue in the first place.
Pilots don't have the luxury of being able to go through hundreds/thousands of lines of code at their desks all morning, coffee at their side, and eventually saying, "oh, here we go, I found the problem", they just need that genny or hydraulic pump to keep going.
However, I do have a problem with a vital system shutting down because a mere clock or counter has reached a particular limit. All a generator, GCU, or hydraulic pump need is a simple logic circuit to determine if it is working within it's parameters, and warn the pilots if it is not. It can be monitored by a computer by all means, but a computer should not have executive control, unless there is a catastrophic situation developing.
I am a line pilot, and if we have just gone around at Innsbruck because we've lost both GCU's in our No 1 engine, and then the two GCU's in our No 2 engine quit in the climb out, and I later discover that all four GCU's quit; not because they overheated, not because they oversped, not because the voltage or frequency was wrong but because a bloody clock said so....That is really going to xxxx me off ! - assuming of course that a quadruple genny failure did not distract us so much in that valley that we flew into a mountain!
I like having FADECs to look after the engines and help me prevent exceedances, but I don't expect one to shut down simply because a register has become full.
System software designers must not lose sight of how their systems will be used, or the fact that such systems need to keep running unless a catastrophic or potentially catastrophic situation has arisen.
Clock/counter overflow is not catastrophic. Nor should the shut down potential of a clock/counter/register overflow need to be carefully checked for, because it should never be an issue in the first place.
Pilots don't have the luxury of being able to go through hundreds/thousands of lines of code at their desks all morning, coffee at their side, and eventually saying, "oh, here we go, I found the problem", they just need that genny or hydraulic pump to keep going.
Join Date: Nov 2009
Location: flying by night
Posts: 500
Likes: 0
Received 0 Likes
on
0 Posts
regarding the purported need for analog controls, and whether we need software...I came to think of lead software engineer Margaret Hamilton, and the code she wrote for Apollo 11, and a picture of it on a stack of paper as tall as her...the code which basically saved the moon landing by recovering from a malfunction: https://medium.com/@verne/margaret-h...n-7d550c73d3fa among other things, she also coined the term "software engineering" (it was treated more like an esoteric art previously, which it isnt)
computers and aircraft seem to be a combination that scares a lot of people (the impression is that bugs and "hackers" are everywhere), but I think it's a really cool combination, and the rare glitches such as this potential thing here havent really caused any accidents yet, have they? on the contrary, software prevents accidents every day
computers and aircraft seem to be a combination that scares a lot of people (the impression is that bugs and "hackers" are everywhere), but I think it's a really cool combination, and the rare glitches such as this potential thing here havent really caused any accidents yet, have they? on the contrary, software prevents accidents every day
Join Date: Jun 2009
Location: Canada
Posts: 464
Likes: 0
Received 0 Likes
on
0 Posts
We also had a similar issue a few years ago with some hardware we bought which started spewing errors after being in operation for about four years. Turned out that, although their clock was 64 bits, someone had unintentionally copied it to a 32-bit variable and then copied that back to the 64-bit variable, so the top bits were always zero. The things you don't test are usually the things that don't work.
Join Date: Jun 2009
Location: Canada
Posts: 464
Likes: 0
Received 0 Likes
on
0 Posts
So what's that got to do with the control of a generator? It's still all about monitoring.
One of the benefits of software control systems over analogue is that you can make them as complex as you want. One of the downsides is that they may contain completely unexpected failure modes, while analogue systems tend to fail in predictable ways. Just because the software has worked perfectly for 248 days doesn't mean it won't fail completely after 249, whereas an analogue system will usually degrade before it fails.
Thread Starter
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes
on
0 Posts
About ALL Generators /Power on line for 248 days
Peekay4 said
yet the FAA doc first page says very clearly
https://s3.amazonaws.com/public-insp...2015-10066.pdf
Loss of ALL AC power includes APU
But further on says
Now granted the 248 day is a stretch- but its still a good bet that both engines ( 4 gen total ) are started within minutes of each other, and absent some special maintence on one engine, the time count would be within minutes.
The role of the APU in that case is not well defined.
And is there a common GCU counter for all in the system ?
Can you start via windmill any engine with a ' bricked ' GEN ? as in the case where both engines drop off within a few minutes
Or can the APU cross feed to the engine ignition system ??
Anyone KNOW for sure ?
No, it is perfectly clear.
Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.
Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.
https://s3.amazonaws.com/public-insp...2015-10066.pdf
This AD was prompted by the
determination that a Model 787 airplane that has been powered continuously for 248 days
can lose all alternating current (AC) electrical power due to the generator control units
(GCUs) simultaneously going into failsafe mode. This condition is caused by a software
counter internal to the GCUs that will overflow after 248 days of continuous power. We
are issuing this AD to prevent loss of all AC electrical power, which could result in loss
of control of the airplane
determination that a Model 787 airplane that has been powered continuously for 248 days
can lose all alternating current (AC) electrical power due to the generator control units
(GCUs) simultaneously going into failsafe mode. This condition is caused by a software
counter internal to the GCUs that will overflow after 248 days of continuous power. We
are issuing this AD to prevent loss of all AC electrical power, which could result in loss
of control of the airplane
Loss of ALL AC power includes APU
But further on says
If the four main GCUs were powered up at the same time,
after 248 days of continuous power, all four GCUs will go into failsafe mode at the same
time, resulting in a loss of all AC electrical power regardless of flight phase.
after 248 days of continuous power, all four GCUs will go into failsafe mode at the same
time, resulting in a loss of all AC electrical power regardless of flight phase.
The role of the APU in that case is not well defined.
And is there a common GCU counter for all in the system ?
Can you start via windmill any engine with a ' bricked ' GEN ? as in the case where both engines drop off within a few minutes
Or can the APU cross feed to the engine ignition system ??
Anyone KNOW for sure ?
Thread Starter
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes
on
0 Posts
About Engine Ignition
Sure, but how is the 27-28 V DC generated ?? - AFIK it is via inverter from AC system.
IOW if GEN is shut down ( still rotates but no output ), what about DC ??
IOW if GEN is shut down ( still rotates but no output ), what about DC ??
Last edited by SAMPUBLIUS; 28th May 2015 at 03:10. Reason: corrected to DC via inverter
No, 28 Vdc for the igniters comes off the hot battery bus (igniter power available from the hot battery bus is a common design feature on Boeings, although on most the igniter power only drops to the battery bus when the main AC bus goes down). The whole reason they used DC for the igniters on the 787 is so they wouldn't need that current going the a DC to AC inverter.
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes
on
0 Posts
Loss of ALL AC power includes APU
The AD does NOT mean integer overflow in one GCU will shutdown the other three GCUs plus the APU as you seem to think. That is simply wrong!
The 787 has six generators: 2 per engine (4 total) plus 2 more on the APU. The FAA statement makes clear that the issue affects the four GCUs related to the engine, not any GCUs related to the APU.
Read the following sections from the FAA AD carefully:
The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.
If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time
And notice this simultaneous failure only occurs "if the four main GCUs ... were powered up at the same time".
Translation: if the four main GCUs were not powered up at the same time, failure in one GCU will not cause all others to fail.
Thread Starter
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes
on
0 Posts
re 28 VDC from battery
28 Vdc for the igniters comes off the hot battery bus
Even so - IMO the timer shutoff of GENs although unlikely is still a dumb idea.
And it takes x months to correct ??
Peekay, correct me if I am wrong but, "when power is applied" obviously means external power, and when that is plugged in, the engines will not be running. So the GCUs are powered, even though their engines are not running. Surely the APU GCUs are in the same boat? "Powered" without the APU running? Scenario: cold ship, APU is started (or external power applied), as soon as power comes on-line, all 6 GCUs will be powered. 248 days later, all 6 GCUs will die.
You seem to be suggesting that the engine GCUs remain powered when the engines are stopped but the APU GCUs are "depowered" when the APU is stopped.
Tinfoil hat donned!
You seem to be suggesting that the engine GCUs remain powered when the engines are stopped but the APU GCUs are "depowered" when the APU is stopped.
Tinfoil hat donned!
Thread Starter
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes
on
0 Posts
RE powered GCU
Capn boggs said
Capn may have a valid point !!
IMO, since the APU is usually shut down in mid flight, it may be unlikely that the APU 2 GENS will have the same time count.
Meanwhile since each engine has 2 GENS, then those 2 GENS will usually have the same time powered. And since both engines are often started within minutes of each other, then it is quite likely that 4 GENS could be bricked by a counter within minutes of each other.
BUT what is unclear is IF the GEN COUNTER TIME counts SYSTEM powered time such as when under Ground Power- then even though APU is off, the 2 APU GENS may also be counted as ' powered " time.
One would certainly hope that Boeing would by now have clarified just how ' powered' time is counted by each GEN.
Scenario: cold ship, APU is started (or external power applied), as soon as power comes on-line, all 6 GCUs will be powered. 248 days later, all 6 GCUs will die.
IMO, since the APU is usually shut down in mid flight, it may be unlikely that the APU 2 GENS will have the same time count.
Meanwhile since each engine has 2 GENS, then those 2 GENS will usually have the same time powered. And since both engines are often started within minutes of each other, then it is quite likely that 4 GENS could be bricked by a counter within minutes of each other.
BUT what is unclear is IF the GEN COUNTER TIME counts SYSTEM powered time such as when under Ground Power- then even though APU is off, the 2 APU GENS may also be counted as ' powered " time.
One would certainly hope that Boeing would by now have clarified just how ' powered' time is counted by each GEN.
Last edited by SAMPUBLIUS; 28th May 2015 at 14:31. Reason: capn may have valid point plus typo
Join Date: Jul 2002
Location: UK
Posts: 3,093
Likes: 0
Received 0 Likes
on
0 Posts
The best ones get the highest paid jobs.
And, actually, the best ones know that the most expensive part of building a safety critical system is passing certification.
So what's that got to do with the control of a generator? It's still all about monitoring.
Join Date: Jan 2011
Location: on the cusp
Age: 52
Posts: 217
Likes: 0
Received 0 Likes
on
0 Posts
This issue is not about the software as far as I see it. I am pretty sure that the software module was implemented and tested successfully against the module requirements. It is a systems engineering process issue, that is a concern.
During design, implementation and certification the analysis never revealed that an important component is guaranteed to fail every 248 days. This is evident because had it been analysed then there would already be an aircrew/maintenance action in place. The aim of the certification and analysis process is to drive out and mitigate all potential latent defects to an acceptable probability/consequence level. In this case the process failed spectacularly. Luckily the outcome was unspectacular and easy to manage, but the failure of the process is still of great concern.
During design, implementation and certification the analysis never revealed that an important component is guaranteed to fail every 248 days. This is evident because had it been analysed then there would already be an aircrew/maintenance action in place. The aim of the certification and analysis process is to drive out and mitigate all potential latent defects to an acceptable probability/consequence level. In this case the process failed spectacularly. Luckily the outcome was unspectacular and easy to manage, but the failure of the process is still of great concern.