Go Back  PPRuNe Forums > Flight Deck Forums > Tech Log
Reload this Page >

Boeing 787 integer overflow bug

Wikiposts
Search
Tech Log The very best in practical technical discussion on the web

Boeing 787 integer overflow bug

Thread Tools
 
Search this Thread
 
Old 27th May 2015, 00:15
  #81 (permalink)  
 
Join Date: Jul 2002
Location: UK
Posts: 3,093
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by msbbarratt
You're referring to what I call the Too-Much-Technology problem. There us an unsettling fashion for using technology, especially software, to solve problems that don't really exist.
Not applicable here.

...Certainly that's how they would have been done 30 years ago. Generator control didn’t need software for the first 80 years of flight, and not a lot has changed.
Would they have been consistently running for 8 months in that period?

However, kids coming out of university have almost no idea what an analogue control circuit is.
Actually, a decent Software Engineering graduate will be *well* aware of the fundamentals of control circuits and logic paths. Furthermore only the best of those graduates usually end up specialising in real-time/safety-critical work.

But they do know what software is. So guess what they choose to use when they're a bit older and end up designing GCUs? Software. Is it overkill? Possibly. Would they ever think of building it a different way? No.
Unfair assumption. Boeing sold this particular product on the basis of being the most energy-efficient airliner that technology could devise. Computer-controlled and regulated technology was and is the only practical method for achieving that aim (and hopefully proving it).

The only thing software has going for it in such a situation is that a whole ton of functionality can be implemented with a very low size/weight penalty.
Again, incorrect. Among other factors it is the most practical way of assessing the product's ability to meet it's design requirements, and furthermore it is a far more practical method in terms of revising the systems design when problems do arise (it's much easier/cheaper to flash an EEPROM and change the systems' programming than it is to replace a physical TLA board).

Though in the case of a GCU I'm struggling to see what that extra functionality might be.
Providing a method of assessing and measuring the efficiency of the aircraft's systems, and providing a straightforward method of tweaking and improving their behaviour, for starters.
DozyWannabe is offline  
Old 27th May 2015, 01:57
  #82 (permalink)  
 
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes on 0 Posts
I don't know too much about a GCU, but I can't imagine that there is much about them that couldn't be run with a good old fashioned analogue control circuit.
All analogue systems -- whether mechanical or electrical -- suffer from aging (yes, that is a technical term). Simply put, analogue systems always drift over time, continuously varying their performance until tolerance limits are exceeded.

When you combine an inherently analogue mechanical system with an analogue control system, you are essentially fighting a losing battle because both the system under control and the system doing the controlling will independently and jointly go out of tolerance. So analogue systems constantly requires external tuning to keep their performance levels within an acceptable range.

Analogue systems are also susceptible to changing environmental conditions. The change in temperature from a hot ground tarmac to freezing flight levels are more than enough to affect the performance of analogue components (capacitors, resistors, amplifiers, etc.) So again, it's very difficult to maintain tight tolerances with analogue controls.

Digital systems, on the other hand, are not susceptible to these problems. If I set a certain digital memory parameter to the value of 128 decimal, it will not vary to become 129 or 127 over time. It will also remain exactly at 128 over its entire design temperature & environmental range. While memory can get corrupted -- and there are ways to automatically detect & correct this -- bits and bytes don't age, freeze or boil over.

The need to meet modern tolerance / precision requirements alone justifies the motivation to use digital vs. analogue techniques.

Plus are many other advantages to using a digital control systems. A single integrated digital controller can monitor, control and tune hundreds of parameters simultaneously in real time -- something that's impossible (or at least impractical) to do with analogue controllers.

Analogue controllers also tend to be fairly "dumb". With digital controllers, manufacturers can implement sophisticated control algorithms to increase performance, economy, reliability, etc.
peekay4 is offline  
Old 27th May 2015, 04:26
  #83 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
Danger Probability vs Possibility

[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...
Well the FAA did get concerned. And it is still NOT clear if only one continuous run generator/power system/ground power connection for that time would trip ALL off line.

The fact that a ( simple ? ) ( single / ) counter could trip ALL systems ( including the APU ( total 6 generators ) off line until some sort of ground access reset doesn't say much for redundancy or safety.

And while the RAT could keep the controls running, it is NOT clear if engines could be restarted [takes electrical along with windmilling ( air start ) ]

Granted it is very very very unlikely- but the probibility and possibility is NOT ZERO.
SAMPUBLIUS is offline  
Old 27th May 2015, 05:37
  #84 (permalink)  
 
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes on 0 Posts
And it is still NOT clear if only one continuous run generator/power system/ground power connection for that time would trip ALL off line.
No, it is perfectly clear.

Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.

The danger is if all four GCUs were powered up at the same time, then all four would fail at the same time.

This is all clearly discussed in the original FAA AD under the "Supplementary Information" section:

https://s3.amazonaws.com/public-insp...2015-10066.pdf

Also the APUs are not affected (different make/model).
peekay4 is offline  
Old 27th May 2015, 12:24
  #85 (permalink)  
 
Join Date: Oct 2005
Location: Classified
Posts: 314
Likes: 0
Received 0 Likes on 0 Posts
..........

Last edited by Radix; 18th Mar 2016 at 01:39.
Radix is offline  
Old 27th May 2015, 15:43
  #86 (permalink)  
 
Join Date: Nov 1999
Location: UK
Posts: 2,492
Received 101 Likes on 61 Posts
I have no problem with digital systems. I have no problem with computers (I fly Airbus's !), and I agree that extra sophistication can be acheived with computers, or micro-controllers. I also agree that analogue systems need regular manual adjustment and calibration, which is a pain. (my first profession was as an electronics engineer, and believe me, I have spent plenty of time doing just that).

However, I do have a problem with a vital system shutting down because a mere clock or counter has reached a particular limit. All a generator, GCU, or hydraulic pump need is a simple logic circuit to determine if it is working within it's parameters, and warn the pilots if it is not. It can be monitored by a computer by all means, but a computer should not have executive control, unless there is a catastrophic situation developing.

I am a line pilot, and if we have just gone around at Innsbruck because we've lost both GCU's in our No 1 engine, and then the two GCU's in our No 2 engine quit in the climb out, and I later discover that all four GCU's quit; not because they overheated, not because they oversped, not because the voltage or frequency was wrong but because a bloody clock said so....That is really going to xxxx me off ! - assuming of course that a quadruple genny failure did not distract us so much in that valley that we flew into a mountain!

I like having FADECs to look after the engines and help me prevent exceedances, but I don't expect one to shut down simply because a register has become full.

System software designers must not lose sight of how their systems will be used, or the fact that such systems need to keep running unless a catastrophic or potentially catastrophic situation has arisen.

Clock/counter overflow is not catastrophic. Nor should the shut down potential of a clock/counter/register overflow need to be carefully checked for, because it should never be an issue in the first place.

Pilots don't have the luxury of being able to go through hundreds/thousands of lines of code at their desks all morning, coffee at their side, and eventually saying, "oh, here we go, I found the problem", they just need that genny or hydraulic pump to keep going.
Uplinker is offline  
Old 27th May 2015, 18:07
  #87 (permalink)  
 
Join Date: Nov 2009
Location: flying by night
Posts: 500
Likes: 0
Received 0 Likes on 0 Posts
regarding the purported need for analog controls, and whether we need software...I came to think of lead software engineer Margaret Hamilton, and the code she wrote for Apollo 11, and a picture of it on a stack of paper as tall as her...the code which basically saved the moon landing by recovering from a malfunction: https://medium.com/@verne/margaret-h...n-7d550c73d3fa among other things, she also coined the term "software engineering" (it was treated more like an esoteric art previously, which it isnt)

computers and aircraft seem to be a combination that scares a lot of people (the impression is that bugs and "hackers" are everywhere), but I think it's a really cool combination, and the rare glitches such as this potential thing here havent really caused any accidents yet, have they? on the contrary, software prevents accidents every day
deptrai is offline  
Old 27th May 2015, 18:11
  #88 (permalink)  
 
Join Date: Jun 2009
Location: Canada
Posts: 464
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Uplinker
Clock/counter overflow is not catastrophic. Nor should the shut down potential of a clock/counter/register overflow need to be carefully checked for, because it should never be an issue in the first place.
Computer systems failed all over the world a couple of years ago when there was a leap second. More will fail this year when the next leap second happens (I know a couple of mine will, so I'll have to shut them down before midnight, and restart after). Clocks suddenly jumping back can have really bad consequences on all kinds of code which expects the clock to start at 0 and only ever increment.

We also had a similar issue a few years ago with some hardware we bought which started spewing errors after being in operation for about four years. Turned out that, although their clock was 64 bits, someone had unintentionally copied it to a 32-bit variable and then copied that back to the 64-bit variable, so the top bits were always zero. The things you don't test are usually the things that don't work.
MG23 is offline  
Old 27th May 2015, 21:40
  #89 (permalink)  
 
Join Date: Jun 2009
Location: Canada
Posts: 464
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by msbbarratt
Does your little Honda petrol generator need software to reliably produce 240V 50Hz or 110V 60Hz? No.
New ones apparently have electronic fuel injection, so probably... yes.

So what's that got to do with the control of a generator? It's still all about monitoring.
And, apparently, shutting it down when the monitoring says it exceeded safe parameters. As someone touched on earlier in the thread, if the 'safe parameter' is 'voltage didn't exceed X for Y seconds' and the time since the last check is calculated as -20,000,000 seconds because the counter just jumped back to zero, then the software may well barf and shut down because it doesn't know what's going on. For something that 'can never happen', that behaviour makes sense... until it happens.

One of the benefits of software control systems over analogue is that you can make them as complex as you want. One of the downsides is that they may contain completely unexpected failure modes, while analogue systems tend to fail in predictable ways. Just because the software has worked perfectly for 248 days doesn't mean it won't fail completely after 249, whereas an analogue system will usually degrade before it fails.
MG23 is offline  
Old 27th May 2015, 21:52
  #90 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
Question About ALL Generators /Power on line for 248 days

Peekay4 said
No, it is perfectly clear.

Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.
yet the FAA doc first page says very clearly


https://s3.amazonaws.com/public-insp...2015-10066.pdf

This AD was prompted by the
determination that a Model 787 airplane that has been powered continuously for 248 days
can lose all alternating current (AC) electrical power due to the generator control units
(GCUs) simultaneously going into failsafe mode. This condition is caused by a software
counter internal to the GCUs that will overflow after 248 days of continuous power. We
are issuing this AD to prevent loss of all AC electrical power, which could result in loss
of control of the airplane

Loss of ALL AC power includes APU

But further on says
If the four main GCUs were powered up at the same time,
after 248 days of continuous power, all four GCUs will go into failsafe mode at the same
time, resulting in a loss of all AC electrical power regardless of flight phase.
Now granted the 248 day is a stretch- but its still a good bet that both engines ( 4 gen total ) are started within minutes of each other, and absent some special maintence on one engine, the time count would be within minutes.

The role of the APU in that case is not well defined.

And is there a common GCU counter for all in the system ?

Can you start via windmill any engine with a ' bricked ' GEN ? as in the case where both engines drop off within a few minutes
Or can the APU cross feed to the engine ignition system ??

Anyone KNOW for sure ?
SAMPUBLIUS is offline  
Old 28th May 2015, 02:16
  #91 (permalink)  
 
Join Date: Jul 2013
Location: Everett, WA
Age: 68
Posts: 4,407
Received 180 Likes on 88 Posts
On the 787, engine ignition is 28 Vdc, not AC.
tdracer is online now  
Old 28th May 2015, 03:01
  #92 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
About Engine Ignition

Sure, but how is the 27-28 V DC generated ?? - AFIK it is via inverter from AC system.

IOW if GEN is shut down ( still rotates but no output ), what about DC ??

Last edited by SAMPUBLIUS; 28th May 2015 at 03:10. Reason: corrected to DC via inverter
SAMPUBLIUS is offline  
Old 28th May 2015, 03:16
  #93 (permalink)  
 
Join Date: Jul 2013
Location: Everett, WA
Age: 68
Posts: 4,407
Received 180 Likes on 88 Posts
No, 28 Vdc for the igniters comes off the hot battery bus (igniter power available from the hot battery bus is a common design feature on Boeings, although on most the igniter power only drops to the battery bus when the main AC bus goes down). The whole reason they used DC for the igniters on the 787 is so they wouldn't need that current going the a DC to AC inverter.
tdracer is online now  
Old 28th May 2015, 03:28
  #94 (permalink)  
 
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes on 0 Posts
Loss of ALL AC power includes APU
No. In the context of the FAA AD, the APU is not considered, as presumably the APU may not be operating at the time of the GCU failures (see below for exact wording).

The AD does NOT mean integer overflow in one GCU will shutdown the other three GCUs plus the APU as you seem to think. That is simply wrong!

The 787 has six generators: 2 per engine (4 total) plus 2 more on the APU. The FAA statement makes clear that the issue affects the four GCUs related to the engine, not any GCUs related to the APU.

Read the following sections from the FAA AD carefully:

The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.
Translation: overflow in one GCU unit only causes THAT one unit to go into failsafe mode. FAA wrote "that unit", not "all units".

If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time
Translation: clearly here the FAA is only talking about the four GCUs related to the engines, not the other two GCUs related to the APU.

And notice this simultaneous failure only occurs "if the four main GCUs ... were powered up at the same time".

Translation: if the four main GCUs were not powered up at the same time, failure in one GCU will not cause all others to fail.
peekay4 is offline  
Old 28th May 2015, 04:00
  #95 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
Question re 28 VDC from battery

28 Vdc for the igniters comes off the hot battery bus
So if 4 gens get bricked due to timer- then the AP must be started via battery, which also powers igniter - and then APU can keep the igniter and battery charged and ?? going.

Even so - IMO the timer shutoff of GENs although unlikely is still a dumb idea.

And it takes x months to correct ??
SAMPUBLIUS is offline  
Old 28th May 2015, 05:40
  #96 (permalink)  
 
Join Date: Mar 2002
Location: Seat 1A
Posts: 8,552
Received 73 Likes on 42 Posts
Peekay, correct me if I am wrong but, "when power is applied" obviously means external power, and when that is plugged in, the engines will not be running. So the GCUs are powered, even though their engines are not running. Surely the APU GCUs are in the same boat? "Powered" without the APU running? Scenario: cold ship, APU is started (or external power applied), as soon as power comes on-line, all 6 GCUs will be powered. 248 days later, all 6 GCUs will die.

You seem to be suggesting that the engine GCUs remain powered when the engines are stopped but the APU GCUs are "depowered" when the APU is stopped.

Tinfoil hat donned!
Capn Bloggs is offline  
Old 28th May 2015, 14:25
  #97 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
Unhappy RE powered GCU

Capn boggs said
Scenario: cold ship, APU is started (or external power applied), as soon as power comes on-line, all 6 GCUs will be powered. 248 days later, all 6 GCUs will die.
Capn may have a valid point !!

IMO, since the APU is usually shut down in mid flight, it may be unlikely that the APU 2 GENS will have the same time count.

Meanwhile since each engine has 2 GENS, then those 2 GENS will usually have the same time powered. And since both engines are often started within minutes of each other, then it is quite likely that 4 GENS could be bricked by a counter within minutes of each other.

BUT what is unclear is IF the GEN COUNTER TIME counts SYSTEM powered time such as when under Ground Power- then even though APU is off, the 2 APU GENS may also be counted as ' powered " time.

One would certainly hope that Boeing would by now have clarified just how ' powered' time is counted by each GEN.

Last edited by SAMPUBLIUS; 28th May 2015 at 14:31. Reason: capn may have valid point plus typo
SAMPUBLIUS is offline  
Old 29th May 2015, 02:01
  #98 (permalink)  
 
Join Date: Jul 2002
Location: UK
Posts: 3,093
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by msbbarratt
Perhaps, but when did you last hear one say "I know, I'll do this control system as a synchronous state machine"? Not once in the last 20 years I bet.
As I've stated many times, I don't work in real-time systems myself (Software Engineering is a job for me, not a calling), so it's unlikely that I'd be hearing it in the work environment. But I only graduated 14 years ago, and I still remember what a finite state machine is, and with a bit of revision could probably explain the difference between an NFA and a DFA.

The best ones get the highest paid jobs.
Not true based on the experiences of people that I know. For one thing, the kind of personality which tends to occur in those who have an innate talent for low-level bit-flipping doesn't usually lend itself to the corporate politics involved in climbing the career ladder.

And, actually, the best ones know that the most expensive part of building a safety critical system is passing certification.
OK, but a lot of the B787's systems were self-certified, so how does that fit the pattern?

So what's that got to do with the control of a generator? It's still all about monitoring.
Right - it's all about monitoring. Presumably (and this is marginally-educated guesswork on my part), the failsafe mode exists for when the monitoring software detects a problem in the unit. The counter presumably exists in order to timestamp the monitoring operations as they happen. Prudent engineering practise would also likely have any unknown error conditions cause the unit to enter failsafe mode because it's better to be safe than sorry. Therefore the signal to enter failsafe mode would come from the monitoring system, and the counter overflow - being an unexpected error condition - would cause the monitoring system to do so. So even if the control system was implemented in hardware, it wouldn't alter the situation.
DozyWannabe is offline  
Old 30th May 2015, 10:02
  #99 (permalink)  
 
Join Date: Jan 2011
Location: on the cusp
Age: 52
Posts: 217
Likes: 0
Received 0 Likes on 0 Posts
This issue is not about the software as far as I see it. I am pretty sure that the software module was implemented and tested successfully against the module requirements. It is a systems engineering process issue, that is a concern.

During design, implementation and certification the analysis never revealed that an important component is guaranteed to fail every 248 days. This is evident because had it been analysed then there would already be an aircrew/maintenance action in place. The aim of the certification and analysis process is to drive out and mitigate all potential latent defects to an acceptable probability/consequence level. In this case the process failed spectacularly. Luckily the outcome was unspectacular and easy to manage, but the failure of the process is still of great concern.
dClbydalpha is offline  

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off



Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.