Go Back  PPRuNe Forums > Flight Deck Forums > Tech Log
Reload this Page >

Boeing 787 integer overflow bug

Wikiposts
Search
Tech Log The very best in practical technical discussion on the web

Boeing 787 integer overflow bug

Thread Tools
 
Search this Thread
 
Old 3rd May 2015, 02:01
  #41 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
Talking RE OVERFLOW BUG

High and Flighty/FAA said
'...That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”>>
But normally there is a few second or minutes between start of one engine and the second engine. And then the APU will be shut off after climbout . Which then leads to the following

One engine GCU times out and shuts of its two generators. No biggie- other engine and APU can easily carry the load. But a few minutes later, 2nd engine generator system cuts out . Oh well we still have APU to start engines ? Then an few minutes later, the APU generator times out ??

Or is the GCU involved a single point join so that its timer overloads - and the battery system cuts in with ' nearer my god to thee ' ??
SAMPUBLIUS is offline  
Old 3rd May 2015, 03:02
  #42 (permalink)  
 
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes on 0 Posts
Building a software is like building a house, the first thing you have to do is to list all the materials/variables that you need, defining size, use, purposes, movement, range aso. The integer counter is one of the easiest variable to verify all along the software design and realisation. No need of sophisticated stats, only very very basic methods with paper and pencil in your armchair. No excuse after Ariane501 crash and report.NO !
roulishollandais is offline  
Old 3rd May 2015, 03:35
  #43 (permalink)  
 
Join Date: Apr 1998
Location: Mesopotamos
Posts: 5
Likes: 0
Received 0 Likes on 0 Posts
Congratulations to the tester that found the bug. Good testers think outside the box as this one had done, 248 days was perhaps an unlikely scenario but by bringing it up as a test fail it really got everyone's attention of what a simple oversight can do.

Once fixed it's going to take another 248 days to re-run the test.

Anyhow, methinks a 787 version 1.0 probably flies better if you reboot it first.
cattletruck is offline  
Old 3rd May 2015, 05:30
  #44 (permalink)  
 
Join Date: Jan 2008
Location: Los Angeles
Posts: 168
Likes: 0
Received 0 Likes on 0 Posts
Once fixed it's going to take another 248 days to re-run the test.
I'm pretty sure it can be qualified "by inspection".
poorjohn is offline  
Old 3rd May 2015, 09:46
  #45 (permalink)  
 
Join Date: Sep 2002
Location: Oz
Posts: 297
Likes: 0
Received 1 Like on 1 Post
Very interesting thread, something I had never even thought about. I just wish I had even the faintest idea of what you guys are talking about.

So, can someone answer a couple of questions for a simple guy.

Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.

I gather, from googling it (nothing of which I understood anyway), integer counters are fairly common in computing software, so how do banks not have this problem as their computer hardware boxes have times between power downs measured in years.

If software is not hand coded, ie someone pounding away on a keyboard writing lines of code, how is it written if it's not hand coded.

All this stuff, sounds like you're talking about the warp drive of the Starship Enterprise.

I have such a headache now!

Last edited by Guptar; 3rd May 2015 at 09:46. Reason: spelling
Guptar is offline  
Old 3rd May 2015, 11:04
  #46 (permalink)  
 
Join Date: Jun 2014
Location: Village of Santo Poco
Posts: 869
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by p.j.m
Boeing must be using Windows programmers these days.

Pilot: "Hello Help desk - the aircraft has lost power"
Indian "have you rebooted?"
Hey, that's racist!
Amadis of Gaul is offline  
Old 3rd May 2015, 13:26
  #47 (permalink)  
 
Join Date: Jan 2008
Location: Los Angeles
Posts: 168
Likes: 0
Received 0 Likes on 0 Posts
Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?
That's a question for the hardware guys.

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.
A "counter" is basically a unit of data storage in memory and the associated software that manipulates the value being stored, e.g. increments the value and tests it against limit(s). Computers typically have instructions that let them access memory in chunks smaller than the default size, and to not waste memory (which for critical real-time devices can be expensive) the programmer selects a size appropriate to the need.

I gather, from googling it (nothing of which I understood anyway), integer counters are fairly common in computing software, so how do banks not have this problem as their computer hardware boxes have times between power downs measured in years.
This 787 counter counted units of time, so it was a timer. You'd have to know what it was for and why it was designed to force the hardware it controlled into some inoperational mode when the value became zero. It could have been a valid reason e.g. the device had reached a critical time limit where it had to be shut down and lubricated and the problem the program designer didn't allow for was that that service could have taken place without powering off the device and resetting the timer/counter.

If software is not hand coded, ie someone pounding away on a keyboard writing lines of code, how is it written if it's not hand coded.
Programmers may insert into their own program software modules written by other programmers. Hand-coded, but by others' hands.
(The design fault here is that the software does not count characters I've typed within a quote, so I have to say something I didn't need to say outside the quote or it will flog me because my message was "too short".)
poorjohn is offline  
Old 3rd May 2015, 18:33
  #48 (permalink)  
 
Join Date: May 2008
Location: denmark
Posts: 9
Likes: 0
Received 0 Likes on 0 Posts
Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?
That's a question for the hardware guys.
I'm not working with areospace, but in my field of engineering (wind turbines) it differently would have integer counters.
Often those systems run at a constant scan rate, and software filters and timers are used to slow down the reaction of the system in a configurable manner.
Since systems have boxes connected together with communication links, monitoring of broken communication links have to be implemented. (Typical as Timeouts).
Another purpose could be for shutting tings down in case of faults, e.g. stop the engine if the lubrication pressure is lower than 2 bars for 5 secs.
Timers is also used to delay, and prevent erratic state change of an output, i.e. prevent a valve from being turned off/on every 10ms. Scan.
(Persistent) Counters are also used for statistics for maintenance and trouble shooting.
It is good system-engineering practice to separate/compartmentalize safety critical control, from datalogging for diagnostics.
If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit
The size of a counter value is primary related to software, and not hardware architecture.
Using a 64bit desktop microprocessor in such an environment is often a bad idea, if possible micro-controllers like ARM cortex is used instead.
A bigger complex CPU use more power, generates more heat, and is 100 times more unreliable than a small microcontroller.
An Intel desktop CPU is only on the market for 3 years, and industrial/aerospace products have to be supported for 20 years.
Some of the newer micro controllers like the TMS570 have features that make safety certification easier.
HighWind is offline  
Old 3rd May 2015, 20:56
  #49 (permalink)  
 
Join Date: Mar 2009
Location: Perth Western Australia
Age: 57
Posts: 808
Likes: 0
Received 0 Likes on 0 Posts
The basics of what type of variable you declare depends can upon several things

1) You misunderstand the requirements.

2) you are sloppy.

3)architecture coupled with the above.

It used to be that people tried to keep their code small, with todays cpu's and resources, people have become very sloppy. But there are a couple of places (probally more) that I know which forces me to declare small variable types.

1) Where the output of the variable, results in and excessive amount of data to capture and store.

2) Using micro controllers. These usually have limited on board space. I would imagine in an environment such as these, with access to the best, they would still be constrained.

And there could be many other reasons.
rh200 is offline  
Old 3rd May 2015, 22:40
  #50 (permalink)  
 
Join Date: Aug 2013
Location: PA
Age: 59
Posts: 30
Likes: 0
Received 0 Likes on 0 Posts
FAA directive issued for 787

(CNN)The headaches for Boeing over its 787 Dreamliner continue.

The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.

The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.

FAA finds Boeing Dreamliner could lose all power, issues maintenance mandate
underfire is offline  
Old 4th May 2015, 05:22
  #51 (permalink)  
 
Join Date: Oct 2005
Location: Classified
Posts: 314
Likes: 0
Received 0 Likes on 0 Posts
..........

Last edited by Radix; 18th Mar 2016 at 01:42.
Radix is offline  
Old 4th May 2015, 06:47
  #52 (permalink)  
 
Join Date: Sep 2014
Location: Canada
Posts: 1,257
Likes: 0
Received 0 Likes on 0 Posts
Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?
The purpose of an integer counter is to provide a standard measurement of time.

Remember that hardware can run at varying speeds, so we can't rely on hardware cycle speed to measure time. E.g., suppose today a CPU runs at 1 GHz, but tomorrow a replacement CPU comes out at 2 GHz, so each hardware cycle is now twice as fast. We don't want all of our time measurements to be suddenly be off by a factor of two!

Therefore a counter is provided which always increases at a predictable, set time period (called the tick time period) regardless of the underlying hardware speed.

A common tick period is 100 Hz. I.e., the time counter will always increment once every 1/100th of a second, regardless of the speed of the hardware. An elapsed time of 100 ticks means 1 second has passed, on any hardware.

Most real-time systems are completely tick based. At each and every tick, the system "kernel" is activated and every running task re-scheduled for execution based on their priority and allocated processing time budget (also measured in ticks).

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.
Boeing probably had little to do with this bug. The affected GCUs would have been supplied by a third-party company.

And that third-party company probably used a Real Time Operating System (RTOS) supplied by yet another company.

My guess is this integer overflow is probably in the RTOS or related code. The bug might have been discovered in some completely unrelated software (maybe not even aviation software) using the same RTOS.

The speculation is that the buggy code is a 32-bit signed counter measuring 100 Hz ticks. So with one bit taken for the sign (+/-), that leaves 31-bits for the counter and 2^31/(60*60*24*100) = 248.55 days.
peekay4 is offline  
Old 5th May 2015, 01:33
  #53 (permalink)  
 
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Peekay4
Boeing probably had little to do with this bug. The affected GCUs would have been supplied by a third-party company.

And that third-party company probably used a Real Time Operating System (RTOS) supplied by yet another company.

My guess is this integer overflow is probably in the RTOS or related code. The bug might have been discovered in some completely unrelated software (maybe not even aviation software) using the same RTOS.
If you use software from a third party, you need not only the soft or the RTOS but the totality of its documentation and the whole test data. The furnisher of the RTOS or the software may design them for a toy, but Boeing uses them for an aircraft.
The certifiers are at fault too , they have to verify that documentation and test data are there and tests have been done actually after implementation. It seems easy to ask a third party to share the work, in fact you have to verify all the links .
To be sure the work is done you have to pay when you received everything and it is OK. Everyone must sign his work as complete. Certifiers should not have certified the B787 before all the tests are done and on the table.

Last edited by roulishollandais; 5th May 2015 at 01:36. Reason: bugs from my spellchecker which was still in french!
roulishollandais is offline  
Old 6th May 2015, 10:16
  #54 (permalink)  
 
Join Date: Feb 2002
Location: UK
Age: 58
Posts: 3,500
Received 165 Likes on 89 Posts
Back in the real world...

It takes about 20 minutes to downpower and reboot the a/c. Not good on a quick turn round but if the a/c has just come out of the shed after an A check, no big issue.
It is common practice to park the a/c without power if it is not required for several hours.

Just another card on the check.
TURIN is offline  
Old 6th May 2015, 12:10
  #55 (permalink)  
 
Join Date: Aug 2005
Location: fairly close to the colonial capitol
Age: 55
Posts: 1,693
Likes: 0
Received 0 Likes on 0 Posts
The 'lazy' certification issue RH mentions is truer today than ever before. More and more reliance on manufacturer-designed testing regimes for the regulators regarding airborne computer systems has the odd chicken coming home to roost in times recent. (past few decades)

This lack of complete knowledge of the widest operational range (extremes/faulty sensors/etc) at the confluence of hardware/software interface has the potential for the occasional dicey consequence - particularly after human factors are added into the melange.
vapilot2004 is offline  
Old 10th May 2015, 16:41
  #56 (permalink)  
 
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes on 0 Posts
integer overflow

if you don't want to read the totality of that report :
Originally Posted by Ariane 501 full report (12 pages only to read)
The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error.

The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected.

The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose.

The alignment function is operative for 50 seconds after starting of the Flight Mode of the SRIs which occurs at H0 - 3 seconds for Ariane 5. Consequently, when lift-off occurs, the function continues for approx. 40 seconds of flight. This time sequence is based on a requirement of Ariane 4 and is not required for Ariane 5.

The Operand Error occurred due to an unexpected high value of an internal alignment function result called BH, Horizontal Bias, related to the horizontal velocity sensed by the platform. This value is calculated as an indicator for alignment precision over time.

The value of BH was much higher than expected because the early part of the trajectory of Ariane 5 differs from that of Ariane 4 and results in considerably higher horizontal velocity values.
roulishollandais is offline  
Old 10th May 2015, 18:51
  #57 (permalink)  
 
Join Date: Feb 2009
Location: RSW & Europe
Posts: 61
Likes: 0
Received 0 Likes on 0 Posts
Ariane 501 and Cluster

There is no such thing as a "free launch" or lunch, I was involved in Cluster and almost 10 years of my life went up in smoke.
Cluster (spacecraft) - Wikipedia, the free encyclopedia)
blackbeard1 is offline  
Old 12th May 2015, 10:23
  #58 (permalink)  
 
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes on 0 Posts
@blackbeard1
10 years of your life in smoke from that crazy overflowing bit but wide misfunction in that rocket project ! Condolences !

Boeing may probably find other things to care…
roulishollandais is offline  
Old 12th May 2015, 11:21
  #59 (permalink)  
 
Join Date: Jun 2002
Location: Geneva, Switzerland
Age: 58
Posts: 1,907
Received 3 Likes on 3 Posts
Wasn't cluster II re-launched a few years latter and still happily operating ?
atakacs is offline  
Old 12th May 2015, 12:21
  #60 (permalink)  
 
Join Date: Feb 2009
Location: RSW & Europe
Posts: 61
Likes: 0
Received 0 Likes on 0 Posts
Cluster

Cluster was rebuilt and launched from Baikonur and is still working and giving good scientific as you said. I am now retired, as are most of the original team, sadly some have died but it is good to know that the original design and objectives are still giving good scientific results.

ESA Science & Technology: Cluster
blackbeard1 is offline  


Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.