Go Back  PPRuNe Forums > Flight Deck Forums > Tech Log
Reload this Page >

Boeing 787 integer overflow bug

Wikiposts
Search
Tech Log The very best in practical technical discussion on the web

Boeing 787 integer overflow bug

Thread Tools
 
Search this Thread
 
Old 2nd May 2015, 10:46
  #21 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by peekay4
Please don't leave your 787 powered on for 248 days straight...

New FAA AD:

https://s3.amazonaws.com/public-insp...2015-10066.pdf
The overflow of a counter has been found, someone said how long would we need to keep a generator running for the counter overflow problem to show - 248 days!

Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.

Yet the FAA felt they had to issue an AD!? Really???
Ian W is offline  
Old 2nd May 2015, 11:12
  #22 (permalink)  
 
Join Date: Apr 2007
Location: scotland
Age: 65
Posts: 7
Likes: 0
Received 0 Likes on 0 Posts
Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?
startall4 is offline  
Old 2nd May 2015, 11:57
  #23 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by startall4
Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?
Well good luck with the B check maintenance
Ian W is offline  
Old 2nd May 2015, 15:27
  #24 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
ON POWERED TIME

As the probability of generators being kept running for that long is zero,
Methinks thou art missing a point. Power on time also includes when groun dpower is connected. Early on the 787 took a long time to boot up computers and system from power off. So as i understand it, Airlines were encouraged to connect groundpower and transfer before shutting down APU. Also, power up seemed to give a bundle of false alarms that needed to be reset- sorted out.

Thus it was ( is ? ) not unusual for a 787 to be powered on for several months, at least to the point of triggering the involved ' counter "

And redundant- fail safe systems should NOT have a common tie point which can trigger such an event.
SAMPUBLIUS is offline  
Old 2nd May 2015, 15:35
  #25 (permalink)  
Thread Starter
 
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes on 0 Posts
Talking yes really

The counter involved also counts when ground power is connected as in prior to APU shutdown.
SAMPUBLIUS is offline  
Old 2nd May 2015, 16:08
  #26 (permalink)  
 
Join Date: Jan 2011
Location: Seattle
Posts: 715
Likes: 0
Received 3 Likes on 2 Posts
DozyWannabe

Your analysis sounds about right, but from what I've been told real-time aviation software isn't usually hand-coded in the manner most other software is.
Sadly, this is exactly the case I ran into many times while at Boeing. One would expect that embedded controllers would be based on tested and stable RTOSs and libraries. Where an uptime of 248 days is no big deal for 32 bit controllers, so the overflow and wrap-around issues have been addressed. But I've worked with people who insisted on writing every line of code from scratch. Just because NIH.
EEngr is offline  
Old 2nd May 2015, 16:28
  #27 (permalink)  
 
Join Date: Jan 2008
Location: uk
Posts: 857
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Ian W
The overflow of a counter has been found, someone said how long would we need to keep a generator running for the counter overflow problem to show - 248 days!
The history of software development is littered with problems caused by people who thought counters were "big enough" that overflow would never be a problem, or that they would never overflow in the expected life of the software, or that the programmer would be retired / dead by the time the problem hit. This sort of thing really should _not_ be happening in safety critical software in this century.

Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.

Yet the FAA felt they had to issue an AD!? Really???
The AD seems to just say "mandatory restart every 120 days" - I guess that gives two chances to catch it plus a bit of margin. If everyone is doing this anyway - if there is zero chance as you say - then I'm not sure why they included a cost of compliance...

It is also implied that this was "found" and therefore was not previously documented - as it should have been. To me, this indicates a non-zero risk that in some future change someone will make the counter value persistent (no resets), or make it effectively smaller (and overflow sooner), assuming (because it is not documented) that overflow causes no problems. The AD serves, in part, to document it.

I am more interested in what remains unsaid, namely why this software was/is being tested "in laboratory testing" _now_ - inevitable suspicion is that it is because of a real in-service problem (most likely not this one as you say). It also raises the question of why the software was _not_ tested "in the lab" before flight (or maybe it was but not fully / correctly). I don't suppose we'll ever know...
infrequentflyer789 is offline  
Old 2nd May 2015, 17:05
  #28 (permalink)  
 
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes on 0 Posts
INTEGER OVERFLOW WAS ALREADY THE MAIN CAUSE OF THE ARIANE 501 ROCKET CRASH !
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF)

The report showed also many other IT faults (I counted 99 of 80 different types).

Many people think it is unpossible to write and implement bugfree lines. That must end. It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.

Again :
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF) ! Shame to our airspace community
roulishollandais is offline  
Old 2nd May 2015, 17:33
  #29 (permalink)  
ImageGear
Guest
 
Posts: n/a
"NO LESSON HAS BEEN LEARNED..."

Indeed, in the rush to push any hardware/firmware/software out of the door and start bringing in revenue (to fund the completion of the design), just about the last programme task to be started are the error recovery routines.

In the meantime, the infamous "Bit Bucket" is the black hole where subroutines die when a line of code has failed to execute correctly.
A Programme Director who has the guts to put the brakes on rollout until elegant error recovery schemes are completed will not occupy his role for long.

In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one.
 
Old 2nd May 2015, 17:43
  #30 (permalink)  
 
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes on 0 Posts
In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one.
... And passengers

roulishollandais is offline  
Old 2nd May 2015, 18:47
  #31 (permalink)  
 
Join Date: Oct 2004
Location: California
Posts: 385
Likes: 0
Received 11 Likes on 8 Posts
All counters roll over (overflow) at some point. Some more often than others. The GPS Week counter rolls over every 1024 weeks (19+ years). The first rollover happened in 1999 (just before the year 2000 rollover problem) and caused many GPS systems to fail. The next GPS week rollover will happen in 2019. The date format of Digital Certificates will have a rollover situation in 2049.
MarcK is offline  
Old 2nd May 2015, 18:57
  #32 (permalink)  
 
Join Date: Mar 2015
Location: XFW, Germany
Posts: 128
Likes: 0
Received 0 Likes on 0 Posts
and several UNIX system will have a nice rollover in 2038

Yet.. that very 248 days (32bit for 1/100s) happened before and it's just insane that they "reintroduce" that (ask Win95 users that happened to have such an uptime).
PAXfips is offline  
Old 2nd May 2015, 19:44
  #33 (permalink)  
 
Join Date: Nov 2009
Location: flying by night
Posts: 500
Likes: 0
Received 0 Likes on 0 Posts
this is the 3rd thread about the same subject, the other 2 have been merged here: http://www.pprune.org/tech-log/56075...-new-post.html
deptrai is offline  
Old 2nd May 2015, 20:57
  #34 (permalink)  
 
Join Date: Jun 2002
Location: Avon, CT, USA
Age: 68
Posts: 470
Likes: 0
Received 0 Likes on 0 Posts
Planes have turned into flying computes so I guess they need to be re-booted often to prevent problems. Hopefully at cruise I pilot won't have the computer try to calculate PI to infinity. I believe that's how they overloaded the computer on a Star Trek episode.
ATPMBA is offline  
Old 2nd May 2015, 22:11
  #35 (permalink)  
 
Join Date: Jun 2001
Location: Rockytop, Tennessee, USA
Posts: 5,898
Likes: 0
Received 1 Like on 1 Post
Lead story on the U.S. edition of CNN's web page:

FAA finds Boeing Dreamliner could lose all power, issues maintenance mandate

By Greg Botelho, CNN
Updated 5:25 PM ET, Sat May 2, 2015


(CNN)—The headaches for Boeing over its 787 Dreamliner continue.

The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.

The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.

"The FAA has found that the risk to the flying public justifies waiving notice and comment," the agency said.

The maintenance mandate was characterized as temporary, until software is developed to resolve the problem.

This marks the latest setback for Boeing over its 787, which debuted in 2011 in Asia and a year later in the United States amid much fanfare...
FAA: Dreamliner battery could lose all power - CNN.com

Oh, the horror...
Airbubba is offline  
Old 2nd May 2015, 22:35
  #36 (permalink)  
 
Join Date: Dec 2014
Location: USA
Posts: 41
Likes: 0
Received 0 Likes on 0 Posts
It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.
Of course the systems on the 787 are a bit more complicated than an adding machine. What happens when you enter 9999999999 + 1 on your calculator?

It is no more possible to write defect-free complex software than it is to design a hydraulic pump that never fails. The proper approach is therefore to expect failures and design systems that are fault-tolerant. Maybe Boeing didn't do that here, or could have done better, but the general idea that we can achieve perfection in complex software is wishful thinking.

Even with our best engineering we find problems that go undetected during design and testing. We deal with those when we discover them, as Boeing are doing now (they are updating the software).
ams6110 is offline  
Old 3rd May 2015, 00:20
  #37 (permalink)  
 
Join Date: Aug 2013
Location: Land of 1,000 Dances
Age: 63
Posts: 68
Likes: 0
Received 0 Likes on 0 Posts
Slightly tongue-in-cheek post from IT journal The Register:


The US Federal Aviation Administration (FAA) has issued a new airworthiness directive (PDF) for Boeing's 787 because a software bug shuts down the plane's electricity generators every 248 days.
“We have been advised by Boeing of an issue identified during laboratory testing,” the directive says. That issue sees “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
When the GCU is in failsafe mode it isn't making any power. That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”
And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies.
The fix outlined in the directive is pretty simple: make sure you turn the GCUs off before 248 days elapse.
Boeing is working on a fix and the FAA says “Once this software is developed, approved, and available, we might consider additional rulemaking.”
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
Have you turned it off and on again? That's the way to stop the plane becoming a brick! The Register
HighAndFlighty is offline  
Old 3rd May 2015, 01:00
  #38 (permalink)  
 
Join Date: Mar 2015
Location: Ottawa, ON, Canada
Posts: 11
Likes: 0
Received 0 Likes on 0 Posts
re: Perfect Software

ams6110: I'm not a commercial pilot but I do have 25+ years in software. It's not so much "wishful thinking" about perfect software but the cost of attaining it. The Space Shuttle flight control software team was one group that was famous for attaining near zero defects, but they did so with incredible rigour and an associated slow pace.

Can anyone shed some light on the language used in the GCU's system? Was it Ada or C/C++? Regardless, it's pretty common to do "what happens when this value hits its maximum + 1" testing. Even in simple web systems I often do that in unit tests to verify that nothing breaks in an unexpected way.

The difference in this case is probably that it's an internal counter rather than a parameter being passed about. A parameter would probably be subjected to boundary value tests, but perhaps not a global counter. There are also tools for what's called "fuzz testing" that will inject invalid values to catch just these sorts of problems. Again, though, that may have been done but just not in the right place.

One hope I do have, though, is that Boeing treats this like a mechanical issue and does proper root cause analysis and deals with the human aspects as much as the technical ones. That doesn't happen very often at all in the software world.
dave.rooney is offline  
Old 3rd May 2015, 01:14
  #39 (permalink)  
 
Join Date: Mar 2009
Location: Perth Western Australia
Age: 57
Posts: 808
Likes: 0
Received 0 Likes on 0 Posts
I think they need something like the Concord used to have in the cabin, but instead of displaying speed, count down too power off
rh200 is offline  
Old 3rd May 2015, 01:19
  #40 (permalink)  
 
Join Date: Jun 2001
Location: Rockytop, Tennessee, USA
Posts: 5,898
Likes: 0
Received 1 Like on 1 Post
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
And, actually, as discussed on another thread, not being able to power things down for a reset is a challenge on 'modern' aircraft.

Back in the not so distant FE days pulling a breaker and resetting it would cure a lot of mysterious faults. You were supposed to use that superior systems knowledge to psychoanalyze the electrical system to figure out what relay would be unpowered on what bus etc., etc., etc...

Later, in the early glass days, inoperative computer boxes like FCC's would sometimes be cured back on the ground if you removed all power to the plane and started up again. The feds would also look the other way if, for example, you reset a yaw damper that didn't come online during start. You've seen it before and know how to fix it, no need to pull out the book, right?

Now, in this enlightened era of fly-by-wire and electric airliners, you don't touch a button in case of a fault until you've run through the QRH, then got a phone patch and done a kumbaya session with the dispatcher and maintenance. Which is probably a good thing given the history of creative systems analysis by flight crews. Even after you land, more effort often seems to be put into finding the right deferral code than fixing what might be a simple problem.

Anyway, stuff like a frozen ACARS screen in flight that in the past you would be expected to troubleshoot and pull and reset a breaker is now something that you need to document, advise the company and live with unless you somehow get special dispensation from somebody on the ground.

Sweeping language suggesting a general prudential approach in the preamble to the abnormal section of the flight manual has been replaced by paragraphs of CYA verbiage to insulate the company from liability if you make a command decision to try something not specifically authorized in the book.

So, it seems to me that increasingly, the pilot can't 'turn it off and turn it on again' even on the older aircraft.
Airbubba is offline  


Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.