Boeing 787 integer overflow bug
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes
on
0 Posts
Please don't leave your 787 powered on for 248 days straight...
New FAA AD:
https://s3.amazonaws.com/public-insp...2015-10066.pdf
New FAA AD:
https://s3.amazonaws.com/public-insp...2015-10066.pdf
Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.
Yet the FAA felt they had to issue an AD!? Really???
Join Date: Apr 2007
Location: scotland
Age: 65
Posts: 7
Likes: 0
Received 0 Likes
on
0 Posts
Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes
on
0 Posts
Well good luck with the B check maintenance
Thread Starter
Join Date: Apr 2014
Location: Washstate
Age: 79
Posts: 0
Likes: 0
Received 0 Likes
on
0 Posts
ON POWERED TIME
As the probability of generators being kept running for that long is zero,
Thus it was ( is ? ) not unusual for a 787 to be powered on for several months, at least to the point of triggering the involved ' counter "
And redundant- fail safe systems should NOT have a common tie point which can trigger such an event.
DozyWannabe
Sadly, this is exactly the case I ran into many times while at Boeing. One would expect that embedded controllers would be based on tested and stable RTOSs and libraries. Where an uptime of 248 days is no big deal for 32 bit controllers, so the overflow and wrap-around issues have been addressed. But I've worked with people who insisted on writing every line of code from scratch. Just because NIH.
Your analysis sounds about right, but from what I've been told real-time aviation software isn't usually hand-coded in the manner most other software is.
Join Date: Jan 2008
Location: uk
Posts: 857
Likes: 0
Received 0 Likes
on
0 Posts
Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.
Yet the FAA felt they had to issue an AD!? Really???
It is also implied that this was "found" and therefore was not previously documented - as it should have been. To me, this indicates a non-zero risk that in some future change someone will make the counter value persistent (no resets), or make it effectively smaller (and overflow sooner), assuming (because it is not documented) that overflow causes no problems. The AD serves, in part, to document it.
I am more interested in what remains unsaid, namely why this software was/is being tested "in laboratory testing" _now_ - inevitable suspicion is that it is because of a real in-service problem (most likely not this one as you say). It also raises the question of why the software was _not_ tested "in the lab" before flight (or maybe it was but not fully / correctly). I don't suppose we'll ever know...
Join Date: Jun 2011
Location: france
Posts: 760
Likes: 0
Received 0 Likes
on
0 Posts
INTEGER OVERFLOW WAS ALREADY THE MAIN CAUSE OF THE ARIANE 501 ROCKET CRASH !
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF)
The report showed also many other IT faults (I counted 99 of 80 different types).
Many people think it is unpossible to write and implement bugfree lines. That must end. It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.
Again :
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF) ! Shame to our airspace community
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF)
The report showed also many other IT faults (I counted 99 of 80 different types).
Many people think it is unpossible to write and implement bugfree lines. That must end. It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.
Again :
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF) ! Shame to our airspace community
Guest
Posts: n/a
"NO LESSON HAS BEEN LEARNED..."
Indeed, in the rush to push any hardware/firmware/software out of the door and start bringing in revenue (to fund the completion of the design), just about the last programme task to be started are the error recovery routines.
In the meantime, the infamous "Bit Bucket" is the black hole where subroutines die when a line of code has failed to execute correctly.
A Programme Director who has the guts to put the brakes on rollout until elegant error recovery schemes are completed will not occupy his role for long.
In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one.
In the meantime, the infamous "Bit Bucket" is the black hole where subroutines die when a line of code has failed to execute correctly.
A Programme Director who has the guts to put the brakes on rollout until elegant error recovery schemes are completed will not occupy his role for long.
In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one.
All counters roll over (overflow) at some point. Some more often than others. The GPS Week counter rolls over every 1024 weeks (19+ years). The first rollover happened in 1999 (just before the year 2000 rollover problem) and caused many GPS systems to fail. The next GPS week rollover will happen in 2019. The date format of Digital Certificates will have a rollover situation in 2049.
Join Date: Mar 2015
Location: XFW, Germany
Posts: 128
Likes: 0
Received 0 Likes
on
0 Posts
and several UNIX system will have a nice rollover in 2038
Yet.. that very 248 days (32bit for 1/100s) happened before and it's just insane that they "reintroduce" that (ask Win95 users that happened to have such an uptime).
Yet.. that very 248 days (32bit for 1/100s) happened before and it's just insane that they "reintroduce" that (ask Win95 users that happened to have such an uptime).
Join Date: Nov 2009
Location: flying by night
Posts: 500
Likes: 0
Received 0 Likes
on
0 Posts
this is the 3rd thread about the same subject, the other 2 have been merged here: http://www.pprune.org/tech-log/56075...-new-post.html
Join Date: Jun 2002
Location: Avon, CT, USA
Age: 68
Posts: 470
Likes: 0
Received 0 Likes
on
0 Posts
Planes have turned into flying computes so I guess they need to be re-booted often to prevent problems. Hopefully at cruise I pilot won't have the computer try to calculate PI to infinity. I believe that's how they overloaded the computer on a Star Trek episode.
Join Date: Jun 2001
Location: Rockytop, Tennessee, USA
Posts: 5,898
Likes: 0
Received 1 Like
on
1 Post
Lead story on the U.S. edition of CNN's web page:
FAA: Dreamliner battery could lose all power - CNN.com
Oh, the horror...
FAA finds Boeing Dreamliner could lose all power, issues maintenance mandate
By Greg Botelho, CNN
Updated 5:25 PM ET, Sat May 2, 2015
(CNN)—The headaches for Boeing over its 787 Dreamliner continue.
The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.
The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.
"The FAA has found that the risk to the flying public justifies waiving notice and comment," the agency said.
The maintenance mandate was characterized as temporary, until software is developed to resolve the problem.
This marks the latest setback for Boeing over its 787, which debuted in 2011 in Asia and a year later in the United States amid much fanfare...
By Greg Botelho, CNN
Updated 5:25 PM ET, Sat May 2, 2015
(CNN)—The headaches for Boeing over its 787 Dreamliner continue.
The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.
The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.
"The FAA has found that the risk to the flying public justifies waiving notice and comment," the agency said.
The maintenance mandate was characterized as temporary, until software is developed to resolve the problem.
This marks the latest setback for Boeing over its 787, which debuted in 2011 in Asia and a year later in the United States amid much fanfare...
Oh, the horror...
Join Date: Dec 2014
Location: USA
Posts: 41
Likes: 0
Received 0 Likes
on
0 Posts
It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.
It is no more possible to write defect-free complex software than it is to design a hydraulic pump that never fails. The proper approach is therefore to expect failures and design systems that are fault-tolerant. Maybe Boeing didn't do that here, or could have done better, but the general idea that we can achieve perfection in complex software is wishful thinking.
Even with our best engineering we find problems that go undetected during design and testing. We deal with those when we discover them, as Boeing are doing now (they are updating the software).
Join Date: Aug 2013
Location: Land of 1,000 Dances
Age: 63
Posts: 68
Likes: 0
Received 0 Likes
on
0 Posts
Slightly tongue-in-cheek post from IT journal The Register:
Have you turned it off and on again? That's the way to stop the plane becoming a brick! The Register
The US Federal Aviation Administration (FAA) has issued a new airworthiness directive (PDF) for Boeing's 787 because a software bug shuts down the plane's electricity generators every 248 days.
“We have been advised by Boeing of an issue identified during laboratory testing,” the directive says. That issue sees “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
When the GCU is in failsafe mode it isn't making any power. That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”
And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies.
The fix outlined in the directive is pretty simple: make sure you turn the GCUs off before 248 days elapse.
Boeing is working on a fix and the FAA says “Once this software is developed, approved, and available, we might consider additional rulemaking.”
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
“We have been advised by Boeing of an issue identified during laboratory testing,” the directive says. That issue sees “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
When the GCU is in failsafe mode it isn't making any power. That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”
And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies.
The fix outlined in the directive is pretty simple: make sure you turn the GCUs off before 248 days elapse.
Boeing is working on a fix and the FAA says “Once this software is developed, approved, and available, we might consider additional rulemaking.”
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
Join Date: Mar 2015
Location: Ottawa, ON, Canada
Posts: 11
Likes: 0
Received 0 Likes
on
0 Posts
re: Perfect Software
ams6110: I'm not a commercial pilot but I do have 25+ years in software. It's not so much "wishful thinking" about perfect software but the cost of attaining it. The Space Shuttle flight control software team was one group that was famous for attaining near zero defects, but they did so with incredible rigour and an associated slow pace.
Can anyone shed some light on the language used in the GCU's system? Was it Ada or C/C++? Regardless, it's pretty common to do "what happens when this value hits its maximum + 1" testing. Even in simple web systems I often do that in unit tests to verify that nothing breaks in an unexpected way.
The difference in this case is probably that it's an internal counter rather than a parameter being passed about. A parameter would probably be subjected to boundary value tests, but perhaps not a global counter. There are also tools for what's called "fuzz testing" that will inject invalid values to catch just these sorts of problems. Again, though, that may have been done but just not in the right place.
One hope I do have, though, is that Boeing treats this like a mechanical issue and does proper root cause analysis and deals with the human aspects as much as the technical ones. That doesn't happen very often at all in the software world.
Can anyone shed some light on the language used in the GCU's system? Was it Ada or C/C++? Regardless, it's pretty common to do "what happens when this value hits its maximum + 1" testing. Even in simple web systems I often do that in unit tests to verify that nothing breaks in an unexpected way.
The difference in this case is probably that it's an internal counter rather than a parameter being passed about. A parameter would probably be subjected to boundary value tests, but perhaps not a global counter. There are also tools for what's called "fuzz testing" that will inject invalid values to catch just these sorts of problems. Again, though, that may have been done but just not in the right place.
One hope I do have, though, is that Boeing treats this like a mechanical issue and does proper root cause analysis and deals with the human aspects as much as the technical ones. That doesn't happen very often at all in the software world.
Join Date: Jun 2001
Location: Rockytop, Tennessee, USA
Posts: 5,898
Likes: 0
Received 1 Like
on
1 Post
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
Back in the not so distant FE days pulling a breaker and resetting it would cure a lot of mysterious faults. You were supposed to use that superior systems knowledge to psychoanalyze the electrical system to figure out what relay would be unpowered on what bus etc., etc., etc...
Later, in the early glass days, inoperative computer boxes like FCC's would sometimes be cured back on the ground if you removed all power to the plane and started up again. The feds would also look the other way if, for example, you reset a yaw damper that didn't come online during start. You've seen it before and know how to fix it, no need to pull out the book, right?
Now, in this enlightened era of fly-by-wire and electric airliners, you don't touch a button in case of a fault until you've run through the QRH, then got a phone patch and done a kumbaya session with the dispatcher and maintenance. Which is probably a good thing given the history of creative systems analysis by flight crews. Even after you land, more effort often seems to be put into finding the right deferral code than fixing what might be a simple problem.
Anyway, stuff like a frozen ACARS screen in flight that in the past you would be expected to troubleshoot and pull and reset a breaker is now something that you need to document, advise the company and live with unless you somehow get special dispensation from somebody on the ground.
Sweeping language suggesting a general prudential approach in the preamble to the abnormal section of the flight manual has been replaced by paragraphs of CYA verbiage to insulate the company from liability if you make a command decision to try something not specifically authorized in the book.
So, it seems to me that increasingly, the pilot can't 'turn it off and turn it on again' even on the older aircraft.