PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Tech Log (https://www.pprune.org/tech-log-15/)
-   -   Boeing 787 integer overflow bug (https://www.pprune.org/tech-log/560793-boeing-787-integer-overflow-bug.html)

SAMPUBLIUS 30th Apr 2015 17:39

787 ELECTRICAL ISSUE
 
GEEZE- IMO any software that allows all systems to fail at the same time- even under extreme unlikely events is fubar !
FAA orders new 787 electrical fix to prevent power failure - 4/30/2015 - Flight Global

ll Boeing 787 operators will be required to periodically deactivate the electrical system to avoid a problem with a newly-discovered software bug that could cause the aircraft to lose alternating current (AC) power, the US Federal Aviation Administration says in a new airworthiness directive.

The agency adopted the final rule after Boeing reported the results of a laboratory test showing a total loss of power is possible if the generator control units run continuously for eight months, says the FAA’s 30 April notice in the Federal Register. :eek:

The binding airworthiness directive is being published less than two weeks after Boeing privately alerted operators about the problem, the company says in a statement to Flightglobal.

It is rare for a commercial aircraft to remain powered on for eight months with no interruptions.

Goes on !

...

All six power generating systems are managed by a corresponding generator control unit (GCU). Boeing’s laboratory testing discovered that an internal software counter in the GCU overflows after running continuously for 248 days, according to the FAA. The overflow causes all four GCUs on the engine-mounted generators to enter failsafe mode at the same time.

fleigle 30th Apr 2015 19:07

Yeah, that would be a hell of a long-distance flight, probably the blue-screen-of-death app. for toilet overflow would happen before that though!!!.
:E

SAMPUBLIUS 1st May 2015 04:36

and from WSJ
 
Yikes !

from WSJ extract

A Federal Aviation Administration safety directive that became public on Thursday reveals that Boeing’s laboratory tests discovered that under certain circumstances, all of the 787’s power systems can suddenly shut down entirely during a flight.

Such a problem, —which the FAA said risks “loss of control of the airplane,” can occur after a jetliner remains connected to onboard or ground-based electric power without a break for a stretch of 248 consecutive days, the agency said. The FAA is ordering airlines to shut down power systems periodically to alleviate the hazard.

Boeing said such shutdowns are part of regular maintenance, and it would be rare for a jet to have power uninterrupted for so long. The plane maker roughly a week ago recommended that airlines voluntarily turn off power systems at least every four months.

During the early stages of the plane’s introduction, Boeing drafted an internal report concluding that Dreamliners experienced most of their reliability problems just after being powered up. The company recommended adding additional time before flights to deal with erroneous “nuisance” messages.


chrissw 1st May 2015 06:49

787 software problem?
 
Just hope you're not flying on the 248th day! (Although admittedly the fix isn't difficult...)

787 software bug can shut down planes' generators ? The Register

FE Hoppy 1st May 2015 08:06

Back in the real world, when was the last time an aircraft was continuously powered for 248 days?


The Ejets had a similar problem when first introduced but for them it was an commanded RAT deployment on the ground after 40 hours.

Quick software update and all was well.

chrissw 1st May 2015 09:25

Indeed, in the real world it's never going to happen. Nevertheless, the FAA clearly thought it was significant enough to issue a directive about it.

Also I suspect that software updates are far from trivial where the software is safety-critical with multiple redundancies and parallel processing.

Basil 1st May 2015 10:28


Have you turned it off and on again?
Did that a couple of times when the B747-400 first entered service.

Ian W 1st May 2015 11:29


Originally Posted by chrissw (Post 8961747)
Indeed, in the real world it's never going to happen. Nevertheless, the FAA clearly thought it was significant enough to issue a directive about it.

Also I suspect that software updates are far from trivial where the software is safety-critical with multiple redundancies and parallel processing.

As the probability of generators being kept running for that long is zero, it may not even need a fix. Yes it is poor programming practice but it is not an issue that will affect the aircraft. It's like saying the aircraft can run out of fuel if it flies for more than 16 hours!! :eek:

Dan Winterland 1st May 2015 14:05


Have you turned it off and on again?

Did that a couple of times when the B747-400 first entered service.
A relatively common Airbus fix!

SAMPUBLIUS 1st May 2015 14:41

about power on issues 787
 
actually, its not just the generators on , its also ground power

from WSJ extract
Quote:
A Federal Aviation Administration safety directive that became public on Thursday reveals that Boeing’s laboratory tests discovered that under certain circumstances, all of the 787’s power systems can suddenly shut down entirely during a flight.

Such a problem, —which the FAA said risks “loss of control of the airplane,” can occur after a jetliner remains connected to onboard or ground-based electric power without a break for a stretch of 248 consecutive days, the agency said. The FAA is ordering airlines to shut down power systems periodically to alleviate the hazard.

Boeing said such shutdowns are part of regular maintenance, and it would be rare for a jet to have power uninterrupted for so long. The plane maker roughly a week ago recommended that airlines voluntarily turn off power systems at least every four months.

During the early stages of the plane’s introduction, Boeing drafted an internal report concluding that Dreamliners experienced most of their reliability problems just after being powered up. The company recommended adding additional time before flights to deal with erroneous “nuisance” messages.

tubby linton 1st May 2015 14:53

Turning electrical equipment off then on is known as a Ferranti reset.

Gertrude the Wombat 1st May 2015 15:03

I've written software like that. Just try to get your boss to let you fix it!


"But I've got to fix it, else it'll crash after 248 days."


"Who cares? - there's no chance of it staying up for that long anyway, it'll have crashed for some other reason long before then. Go and do something actually useful instead."

ion_berkley 1st May 2015 20:44

So what's the bet then?
32bit signed value used as a counter running at 100Hz?
Pretty damn close to exactly 248 days (21427200 secs), 2^31 = 2147483648

peekay4 2nd May 2015 01:06

Boeing 787 integer overflow bug
 
Please don't leave your 787 powered on for 248 days straight...

New FAA AD:

https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf


This AD was prompted by the determination that a Model 787 airplane that has been powered continuously for 248 days can lose all alternating current (AC) electrical power due to the generator control units (GCUs) simultaneously going into failsafe mode. This condition is caused by a software counter internal to the GCUs that will overflow after 248 days of continuous power.

DozyWannabe 2nd May 2015 01:24

OK, so based on the articles it looks to me that this issue was discovered through some kind of regression testing (for non-software folks, this is essentially a form of testing which continually runs scenarios against the software throughout the life of the product, in particular checking that fixes and updates don't break existing code). The reason this is important is because testing of this kind is and always has been mandatory for aviation/safety-critical systems - in fact many of the methods were invented and perfected by the aviation software pioneers. It doesn't matter that a real-world occurrence of this scenario is very unlikely, for this software specialty that's not good enough. By the sound of things, it seems this scenario was encountered in testing by Boeing's software team/contractors, and the FAA was immediately notified. In short, this is what's supposed to happen and - if anything - only serves to prove that the system for finding and resolving this kind of issue is working as it should.

@Gertrude the Wombat - As a more mundane software engineer myself, I can only repeat that your hypothetical management dismissal simply won't fly in the aviation software world.

@ion_berkley - Your analysis sounds about right, but from what I've been told real-time aviation software isn't usually hand-coded in the manner most other software is. I know that Airbus's development environment is essentially a graphical system with discrete blocks of tested and approved code underpinning the graphical logic structure. That said, I don't have any info on how this specific system on the B787 was put together.

[EDIT : As far as finding the issue now goes - one aspect of this kind of testing in terms of scientific software reliability is that the engineers will continue adding scenarios to the suite of tests, and if the scenario is considered unlikely in the field it is usually called an "edge case" in software terminology. I suspect that this particular edge case was added to the suite fairly recently.]

p.j.m 2nd May 2015 02:11


Originally Posted by peekay4 (Post 8962474)
Please don't leave your 787 powered on for 248 days straight...

Boeing must be using Windows programmers these days.

Pilot: "Hello Help desk - the aircraft has lost power"
Indian "have you rebooted?"

Radix 2nd May 2015 02:35

Boeing 787 integer overflow bug
 
.............

DouglasFlyer 2nd May 2015 04:34

Now I'm definitely going to buy a "If It's Not Boeing, I'm Not Going" T-Shirt :rolleyes:

No Fly Zone 2nd May 2015 08:44

And the Time?
 
OK; have seen this notice a couple of times. Using normal procedures, how log does it take to do a FULL electrical shut down on a 787. And once 'cold' how long to reboot from the cold state?
Is there any reason that this cannot become a scheduled, monthly or even A-level Mx procedure? So, How long to "Cold-Boot" a 787?"
I cannot imagine the a 787 in commercial service could go 248 days without some reason to de-power the works. More likely might be the rarely used 787-BBJ (what, two of them currently?) [[and my only concern there is protecting the crew. The world already has enough yokels that own/ride their own 787BBJs]]
Any ideas about the cold-boot time? Thanks.

STBYRUD 2nd May 2015 08:55

I know the 777 takes a few minutes to wake up, nothing that you can't fit into a normal daily cycle somewhere, I doubt the 787 will be any slower. Lets see, this will just make it into a Bulletin probably, Boeing doesn't have the best track record in fixing software bugs unfortunately (especially when existing airframes are to be rid of the problem)...

Ian W 2nd May 2015 10:46


Originally Posted by peekay4 (Post 8962474)
Please don't leave your 787 powered on for 248 days straight...

New FAA AD:

https://s3.amazonaws.com/public-insp...2015-10066.pdf

The overflow of a counter has been found, someone said how long would we need to keep a generator running for the counter overflow problem to show - 248 days!

Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.

Yet the FAA felt they had to issue an AD!? :D Really???

startall4 2nd May 2015 11:12

Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?

Ian W 2nd May 2015 11:57


Originally Posted by startall4 (Post 8962810)
Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?

Well good luck with the B check maintenance :)

SAMPUBLIUS 2nd May 2015 15:27

ON POWERED TIME
 

As the probability of generators being kept running for that long is zero,
Methinks thou art missing a point. Power on time also includes when groun dpower is connected. Early on the 787 took a long time to boot up computers and system from power off. So as i understand it, Airlines were encouraged to connect groundpower and transfer before shutting down APU. Also, power up seemed to give a bundle of false alarms that needed to be reset- sorted out.

Thus it was ( is ? ) not unusual for a 787 to be powered on for several months, at least to the point of triggering the involved ' counter "

And redundant- fail safe systems should NOT have a common tie point which can trigger such an event. :mad:

SAMPUBLIUS 2nd May 2015 15:35

yes really
 
The counter involved also counts when ground power is connected as in prior to APU shutdown.
:O

EEngr 2nd May 2015 16:08

DozyWannabe


Your analysis sounds about right, but from what I've been told real-time aviation software isn't usually hand-coded in the manner most other software is.
Sadly, this is exactly the case I ran into many times while at Boeing. One would expect that embedded controllers would be based on tested and stable RTOSs and libraries. Where an uptime of 248 days is no big deal for 32 bit controllers, so the overflow and wrap-around issues have been addressed. But I've worked with people who insisted on writing every line of code from scratch. Just because NIH.

infrequentflyer789 2nd May 2015 16:28


Originally Posted by Ian W (Post 8962780)
The overflow of a counter has been found, someone said how long would we need to keep a generator running for the counter overflow problem to show - 248 days!

The history of software development is littered with problems caused by people who thought counters were "big enough" that overflow would never be a problem, or that they would never overflow in the expected life of the software, or that the programmer would be retired / dead by the time the problem hit. This sort of thing really should _not_ be happening in safety critical software in this century.


Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.

Yet the FAA felt they had to issue an AD!? :D Really???
The AD seems to just say "mandatory restart every 120 days" - I guess that gives two chances to catch it plus a bit of margin. If everyone is doing this anyway - if there is zero chance as you say - then I'm not sure why they included a cost of compliance...

It is also implied that this was "found" and therefore was not previously documented - as it should have been. To me, this indicates a non-zero risk that in some future change someone will make the counter value persistent (no resets), or make it effectively smaller (and overflow sooner), assuming (because it is not documented) that overflow causes no problems. The AD serves, in part, to document it.

I am more interested in what remains unsaid, namely why this software was/is being tested "in laboratory testing" _now_ - inevitable suspicion is that it is because of a real in-service problem (most likely not this one as you say). It also raises the question of why the software was _not_ tested "in the lab" before flight (or maybe it was but not fully / correctly). I don't suppose we'll ever know...

roulishollandais 2nd May 2015 17:05

INTEGER OVERFLOW WAS ALREADY THE MAIN CAUSE OF THE ARIANE 501 ROCKET CRASH !
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF)

The report showed also many other IT faults (I counted 99 of 80 different types).

Many people think it is unpossible to write and implement bugfree lines. That must end. It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.

Again :
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF) ! Shame to our airspace community :mad:

ImageGear 2nd May 2015 17:33

"NO LESSON HAS BEEN LEARNED..."
 
Indeed, in the rush to push any hardware/firmware/software out of the door and start bringing in revenue (to fund the completion of the design), just about the last programme task to be started are the error recovery routines.

In the meantime, the infamous "Bit Bucket" is the black hole where subroutines die when a line of code has failed to execute correctly.
A Programme Director who has the guts to put the brakes on rollout until elegant error recovery schemes are completed will not occupy his role for long.

In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one. :=

roulishollandais 2nd May 2015 17:43


In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one.
... And passengers :ugh:


MarcK 2nd May 2015 18:47

All counters roll over (overflow) at some point. Some more often than others. The GPS Week counter rolls over every 1024 weeks (19+ years). The first rollover happened in 1999 (just before the year 2000 rollover problem) and caused many GPS systems to fail. The next GPS week rollover will happen in 2019. The date format of Digital Certificates will have a rollover situation in 2049.

PAXfips 2nd May 2015 18:57

and several UNIX system will have a nice rollover in 2038 :p

Yet.. that very 248 days (32bit for 1/100s) happened before and it's just insane that they "reintroduce" that (ask Win95 users that happened to have such an uptime).

deptrai 2nd May 2015 19:44

this is the 3rd thread about the same subject, the other 2 have been merged here: http://www.pprune.org/tech-log/56075...-new-post.html

ATPMBA 2nd May 2015 20:57

Planes have turned into flying computes so I guess they need to be re-booted often to prevent problems. Hopefully at cruise I pilot won't have the computer try to calculate PI to infinity. I believe that's how they overloaded the computer on a Star Trek episode.

Airbubba 2nd May 2015 22:11

Lead story on the U.S. edition of CNN's web page:


FAA finds Boeing Dreamliner could lose all power, issues maintenance mandate

By Greg Botelho, CNN
Updated 5:25 PM ET, Sat May 2, 2015


(CNN)—The headaches for Boeing over its 787 Dreamliner continue.

The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.

The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.

"The FAA has found that the risk to the flying public justifies waiving notice and comment," the agency said.

The maintenance mandate was characterized as temporary, until software is developed to resolve the problem.

This marks the latest setback for Boeing over its 787, which debuted in 2011 in Asia and a year later in the United States amid much fanfare...
FAA: Dreamliner battery could lose all power - CNN.com

Oh, the horror... :eek:

ams6110 2nd May 2015 22:35


It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.
Of course the systems on the 787 are a bit more complicated than an adding machine. What happens when you enter 9999999999 + 1 on your calculator?

It is no more possible to write defect-free complex software than it is to design a hydraulic pump that never fails. The proper approach is therefore to expect failures and design systems that are fault-tolerant. Maybe Boeing didn't do that here, or could have done better, but the general idea that we can achieve perfection in complex software is wishful thinking.

Even with our best engineering we find problems that go undetected during design and testing. We deal with those when we discover them, as Boeing are doing now (they are updating the software).

HighAndFlighty 3rd May 2015 00:20

Slightly tongue-in-cheek post from IT journal The Register:



The US Federal Aviation Administration (FAA) has issued a new airworthiness directive (PDF) for Boeing's 787 because a software bug shuts down the plane's electricity generators every 248 days.
“We have been advised by Boeing of an issue identified during laboratory testing,” the directive says. That issue sees “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
When the GCU is in failsafe mode it isn't making any power. That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”
And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies.
The fix outlined in the directive is pretty simple: make sure you turn the GCUs off before 248 days elapse.
Boeing is working on a fix and the FAA says “Once this software is developed, approved, and available, we might consider additional rulemaking.”
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
Have you turned it off and on again? That's the way to stop the plane becoming a brick! The Register

dave.rooney 3rd May 2015 01:00

re: Perfect Software
 
ams6110: I'm not a commercial pilot but I do have 25+ years in software. It's not so much "wishful thinking" about perfect software but the cost of attaining it. The Space Shuttle flight control software team was one group that was famous for attaining near zero defects, but they did so with incredible rigour and an associated slow pace.

Can anyone shed some light on the language used in the GCU's system? Was it Ada or C/C++? Regardless, it's pretty common to do "what happens when this value hits its maximum + 1" testing. Even in simple web systems I often do that in unit tests to verify that nothing breaks in an unexpected way.

The difference in this case is probably that it's an internal counter rather than a parameter being passed about. A parameter would probably be subjected to boundary value tests, but perhaps not a global counter. There are also tools for what's called "fuzz testing" that will inject invalid values to catch just these sorts of problems. Again, though, that may have been done but just not in the right place.

One hope I do have, though, is that Boeing treats this like a mechanical issue and does proper root cause analysis and deals with the human aspects as much as the technical ones. That doesn't happen very often at all in the software world.

rh200 3rd May 2015 01:14

I think they need something like the Concord used to have in the cabin, but instead of displaying speed, count down too power off:E

Airbubba 3rd May 2015 01:19


For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again
And, actually, as discussed on another thread, not being able to power things down for a reset is a challenge on 'modern' aircraft.

Back in the not so distant FE days pulling a breaker and resetting it would cure a lot of mysterious faults. You were supposed to use that superior systems knowledge to psychoanalyze the electrical system to figure out what relay would be unpowered on what bus etc., etc., etc...

Later, in the early glass days, inoperative computer boxes like FCC's would sometimes be cured back on the ground if you removed all power to the plane and started up again. The feds would also look the other way if, for example, you reset a yaw damper that didn't come online during start. You've seen it before and know how to fix it, no need to pull out the book, right?

Now, in this enlightened era of fly-by-wire and electric airliners, you don't touch a button in case of a fault until you've run through the QRH, then got a phone patch and done a kumbaya session with the dispatcher and maintenance. Which is probably a good thing given the history of creative systems analysis by flight crews. Even after you land, more effort often seems to be put into finding the right deferral code than fixing what might be a simple problem.

Anyway, stuff like a frozen ACARS screen in flight that in the past you would be expected to troubleshoot and pull and reset a breaker is now something that you need to document, advise the company and live with unless you somehow get special dispensation from somebody on the ground.

Sweeping language suggesting a general prudential approach in the preamble to the abnormal section of the flight manual has been replaced by paragraphs of CYA verbiage to insulate the company from liability if you make a command decision to try something not specifically authorized in the book.

So, it seems to me that increasingly, the pilot can't 'turn it off and turn it on again' even on the older aircraft.


All times are GMT. The time now is 16:41.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.