PDA

View Full Version : Boeing 787 integer overflow bug


SAMPUBLIUS
30th Apr 2015, 17:39
GEEZE- IMO any software that allows all systems to fail at the same time- even under extreme unlikely events is fubar !
FAA orders new 787 electrical fix to prevent power failure - 4/30/2015 - Flight Global (http://www.flightglobal.com/news/articles/faa-orders-new-787-electrical-fix-to-prevent-power-failure-411794/)

ll Boeing 787 operators will be required to periodically deactivate the electrical system to avoid a problem with a newly-discovered software bug that could cause the aircraft to lose alternating current (AC) power, the US Federal Aviation Administration says in a new airworthiness directive.

The agency adopted the final rule after Boeing reported the results of a laboratory test showing a total loss of power is possible if the generator control units run continuously for eight months, says the FAA’s 30 April notice in the Federal Register. :eek:

The binding airworthiness directive is being published less than two weeks after Boeing privately alerted operators about the problem, the company says in a statement to Flightglobal.

It is rare for a commercial aircraft to remain powered on for eight months with no interruptions.

Goes on !

...All six power generating systems are managed by a corresponding generator control unit (GCU). Boeing’s laboratory testing discovered that an internal software counter in the GCU overflows after running continuously for 248 days, according to the FAA. The overflow causes all four GCUs on the engine-mounted generators to enter failsafe mode at the same time.

fleigle
30th Apr 2015, 19:07
Yeah, that would be a hell of a long-distance flight, probably the blue-screen-of-death app. for toilet overflow would happen before that though!!!.
:E

SAMPUBLIUS
1st May 2015, 04:36
Yikes !

from WSJ extract A Federal Aviation Administration safety directive that became public on Thursday reveals that Boeing’s laboratory tests discovered that under certain circumstances, all of the 787’s power systems can suddenly shut down entirely during a flight.

Such a problem, —which the FAA said risks “loss of control of the airplane,” can occur after a jetliner remains connected to onboard or ground-based electric power without a break for a stretch of 248 consecutive days, the agency said. The FAA is ordering airlines to shut down power systems periodically to alleviate the hazard.

Boeing said such shutdowns are part of regular maintenance, and it would be rare for a jet to have power uninterrupted for so long. The plane maker roughly a week ago recommended that airlines voluntarily turn off power systems at least every four months.

During the early stages of the plane’s introduction, Boeing drafted an internal report concluding that Dreamliners experienced most of their reliability problems just after being powered up. The company recommended adding additional time before flights to deal with erroneous “nuisance” messages.

chrissw
1st May 2015, 06:49
Just hope you're not flying on the 248th day! (Although admittedly the fix isn't difficult...)

787 software bug can shut down planes' generators ? The Register (http://www.theregister.co.uk/2015/05/01/787_software_bug_can_shut_down_planes_generators/)

FE Hoppy
1st May 2015, 08:06
Back in the real world, when was the last time an aircraft was continuously powered for 248 days?


The Ejets had a similar problem when first introduced but for them it was an commanded RAT deployment on the ground after 40 hours.

Quick software update and all was well.

chrissw
1st May 2015, 09:25
Indeed, in the real world it's never going to happen. Nevertheless, the FAA clearly thought it was significant enough to issue a directive about it.

Also I suspect that software updates are far from trivial where the software is safety-critical with multiple redundancies and parallel processing.

Basil
1st May 2015, 10:28
Have you turned it off and on again?
Did that a couple of times when the B747-400 first entered service.

Ian W
1st May 2015, 11:29
Indeed, in the real world it's never going to happen. Nevertheless, the FAA clearly thought it was significant enough to issue a directive about it.

Also I suspect that software updates are far from trivial where the software is safety-critical with multiple redundancies and parallel processing.

As the probability of generators being kept running for that long is zero, it may not even need a fix. Yes it is poor programming practice but it is not an issue that will affect the aircraft. It's like saying the aircraft can run out of fuel if it flies for more than 16 hours!! :eek:

Dan Winterland
1st May 2015, 14:05
Have you turned it off and on again?

Did that a couple of times when the B747-400 first entered service.


A relatively common Airbus fix!

SAMPUBLIUS
1st May 2015, 14:41
actually, its not just the generators on , its also ground power

from WSJ extract
Quote:
A Federal Aviation Administration safety directive that became public on Thursday reveals that Boeing’s laboratory tests discovered that under certain circumstances, all of the 787’s power systems can suddenly shut down entirely during a flight.

Such a problem, —which the FAA said risks “loss of control of the airplane,” can occur after a jetliner remains connected to onboard or ground-based electric power without a break for a stretch of 248 consecutive days, the agency said. The FAA is ordering airlines to shut down power systems periodically to alleviate the hazard.

Boeing said such shutdowns are part of regular maintenance, and it would be rare for a jet to have power uninterrupted for so long. The plane maker roughly a week ago recommended that airlines voluntarily turn off power systems at least every four months.

During the early stages of the plane’s introduction, Boeing drafted an internal report concluding that Dreamliners experienced most of their reliability problems just after being powered up. The company recommended adding additional time before flights to deal with erroneous “nuisance” messages.

tubby linton
1st May 2015, 14:53
Turning electrical equipment off then on is known as a Ferranti reset.

Gertrude the Wombat
1st May 2015, 15:03
I've written software like that. Just try to get your boss to let you fix it!


"But I've got to fix it, else it'll crash after 248 days."


"Who cares? - there's no chance of it staying up for that long anyway, it'll have crashed for some other reason long before then. Go and do something actually useful instead."

ion_berkley
1st May 2015, 20:44
So what's the bet then?
32bit signed value used as a counter running at 100Hz?
Pretty damn close to exactly 248 days (21427200 secs), 2^31 = 2147483648

peekay4
2nd May 2015, 01:06
Please don't leave your 787 powered on for 248 days straight...

New FAA AD:

https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf

This AD was prompted by the determination that a Model 787 airplane that has been powered continuously for 248 days can lose all alternating current (AC) electrical power due to the generator control units (GCUs) simultaneously going into failsafe mode. This condition is caused by a software counter internal to the GCUs that will overflow after 248 days of continuous power.

DozyWannabe
2nd May 2015, 01:24
OK, so based on the articles it looks to me that this issue was discovered through some kind of regression testing (for non-software folks, this is essentially a form of testing which continually runs scenarios against the software throughout the life of the product, in particular checking that fixes and updates don't break existing code). The reason this is important is because testing of this kind is and always has been mandatory for aviation/safety-critical systems - in fact many of the methods were invented and perfected by the aviation software pioneers. It doesn't matter that a real-world occurrence of this scenario is very unlikely, for this software specialty that's not good enough. By the sound of things, it seems this scenario was encountered in testing by Boeing's software team/contractors, and the FAA was immediately notified. In short, this is what's supposed to happen and - if anything - only serves to prove that the system for finding and resolving this kind of issue is working as it should.

@Gertrude the Wombat - As a more mundane software engineer myself, I can only repeat that your hypothetical management dismissal simply won't fly in the aviation software world.

@ion_berkley - Your analysis sounds about right, but from what I've been told real-time aviation software isn't usually hand-coded in the manner most other software is. I know that Airbus's development environment is essentially a graphical system with discrete blocks of tested and approved code underpinning the graphical logic structure. That said, I don't have any info on how this specific system on the B787 was put together.

[EDIT : As far as finding the issue now goes - one aspect of this kind of testing in terms of scientific software reliability is that the engineers will continue adding scenarios to the suite of tests, and if the scenario is considered unlikely in the field it is usually called an "edge case" in software terminology. I suspect that this particular edge case was added to the suite fairly recently.]

p.j.m
2nd May 2015, 02:11
Please don't leave your 787 powered on for 248 days straight...

Boeing must be using Windows programmers these days.

Pilot: "Hello Help desk - the aircraft has lost power"
Indian "have you rebooted?"

Radix
2nd May 2015, 02:35
.............

DouglasFlyer
2nd May 2015, 04:34
Now I'm definitely going to buy a "If It's Not Boeing, I'm Not Going" T-Shirt :rolleyes:

No Fly Zone
2nd May 2015, 08:44
OK; have seen this notice a couple of times. Using normal procedures, how log does it take to do a FULL electrical shut down on a 787. And once 'cold' how long to reboot from the cold state?
Is there any reason that this cannot become a scheduled, monthly or even A-level Mx procedure? So, How long to "Cold-Boot" a 787?"
I cannot imagine the a 787 in commercial service could go 248 days without some reason to de-power the works. More likely might be the rarely used 787-BBJ (what, two of them currently?) [[and my only concern there is protecting the crew. The world already has enough yokels that own/ride their own 787BBJs]]
Any ideas about the cold-boot time? Thanks.

STBYRUD
2nd May 2015, 08:55
I know the 777 takes a few minutes to wake up, nothing that you can't fit into a normal daily cycle somewhere, I doubt the 787 will be any slower. Lets see, this will just make it into a Bulletin probably, Boeing doesn't have the best track record in fixing software bugs unfortunately (especially when existing airframes are to be rid of the problem)...

Ian W
2nd May 2015, 10:46
Please don't leave your 787 powered on for 248 days straight...

New FAA AD:

https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf

The overflow of a counter has been found, someone said how long would we need to keep a generator running for the counter overflow problem to show - 248 days!

Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.

Yet the FAA felt they had to issue an AD!? :D Really???

startall4
2nd May 2015, 11:12
Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?

Ian W
2nd May 2015, 11:57
Ian W. Could it be that although the genny's aren't running, the GCU's are powered and given that apparently it is a real b..l ache if the 787 gets electrically unpowered and takes hours to "restart" it may well be that this issue is more probable that we think?

Well good luck with the B check maintenance :)

SAMPUBLIUS
2nd May 2015, 15:27
As the probability of generators being kept running for that long is zero,

Methinks thou art missing a point. Power on time also includes when groun dpower is connected. Early on the 787 took a long time to boot up computers and system from power off. So as i understand it, Airlines were encouraged to connect groundpower and transfer before shutting down APU. Also, power up seemed to give a bundle of false alarms that needed to be reset- sorted out.

Thus it was ( is ? ) not unusual for a 787 to be powered on for several months, at least to the point of triggering the involved ' counter "

And redundant- fail safe systems should NOT have a common tie point which can trigger such an event. :mad:

SAMPUBLIUS
2nd May 2015, 15:35
The counter involved also counts when ground power is connected as in prior to APU shutdown.
:O

EEngr
2nd May 2015, 16:08
DozyWannabe (http://www.pprune.org/members/54871-dozywannabe)

Your analysis sounds about right, but from what I've been told real-time aviation software isn't usually hand-coded in the manner most other software is.Sadly, this is exactly the case I ran into many times while at Boeing. One would expect that embedded controllers would be based on tested and stable RTOSs and libraries. Where an uptime of 248 days is no big deal for 32 bit controllers, so the overflow and wrap-around issues have been addressed. But I've worked with people who insisted on writing every line of code from scratch. Just because NIH.

infrequentflyer789
2nd May 2015, 16:28
The overflow of a counter has been found, someone said how long would we need to keep a generator running for the counter overflow problem to show - 248 days!

The history of software development is littered with problems caused by people who thought counters were "big enough" that overflow would never be a problem, or that they would never overflow in the expected life of the software, or that the programmer would be retired / dead by the time the problem hit. This sort of thing really should _not_ be happening in safety critical software in this century.


Presumably, there is a requirement to report such software issues even though the chance of keeping a specific generator running for 248 days is zero. The chance of all generators on an aircraft being kept running for the same 248 days is less than zero. It is not even vanishingly small it is zero.

Yet the FAA felt they had to issue an AD!? :D Really???

The AD seems to just say "mandatory restart every 120 days" - I guess that gives two chances to catch it plus a bit of margin. If everyone is doing this anyway - if there is zero chance as you say - then I'm not sure why they included a cost of compliance...

It is also implied that this was "found" and therefore was not previously documented - as it should have been. To me, this indicates a non-zero risk that in some future change someone will make the counter value persistent (no resets), or make it effectively smaller (and overflow sooner), assuming (because it is not documented) that overflow causes no problems. The AD serves, in part, to document it.

I am more interested in what remains unsaid, namely why this software was/is being tested "in laboratory testing" _now_ - inevitable suspicion is that it is because of a real in-service problem (most likely not this one as you say). It also raises the question of why the software was _not_ tested "in the lab" before flight (or maybe it was but not fully / correctly). I don't suppose we'll ever know...

roulishollandais
2nd May 2015, 17:05
INTEGER OVERFLOW WAS ALREADY THE MAIN CAUSE OF THE ARIANE 501 ROCKET CRASH !
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF)

The report showed also many other IT faults (I counted 99 of 80 different types).

Many people think it is unpossible to write and implement bugfree lines. That must end. It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.

Again :
NO LESSON HAS BEEN LEARNED BY BOEING FROM THAT MAIN ACCIDENT (Cost 8 billions FF) ! Shame to our airspace community :mad:

ImageGear
2nd May 2015, 17:33
Indeed, in the rush to push any hardware/firmware/software out of the door and start bringing in revenue (to fund the completion of the design), just about the last programme task to be started are the error recovery routines.

In the meantime, the infamous "Bit Bucket" is the black hole where subroutines die when a line of code has failed to execute correctly.
A Programme Director who has the guts to put the brakes on rollout until elegant error recovery schemes are completed will not occupy his role for long.

In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one. :=

roulishollandais
2nd May 2015, 17:43
In the meantime, Pilots, Engineers and Airlines must pick up the tab for debugging the suppliers code in more ways than one.... And passengers :ugh:

MarcK
2nd May 2015, 18:47
All counters roll over (overflow) at some point. Some more often than others. The GPS Week counter rolls over every 1024 weeks (19+ years). The first rollover happened in 1999 (just before the year 2000 rollover problem) and caused many GPS systems to fail. The next GPS week rollover will happen in 2019. The date format of Digital Certificates will have a rollover situation in 2049.

PAXfips
2nd May 2015, 18:57
and several UNIX system will have a nice rollover in 2038 :p

Yet.. that very 248 days (32bit for 1/100s) happened before and it's just insane that they "reintroduce" that (ask Win95 users that happened to have such an uptime).

deptrai
2nd May 2015, 19:44
this is the 3rd thread about the same subject, the other 2 have been merged here: http://www.pprune.org/tech-log/560755-787-software-problem-new-post.html

ATPMBA
2nd May 2015, 20:57
Planes have turned into flying computes so I guess they need to be re-booted often to prevent problems. Hopefully at cruise I pilot won't have the computer try to calculate PI to infinity. I believe that's how they overloaded the computer on a Star Trek episode.

Airbubba
2nd May 2015, 22:11
Lead story on the U.S. edition of CNN's web page:

FAA finds Boeing Dreamliner could lose all power, issues maintenance mandate

By Greg Botelho, CNN
Updated 5:25 PM ET, Sat May 2, 2015


(CNN)—The headaches for Boeing over its 787 Dreamliner continue.

The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.

The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.

"The FAA has found that the risk to the flying public justifies waiving notice and comment," the agency said.

The maintenance mandate was characterized as temporary, until software is developed to resolve the problem.

This marks the latest setback for Boeing over its 787, which debuted in 2011 in Asia and a year later in the United States amid much fanfare...


FAA: Dreamliner battery could lose all power - CNN.com (http://www.cnn.com/2015/05/02/us/boeing-787-dreamliner-faa-directive/index.html)

Oh, the horror... :eek:

ams6110
2nd May 2015, 22:35
It is possible to write 1+1=2 and not 1+1=3 , it is possible to debug IT.

Of course the systems on the 787 are a bit more complicated than an adding machine. What happens when you enter 9999999999 + 1 on your calculator?

It is no more possible to write defect-free complex software than it is to design a hydraulic pump that never fails. The proper approach is therefore to expect failures and design systems that are fault-tolerant. Maybe Boeing didn't do that here, or could have done better, but the general idea that we can achieve perfection in complex software is wishful thinking.

Even with our best engineering we find problems that go undetected during design and testing. We deal with those when we discover them, as Boeing are doing now (they are updating the software).

HighAndFlighty
3rd May 2015, 00:20
Slightly tongue-in-cheek post from IT journal The Register:



The US Federal Aviation Administration (FAA) has issued a new airworthiness directive (PDF) (https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf) for Boeing's 787 because a software bug shuts down the plane's electricity generators every 248 days.
“We have been advised by Boeing of an issue identified during laboratory testing,” the directive says. That issue sees “The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.”
When the GCU is in failsafe mode it isn't making any power. That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”
And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies.
The fix outlined in the directive is pretty simple: make sure you turn the GCUs off before 248 days elapse.
Boeing is working on a fix and the FAA says “Once this software is developed, approved, and available, we might consider additional rulemaking.”
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on againHave you turned it off and on again? That's the way to stop the plane becoming a brick! The Register (http://m.theregister.co.uk/2015/05/01/787_software_bug_can_shut_down_planes_generators/)

dave.rooney
3rd May 2015, 01:00
ams6110: I'm not a commercial pilot but I do have 25+ years in software. It's not so much "wishful thinking" about perfect software but the cost of attaining it. The Space Shuttle flight control software team was one group that was famous for attaining near zero defects, but they did so with incredible rigour and an associated slow pace.

Can anyone shed some light on the language used in the GCU's system? Was it Ada or C/C++? Regardless, it's pretty common to do "what happens when this value hits its maximum + 1" testing. Even in simple web systems I often do that in unit tests to verify that nothing breaks in an unexpected way.

The difference in this case is probably that it's an internal counter rather than a parameter being passed about. A parameter would probably be subjected to boundary value tests, but perhaps not a global counter. There are also tools for what's called "fuzz testing" that will inject invalid values to catch just these sorts of problems. Again, though, that may have been done but just not in the right place.

One hope I do have, though, is that Boeing treats this like a mechanical issue and does proper root cause analysis and deals with the human aspects as much as the technical ones. That doesn't happen very often at all in the software world.

rh200
3rd May 2015, 01:14
I think they need something like the Concord used to have in the cabin, but instead of displaying speed, count down too power off:E

Airbubba
3rd May 2015, 01:19
For now, before you board a 787 it's probably worth asking the pilot if he can turn it off and turn it on again

And, actually, as discussed on another thread, not being able to power things down for a reset is a challenge on 'modern' aircraft.

Back in the not so distant FE days pulling a breaker and resetting it would cure a lot of mysterious faults. You were supposed to use that superior systems knowledge to psychoanalyze the electrical system to figure out what relay would be unpowered on what bus etc., etc., etc...

Later, in the early glass days, inoperative computer boxes like FCC's would sometimes be cured back on the ground if you removed all power to the plane and started up again. The feds would also look the other way if, for example, you reset a yaw damper that didn't come online during start. You've seen it before and know how to fix it, no need to pull out the book, right?

Now, in this enlightened era of fly-by-wire and electric airliners, you don't touch a button in case of a fault until you've run through the QRH, then got a phone patch and done a kumbaya session with the dispatcher and maintenance. Which is probably a good thing given the history of creative systems analysis by flight crews. Even after you land, more effort often seems to be put into finding the right deferral code than fixing what might be a simple problem.

Anyway, stuff like a frozen ACARS screen in flight that in the past you would be expected to troubleshoot and pull and reset a breaker is now something that you need to document, advise the company and live with unless you somehow get special dispensation from somebody on the ground.

Sweeping language suggesting a general prudential approach in the preamble to the abnormal section of the flight manual has been replaced by paragraphs of CYA verbiage to insulate the company from liability if you make a command decision to try something not specifically authorized in the book.

So, it seems to me that increasingly, the pilot can't 'turn it off and turn it on again' even on the older aircraft.

SAMPUBLIUS
3rd May 2015, 02:01
High and Flighty/FAA said '...That'll be bad news if all four of the GCUs aboard a 787 were powered up at the same time, because all will then shut down, “resulting in a loss of all AC electrical power regardless of flight phase.”>>

But normally there is a few second or minutes between start of one engine and the second engine. And then the APU will be shut off after climbout . Which then leads to the following

One engine GCU times out and shuts of its two generators. No biggie- other engine and APU can easily carry the load. But a few minutes later, 2nd engine generator system cuts out . Oh well we still have APU to start engines ? Then an few minutes later, the APU generator times out ??

Or is the GCU involved a single point join so that its timer overloads - and the battery system cuts in with ' nearer my god to thee ' ??:8

roulishollandais
3rd May 2015, 03:02
Building a software is like building a house, the first thing you have to do is to list all the materials/variables that you need, defining size, use, purposes, movement, range aso. The integer counter is one of the easiest variable to verify all along the software design and realisation. No need of sophisticated stats, only very very basic methods with paper and pencil in your armchair. No excuse after Ariane501 crash and report.NO !

cattletruck
3rd May 2015, 03:35
Congratulations to the tester that found the bug. Good testers think outside the box as this one had done, 248 days was perhaps an unlikely scenario but by bringing it up as a test fail it really got everyone's attention of what a simple oversight can do.

Once fixed it's going to take another 248 days to re-run the test.

Anyhow, methinks a 787 version 1.0 probably flies better if you reboot it first.

poorjohn
3rd May 2015, 05:30
Once fixed it's going to take another 248 days to re-run the test. I'm pretty sure it can be qualified "by inspection".

Guptar
3rd May 2015, 09:46
Very interesting thread, something I had never even thought about. I just wish I had even the faintest idea of what you guys are talking about.

So, can someone answer a couple of questions for a simple guy.

Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.

I gather, from googling it (nothing of which I understood anyway), integer counters are fairly common in computing software, so how do banks not have this problem as their computer hardware boxes have times between power downs measured in years.

If software is not hand coded, ie someone pounding away on a keyboard writing lines of code, how is it written if it's not hand coded.

All this stuff, sounds like you're talking about the warp drive of the Starship Enterprise.

I have such a headache now!

Amadis of Gaul
3rd May 2015, 11:04
Boeing must be using Windows programmers these days.

Pilot: "Hello Help desk - the aircraft has lost power"
Indian "have you rebooted?"

Hey, that's racist!

poorjohn
3rd May 2015, 13:26
Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?
That's a question for the hardware guys.

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.
A "counter" is basically a unit of data storage in memory and the associated software that manipulates the value being stored, e.g. increments the value and tests it against limit(s). Computers typically have instructions that let them access memory in chunks smaller than the default size, and to not waste memory (which for critical real-time devices can be expensive) the programmer selects a size appropriate to the need.

I gather, from googling it (nothing of which I understood anyway), integer counters are fairly common in computing software, so how do banks not have this problem as their computer hardware boxes have times between power downs measured in years.
This 787 counter counted units of time, so it was a timer. You'd have to know what it was for and why it was designed to force the hardware it controlled into some inoperational mode when the value became zero. It could have been a valid reason e.g. the device had reached a critical time limit where it had to be shut down and lubricated and the problem the program designer didn't allow for was that that service could have taken place without powering off the device and resetting the timer/counter.

If software is not hand coded, ie someone pounding away on a keyboard writing lines of code, how is it written if it's not hand coded.
Programmers may insert into their own program software modules written by other programmers. Hand-coded, but by others' hands.
(The design fault here is that the software does not count characters I've typed within a quote, so I have to say something I didn't need to say outside the quote or it will flog me because my message was "too short".)

HighWind
3rd May 2015, 18:33
Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?
That's a question for the hardware guys. I'm not working with areospace, but in my field of engineering (wind turbines) it differently would have integer counters.
Often those systems run at a constant scan rate, and software filters and timers are used to slow down the reaction of the system in a configurable manner.
Since systems have boxes connected together with communication links, monitoring of broken communication links have to be implemented. (Typical as Timeouts).
Another purpose could be for shutting tings down in case of faults, e.g. stop the engine if the lubrication pressure is lower than 2 bars for 5 secs.
Timers is also used to delay, and prevent erratic state change of an output, i.e. prevent a valve from being turned off/on every 10ms. Scan.
(Persistent) Counters are also used for statistics for maintenance and trouble shooting.
It is good system-engineering practice to separate/compartmentalize safety critical control, from datalogging for diagnostics.
If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit The size of a counter value is primary related to software, and not hardware architecture.
Using a 64bit desktop microprocessor in such an environment is often a bad idea, if possible micro-controllers like ARM cortex is used instead.
A bigger complex CPU use more power, generates more heat, and is 100 times more unreliable than a small microcontroller.
An Intel desktop CPU is only on the market for 3 years, and industrial/aerospace products have to be supported for 20 years.
Some of the newer micro controllers like the TMS570 have features that make safety certification easier.

rh200
3rd May 2015, 20:56
The basics of what type of variable you declare depends can upon several things

1) You misunderstand the requirements.

2) you are sloppy.

3)architecture coupled with the above.

It used to be that people tried to keep their code small, with todays cpu's and resources, people have become very sloppy. But there are a couple of places (probally more) that I know which forces me to declare small variable types.

1) Where the output of the variable, results in and excessive amount of data to capture and store.

2) Using micro controllers. These usually have limited on board space. I would imagine in an environment such as these, with access to the best, they would still be constrained.

And there could be many other reasons.

underfire
3rd May 2015, 22:40
(CNN)The headaches for Boeing over its 787 Dreamliner continue.

The Federal Aviation Administration on Friday issued a directive mandating "a repetitive maintenance task" for that model of airliners due to issues with its power supply. Specifically, the FAA explained testing revealed that 787s could lose all AC electrical power after being continuously powered for 248 days, a problem that -- if left unchecked -- would leave an aircrew unable to control the plane.

The order took effect immediately, with the federal agency finding that there's no good reason to delay the decision.

FAA finds Boeing Dreamliner could lose all power, issues maintenance mandate (http://edition.cnn.com/2015/05/02/us/boeing-787-dreamliner-faa-directive/)

Radix
4th May 2015, 05:22
..........

peekay4
4th May 2015, 06:47
Why does a GCU have an integer counter, does it need to count something to measure time or cycles of something?
The purpose of an integer counter is to provide a standard measurement of time.

Remember that hardware can run at varying speeds, so we can't rely on hardware cycle speed to measure time. E.g., suppose today a CPU runs at 1 GHz, but tomorrow a replacement CPU comes out at 2 GHz, so each hardware cycle is now twice as fast. We don't want all of our time measurements to be suddenly be off by a factor of two!

Therefore a counter is provided which always increases at a predictable, set time period (called the tick time period) regardless of the underlying hardware speed.

A common tick period is 100 Hz. I.e., the time counter will always increment once every 1/100th of a second, regardless of the speed of the hardware. An elapsed time of 100 ticks means 1 second has passed, on any hardware.

Most real-time systems are completely tick based. At each and every tick, the system "kernel" is activated and every running task re-scheduled for execution based on their priority and allocated processing time budget (also measured in ticks).

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.

Boeing probably had little to do with this bug. The affected GCUs would have been supplied by a third-party company.

And that third-party company probably used a Real Time Operating System (RTOS) supplied by yet another company.

My guess is this integer overflow is probably in the RTOS or related code. The bug might have been discovered in some completely unrelated software (maybe not even aviation software) using the same RTOS.

The speculation is that the buggy code is a 32-bit signed counter measuring 100 Hz ticks. So with one bit taken for the sign (+/-), that leaves 31-bits for the counter and 2^31/(60*60*24*100) = 248.55 days.

roulishollandais
5th May 2015, 01:33
Boeing probably had little to do with this bug. The affected GCUs would have been supplied by a third-party company.

And that third-party company probably used a Real Time Operating System (RTOS) supplied by yet another company.

My guess is this integer overflow is probably in the RTOS or related code. The bug might have been discovered in some completely unrelated software (maybe not even aviation software) using the same RTOS.If you use software from a third party, you need not only the soft or the RTOS but the totality of its documentation and the whole test data. The furnisher of the RTOS or the software may design them for a toy, but Boeing uses them for an aircraft.
The certifiers are at fault too , they have to verify that documentation and test data are there and tests have been done actually after implementation. It seems easy to ask a third party to share the work, in fact you have to verify all the links .
To be sure the work is done you have to pay when you received everything and it is OK. Everyone must sign his work as complete. Certifiers should not have certified the B787 before all the tests are done and on the table.

TURIN
6th May 2015, 10:16
Back in the real world...

It takes about 20 minutes to downpower and reboot the a/c. Not good on a quick turn round but if the a/c has just come out of the shed after an A check, no big issue.
It is common practice to park the a/c without power if it is not required for several hours.

Just another card on the check.

vapilot2004
6th May 2015, 12:10
The 'lazy' certification issue RH mentions is truer today than ever before. More and more reliance on manufacturer-designed testing regimes for the regulators regarding airborne computer systems has the odd chicken coming home to roost in times recent. (past few decades)

This lack of complete knowledge of the widest operational range (extremes/faulty sensors/etc) at the confluence of hardware/software interface has the potential for the occasional dicey consequence - particularly after human factors are added into the melange.

roulishollandais
10th May 2015, 16:41
if you don't want to read the totality of that report :The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error.

The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected.

The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose.

The alignment function is operative for 50 seconds after starting of the Flight Mode of the SRIs which occurs at H0 - 3 seconds for Ariane 5. Consequently, when lift-off occurs, the function continues for approx. 40 seconds of flight. This time sequence is based on a requirement of Ariane 4 and is not required for Ariane 5.

The Operand Error occurred due to an unexpected high value of an internal alignment function result called BH, Horizontal Bias, related to the horizontal velocity sensed by the platform. This value is calculated as an indicator for alignment precision over time.

The value of BH was much higher than expected because the early part of the trajectory of Ariane 5 differs from that of Ariane 4 and results in considerably higher horizontal velocity values.

blackbeard1
10th May 2015, 18:51
There is no such thing as a "free launch" or lunch, I was involved in Cluster and almost 10 years of my life went up in smoke.
Cluster (spacecraft) - Wikipedia, the free encyclopedia (http://en.m.wikipedia.org/wiki/Cluster_(spacecraft)))

roulishollandais
12th May 2015, 10:23
@blackbeard1
10 years of your life in smoke from that crazy overflowing bit but wide misfunction in that rocket project ! Condolences !

Boeing may probably find other things to care…:ugh:

atakacs
12th May 2015, 11:21
Wasn't cluster II re-launched a few years latter and still happily operating ?

blackbeard1
12th May 2015, 12:21
Cluster was rebuilt and launched from Baikonur and is still working and giving good scientific as you said. I am now retired, as are most of the original team, sadly some have died but it is good to know that the original design and objectives are still giving good scientific results.

ESA Science & Technology: Cluster (http://sci.esa.int/cluster/)

roulishollandais
22nd May 2015, 00:38
Thank you blackbeard1 for that wonderful Cluster and link.
I had pleasure to learn more from boreal auroras and the last studies.

Sunamer
22nd May 2015, 08:16
"32bit signed value used as a counter running at 100Hz?"

I was wondering, why would you need to have a signed value if it is a simple counter...
Unsigned one would give twice the range of the signed one... 248*5 = more than a year. :}

"If you aircraft was powered for more than a year, don't forget to power cycle it..." kind of thing

The reaction to this non issue from outlets like CNN was just... :yuk:

EEngr
22nd May 2015, 15:34
I was wondering, why would you need to have a signed value if it is a simple counter... Because signed integer math and conditional logic can give you positive/negative interval values. As in one event occurred before or after another. And there may be places in the code where this would be expected.:8

dClbydalpha
24th May 2015, 09:43
non issue

Sorry but this is anything but a non-issue, looking at the information in the publice domain, this is a systematic design failure.

1. The GCU control system fails after ~7000 hours.
2. It is a common mode failure so no credit can be given to multiple systems.
3. The failure leads to loss of all AC.
4. Loss of all AC is at least HAZARDOUS.

Therefore a target of 1x10-7 is fulfilled by a design stuggling to meet 1x10-4

Firstly the overflow error should be trapped at source. It adds complexity to design, but it needs to be done in safety critical systems.
Secondly it appears the safety analysis has not fully analysed all the software failures ... if the software design process guidelines for safety critical systems had been followed then this should have stood out like a sore thumb. This is the kind of thing that happens when people use the analysis from old designs, without re-validating the original assumptions against the new design.

In mechanical terms, if a fastner repeatedly loosens in flight there is something wrong, it is not acceptable to say that it didn't come totally undone so as long as we tighten it up each time it is ok, the fastener should be redesigned.

I have not seen a statement from Boeing that denies any of the 4 assumptions i have made, but I emphasise that I have no detailed knowledge so this is based only on the public domain information ... but based on that it really worries me, because it isn't a "bug" it is a systematic failure.

roulishollandais
24th May 2015, 17:51
dClbydalpha,
I agree with that.
Once again learn the lesson from Ariane501 report : Not only it is easy to avoid integer overflow being very methodic, but the report showed that a long list of other failures have been leading to the fatal 37. second. Any item of that list should have avoid the rocket destruction.

DeafOldFart
24th May 2015, 22:06
Er..... how about tacking another 32 bit address on, to make it 64 bit count.... should take us into intergalactic durations....
Or running a slower clock speed, like the 1khz machines I cut my teeth on...

roulishollandais
25th May 2015, 08:44
Hello msbbarratt,
Excellent post ! :)
In which case the spec was junkIn the case of Ariane501 somebody said the spec said that in case of double IRS failure stop the trajectory calculation... So that spec was not very smart !

And we often read on PPRuNe "It worked as designed".
The difficult for the IT analyst is to guess where something could be missing or wrong in the specification ! And we have to warn the people who is building the spec : "That could happen, do you want to accept that ?" because we know the hidden side of the system and architecture that the final boss is not aware with (like DeafOldFart suggesting to replace the B787 overflowed 32 bits integer by a 64 bits integer or modify the frequency :})

Let us hope it is the cheapest case for Boeing but probably it will not be the case as certifiers did jump over the bug too..:{

dClbydalpha
25th May 2015, 10:53
I have just finished reading the linked item below.

http://www.faa.gov/about/plans_reports/media/787_report_final.pdf


As suspected, the usual observations are there. Lack of ownership of requirements, inadequate v&v coverage and use of previous design experience without re-validating the design assumptions.

Nothing new, and that is what concerns me. Not disasterous as an individual item, no outright condemnations, but as the report shows that the GCUs were a deep-dive item, the process seems to be struggling with managing the complexity and nature of these next generation projects. In this case the inevitable system level impact of a low-level design decision was not spotted, perhaps due to the amount of responsibility boundaries that had to be crossed between.

Uplinker
25th May 2015, 11:20
Forgive me because I am not a software programmer, but any airborne safety critical system - such as a GCU - that is required to work should not be even slightly open to being compromised or shut down by just a clock, or a clock malfunction.

The GCU's in this case do not fail, they are switched off because a clock says so. What does a mere clock know about the generator load, the CSD oil temperature and pressure, the servicability of the other electrical systems in the network etc?

To have a healthy system shut down because a mere timer or a timer fault says so is crazy!!

How was it ever allowed to be designed this way?

EEngr
25th May 2015, 14:53
However this all seems to have been some sort of surprise, and it shouldn't be.This is he primary problem as I see it. The fact that the spec/design/test process appears to have a large hole in it through which this bug slipped needs to be investigated further.

The whole GCU reset every 248 days, by itself, is a non issue. That (like many other maintenance items) can easily be taken care of once the issue is known and few people would care. Some might. Every maintenance step, no matter how trivial, incurs a cost to document and track at the operator's expense. So even one extra check box would raise a few questions. Particularly if they understood how trivial the fix would have been back in the design stage.

But what with the industries increasing reliance on manufacturers self certification and the regulators hesitance at questioning anything process related within a company, I'm not hopeful that other bugs haven't slipped through as well.

roulishollandais
25th May 2015, 17:52
Thank you dClbydalpha

roulishollandais
25th May 2015, 18:01
EEngr
The issue may be very different if your system is analogical (Concorde) or digital (Ariane5, Airbus320 family, B787).
In the first you may have saturation, in the latter unknown consequence of carry/overflow indicator like the destruction of the rocket (8 billions FF) for an unused variable BH.

roulishollandais
26th May 2015, 06:53
"Tout va très bien, Madame la Marquise"!

Despite some people were retired other teams had been working on Ariane4 V33 and on Ariane5... But they were focused on terrorism instead of science ! They had not an enough IT level of knowledge:mad: and were leading hidden geopolitical aim...:suspect:

Their was a confusion between the fact that both IRS do not work, and how that diagnostic is done -with a double crazy carry, followed by a long list of failures and loss of rigor with excess of optimism, trusting in the first positive statistic results instead of tracking the best proof.

Uplinker
26th May 2015, 06:55
Clock, counter, whatever.

My point remains the same. We simply cannot have safety critical and perfectly functional systems shutting down because of mere "housekeeping trivia". This needs to be addressed. Safety critical systems should never be shut down by mere admin processes.

If it overheats: maybe. If the oil pressure drops: maybe. If it over speeds: yes. But an overflowing clock/counter? Definitely not!

I am a current line pilot, and although I am not a software programmer, I have written simple software programs, so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions. Something as important as a main generator should not be subject to anything more than a simple logic network which keeps it operational as long as its basic parameters remain within limits.

SAMPUBLIUS
26th May 2015, 14:29
so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions.


Amen Amen.

To err is human- to really screw up takes a computer.

The above comment is/was the point of my initial post in this thread.
Other comments along that line also apply.

There should/must be NO way an ' administrator ' should be able to shut down a critical system without recourse. PERIOD:mad:

EEngr
26th May 2015, 15:55
Something as important as a main generator should not be subject to anything more than a simple logic networkGood luck with that. Modern aircraft have electrical systems far too complex to operate for a 'simple logic network'. And airlines are not going back to the days of a flight engineer with a panel full of gauges and switches. 'Software' is the only practical way of controlling and reconfiguring such a system to account for generators going on or off line and bus reconfiguring for various external power or autoland configurations.

What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed. Or at least exposed to examination before a product is put into service. This isn't a big deal in the embedded s/w world. The RTOS (Real Time Operating Systems) vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes?

roulishollandais
26th May 2015, 17:36
vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes? don't dream , others sectors are not perfect ! Fukushima is a desaster where during years they refused to respect the warning of some hydrologists who said the water is the first threat against the plant !
Where a fault may be done once somebody will do it.
We have to learn from our faults, sins and other mistakes...

Radix
26th May 2015, 20:45
..........

peekay4
26th May 2015, 21:03
There is no such thing as a perfect process or a perfect system. And furthermore, expecting (or depending on) perfection is the wrong thing to do, because it is unrealistic.

In fact, during certification of (new) aircraft, there is an acknowledgement that some defects will remain.

Hence, defects such as these -- while should have been caught -- are not indicative of a process breakdown, certification breakdown, etc., but simply a reflection of reality.

The effects of any potential defect, however, should not be catastrophic. So what should be expected is a "graceful degradation" when failures do occur.

Actually a better analogy might be "defense in depth" used in security practice -- having multiple layers so that even a complete failure of one layer does not bring down the entire system.

The real question is then: even given a quadruple GCU failure taking down all four AC busses (due to this bug or some other malfunction) -- will that crash a 787?

Someone more familiar with 787s can correct me, but I think the answer is generally NO, as there is still the DC bus which will automatically run from batteries, before the ram air system kicking in (or possibly from APU as well.)

DozyWannabe
26th May 2015, 23:49
Time for a bit of a reality check, I feel...

No excuse after Ariane501 crash and report.NO !
In all fairness, the Ariane 501 scenario is a completely different kettle of fish from what we're talking about here. The former was a case of a hard-coded logical error involving number format translation and bit-depth conversion, whereas the latter was a case of integer counter overflow - however the crucial difference is that the former error occurred in a part of the program which was always expected to be executed, whereas the latter is very much an edge case (i.e. a scenario which is unlikely to occur in the real world). In practical terms we're talking about a scenario in which the aircraft has not once been in a "cold and dark" state for two-thirds of a year (ref: TURIN at post 54).

Congratulations to the tester that found the bug. Good testers think outside the box as this one had done...
Heh - I very much doubt that it was a single tester. Real-time software testing works rather differently from other disciplines. I suspect that it would more likely have been part of a suite of edge-case regressions intended to be added from the start.

Once fixed it's going to take another 248 days to re-run the test.

Nope, far more likely that the testing suite can increment the counter at any rate desired. :) Remember, it's not the counter itself that is the root of the issue as much as it is the dependent systems' ability to interpret the rollover correctly.

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.
Home/business computing and real-time/safety-critical computing are entirely different worlds. I'm not going to go into detail now, but it's worth pointing out that safety-critical systems tend to use obsolete hardware because of its proven nature and significantly lesser complexity. (Engineering maxim : more complexity means more things that can go wrong). More to the point, using a 64-bit signed integer would just have kicked the "can" (problem) down the road.

2) the specification defines an up time consistent with normal aircraft operations, < 248 days, in which case the software would have to have been tested against it before it was certified

3) as per 2) but someone has also taken the trouble to go beyond the spec in their testing and discovered the true system up time.
My "money" would be on this.

However this all seems to have been some sort of surprise, and it shouldn't be. It should be there in black and white in the paperwork. And it may well be the case that it is all written down in the right place, but that someone else simply hasn't read it. I'm expecting that to be the case, actually.
Possibly - I was thinking that they were applying additional "layers" of edge case testing based on the likelihood of the scenario occurring as development time became less critical.

We can't even keep a GCU running for 249 days.
Er, I'd argue not only that we can, but also that we just did - by applying standardised software reliability metrics and techniques that we've been developing and perfecting for decades.

As another software person, I'm also well aware of the limitations you're talking about - but we're not talking about the same kind of inherently dynamic logic required for a "self-driving" car or a fully-automated aircraft here, we're talking about bog-standard systems monitoring logic behaviour in scenarios which are extremely unlikely to occur in the real world.

What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed.
...
Who has the clout to hold Boeing's feet to the fire to adopt such processes?
Again, I'd argue that the very fact we're discussing this now means that Boeing (and/or their subcontractors) already have those processes in place. We're not talking about a glaring software mistake that slipped through the cracks, it's far more likely to be a missed edge-case in the specification - and the reason it wasn't covered until now is precisely because we're talking about an extremely unlikely real-world scenario. As I said above, we're in the realms of a hypothetical scenario in which the aircraft has not been powered down ("cold and dark") for *eight months*. Furthermore that each of the power units were brought online around the same time and all of them were kept running for the entirety of those eight months.

As you quite rightly state, modern aircraft systems are incredibly complex these days, and it's therefore much more sensible to focus testing on the most likely scenarios first and then adding layers of testing for less likely scenarios as the development and lifecycle of the product continues.

Don't get me wrong, this was undoubtedly an "oops" - I'm sure that several people who worked on these systems are now a little wiser and will swear to be more thorough in their work for the rest of their lives. Nevertheless, it's important that we all try to retain a little bit of perspective!

[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...]

DozyWannabe
27th May 2015, 00:15
You're referring to what I call the Too-Much-Technology problem. There us an unsettling fashion for using technology, especially software, to solve problems that don't really exist.
Not applicable here.

...Certainly that's how they would have been done 30 years ago. Generator control didn’t need software for the first 80 years of flight, and not a lot has changed.
Would they have been consistently running for 8 months in that period?

However, kids coming out of university have almost no idea what an analogue control circuit is.
Actually, a decent Software Engineering graduate will be *well* aware of the fundamentals of control circuits and logic paths. Furthermore only the best of those graduates usually end up specialising in real-time/safety-critical work.

But they do know what software is. So guess what they choose to use when they're a bit older and end up designing GCUs? Software. Is it overkill? Possibly. Would they ever think of building it a different way? No.
Unfair assumption. Boeing sold this particular product on the basis of being the most energy-efficient airliner that technology could devise. Computer-controlled and regulated technology was and is the only practical method for achieving that aim (and hopefully proving it).

The only thing software has going for it in such a situation is that a whole ton of functionality can be implemented with a very low size/weight penalty.
Again, incorrect. Among other factors it is the most practical way of assessing the product's ability to meet it's design requirements, and furthermore it is a far more practical method in terms of revising the systems design when problems do arise (it's much easier/cheaper to flash an EEPROM and change the systems' programming than it is to replace a physical TLA board).

Though in the case of a GCU I'm struggling to see what that extra functionality might be.
Providing a method of assessing and measuring the efficiency of the aircraft's systems, and providing a straightforward method of tweaking and improving their behaviour, for starters.

peekay4
27th May 2015, 01:57
I don't know too much about a GCU, but I can't imagine that there is much about them that couldn't be run with a good old fashioned analogue control circuit.
All analogue systems -- whether mechanical or electrical -- suffer from aging (yes, that is a technical term). Simply put, analogue systems always drift over time, continuously varying their performance until tolerance limits are exceeded.

When you combine an inherently analogue mechanical system with an analogue control system, you are essentially fighting a losing battle because both the system under control and the system doing the controlling will independently and jointly go out of tolerance. So analogue systems constantly requires external tuning to keep their performance levels within an acceptable range.

Analogue systems are also susceptible to changing environmental conditions. The change in temperature from a hot ground tarmac to freezing flight levels are more than enough to affect the performance of analogue components (capacitors, resistors, amplifiers, etc.) So again, it's very difficult to maintain tight tolerances with analogue controls.

Digital systems, on the other hand, are not susceptible to these problems. If I set a certain digital memory parameter to the value of 128 decimal, it will not vary to become 129 or 127 over time. It will also remain exactly at 128 over its entire design temperature & environmental range. While memory can get corrupted -- and there are ways to automatically detect & correct this -- bits and bytes don't age, freeze or boil over.

The need to meet modern tolerance / precision requirements alone justifies the motivation to use digital vs. analogue techniques.

Plus are many other advantages to using a digital control systems. A single integrated digital controller can monitor, control and tune hundreds of parameters simultaneously in real time -- something that's impossible (or at least impractical) to do with analogue controllers.

Analogue controllers also tend to be fairly "dumb". With digital controllers, manufacturers can implement sophisticated control algorithms to increase performance, economy, reliability, etc.

SAMPUBLIUS
27th May 2015, 04:26
[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...

Well the FAA did get concerned. And it is still NOT clear if only one continuous run generator/power system/ground power connection for that time would trip ALL off line.

The fact that a ( simple ? ) ( single / ) counter could trip ALL systems ( including the APU ( total 6 generators ) off line until some sort of ground access reset doesn't say much for redundancy or safety.

And while the RAT could keep the controls running, it is NOT clear if engines could be restarted [takes electrical along with windmilling ( air start ) ]

Granted it is very very very unlikely- but the probibility and possibility is NOT ZERO.

peekay4
27th May 2015, 05:37
And it is still NOT clear if only one continuous run generator/power system/ground power connection for that time would trip ALL off line.
No, it is perfectly clear.

Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.

The danger is if all four GCUs were powered up at the same time, then all four would fail at the same time.

This is all clearly discussed in the original FAA AD under the "Supplementary Information" section:

https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf

Also the APUs are not affected (different make/model).

Radix
27th May 2015, 12:24
..........

Uplinker
27th May 2015, 15:43
I have no problem with digital systems. I have no problem with computers (I fly Airbus's !), and I agree that extra sophistication can be acheived with computers, or micro-controllers. I also agree that analogue systems need regular manual adjustment and calibration, which is a pain. (my first profession was as an electronics engineer, and believe me, I have spent plenty of time doing just that).

However, I do have a problem with a vital system shutting down because a mere clock or counter has reached a particular limit. All a generator, GCU, or hydraulic pump need is a simple logic circuit to determine if it is working within it's parameters, and warn the pilots if it is not. It can be monitored by a computer by all means, but a computer should not have executive control, unless there is a catastrophic situation developing.

I am a line pilot, and if we have just gone around at Innsbruck because we've lost both GCU's in our No 1 engine, and then the two GCU's in our No 2 engine quit in the climb out, and I later discover that all four GCU's quit; not because they overheated, not because they oversped, not because the voltage or frequency was wrong but because a bloody clock said so....That is really going to xxxx me off ! - assuming of course that a quadruple genny failure did not distract us so much in that valley that we flew into a mountain!

I like having FADECs to look after the engines and help me prevent exceedances, but I don't expect one to shut down simply because a register has become full.

System software designers must not lose sight of how their systems will be used, or the fact that such systems need to keep running unless a catastrophic or potentially catastrophic situation has arisen.

Clock/counter overflow is not catastrophic. Nor should the shut down potential of a clock/counter/register overflow need to be carefully checked for, because it should never be an issue in the first place.

Pilots don't have the luxury of being able to go through hundreds/thousands of lines of code at their desks all morning, coffee at their side, and eventually saying, "oh, here we go, I found the problem", they just need that genny or hydraulic pump to keep going.

deptrai
27th May 2015, 18:07
regarding the purported need for analog controls, and whether we need software...I came to think of lead software engineer Margaret Hamilton, and the code she wrote for Apollo 11, and a picture of it on a stack of paper as tall as her...the code which basically saved the moon landing by recovering from a malfunction: https://medium.com/@verne/margaret-hamilton-the-engineer-who-took-the-apollo-to-the-moon-7d550c73d3fa among other things, she also coined the term "software engineering" (it was treated more like an esoteric art previously, which it isnt)

computers and aircraft seem to be a combination that scares a lot of people (the impression is that bugs and "hackers" are everywhere), but I think it's a really cool combination, and the rare glitches such as this potential thing here havent really caused any accidents yet, have they? on the contrary, software prevents accidents every day :)

MG23
27th May 2015, 18:11
Clock/counter overflow is not catastrophic. Nor should the shut down potential of a clock/counter/register overflow need to be carefully checked for, because it should never be an issue in the first place.

Computer systems failed all over the world a couple of years ago when there was a leap second. More will fail this year when the next leap second happens (I know a couple of mine will, so I'll have to shut them down before midnight, and restart after). Clocks suddenly jumping back can have really bad consequences on all kinds of code which expects the clock to start at 0 and only ever increment.

We also had a similar issue a few years ago with some hardware we bought which started spewing errors after being in operation for about four years. Turned out that, although their clock was 64 bits, someone had unintentionally copied it to a 32-bit variable and then copied that back to the 64-bit variable, so the top bits were always zero. The things you don't test are usually the things that don't work.

MG23
27th May 2015, 21:40
Does your little Honda petrol generator need software to reliably produce 240V 50Hz or 110V 60Hz? No.

New ones apparently have electronic fuel injection, so probably... yes.

So what's that got to do with the control of a generator? It's still all about monitoring.And, apparently, shutting it down when the monitoring says it exceeded safe parameters. As someone touched on earlier in the thread, if the 'safe parameter' is 'voltage didn't exceed X for Y seconds' and the time since the last check is calculated as -20,000,000 seconds because the counter just jumped back to zero, then the software may well barf and shut down because it doesn't know what's going on. For something that 'can never happen', that behaviour makes sense... until it happens.

One of the benefits of software control systems over analogue is that you can make them as complex as you want. One of the downsides is that they may contain completely unexpected failure modes, while analogue systems tend to fail in predictable ways. Just because the software has worked perfectly for 248 days doesn't mean it won't fail completely after 249, whereas an analogue system will usually degrade before it fails.

SAMPUBLIUS
27th May 2015, 21:52
Peekay4 said No, it is perfectly clear.

Each GCU is independent. The counter is internal to each GCU. Failure of one GCU only affects that particular GCU, and will not affect any others.

yet the FAA doc first page says very clearly


https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf

This AD was prompted by the
determination that a Model 787 airplane that has been powered continuously for 248 days
can lose all alternating current (AC) electrical power due to the generator control units
(GCUs) simultaneously going into failsafe mode. This condition is caused by a software
counter internal to the GCUs that will overflow after 248 days of continuous power. We
are issuing this AD to prevent loss of all AC electrical power, which could result in loss
of control of the airplane


Loss of ALL AC power includes APU :ugh:

But further on says If the four main GCUs were powered up at the same time,
after 248 days of continuous power, all four GCUs will go into failsafe mode at the same
time, resulting in a loss of all AC electrical power regardless of flight phase.

Now granted the 248 day is a stretch- but its still a good bet that both engines ( 4 gen total ) are started within minutes of each other, and absent some special maintence on one engine, the time count would be within minutes.

The role of the APU in that case is not well defined.

And is there a common GCU counter for all in the system ?

Can you start via windmill any engine with a ' bricked ' GEN ? as in the case where both engines drop off within a few minutes
Or can the APU cross feed to the engine ignition system ?? :confused:

Anyone KNOW for sure ?

tdracer
28th May 2015, 02:16
On the 787, engine ignition is 28 Vdc, not AC.

SAMPUBLIUS
28th May 2015, 03:01
Sure, but how is the 27-28 V DC generated ?? - AFIK it is via inverter from AC system.

IOW if GEN is shut down ( still rotates but no output ), what about DC ??

tdracer
28th May 2015, 03:16
No, 28 Vdc for the igniters comes off the hot battery bus (igniter power available from the hot battery bus is a common design feature on Boeings, although on most the igniter power only drops to the battery bus when the main AC bus goes down). The whole reason they used DC for the igniters on the 787 is so they wouldn't need that current going the a DC to AC inverter.

peekay4
28th May 2015, 03:28
Loss of ALL AC power includes APU
No. In the context of the FAA AD, the APU is not considered, as presumably the APU may not be operating at the time of the GCU failures (see below for exact wording).

The AD does NOT mean integer overflow in one GCU will shutdown the other three GCUs plus the APU as you seem to think. That is simply wrong!

The 787 has six generators: 2 per engine (4 total) plus 2 more on the APU. The FAA statement makes clear that the issue affects the four GCUs related to the engine, not any GCUs related to the APU.

Read the following sections from the FAA AD carefully:

The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing that GCU to go into failsafe mode.

Translation: overflow in one GCU unit only causes THAT one unit to go into failsafe mode. FAA wrote "that unit", not "all units".

If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time


Translation: clearly here the FAA is only talking about the four GCUs related to the engines, not the other two GCUs related to the APU.

And notice this simultaneous failure only occurs "if the four main GCUs ... were powered up at the same time".

Translation: if the four main GCUs were not powered up at the same time, failure in one GCU will not cause all others to fail.

SAMPUBLIUS
28th May 2015, 04:00
28 Vdc for the igniters comes off the hot battery bus

So if 4 gens get bricked due to timer- then the AP must be started via battery, which also powers igniter - and then APU can keep the igniter and battery charged and ?? going.

Even so - IMO the timer shutoff of GENs although unlikely is still a dumb idea.

And it takes x months to correct ?? :sad::mad:

Capn Bloggs
28th May 2015, 05:40
Peekay, correct me if I am wrong but, "when power is applied" obviously means external power, and when that is plugged in, the engines will not be running. So the GCUs are powered, even though their engines are not running. Surely the APU GCUs are in the same boat? "Powered" without the APU running? Scenario: cold ship, APU is started (or external power applied), as soon as power comes on-line, all 6 GCUs will be powered. 248 days later, all 6 GCUs will die.

You seem to be suggesting that the engine GCUs remain powered when the engines are stopped but the APU GCUs are "depowered" when the APU is stopped.

Tinfoil hat donned! ;)

SAMPUBLIUS
28th May 2015, 14:25
Capn boggs said Scenario: cold ship, APU is started (or external power applied), as soon as power comes on-line, all 6 GCUs will be powered. 248 days later, all 6 GCUs will die.

Capn may have a valid point !!

IMO, since the APU is usually shut down in mid flight, it may be unlikely that the APU 2 GENS will have the same time count.

Meanwhile since each engine has 2 GENS, then those 2 GENS will usually have the same time powered. And since both engines are often started within minutes of each other, then it is quite likely that 4 GENS could be bricked by a counter within minutes of each other.

BUT what is unclear is IF the GEN COUNTER TIME counts SYSTEM powered time such as when under Ground Power- then even though APU is off, the 2 APU GENS may also be counted as ' powered " time.

One would certainly hope that Boeing would by now have clarified just how ' powered' time is counted by each GEN. :(

DozyWannabe
29th May 2015, 02:01
Perhaps, but when did you last hear one say "I know, I'll do this control system as a synchronous state machine"? Not once in the last 20 years I bet.
As I've stated many times, I don't work in real-time systems myself (Software Engineering is a job for me, not a calling), so it's unlikely that I'd be hearing it in the work environment. But I only graduated 14 years ago, and I still remember what a finite state machine is, and with a bit of revision could probably explain the difference between an NFA and a DFA.

The best ones get the highest paid jobs.
Not true based on the experiences of people that I know. For one thing, the kind of personality which tends to occur in those who have an innate talent for low-level bit-flipping doesn't usually lend itself to the corporate politics involved in climbing the career ladder.

And, actually, the best ones know that the most expensive part of building a safety critical system is passing certification.
OK, but a lot of the B787's systems were self-certified, so how does that fit the pattern?

So what's that got to do with the control of a generator? It's still all about monitoring.
Right - it's all about monitoring. Presumably (and this is marginally-educated guesswork on my part), the failsafe mode exists for when the monitoring software detects a problem in the unit. The counter presumably exists in order to timestamp the monitoring operations as they happen. Prudent engineering practise would also likely have any unknown error conditions cause the unit to enter failsafe mode because it's better to be safe than sorry. Therefore the signal to enter failsafe mode would come from the monitoring system, and the counter overflow - being an unexpected error condition - would cause the monitoring system to do so. So even if the control system was implemented in hardware, it wouldn't alter the situation.

dClbydalpha
30th May 2015, 10:02
This issue is not about the software as far as I see it. I am pretty sure that the software module was implemented and tested successfully against the module requirements. It is a systems engineering process issue, that is a concern.

During design, implementation and certification the analysis never revealed that an important component is guaranteed to fail every 248 days. This is evident because had it been analysed then there would already be an aircrew/maintenance action in place. The aim of the certification and analysis process is to drive out and mitigate all potential latent defects to an acceptable probability/consequence level. In this case the process failed spectacularly. Luckily the outcome was unspectacular and easy to manage, but the failure of the process is still of great concern.