Boeing 787 integer overflow bug
Joined: Aug 2012
Posts: 1
Likes: 0
"32bit signed value used as a counter running at 100Hz?"
I was wondering, why would you need to have a signed value if it is a simple counter...
Unsigned one would give twice the range of the signed one... 248*5 = more than a year.
"If you aircraft was powered for more than a year, don't forget to power cycle it..." kind of thing
The reaction to this non issue from outlets like CNN was just...
I was wondering, why would you need to have a signed value if it is a simple counter...
Unsigned one would give twice the range of the signed one... 248*5 = more than a year.

"If you aircraft was powered for more than a year, don't forget to power cycle it..." kind of thing
The reaction to this non issue from outlets like CNN was just...

Joined: Jan 2011
Posts: 780
Likes: 89
From: Seattle
I was wondering, why would you need to have a signed value if it is a simple counter...
Joined: Jan 2011
Posts: 217
Likes: 0
From: on the cusp
non issue
1. The GCU control system fails after ~7000 hours.
2. It is a common mode failure so no credit can be given to multiple systems.
3. The failure leads to loss of all AC.
4. Loss of all AC is at least HAZARDOUS.
Therefore a target of 1x10-7 is fulfilled by a design stuggling to meet 1x10-4
Firstly the overflow error should be trapped at source. It adds complexity to design, but it needs to be done in safety critical systems.
Secondly it appears the safety analysis has not fully analysed all the software failures ... if the software design process guidelines for safety critical systems had been followed then this should have stood out like a sore thumb. This is the kind of thing that happens when people use the analysis from old designs, without re-validating the original assumptions against the new design.
In mechanical terms, if a fastner repeatedly loosens in flight there is something wrong, it is not acceptable to say that it didn't come totally undone so as long as we tighten it up each time it is ok, the fastener should be redesigned.
I have not seen a statement from Boeing that denies any of the 4 assumptions i have made, but I emphasise that I have no detailed knowledge so this is based only on the public domain information ... but based on that it really worries me, because it isn't a "bug" it is a systematic failure.
Joined: Jun 2011
Posts: 760
Likes: 0
From: france
dClbydalpha,
I agree with that.
Once again learn the lesson from Ariane501 report : Not only it is easy to avoid integer overflow being very methodic, but the report showed that a long list of other failures have been leading to the fatal 37. second. Any item of that list should have avoid the rocket destruction.
I agree with that.
Once again learn the lesson from Ariane501 report : Not only it is easy to avoid integer overflow being very methodic, but the report showed that a long list of other failures have been leading to the fatal 37. second. Any item of that list should have avoid the rocket destruction.
Joined: Jul 2013
Posts: 40
Likes: 0
From: Medway towns
Er..... how about tacking another 32 bit address on, to make it 64 bit count.... should take us into intergalactic durations....
Or running a slower clock speed, like the 1khz machines I cut my teeth on...
Or running a slower clock speed, like the 1khz machines I cut my teeth on...
Joined: Jun 2011
Posts: 760
Likes: 0
From: france
Hello msbbarratt,
Excellent post !
In the case of Ariane501 somebody said the spec said that in case of double IRS failure stop the trajectory calculation... So that spec was not very smart !
And we often read on PPRuNe "It worked as designed".
The difficult for the IT analyst is to guess where something could be missing or wrong in the specification ! And we have to warn the people who is building the spec : "That could happen, do you want to accept that ?" because we know the hidden side of the system and architecture that the final boss is not aware with (like DeafOldFart suggesting to replace the B787 overflowed 32 bits integer by a 64 bits integer or modify the frequency
)
Let us hope it is the cheapest case for Boeing but probably it will not be the case as certifiers did jump over the bug too..
Excellent post !

In which case the spec was junk
And we often read on PPRuNe "It worked as designed".
The difficult for the IT analyst is to guess where something could be missing or wrong in the specification ! And we have to warn the people who is building the spec : "That could happen, do you want to accept that ?" because we know the hidden side of the system and architecture that the final boss is not aware with (like DeafOldFart suggesting to replace the B787 overflowed 32 bits integer by a 64 bits integer or modify the frequency
)Let us hope it is the cheapest case for Boeing but probably it will not be the case as certifiers did jump over the bug too..
Joined: Jan 2011
Posts: 217
Likes: 0
From: on the cusp
I have just finished reading the linked item below.
http://www.faa.gov/about/plans_repor...port_final.pdf
As suspected, the usual observations are there. Lack of ownership of requirements, inadequate v&v coverage and use of previous design experience without re-validating the design assumptions.
Nothing new, and that is what concerns me. Not disasterous as an individual item, no outright condemnations, but as the report shows that the GCUs were a deep-dive item, the process seems to be struggling with managing the complexity and nature of these next generation projects. In this case the inevitable system level impact of a low-level design decision was not spotted, perhaps due to the amount of responsibility boundaries that had to be crossed between.
http://www.faa.gov/about/plans_repor...port_final.pdf
As suspected, the usual observations are there. Lack of ownership of requirements, inadequate v&v coverage and use of previous design experience without re-validating the design assumptions.
Nothing new, and that is what concerns me. Not disasterous as an individual item, no outright condemnations, but as the report shows that the GCUs were a deep-dive item, the process seems to be struggling with managing the complexity and nature of these next generation projects. In this case the inevitable system level impact of a low-level design decision was not spotted, perhaps due to the amount of responsibility boundaries that had to be crossed between.
Last edited by dClbydalpha; 25th May 2015 at 10:56. Reason: Correcting link



Joined: Nov 1999
Aviation Qualifications: ATPL
Posts: 3,150
Likes: 744
From: UK
Forgive me because I am not a software programmer, but any airborne safety critical system - such as a GCU - that is required to work should not be even slightly open to being compromised or shut down by just a clock, or a clock malfunction.
The GCU's in this case do not fail, they are switched off because a clock says so. What does a mere clock know about the generator load, the CSD oil temperature and pressure, the servicability of the other electrical systems in the network etc?
To have a healthy system shut down because a mere timer or a timer fault says so is crazy!!
How was it ever allowed to be designed this way?
The GCU's in this case do not fail, they are switched off because a clock says so. What does a mere clock know about the generator load, the CSD oil temperature and pressure, the servicability of the other electrical systems in the network etc?
To have a healthy system shut down because a mere timer or a timer fault says so is crazy!!
How was it ever allowed to be designed this way?

Joined: Jan 2011
Posts: 780
Likes: 89
From: Seattle
However this all seems to have been some sort of surprise, and it shouldn't be.
The whole GCU reset every 248 days, by itself, is a non issue. That (like many other maintenance items) can easily be taken care of once the issue is known and few people would care. Some might. Every maintenance step, no matter how trivial, incurs a cost to document and track at the operator's expense. So even one extra check box would raise a few questions. Particularly if they understood how trivial the fix would have been back in the design stage.
But what with the industries increasing reliance on manufacturers self certification and the regulators hesitance at questioning anything process related within a company, I'm not hopeful that other bugs haven't slipped through as well.
Joined: Jun 2011
Posts: 760
Likes: 0
From: france
EEngr
The issue may be very different if your system is analogical (Concorde) or digital (Ariane5, Airbus320 family, B787).
In the first you may have saturation, in the latter unknown consequence of carry/overflow indicator like the destruction of the rocket (8 billions FF) for an unused variable BH.
The issue may be very different if your system is analogical (Concorde) or digital (Ariane5, Airbus320 family, B787).
In the first you may have saturation, in the latter unknown consequence of carry/overflow indicator like the destruction of the rocket (8 billions FF) for an unused variable BH.
Joined: Jun 2011
Posts: 760
Likes: 0
From: france
"Tout va très bien, Madame la Marquise"!
Despite some people were retired other teams had been working on Ariane4 V33 and on Ariane5... But they were focused on terrorism instead of science ! They had not an enough IT level of knowledge
and were leading hidden geopolitical aim...
Their was a confusion between the fact that both IRS do not work, and how that diagnostic is done -with a double crazy carry, followed by a long list of failures and loss of rigor with excess of optimism, trusting in the first positive statistic results instead of tracking the best proof.
Despite some people were retired other teams had been working on Ariane4 V33 and on Ariane5... But they were focused on terrorism instead of science ! They had not an enough IT level of knowledge
and were leading hidden geopolitical aim...
Their was a confusion between the fact that both IRS do not work, and how that diagnostic is done -with a double crazy carry, followed by a long list of failures and loss of rigor with excess of optimism, trusting in the first positive statistic results instead of tracking the best proof.



Joined: Nov 1999
Aviation Qualifications: ATPL
Posts: 3,150
Likes: 744
From: UK
Clock, counter, whatever.
My point remains the same. We simply cannot have safety critical and perfectly functional systems shutting down because of mere "housekeeping trivia". This needs to be addressed. Safety critical systems should never be shut down by mere admin processes.
If it overheats: maybe. If the oil pressure drops: maybe. If it over speeds: yes. But an overflowing clock/counter? Definitely not!
I am a current line pilot, and although I am not a software programmer, I have written simple software programs, so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions. Something as important as a main generator should not be subject to anything more than a simple logic network which keeps it operational as long as its basic parameters remain within limits.
My point remains the same. We simply cannot have safety critical and perfectly functional systems shutting down because of mere "housekeeping trivia". This needs to be addressed. Safety critical systems should never be shut down by mere admin processes.
If it overheats: maybe. If the oil pressure drops: maybe. If it over speeds: yes. But an overflowing clock/counter? Definitely not!
I am a current line pilot, and although I am not a software programmer, I have written simple software programs, so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions. Something as important as a main generator should not be subject to anything more than a simple logic network which keeps it operational as long as its basic parameters remain within limits.
Thread Starter
Joined: Apr 2014
Posts: 0
Likes: 0
From: Washstate
so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions.
To err is human- to really screw up takes a computer.
The above comment is/was the point of my initial post in this thread.
Other comments along that line also apply.
There should/must be NO way an ' administrator ' should be able to shut down a critical system without recourse. PERIOD

Joined: Jan 2011
Posts: 780
Likes: 89
From: Seattle
Something as important as a main generator should not be subject to anything more than a simple logic network
What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed. Or at least exposed to examination before a product is put into service. This isn't a big deal in the embedded s/w world. The RTOS (Real Time Operating Systems) vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes?
Joined: Jun 2011
Posts: 760
Likes: 0
From: france
vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes?
Where a fault may be done once somebody will do it.
We have to learn from our faults, sins and other mistakes...
Joined: Sep 2014
Posts: 1,256
Likes: 0
From: Canada
There is no such thing as a perfect process or a perfect system. And furthermore, expecting (or depending on) perfection is the wrong thing to do, because it is unrealistic.
In fact, during certification of (new) aircraft, there is an acknowledgement that some defects will remain.
Hence, defects such as these -- while should have been caught -- are not indicative of a process breakdown, certification breakdown, etc., but simply a reflection of reality.
The effects of any potential defect, however, should not be catastrophic. So what should be expected is a "graceful degradation" when failures do occur.
Actually a better analogy might be "defense in depth" used in security practice -- having multiple layers so that even a complete failure of one layer does not bring down the entire system.
The real question is then: even given a quadruple GCU failure taking down all four AC busses (due to this bug or some other malfunction) -- will that crash a 787?
Someone more familiar with 787s can correct me, but I think the answer is generally NO, as there is still the DC bus which will automatically run from batteries, before the ram air system kicking in (or possibly from APU as well.)
In fact, during certification of (new) aircraft, there is an acknowledgement that some defects will remain.
Hence, defects such as these -- while should have been caught -- are not indicative of a process breakdown, certification breakdown, etc., but simply a reflection of reality.
The effects of any potential defect, however, should not be catastrophic. So what should be expected is a "graceful degradation" when failures do occur.
Actually a better analogy might be "defense in depth" used in security practice -- having multiple layers so that even a complete failure of one layer does not bring down the entire system.
The real question is then: even given a quadruple GCU failure taking down all four AC busses (due to this bug or some other malfunction) -- will that crash a 787?
Someone more familiar with 787s can correct me, but I think the answer is generally NO, as there is still the DC bus which will automatically run from batteries, before the ram air system kicking in (or possibly from APU as well.)
Joined: Jul 2002
Posts: 3,093
Likes: 0
From: UK
Time for a bit of a reality check, I feel...
In all fairness, the Ariane 501 scenario is a completely different kettle of fish from what we're talking about here. The former was a case of a hard-coded logical error involving number format translation and bit-depth conversion, whereas the latter was a case of integer counter overflow - however the crucial difference is that the former error occurred in a part of the program which was always expected to be executed, whereas the latter is very much an edge case (i.e. a scenario which is unlikely to occur in the real world). In practical terms we're talking about a scenario in which the aircraft has not once been in a "cold and dark" state for two-thirds of a year (ref: TURIN at post 54).
Heh - I very much doubt that it was a single tester. Real-time software testing works rather differently from other disciplines. I suspect that it would more likely have been part of a suite of edge-case regressions intended to be added from the start.
Nope, far more likely that the testing suite can increment the counter at any rate desired.
Remember, it's not the counter itself that is the root of the issue as much as it is the dependent systems' ability to interpret the rollover correctly.
Home/business computing and real-time/safety-critical computing are entirely different worlds. I'm not going to go into detail now, but it's worth pointing out that safety-critical systems tend to use obsolete hardware because of its proven nature and significantly lesser complexity. (Engineering maxim : more complexity means more things that can go wrong). More to the point, using a 64-bit signed integer would just have kicked the "can" (problem) down the road.
My "money" would be on this.
Possibly - I was thinking that they were applying additional "layers" of edge case testing based on the likelihood of the scenario occurring as development time became less critical.
Er, I'd argue not only that we can, but also that we just did - by applying standardised software reliability metrics and techniques that we've been developing and perfecting for decades.
As another software person, I'm also well aware of the limitations you're talking about - but we're not talking about the same kind of inherently dynamic logic required for a "self-driving" car or a fully-automated aircraft here, we're talking about bog-standard systems monitoring logic behaviour in scenarios which are extremely unlikely to occur in the real world.
Again, I'd argue that the very fact we're discussing this now means that Boeing (and/or their subcontractors) already have those processes in place. We're not talking about a glaring software mistake that slipped through the cracks, it's far more likely to be a missed edge-case in the specification - and the reason it wasn't covered until now is precisely because we're talking about an extremely unlikely real-world scenario. As I said above, we're in the realms of a hypothetical scenario in which the aircraft has not been powered down ("cold and dark") for *eight months*. Furthermore that each of the power units were brought online around the same time and all of them were kept running for the entirety of those eight months.
As you quite rightly state, modern aircraft systems are incredibly complex these days, and it's therefore much more sensible to focus testing on the most likely scenarios first and then adding layers of testing for less likely scenarios as the development and lifecycle of the product continues.
Don't get me wrong, this was undoubtedly an "oops" - I'm sure that several people who worked on these systems are now a little wiser and will swear to be more thorough in their work for the rest of their lives. Nevertheless, it's important that we all try to retain a little bit of perspective!
[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...]
In all fairness, the Ariane 501 scenario is a completely different kettle of fish from what we're talking about here. The former was a case of a hard-coded logical error involving number format translation and bit-depth conversion, whereas the latter was a case of integer counter overflow - however the crucial difference is that the former error occurred in a part of the program which was always expected to be executed, whereas the latter is very much an edge case (i.e. a scenario which is unlikely to occur in the real world). In practical terms we're talking about a scenario in which the aircraft has not once been in a "cold and dark" state for two-thirds of a year (ref: TURIN at post 54).
Once fixed it's going to take another 248 days to re-run the test.
Remember, it's not the counter itself that is the root of the issue as much as it is the dependent systems' ability to interpret the rollover correctly.2) the specification defines an up time consistent with normal aircraft operations, < 248 days, in which case the software would have to have been tested against it before it was certified
3) as per 2) but someone has also taken the trouble to go beyond the spec in their testing and discovered the true system up time.
3) as per 2) but someone has also taken the trouble to go beyond the spec in their testing and discovered the true system up time.
However this all seems to have been some sort of surprise, and it shouldn't be. It should be there in black and white in the paperwork. And it may well be the case that it is all written down in the right place, but that someone else simply hasn't read it. I'm expecting that to be the case, actually.
Er, I'd argue not only that we can, but also that we just did - by applying standardised software reliability metrics and techniques that we've been developing and perfecting for decades.
As another software person, I'm also well aware of the limitations you're talking about - but we're not talking about the same kind of inherently dynamic logic required for a "self-driving" car or a fully-automated aircraft here, we're talking about bog-standard systems monitoring logic behaviour in scenarios which are extremely unlikely to occur in the real world.
As you quite rightly state, modern aircraft systems are incredibly complex these days, and it's therefore much more sensible to focus testing on the most likely scenarios first and then adding layers of testing for less likely scenarios as the development and lifecycle of the product continues.
Don't get me wrong, this was undoubtedly an "oops" - I'm sure that several people who worked on these systems are now a little wiser and will swear to be more thorough in their work for the rest of their lives. Nevertheless, it's important that we all try to retain a little bit of perspective!
[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...]



