PPRuNe Forums - Boeing 787 integer overflow bug

PPRuNe Forums (https://www.pprune.org/)

- Tech Log (https://www.pprune.org/tech-log-15/)

- - Boeing 787 integer overflow bug (https://www.pprune.org/tech-log/560793-boeing-787-integer-overflow-bug.html)

Thank you blackbeard1 for that wonderful Cluster and link.
I had pleasure to learn more from boreal auroras and the last studies.

"32bit signed value used as a counter running at 100Hz?"

I was wondering, why would you need to have a signed value if it is a simple counter...
Unsigned one would give twice the range of the signed one... 248*5 = more than a year. :}

"If you aircraft was powered for more than a year, don't forget to power cycle it..." kind of thing

The reaction to this non issue from outlets like CNN was just... :yuk:

Quote:

I was wondering, why would you need to have a signed value if it is a simple counter...

Because signed integer math and conditional logic can give you positive/negative interval values. As in one event occurred before or after another. And there may be places in the code where this would be expected.:8

Quote:

non issue

Sorry but this is anything but a non-issue, looking at the information in the publice domain, this is a systematic design failure.

1. The GCU control system fails after ~7000 hours.
2. It is a common mode failure so no credit can be given to multiple systems.
3. The failure leads to loss of all AC.
4. Loss of all AC is at least HAZARDOUS.

Therefore a target of 1x10-7 is fulfilled by a design stuggling to meet 1x10-4

Firstly the overflow error should be trapped at source. It adds complexity to design, but it needs to be done in safety critical systems.
Secondly it appears the safety analysis has not fully analysed all the software failures ... if the software design process guidelines for safety critical systems had been followed then this should have stood out like a sore thumb. This is the kind of thing that happens when people use the analysis from old designs, without re-validating the original assumptions against the new design.

In mechanical terms, if a fastner repeatedly loosens in flight there is something wrong, it is not acceptable to say that it didn't come totally undone so as long as we tighten it up each time it is ok, the fastener should be redesigned.

I have not seen a statement from Boeing that denies any of the 4 assumptions i have made, but I emphasise that I have no detailed knowledge so this is based only on the public domain information ... but based on that it really worries me, because it isn't a "bug" it is a systematic failure.

dClbydalpha,
I agree with that.
Once again learn the lesson from Ariane501 report : Not only it is easy to avoid integer overflow being very methodic, but the report showed that a long list of other failures have been leading to the fatal 37. second. Any item of that list should have avoid the rocket destruction.

Er..... how about tacking another 32 bit address on, to make it 64 bit count.... should take us into intergalactic durations....
Or running a slower clock speed, like the 1khz machines I cut my teeth on...

Hello msbbarratt,
Excellent post ! :)

Quote:

In which case the spec was junk

In the case of Ariane501 somebody said the spec said that in case of double IRS failure stop the trajectory calculation... So that spec was not very smart !

And we often read on PPRuNe "It worked as designed".
The difficult for the IT analyst is to guess where something could be missing or wrong in the specification ! And we have to warn the people who is building the spec : "That could happen, do you want to accept that ?" because we know the hidden side of the system and architecture that the final boss is not aware with (like DeafOldFart suggesting to replace the B787 overflowed 32 bits integer by a 64 bits integer or modify the frequency :})

Let us hope it is the cheapest case for Boeing but probably it will not be the case as certifiers did jump over the bug too..:{

I have just finished reading the linked item below.

http://www.faa.gov/about/plans_repor...port_final.pdf

As suspected, the usual observations are there. Lack of ownership of requirements, inadequate v&v coverage and use of previous design experience without re-validating the design assumptions.

Nothing new, and that is what concerns me. Not disasterous as an individual item, no outright condemnations, but as the report shows that the GCUs were a deep-dive item, the process seems to be struggling with managing the complexity and nature of these next generation projects. In this case the inevitable system level impact of a low-level design decision was not spotted, perhaps due to the amount of responsibility boundaries that had to be crossed between.

Forgive me because I am not a software programmer, but any airborne safety critical system - such as a GCU - that is required to work should not be even slightly open to being compromised or shut down by just a clock, or a clock malfunction.

The GCU's in this case do not fail, they are switched off because a clock says so. What does a mere clock know about the generator load, the CSD oil temperature and pressure, the servicability of the other electrical systems in the network etc?

To have a healthy system shut down because a mere timer or a timer fault says so is crazy!!

How was it ever allowed to be designed this way?

Quote:

However this all seems to have been some sort of surprise, and it shouldn't be.

This is he primary problem as I see it. The fact that the spec/design/test process appears to have a large hole in it through which this bug slipped needs to be investigated further.

The whole GCU reset every 248 days, by itself, is a non issue. That (like many other maintenance items) can easily be taken care of once the issue is known and few people would care. Some might. Every maintenance step, no matter how trivial, incurs a cost to document and track at the operator's expense. So even one extra check box would raise a few questions. Particularly if they understood how trivial the fix would have been back in the design stage.

But what with the industries increasing reliance on manufacturers self certification and the regulators hesitance at questioning anything process related within a company, I'm not hopeful that other bugs haven't slipped through as well.

Thank you dClbydalpha

EEngr
The issue may be very different if your system is analogical (Concorde) or digital (Ariane5, Airbus320 family, B787).
In the first you may have saturation, in the latter unknown consequence of carry/overflow indicator like the destruction of the rocket (8 billions FF) for an unused variable BH.

"Tout va très bien, Madame la Marquise"!

Despite some people were retired other teams had been working on Ariane4 V33 and on Ariane5... But they were focused on terrorism instead of science ! They had not an enough IT level of knowledge:mad: and were leading hidden geopolitical aim...:suspect:

Their was a confusion between the fact that both IRS do not work, and how that diagnostic is done -with a double crazy carry, followed by a long list of failures and loss of rigor with excess of optimism, trusting in the first positive statistic results instead of tracking the best proof.

Clock, counter, whatever.

My point remains the same. We simply cannot have safety critical and perfectly functional systems shutting down because of mere "housekeeping trivia". This needs to be addressed. Safety critical systems should never be shut down by mere admin processes.

If it overheats: maybe. If the oil pressure drops: maybe. If it over speeds: yes. But an overflowing clock/counter? Definitely not!

I am a current line pilot, and although I am not a software programmer, I have written simple software programs, so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions. Something as important as a main generator should not be subject to anything more than a simple logic network which keeps it operational as long as its basic parameters remain within limits.

Computers are Super-Fast Idiots

Quote:

so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions.

Amen Amen.

To err is human- to really screw up takes a computer.

The above comment is/was the point of my initial post in this thread.
Other comments along that line also apply.

There should/must be NO way an ' administrator ' should be able to shut down a critical system without recourse. PERIOD:mad:

Quote:

Something as important as a main generator should not be subject to anything more than a simple logic network

Good luck with that. Modern aircraft have electrical systems far too complex to operate for a 'simple logic network'. And airlines are not going back to the days of a flight engineer with a panel full of gauges and switches. 'Software' is the only practical way of controlling and reconfiguring such a system to account for generators going on or off line and bus reconfiguring for various external power or autoland configurations.

What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed. Or at least exposed to examination before a product is put into service. This isn't a big deal in the embedded s/w world. The RTOS (Real Time Operating Systems) vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes?

Quote:

vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes?

don't dream , others sectors are not perfect ! Fukushima is a desaster where during years they refused to respect the warning of some hydrologists who said the water is the first threat against the plant !
Where a fault may be done once somebody will do it.
We have to learn from our faults, sins and other mistakes...

There is no such thing as a perfect process or a perfect system. And furthermore, expecting (or depending on) perfection is the wrong thing to do, because it is unrealistic.

In fact, during certification of (new) aircraft, there is an acknowledgement that some defects will remain.

Hence, defects such as these -- while should have been caught -- are not indicative of a process breakdown, certification breakdown, etc., but simply a reflection of reality.

The effects of any potential defect, however, should not be catastrophic. So what should be expected is a "graceful degradation" when failures do occur.

Actually a better analogy might be "defense in depth" used in security practice -- having multiple layers so that even a complete failure of one layer does not bring down the entire system.

The real question is then: even given a quadruple GCU failure taking down all four AC busses (due to this bug or some other malfunction) -- will that crash a 787?

Someone more familiar with 787s can correct me, but I think the answer is generally NO, as there is still the DC bus which will automatically run from batteries, before the ram air system kicking in (or possibly from APU as well.)

Time for a bit of a reality check, I feel...

Quote:

Originally Posted by roulishollandais (Post 8963377)

No excuse after Ariane501 crash and report.NO !

In all fairness, the Ariane 501 scenario is a completely different kettle of fish from what we're talking about here. The former was a case of a hard-coded logical error involving number format translation and bit-depth conversion, whereas the latter was a case of integer counter overflow - however the crucial difference is that the former error occurred in a part of the program which was always expected to be executed, whereas the latter is very much an edge case (i.e. a scenario which is unlikely to occur in the real world). In practical terms we're talking about a scenario in which the aircraft has not once been in a "cold and dark" state for two-thirds of a year (ref: TURIN at post 54).

Quote:

Originally Posted by cattletruck (Post 8963384)

Congratulations to the tester that found the bug. Good testers think outside the box as this one had done...

Heh - I very much doubt that it was a single tester. Real-time software testing works rather differently from other disciplines. I suspect that it would more likely have been part of a suite of edge-case regressions intended to be added from the start.

Quote:

Once fixed it's going to take another 248 days to re-run the test.

Nope, far more likely that the testing suite can increment the counter at any rate desired. :) Remember, it's not the counter itself that is the root of the issue as much as it is the dependent systems' ability to interpret the rollover correctly.

Quote:

Originally Posted by Guptar (Post 8963617)

If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.

Home/business computing and real-time/safety-critical computing are entirely different worlds. I'm not going to go into detail now, but it's worth pointing out that safety-critical systems tend to use obsolete hardware because of its proven nature and significantly lesser complexity. (Engineering maxim : more complexity means more things that can go wrong). More to the point, using a 64-bit signed integer would just have kicked the "can" (problem) down the road.

Quote:

Originally Posted by msbbarratt (Post 8988412)

2) the specification defines an up time consistent with normal aircraft operations, < 248 days, in which case the software would have to have been tested against it before it was certified

3) as per 2) but someone has also taken the trouble to go beyond the spec in their testing and discovered the true system up time.

My "money" would be on this.

Quote:

However this all seems to have been some sort of surprise, and it shouldn't be. It should be there in black and white in the paperwork. And it may well be the case that it is all written down in the right place, but that someone else simply hasn't read it. I'm expecting that to be the case, actually.

Possibly - I was thinking that they were applying additional "layers" of edge case testing based on the likelihood of the scenario occurring as development time became less critical.

Quote:

Originally Posted by msbbarratt (Post 8989650)

We can't even keep a GCU running for 249 days.

Er, I'd argue not only that we can, but also that we just did - by applying standardised software reliability metrics and techniques that we've been developing and perfecting for decades.

As another software person, I'm also well aware of the limitations you're talking about - but we're not talking about the same kind of inherently dynamic logic required for a "self-driving" car or a fully-automated aircraft here, we're talking about bog-standard systems monitoring logic behaviour in scenarios which are extremely unlikely to occur in the real world.

Quote:

Originally Posted by EEngr (Post 8990217)

What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed.
...
Who has the clout to hold Boeing's feet to the fire to adopt such processes?

Again, I'd argue that the very fact we're discussing this now means that Boeing (and/or their subcontractors) already have those processes in place. We're not talking about a glaring software mistake that slipped through the cracks, it's far more likely to be a missed edge-case in the specification - and the reason it wasn't covered until now is precisely because we're talking about an extremely unlikely real-world scenario. As I said above, we're in the realms of a hypothetical scenario in which the aircraft has not been powered down ("cold and dark") for *eight months*. Furthermore that each of the power units were brought online around the same time and all of them were kept running for the entirety of those eight months.

As you quite rightly state, modern aircraft systems are incredibly complex these days, and it's therefore much more sensible to focus testing on the most likely scenarios first and then adding layers of testing for less likely scenarios as the development and lifecycle of the product continues.

Don't get me wrong, this was undoubtedly an "oops" - I'm sure that several people who worked on these systems are now a little wiser and will swear to be more thorough in their work for the rest of their lives. Nevertheless, it's important that we all try to retain a little bit of perspective!

[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...]