PPRuNe Forums - View Single Post - Boeing 787 integer overflow bug
View Single Post
Old 26th May 2015, 23:49
  #80 (permalink)  
DozyWannabe
 
Join Date: Jul 2002
Location: UK
Posts: 3,093
Likes: 0
Received 0 Likes on 0 Posts
Time for a bit of a reality check, I feel...

Originally Posted by roulishollandais
No excuse after Ariane501 crash and report.NO !
In all fairness, the Ariane 501 scenario is a completely different kettle of fish from what we're talking about here. The former was a case of a hard-coded logical error involving number format translation and bit-depth conversion, whereas the latter was a case of integer counter overflow - however the crucial difference is that the former error occurred in a part of the program which was always expected to be executed, whereas the latter is very much an edge case (i.e. a scenario which is unlikely to occur in the real world). In practical terms we're talking about a scenario in which the aircraft has not once been in a "cold and dark" state for two-thirds of a year (ref: TURIN at post 54).

Originally Posted by cattletruck
Congratulations to the tester that found the bug. Good testers think outside the box as this one had done...
Heh - I very much doubt that it was a single tester. Real-time software testing works rather differently from other disciplines. I suspect that it would more likely have been part of a suite of edge-case regressions intended to be added from the start.

Once fixed it's going to take another 248 days to re-run the test.
Nope, far more likely that the testing suite can increment the counter at any rate desired. Remember, it's not the counter itself that is the root of the issue as much as it is the dependent systems' ability to interpret the rollover correctly.

Originally Posted by Guptar
If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.
Home/business computing and real-time/safety-critical computing are entirely different worlds. I'm not going to go into detail now, but it's worth pointing out that safety-critical systems tend to use obsolete hardware because of its proven nature and significantly lesser complexity. (Engineering maxim : more complexity means more things that can go wrong). More to the point, using a 64-bit signed integer would just have kicked the "can" (problem) down the road.

Originally Posted by msbbarratt
2) the specification defines an up time consistent with normal aircraft operations, < 248 days, in which case the software would have to have been tested against it before it was certified

3) as per 2) but someone has also taken the trouble to go beyond the spec in their testing and discovered the true system up time.
My "money" would be on this.

However this all seems to have been some sort of surprise, and it shouldn't be. It should be there in black and white in the paperwork. And it may well be the case that it is all written down in the right place, but that someone else simply hasn't read it. I'm expecting that to be the case, actually.
Possibly - I was thinking that they were applying additional "layers" of edge case testing based on the likelihood of the scenario occurring as development time became less critical.

Originally Posted by msbbarratt
We can't even keep a GCU running for 249 days.
Er, I'd argue not only that we can, but also that we just did - by applying standardised software reliability metrics and techniques that we've been developing and perfecting for decades.

As another software person, I'm also well aware of the limitations you're talking about - but we're not talking about the same kind of inherently dynamic logic required for a "self-driving" car or a fully-automated aircraft here, we're talking about bog-standard systems monitoring logic behaviour in scenarios which are extremely unlikely to occur in the real world.

Originally Posted by EEngr
What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed.
...
Who has the clout to hold Boeing's feet to the fire to adopt such processes?
Again, I'd argue that the very fact we're discussing this now means that Boeing (and/or their subcontractors) already have those processes in place. We're not talking about a glaring software mistake that slipped through the cracks, it's far more likely to be a missed edge-case in the specification - and the reason it wasn't covered until now is precisely because we're talking about an extremely unlikely real-world scenario. As I said above, we're in the realms of a hypothetical scenario in which the aircraft has not been powered down ("cold and dark") for *eight months*. Furthermore that each of the power units were brought online around the same time and all of them were kept running for the entirety of those eight months.

As you quite rightly state, modern aircraft systems are incredibly complex these days, and it's therefore much more sensible to focus testing on the most likely scenarios first and then adding layers of testing for less likely scenarios as the development and lifecycle of the product continues.

Don't get me wrong, this was undoubtedly an "oops" - I'm sure that several people who worked on these systems are now a little wiser and will swear to be more thorough in their work for the rest of their lives. Nevertheless, it's important that we all try to retain a little bit of perspective!

[I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...]
DozyWannabe is offline