Thank you blackbeard1 for that wonderful Cluster and link.
I had pleasure to learn more from boreal auroras and the last studies. |
"32bit signed value used as a counter running at 100Hz?"
I was wondering, why would you need to have a signed value if it is a simple counter... Unsigned one would give twice the range of the signed one... 248*5 = more than a year. :} "If you aircraft was powered for more than a year, don't forget to power cycle it..." kind of thing The reaction to this non issue from outlets like CNN was just... :yuk: |
I was wondering, why would you need to have a signed value if it is a simple counter... |
non issue 1. The GCU control system fails after ~7000 hours. 2. It is a common mode failure so no credit can be given to multiple systems. 3. The failure leads to loss of all AC. 4. Loss of all AC is at least HAZARDOUS. Therefore a target of 1x10-7 is fulfilled by a design stuggling to meet 1x10-4 Firstly the overflow error should be trapped at source. It adds complexity to design, but it needs to be done in safety critical systems. Secondly it appears the safety analysis has not fully analysed all the software failures ... if the software design process guidelines for safety critical systems had been followed then this should have stood out like a sore thumb. This is the kind of thing that happens when people use the analysis from old designs, without re-validating the original assumptions against the new design. In mechanical terms, if a fastner repeatedly loosens in flight there is something wrong, it is not acceptable to say that it didn't come totally undone so as long as we tighten it up each time it is ok, the fastener should be redesigned. I have not seen a statement from Boeing that denies any of the 4 assumptions i have made, but I emphasise that I have no detailed knowledge so this is based only on the public domain information ... but based on that it really worries me, because it isn't a "bug" it is a systematic failure. |
dClbydalpha,
I agree with that. Once again learn the lesson from Ariane501 report : Not only it is easy to avoid integer overflow being very methodic, but the report showed that a long list of other failures have been leading to the fatal 37. second. Any item of that list should have avoid the rocket destruction. |
Er..... how about tacking another 32 bit address on, to make it 64 bit count.... should take us into intergalactic durations....
Or running a slower clock speed, like the 1khz machines I cut my teeth on... |
Hello msbbarratt,
Excellent post ! :) In which case the spec was junk And we often read on PPRuNe "It worked as designed". The difficult for the IT analyst is to guess where something could be missing or wrong in the specification ! And we have to warn the people who is building the spec : "That could happen, do you want to accept that ?" because we know the hidden side of the system and architecture that the final boss is not aware with (like DeafOldFart suggesting to replace the B787 overflowed 32 bits integer by a 64 bits integer or modify the frequency :}) Let us hope it is the cheapest case for Boeing but probably it will not be the case as certifiers did jump over the bug too..:{ |
I have just finished reading the linked item below.
http://www.faa.gov/about/plans_repor...port_final.pdf As suspected, the usual observations are there. Lack of ownership of requirements, inadequate v&v coverage and use of previous design experience without re-validating the design assumptions. Nothing new, and that is what concerns me. Not disasterous as an individual item, no outright condemnations, but as the report shows that the GCUs were a deep-dive item, the process seems to be struggling with managing the complexity and nature of these next generation projects. In this case the inevitable system level impact of a low-level design decision was not spotted, perhaps due to the amount of responsibility boundaries that had to be crossed between. |
Forgive me because I am not a software programmer, but any airborne safety critical system - such as a GCU - that is required to work should not be even slightly open to being compromised or shut down by just a clock, or a clock malfunction.
The GCU's in this case do not fail, they are switched off because a clock says so. What does a mere clock know about the generator load, the CSD oil temperature and pressure, the servicability of the other electrical systems in the network etc? To have a healthy system shut down because a mere timer or a timer fault says so is crazy!! How was it ever allowed to be designed this way? |
However this all seems to have been some sort of surprise, and it shouldn't be. The whole GCU reset every 248 days, by itself, is a non issue. That (like many other maintenance items) can easily be taken care of once the issue is known and few people would care. Some might. Every maintenance step, no matter how trivial, incurs a cost to document and track at the operator's expense. So even one extra check box would raise a few questions. Particularly if they understood how trivial the fix would have been back in the design stage. But what with the industries increasing reliance on manufacturers self certification and the regulators hesitance at questioning anything process related within a company, I'm not hopeful that other bugs haven't slipped through as well. |
Thank you dClbydalpha
|
EEngr
The issue may be very different if your system is analogical (Concorde) or digital (Ariane5, Airbus320 family, B787). In the first you may have saturation, in the latter unknown consequence of carry/overflow indicator like the destruction of the rocket (8 billions FF) for an unused variable BH. |
"Tout va très bien, Madame la Marquise"!
Despite some people were retired other teams had been working on Ariane4 V33 and on Ariane5... But they were focused on terrorism instead of science ! They had not an enough IT level of knowledge:mad: and were leading hidden geopolitical aim...:suspect: Their was a confusion between the fact that both IRS do not work, and how that diagnostic is done -with a double crazy carry, followed by a long list of failures and loss of rigor with excess of optimism, trusting in the first positive statistic results instead of tracking the best proof. |
Clock, counter, whatever.
My point remains the same. We simply cannot have safety critical and perfectly functional systems shutting down because of mere "housekeeping trivia". This needs to be addressed. Safety critical systems should never be shut down by mere admin processes. If it overheats: maybe. If the oil pressure drops: maybe. If it over speeds: yes. But an overflowing clock/counter? Definitely not! I am a current line pilot, and although I am not a software programmer, I have written simple software programs, so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions. Something as important as a main generator should not be subject to anything more than a simple logic network which keeps it operational as long as its basic parameters remain within limits. |
Computers are Super-Fast Idiots
so I know all too well that a computer will very literally only do what you tell it to. It will not do what a human would do. It will not make assumptions or "know" the consequences of its actions or non actions. To err is human- to really screw up takes a computer. The above comment is/was the point of my initial post in this thread. Other comments along that line also apply. There should/must be NO way an ' administrator ' should be able to shut down a critical system without recourse. PERIOD:mad: |
Something as important as a main generator should not be subject to anything more than a simple logic network What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed. Or at least exposed to examination before a product is put into service. This isn't a big deal in the embedded s/w world. The RTOS (Real Time Operating Systems) vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes? |
vendors have been producing libraries that handle such trivial things for years. In everything from my TV set to a controller in a nuclear power plant. My question is: Who has the clout to hold Boeing's feet to the fire to adopt such processes? Where a fault may be done once somebody will do it. We have to learn from our faults, sins and other mistakes... |
..........
|
There is no such thing as a perfect process or a perfect system. And furthermore, expecting (or depending on) perfection is the wrong thing to do, because it is unrealistic.
In fact, during certification of (new) aircraft, there is an acknowledgement that some defects will remain. Hence, defects such as these -- while should have been caught -- are not indicative of a process breakdown, certification breakdown, etc., but simply a reflection of reality. The effects of any potential defect, however, should not be catastrophic. So what should be expected is a "graceful degradation" when failures do occur. Actually a better analogy might be "defense in depth" used in security practice -- having multiple layers so that even a complete failure of one layer does not bring down the entire system. The real question is then: even given a quadruple GCU failure taking down all four AC busses (due to this bug or some other malfunction) -- will that crash a 787? Someone more familiar with 787s can correct me, but I think the answer is generally NO, as there is still the DC bus which will automatically run from batteries, before the ram air system kicking in (or possibly from APU as well.) |
Time for a bit of a reality check, I feel...
Originally Posted by roulishollandais
(Post 8963377)
No excuse after Ariane501 crash and report.NO !
Originally Posted by cattletruck
(Post 8963384)
Congratulations to the tester that found the bug. Good testers think outside the box as this one had done...
Once fixed it's going to take another 248 days to re-run the test.
Originally Posted by Guptar
(Post 8963617)
If all modern computers are coded in 64 bit sizes, why did Boeing stick with 32 bit.
Originally Posted by msbbarratt
(Post 8988412)
2) the specification defines an up time consistent with normal aircraft operations, < 248 days, in which case the software would have to have been tested against it before it was certified
3) as per 2) but someone has also taken the trouble to go beyond the spec in their testing and discovered the true system up time. However this all seems to have been some sort of surprise, and it shouldn't be. It should be there in black and white in the paperwork. And it may well be the case that it is all written down in the right place, but that someone else simply hasn't read it. I'm expecting that to be the case, actually.
Originally Posted by msbbarratt
(Post 8989650)
We can't even keep a GCU running for 249 days.
As another software person, I'm also well aware of the limitations you're talking about - but we're not talking about the same kind of inherently dynamic logic required for a "self-driving" car or a fully-automated aircraft here, we're talking about bog-standard systems monitoring logic behaviour in scenarios which are extremely unlikely to occur in the real world.
Originally Posted by EEngr
(Post 8990217)
What we need are sound software development processes that catch these simple kinds of mistakes and get them fixed.
... Who has the clout to hold Boeing's feet to the fire to adopt such processes? As you quite rightly state, modern aircraft systems are incredibly complex these days, and it's therefore much more sensible to focus testing on the most likely scenarios first and then adding layers of testing for less likely scenarios as the development and lifecycle of the product continues. Don't get me wrong, this was undoubtedly an "oops" - I'm sure that several people who worked on these systems are now a little wiser and will swear to be more thorough in their work for the rest of their lives. Nevertheless, it's important that we all try to retain a little bit of perspective! [I'd also be willing to bet money that this would barely have troubled the media had a few journalists not been fishing for B787 issues i nthe wake of the battery problems...] |
All times are GMT. The time now is 00:06. |
Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.