PPRuNe Forums - View Single Post - New Software Issues Found on the MAX
View Single Post
Old 20th Jan 2020, 00:38
  #10 (permalink)  
MechEngr
 
Join Date: Oct 2019
Location: USA
Posts: 876
Received 218 Likes on 121 Posts
Originally Posted by OldnGrounded
Indeed. How can you possibly not notice that a system under active development is failing (or failing to perform) a POST (power-on self-test) or equivalent? If this is really true, it's a pretty remarkable oversight.

And then there's this:
<embedded quote>
Well, umm . . . if a system doesn't properly run its validation checks on startup, why would we trust that it will properly perform "its function during flight?"
No telling on the specific problem, but there are tolerances in hardware that are not easy to simulate and might not present on one system but will be on another.

In a system I was involved with a top level CPU was controlling slave CPUs via dual port memory. One day we get a call from the lab that the test system had gone crazy out of control and far faster than ever before. Turned out the speed was the same top speed as always and that they thought it was fast because normally it didn't move during boot-up. They hit the master panic stop switch which did a lot of damage to other things. We all wondered how this had gone unnoticed for so long; the prototype had been in flight test for some time. We realized that the prototype in flight test was out of view of the users, did not send data during boot-up to let them know it was doing anything, and the plane was so loud they would be unable to hear it. The only damage was from the sudden stop, something the flight test guys would never know to use.

Turned out it was a race condition where the dual port RAM would sometimes retain certain bits for a while and then the slave CPUs would sometimes get a speed RAM value that wasn't zero before the top level CPU could set valid values. Since no one was doing an electron-by-electron simulation of the RAM it wasn't possible to know that it would wake up with random values; certainly not mentioned in the documentation. The fix was for the slave CPUs to delay on boot-up and for the top level CPU to clear all the bits before doing anything else.

One weird area in embedded systems development is called "fuzzing" where random inputs are shoved into the system to ensure it rejects all the invalid ones and doesn't choke on too much - like putting 20 characters into a 16 character field. An offshoot is to gradually lower the voltage to see when certain integrated circuits misbehave and the most obnoxious technique is to have an external computer gradually slow the system clock until the internal memory in CPUs is not refreshed fast enough and it starts to fail. The latter are usually used to crack encryption by getting the processors to fail and leave intermediate results in cache, but I suppose it could be used on a multi-processor system to force certain modules to boot faster or slower than nominal.

One great way to jam up a system is for a module to send a message to another module that isn't ready to receive any messages and then camp out and wait for a reply which will never come. Once again, timing is everything and it might be the importance of the message is so high that timing out and trying again isn't going to work; alternatively, if it is allowed to try again it can overwhelm the recipient with so many retries on a high-priority message that the recipient cannot act on the message and generate a reply before the originator dumps another one in the cue. For example, asking for status more rapidly than it takes to check the status; a variation on "are we there yet, are we there yet, are we there yet..."
MechEngr is online now