Go Back  PPRuNe Forums > Flight Deck Forums > Tech Log
Reload this Page >

New Software Issues Found on the MAX

Tech Log The very best in practical technical discussion on the web

New Software Issues Found on the MAX

Old 18th Jan 2020, 13:12
  #1 (permalink)  
Thread Starter
 
Join Date: Dec 2018
Location: Florida
Posts: 106
New Software Issues Found on the MAX

Not sure where we are supposed to be keeping up to date on MAX developments. This is the first I have heard that Boeing rewrote the entire software for the flight control computer, not just the MCAS code. Several carriers have now pushed MAX return back to June.


https://www.wsj.com/articles/boeing-...rn-11579290347

https://abcnews.go.com/Politics/soft...ry?id=68357961
https://apnews.com/c8cfe82b6ab25a788b42eab1e8e47a3a
Lake1952 is offline  
Old 18th Jan 2020, 16:02
  #2 (permalink)  
 
Join Date: Sep 2014
Location: UK
Posts: 81
"Rewrote" here is quite misleading, and usual journalistic hyperbole. In fact Boeing will be creating an updated version by amending existing software. Those amendments may be more or less extensive, but they are not starting again from scratch.
donotdespisethesnake is offline  
Old 18th Jan 2020, 16:54
  #3 (permalink)  
 
Join Date: Apr 2019
Location: EDSP
Posts: 223
Ooooopsi

Discovered during rollout:
The issue is in the plane’s flight-control computer software. It was confined to how it performs validation checks during startup and doesn’t involve its function during flight, the people said. The problem came to light when the latest version of the software was loaded onto an actual aircraft, according to one of the people. While it has been tested on planes in flight, most of the software reviews have occurred in a special simulator used by engineers on the ground.
https://www.seattletimes.com/busines...eings-737-max/

How was that: After all that scruntity, most extensively tested and safest piece of software ever!
BDAttitude is offline  
Old 19th Jan 2020, 18:53
  #4 (permalink)  
 
Join Date: Apr 2015
Location: Under the radar, over the rainbow
Posts: 707
Originally Posted by BDAttitude View Post
Discovered during rollout:

https://www.seattletimes.com/busines...eings-737-max/

How was that: After all that scruntity, most extensively tested and safest piece of software ever!
Indeed. How can you possibly not notice that a system under active development is failing (or failing to perform) a POST (power-on self-test) or equivalent? If this is really true, it's a pretty remarkable oversight.

And then there's this:

It was confined to how it performs validation checks during startup and doesn’t involve its function during flight . . .
Well, umm . . . if a system doesn't properly run its validation checks on startup, why would we trust that it will properly perform "its function during flight?"
OldnGrounded is offline  
Old 19th Jan 2020, 19:39
  #5 (permalink)  
 
Join Date: Apr 2019
Location: EDSP
Posts: 223
Well, TBH I don't hope anyone would have gone airborne with an FCC not completing is POST.

This confirmes (once again) that the changes done to implement inter communication and health checking between the two boxes and fixing this dubious AP disconnect issue, which must have something to do with the task scheduling, were in fact open heart surgery on these old architectures.

What could possibly go wrong?

To discover this half a year down the road - presumeably after the fix itself has been reviewed and was close to approval - is pretty embarrassing.
However knowing the timelines from comparable hot fixes in automotive I still think that this one is rather rushed. ​​
With sound testing one would not expect these problems.
​After all, these FCCs are no PCs where every unit is different with regard to hardware, user installed software and configuration.
In such a well controlled environment as transport category airplanes straight from the production line one should be able to do better, if testing and validation was sufficient.
BDAttitude is offline  
Old 19th Jan 2020, 20:14
  #6 (permalink)  
 
Join Date: Jan 2008
Location: uk
Posts: 802
Originally Posted by BDAttitude View Post
Well, TBH I don't hope anyone would have gone airborne with an FCC not completing is POST.
Well, yeah. It isn't quite "nothing to see here", but this is exactly the sort of failure you can sometimes get with any software system going from test or staging environments (eng sim) to production - test never quite does things exactly the same. Looks like failure happened at the right place/time (ie. on the ground) and was caught by the existing self-checks.

The surprising thing for me is that this appears to mean they have not yet flown the final fix. So are all those previous test flight useless now? Was this the reason for the test-flight hiatus - ie. not that they'd finished testing (as some said) but that the software wasn't final yet?

This confirmes (once again) that the changes done to implement inter communication and health checking between the two boxes and fixing this dubious AP disconnect issue, which must have something to do with the task scheduling, were in fact open heart surgery on these old architectures.

What could possibly go wrong?
Well, it hints rather than confirms, but I'd bet that it was that change causing problems as well.

Reading between the lines, I get the impression they are now seriously short of spare CPU cycles in the FCC (my guess is that they were before this issue, possibly even before MAX, and were just hoping they wouldn't need any...). If so, they will now be trying to save cycles anywhere they can (been there, done that, not thankfully on passenger aircraft code...), which is really really bad news. What could possibly go wrong? - anything, Anything they mess with to save cycles, and everything else as well if they actually have started to mess with task scheduling, latent bugs, race conditions, new timing issues, things that haven't surfaced in the life of the NG, and would have stayed hidden, could now be unleashed. Or of course it could all be just fine, nothing to see here, feel the force...

Wondering now how many test flights the new new fixed software will need (once it actually boots properly), and how long after that to certify it?

infrequentflyer789 is offline  
Old 19th Jan 2020, 23:05
  #7 (permalink)  
 
Join Date: Apr 2015
Location: Under the radar, over the rainbow
Posts: 707
Originally Posted by infrequentflyer789 View Post
Well, yeah. It isn't quite "nothing to see here", but this is exactly the sort of failure you can sometimes get with any software system going from test or staging environments (eng sim) to production - test never quite does things exactly the same.
I dunno. I can think of lots of failures that popped up late in the game and brought us up short when we thought we were almost home free, but I can't remember not noticing that a system wasn't running its POST -- or failing it if it was running it.

Maybe the reporting is garbled and that's not really the issue here, but, if it is, it seems rather alarming to me.
OldnGrounded is offline  
Old 19th Jan 2020, 23:56
  #8 (permalink)  
 
Join Date: Jul 2013
Location: Everett, WA
Age: 64
Posts: 2,649
Initialization issues are not all that uncommon - even in DAL A software. You can get inadvertent 'race' conditions happening during initialization where tasks don't happen in the order that was planned (sometimes inconsistently, depending on some of the other external conditions during initialization). This can get particularly tricky when some of the interfacing systems may not always be alive yet during initialization. If they are short on throughput margin it can be even worse because you can't afford the processing necessary to do multiple validity checks.
These issues can also be hard to find during development and rig testing because you often have simulations of the interfacing systems, rather than the actual systems, and the simulations may not completely mimic the initialization characteristics of the actual systems.
We ran into an initialization issue with some DAL A FADEC software after it had been flying around in service for over 20 years that prevented the normal channel alternation logic from working properly during engine start. After we figured out what was happening, it turned out the issue had been there - latent - from day one. But it took a change to an interface to bring it to the surface.
tdracer is offline  
Old 20th Jan 2020, 00:37
  #9 (permalink)  
 
Join Date: Jul 2014
Location: Harbour Master Place
Posts: 662
it turned out the issue had been there - latent - from day one. But it took a change to an interface to bring it to the surface.

Not a computer person, but have been tinkering with computers since the early 80's, and have some good friends with PhD's in the area. That is exactly the sort of bug that concerns pilots with only a dual FCC system moving a very large control surface. The latent one that may surface much much much later and catch someone out.


Much earlier in the I posted a link to one of the original forensic computer accident investigations. it too involved this exact type of latent bug sitting silently waiting to reveal itself with an "interface change" (removal of a physical mechanical interlock preventing a lethal dose of radiation): Nancy Leveson: Therac-25 Accident. This case wan't a race condition, rather a simple "out by one" coding error that remained dormant for years.
CurtainTwitcher is offline  
Old 20th Jan 2020, 00:38
  #10 (permalink)  
 
Join Date: Oct 2019
Location: USA
Posts: 133
Originally Posted by OldnGrounded View Post
Indeed. How can you possibly not notice that a system under active development is failing (or failing to perform) a POST (power-on self-test) or equivalent? If this is really true, it's a pretty remarkable oversight.

And then there's this:
<embedded quote>
Well, umm . . . if a system doesn't properly run its validation checks on startup, why would we trust that it will properly perform "its function during flight?"
No telling on the specific problem, but there are tolerances in hardware that are not easy to simulate and might not present on one system but will be on another.

In a system I was involved with a top level CPU was controlling slave CPUs via dual port memory. One day we get a call from the lab that the test system had gone crazy out of control and far faster than ever before. Turned out the speed was the same top speed as always and that they thought it was fast because normally it didn't move during boot-up. They hit the master panic stop switch which did a lot of damage to other things. We all wondered how this had gone unnoticed for so long; the prototype had been in flight test for some time. We realized that the prototype in flight test was out of view of the users, did not send data during boot-up to let them know it was doing anything, and the plane was so loud they would be unable to hear it. The only damage was from the sudden stop, something the flight test guys would never know to use.

Turned out it was a race condition where the dual port RAM would sometimes retain certain bits for a while and then the slave CPUs would sometimes get a speed RAM value that wasn't zero before the top level CPU could set valid values. Since no one was doing an electron-by-electron simulation of the RAM it wasn't possible to know that it would wake up with random values; certainly not mentioned in the documentation. The fix was for the slave CPUs to delay on boot-up and for the top level CPU to clear all the bits before doing anything else.

One weird area in embedded systems development is called "fuzzing" where random inputs are shoved into the system to ensure it rejects all the invalid ones and doesn't choke on too much - like putting 20 characters into a 16 character field. An offshoot is to gradually lower the voltage to see when certain integrated circuits misbehave and the most obnoxious technique is to have an external computer gradually slow the system clock until the internal memory in CPUs is not refreshed fast enough and it starts to fail. The latter are usually used to crack encryption by getting the processors to fail and leave intermediate results in cache, but I suppose it could be used on a multi-processor system to force certain modules to boot faster or slower than nominal.

One great way to jam up a system is for a module to send a message to another module that isn't ready to receive any messages and then camp out and wait for a reply which will never come. Once again, timing is everything and it might be the importance of the message is so high that timing out and trying again isn't going to work; alternatively, if it is allowed to try again it can overwhelm the recipient with so many retries on a high-priority message that the recipient cannot act on the message and generate a reply before the originator dumps another one in the cue. For example, asking for status more rapidly than it takes to check the status; a variation on "are we there yet, are we there yet, are we there yet..."
MechEngr is offline  
Old 20th Jan 2020, 01:06
  #11 (permalink)  
 
Join Date: Jan 2013
Location: Seattle Area
Posts: 182
Originally Posted by infrequentflyer789 View Post

The surprising thing for me is that this appears to mean they have not yet flown the final fix. So are all those previous test flight useless now? Was this the reason for the test-flight hiatus - ie. not that they'd finished testing (as some said) but that the software wasn't final yet?
My understanding is that much of that flying was to demonstrate the characteristics of the airplane with MCAS disabled, ostensibly to "demonstrate system failure conditions (to classify failure effects)," but I suspect also to satisfy all the questions from foreign CAAs about unaugmented handling characteristics.
Dave Therhino is offline  
Old 20th Jan 2020, 06:21
  #12 (permalink)  
 
Join Date: Jan 2008
Location: Hawaii
Age: 72
Posts: 35
Do you all understand that this a common non issue ?

Many times after a software upload we would have this problem and would just revert to the last update.

You are talking about an airplane that is grounded. NON issue !
hunbet is offline  
Old 20th Jan 2020, 06:59
  #13 (permalink)  
 
Join Date: Apr 2019
Location: EDSP
Posts: 223
Originally Posted by hunbet View Post
Do you all understand that this a common non issue ?

Many times after a software upload we would have this problem and would just revert to the last update.

You are talking about an airplane that is grounded. NON issue !
Well it's back to the developpers, getting a build from RC, doing the quality assurance, regression testing, reviews ... not before may, probably june would be my guess.
NON issue for an increasingly desperate company?
BDAttitude is offline  
Old 20th Jan 2020, 07:07
  #14 (permalink)  
 
Join Date: Apr 2019
Location: EDSP
Posts: 223
Originally Posted by tdracer View Post
These issues can also be hard to find during development and rig testing because you often have simulations of the interfacing systems, rather than the actual systems, and the simulations may not completely mimic the initialization characteristics of the actual systems.
Very true! However if you mess about with task scheduling these things can happen and if one's beeing rushed most certainly will happen. These kind if errors have to be tested out, unfortunately.
BDAttitude is offline  
Old 20th Jan 2020, 12:25
  #15 (permalink)  
 
Join Date: Apr 2015
Location: Under the radar, over the rainbow
Posts: 707
Originally Posted by hunbet View Post
Do you all understand that this a common non issue ?
Nope, I don't understand that, at all. For this to crop up at this point in the process (when Boeing has been telling the world for some time that MCAS 2.0 is ready to go and just needs FAA approval) suggests that (a) either the Boeing folks who have been talking to the world don't know what they're talking about or haven't been candid; and/or (b) the Boeing people actually doing the work mistakenly thought they had completed testing when they had not.
OldnGrounded is offline  
Old 20th Jan 2020, 12:30
  #16 (permalink)  
 
Join Date: Apr 2015
Location: Under the radar, over the rainbow
Posts: 707
Quote:
Originally Posted by tdracer View Post
These issues can also be hard to find during development and rig testing because you often have simulations of the interfacing systems, rather than the actual systems, and the simulations may not completely mimic the initialization characteristics of the actual systems.

Originally Posted by BDAttitude View Post
Very true! However if you mess about with task scheduling these things can happen and if one's beeing rushed most certainly will happen. These kind if errors have to be tested out, unfortunately.
All true, both of you. That said, there's really no escaping the fact that having this issue pop up at this time, after Boeing has told the world that all is ready for regulatory blessing, is a strong indication that something is not as it should be in the development-testing process.
OldnGrounded is offline  
Old 20th Jan 2020, 12:42
  #17 (permalink)  
 
Join Date: Jul 2007
Location: Switzerland
Age: 74
Posts: 96
The problem with interactive computer systems with both synchronous and async interactions is that any amount of testing does not make sure they always work. You have to get it right by design, not by testing. In earlier days there were computer languages like ADA that supported (not guaranteed) a proper design with special system calls like the "rendezvous" and "resource locking". Such a design never works in a hurry and is extremly difficult to fix without creating more problems. This is the sort of a task that a clever single software engineer is better suited to solve than a hundred programmers working for a few dollars (or more).
I had a laugh last summer when Muilenburg boasted he could change the single sensor FCC into a communicating and comparing dual channel system in a matter of weeks. The laugh is still echoing.

Last edited by clearedtocross; 20th Jan 2020 at 17:35.
clearedtocross is offline  
Old 20th Jan 2020, 13:14
  #18 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,346
There are always problems like this that surface if you make changes to geriatric code. Quite often they are minor with just parameters needing to be increased because the system is now doing extra work or has more interfaces so time out failure needs to be longer or a count needs to allow a higher value. When you are down coding at machine code or if you are lucky at assembler level and you are working on the physical machine these small things can bite you. Had the original designer still been around you might have been warned don't change that value without altering this other - apparently unrelated- parameter. The problem with maintaining embedded code that was written before ideas of structure and to fit inside the space/run-time available is that whatever you change may cause an error somewhere. Maintenance programming at machine level is not a skill that is taught any more.
Ian W is offline  
Old 20th Jan 2020, 13:24
  #19 (permalink)  
 
Join Date: Mar 2019
Location: French Alps
Posts: 326
Originally Posted by OldnGrounded View Post
That said, there's really no escaping the fact that having this issue pop up at this time, after Boeing has told the world that all is ready for regulatory blessing, is a strong indication that something is not as it should be in the development-testing process.
So true !
Things like that do happen, but what is unclear is why tests in a real aircraft come so late into the development timeline ?
Not a specialist, but is flight software so complicated as compared to - for instance - autonomous ground vehicle software ?
Fly Aiprt is offline  
Old 20th Jan 2020, 15:06
  #20 (permalink)  
 
Join Date: May 2004
Location: Toronto, Canada
Age: 69
Posts: 57
Originally Posted by Fly Aiprt View Post
...what is unclear is why tests in a real aircraft come so late into the development timeline ?...
@clearedtocross has it:

Originally Posted by clearedtocross View Post
The problem with interactive computer systems with both synchronous and async interactions is that any amount of testing does not make sure they always work. You have to get it right by design, not by testing...
I've been in the computer data communications field for half a century now, not in aeronautics (that was my dad's job, on the Avro Arrow - though I gather that aircraft is also still grounded...). The design of interdependent, asynchronous systems (whether multiple tasks on a single piece of hardware, or inter-system communication via data buses) is not the proverbial "rocket science" but the result of careful understanding of each component, what it's dependent on, and how that dependence is handled.

Capable practitioners in the field, who we hope are integral to aeronautical design, understand critical sections, race conditions and how to avoid, etc. ... and apply that to their own designs, their overall contributions to their own firm's designs, and their specifications to suppliers.
boaclhryul is offline  

Thread Tools
Search this Thread

Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service - Do Not Sell My Personal Information

Copyright © 2018 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.