New Software Issues Found on the MAX [Archive]

Lake1952

18th Jan 2020, 13:12

Not sure where we are supposed to be keeping up to date on MAX developments. This is the first I have heard that Boeing rewrote the entire software for the flight control computer, not just the MCAS code. Several carriers have now pushed MAX return back to June.

https://www.wsj.com/articles/boeing-finds-new-software-problem-that-could-complicate-737-max-return-11579290347

https://abcnews.go.com/Politics/software-issues-delay-return-boeings-737-max/story?id=68357961
https://apnews.com/c8cfe82b6ab25a788b42eab1e8e47a3a

donotdespisethesnake

18th Jan 2020, 16:02

"Rewrote" here is quite misleading, and usual journalistic hyperbole. In fact Boeing will be creating an updated version by amending existing software. Those amendments may be more or less extensive, but they are not starting again from scratch.

BDAttitude

18th Jan 2020, 16:54

Discovered during rollout:
The issue is in the plane’s flight-control computer software. It was confined to how it performs validation checks during startup and doesn’t involve its function during flight, the people said. The problem came to light when the latest version of the software was loaded onto an actual aircraft, according to one of the people. While it has been tested on planes in flight, most of the software reviews have occurred in a special simulator used by engineers on the ground.

https://www.seattletimes.com/business/boeing-aerospace/new-software-flaw-could-further-delay-boeings-737-max/

How was that: After all that scruntity, most extensively tested and safest piece of software ever!

OldnGrounded

19th Jan 2020, 18:53

Discovered during rollout:

https://www.seattletimes.com/business/boeing-aerospace/new-software-flaw-could-further-delay-boeings-737-max/

How was that: After all that scruntity, most extensively tested and safest piece of software ever!

Indeed. How can you possibly not notice that a system under active development is failing (or failing to perform) a POST (power-on self-test) or equivalent? If this is really true, it's a pretty remarkable oversight.

And then there's this:

It was confined to how it performs validation checks during startup and doesn’t involve its function during flight . . .

Well, umm . . . if a system doesn't properly run its validation checks on startup, why would we trust that it will properly perform "its function during flight?"

BDAttitude

19th Jan 2020, 19:39

Well, TBH I don't hope anyone would have gone airborne with an FCC not completing is POST.

This confirmes (once again) that the changes done to implement inter communication and health checking between the two boxes and fixing this dubious AP disconnect issue, which must have something to do with the task scheduling, were in fact open heart surgery on these old architectures.

What could possibly go wrong?

To discover this half a year down the road - presumeably after the fix itself has been reviewed and was close to approval - is pretty embarrassing.
However knowing the timelines from comparable hot fixes in automotive I still think that this one is rather rushed.
With sound testing one would not expect these problems.
After all, these FCCs are no PCs where every unit is different with regard to hardware, user installed software and configuration.
In such a well controlled environment as transport category airplanes straight from the production line one should be able to do better, if testing and validation was sufficient.

infrequentflyer789

19th Jan 2020, 20:14

Well, TBH I don't hope anyone would have gone airborne with an FCC not completing is POST.

Well, yeah. It isn't quite "nothing to see here", but this is exactly the sort of failure you can sometimes get with any software system going from test or staging environments (eng sim) to production - test never quite does things exactly the same. Looks like failure happened at the right place/time (ie. on the ground) and was caught by the existing self-checks.

The surprising thing for me is that this appears to mean they have not yet flown the final fix. So are all those previous test flight useless now? Was this the reason for the test-flight hiatus - ie. not that they'd finished testing (as some said) but that the software wasn't final yet?

This confirmes (once again) that the changes done to implement inter communication and health checking between the two boxes and fixing this dubious AP disconnect issue, which must have something to do with the task scheduling, were in fact open heart surgery on these old architectures.

What could possibly go wrong?

Well, it hints rather than confirms, but I'd bet that it was that change causing problems as well.

Reading between the lines, I get the impression they are now seriously short of spare CPU cycles in the FCC (my guess is that they were before this issue, possibly even before MAX, and were just hoping they wouldn't need any...). If so, they will now be trying to save cycles anywhere they can (been there, done that, not thankfully on passenger aircraft code...), which is really really bad news. What could possibly go wrong? - anything, Anything they mess with to save cycles, and everything else as well if they actually have started to mess with task scheduling, latent bugs, race conditions, new timing issues, things that haven't surfaced in the life of the NG, and would have stayed hidden, could now be unleashed. Or of course it could all be just fine, nothing to see here, feel the force...

Wondering now how many test flights the new new fixed software will need (once it actually boots properly), and how long after that to certify it?

OldnGrounded

19th Jan 2020, 23:05

Well, yeah. It isn't quite "nothing to see here", but this is exactly the sort of failure you can sometimes get with any software system going from test or staging environments (eng sim) to production - test never quite does things exactly the same.

I dunno. I can think of lots of failures that popped up late in the game and brought us up short when we thought we were almost home free, but I can't remember not noticing that a system wasn't running its POST -- or failing it if it was running it.

Maybe the reporting is garbled and that's not really the issue here, but, if it is, it seems rather alarming to me.

tdracer

19th Jan 2020, 23:56

Initialization issues are not all that uncommon - even in DAL A software. You can get inadvertent 'race' conditions happening during initialization where tasks don't happen in the order that was planned (sometimes inconsistently, depending on some of the other external conditions during initialization). This can get particularly tricky when some of the interfacing systems may not always be alive yet during initialization. If they are short on throughput margin it can be even worse because you can't afford the processing necessary to do multiple validity checks.
These issues can also be hard to find during development and rig testing because you often have simulations of the interfacing systems, rather than the actual systems, and the simulations may not completely mimic the initialization characteristics of the actual systems.
We ran into an initialization issue with some DAL A FADEC software after it had been flying around in service for over 20 years that prevented the normal channel alternation logic from working properly during engine start. After we figured out what was happening, it turned out the issue had been there - latent - from day one. But it took a change to an interface to bring it to the surface.

CurtainTwitcher

20th Jan 2020, 00:37

it turned out the issue had been there - latent - from day one. But it took a change to an interface to bring it to the surface.

Not a computer person, but have been tinkering with computers since the early 80's, and have some good friends with PhD's in the area. That is exactly the sort of bug that concerns pilots with only a dual FCC system moving a very large control surface. The latent one that may surface much much much later and catch someone out.

Much earlier in the I posted a link to one of the original forensic computer accident investigations. it too involved this exact type of latent bug sitting silently waiting to reveal itself with an "interface change" (removal of a physical mechanical interlock preventing a lethal dose of radiation): Nancy Leveson: Therac-25 Accident (http://sunnyday.mit.edu/papers/therac.pdf). This case wan't a race condition, rather a simple "out by one" coding error that remained dormant for years.

MechEngr

20th Jan 2020, 00:38

Indeed. How can you possibly not notice that a system under active development is failing (or failing to perform) a POST (power-on self-test) or equivalent? If this is really true, it's a pretty remarkable oversight.

And then there's this:
<embedded quote>
Well, umm . . . if a system doesn't properly run its validation checks on startup, why would we trust that it will properly perform "its function during flight?"

No telling on the specific problem, but there are tolerances in hardware that are not easy to simulate and might not present on one system but will be on another.

In a system I was involved with a top level CPU was controlling slave CPUs via dual port memory. One day we get a call from the lab that the test system had gone crazy out of control and far faster than ever before. Turned out the speed was the same top speed as always and that they thought it was fast because normally it didn't move during boot-up. They hit the master panic stop switch which did a lot of damage to other things. We all wondered how this had gone unnoticed for so long; the prototype had been in flight test for some time. We realized that the prototype in flight test was out of view of the users, did not send data during boot-up to let them know it was doing anything, and the plane was so loud they would be unable to hear it. The only damage was from the sudden stop, something the flight test guys would never know to use.

Turned out it was a race condition where the dual port RAM would sometimes retain certain bits for a while and then the slave CPUs would sometimes get a speed RAM value that wasn't zero before the top level CPU could set valid values. Since no one was doing an electron-by-electron simulation of the RAM it wasn't possible to know that it would wake up with random values; certainly not mentioned in the documentation. The fix was for the slave CPUs to delay on boot-up and for the top level CPU to clear all the bits before doing anything else.

One weird area in embedded systems development is called "fuzzing" where random inputs are shoved into the system to ensure it rejects all the invalid ones and doesn't choke on too much - like putting 20 characters into a 16 character field. An offshoot is to gradually lower the voltage to see when certain integrated circuits misbehave and the most obnoxious technique is to have an external computer gradually slow the system clock until the internal memory in CPUs is not refreshed fast enough and it starts to fail. The latter are usually used to crack encryption by getting the processors to fail and leave intermediate results in cache, but I suppose it could be used on a multi-processor system to force certain modules to boot faster or slower than nominal.

One great way to jam up a system is for a module to send a message to another module that isn't ready to receive any messages and then camp out and wait for a reply which will never come. Once again, timing is everything and it might be the importance of the message is so high that timing out and trying again isn't going to work; alternatively, if it is allowed to try again it can overwhelm the recipient with so many retries on a high-priority message that the recipient cannot act on the message and generate a reply before the originator dumps another one in the cue. For example, asking for status more rapidly than it takes to check the status; a variation on "are we there yet, are we there yet, are we there yet..."

Dave Therhino

20th Jan 2020, 01:06

The surprising thing for me is that this appears to mean they have not yet flown the final fix. So are all those previous test flight useless now? Was this the reason for the test-flight hiatus - ie. not that they'd finished testing (as some said) but that the software wasn't final yet?

My understanding is that much of that flying was to demonstrate the characteristics of the airplane with MCAS disabled, ostensibly to "demonstrate system failure conditions (to classify failure effects)," but I suspect also to satisfy all the questions from foreign CAAs about unaugmented handling characteristics.

hunbet

20th Jan 2020, 06:21

Do you all understand that this a common non issue ?

Many times after a software upload we would have this problem and would just revert to the last update.

You are talking about an airplane that is grounded. NON issue !

BDAttitude

20th Jan 2020, 06:59

Do you all understand that this a common non issue ?

Many times after a software upload we would have this problem and would just revert to the last update.

You are talking about an airplane that is grounded. NON issue !
Well it's back to the developpers, getting a build from RC, doing the quality assurance, regression testing, reviews ... not before may, probably june would be my guess.
NON issue for an increasingly desperate company?

BDAttitude

20th Jan 2020, 07:07

These issues can also be hard to find during development and rig testing because you often have simulations of the interfacing systems, rather than the actual systems, and the simulations may not completely mimic the initialization characteristics of the actual systems.
Very true! However if you mess about with task scheduling these things can happen and if one's beeing rushed most certainly will happen. These kind if errors have to be tested out, unfortunately.

OldnGrounded

20th Jan 2020, 12:25

Do you all understand that this a common non issue ?

Nope, I don't understand that, at all. For this to crop up at this point in the process (when Boeing has been telling the world for some time that MCAS 2.0 is ready to go and just needs FAA approval) suggests that (a) either the Boeing folks who have been talking to the world don't know what they're talking about or haven't been candid; and/or (b) the Boeing people actually doing the work mistakenly thought they had completed testing when they had not.

OldnGrounded

20th Jan 2020, 12:30

Quote:
Originally Posted by tdracer View Post (https://www.pprune.org/rumours-news/628965-new-software-issues-found-max.html#post10667048)
These issues can also be hard to find during development and rig testing because you often have simulations of the interfacing systems, rather than the actual systems, and the simulations may not completely mimic the initialization characteristics of the actual systems.

Very true! However if you mess about with task scheduling these things can happen and if one's beeing rushed most certainly will happen. These kind if errors have to be tested out, unfortunately.

All true, both of you. That said, there's really no escaping the fact that having this issue pop up at this time, after Boeing has told the world that all is ready for regulatory blessing, is a strong indication that something is not as it should be in the development-testing process.

clearedtocross

20th Jan 2020, 12:42

The problem with interactive computer systems with both synchronous and async interactions is that any amount of testing does not make sure they always work. You have to get it right by design, not by testing. In earlier days there were computer languages like ADA that supported (not guaranteed) a proper design with special system calls like the "rendezvous" and "resource locking". Such a design never works in a hurry and is extremly difficult to fix without creating more problems. This is the sort of a task that a clever single software engineer is better suited to solve than a hundred programmers working for a few dollars (or more).
I had a laugh last summer when Muilenburg boasted he could change the single sensor FCC into a communicating and comparing dual channel system in a matter of weeks. The laugh is still echoing.

Ian W

20th Jan 2020, 13:14

There are always problems like this that surface if you make changes to geriatric code. Quite often they are minor with just parameters needing to be increased because the system is now doing extra work or has more interfaces so time out failure needs to be longer or a count needs to allow a higher value. When you are down coding at machine code or if you are lucky at assembler level and you are working on the physical machine these small things can bite you. Had the original designer still been around you might have been warned don't change that value without altering this other - apparently unrelated- parameter. The problem with maintaining embedded code that was written before ideas of structure and to fit inside the space/run-time available is that whatever you change may cause an error somewhere. Maintenance programming at machine level is not a skill that is taught any more.

Fly Aiprt

20th Jan 2020, 13:24

That said, there's really no escaping the fact that having this issue pop up at this time, after Boeing has told the world that all is ready for regulatory blessing, is a strong indication that something is not as it should be in the development-testing process.

So true !
Things like that do happen, but what is unclear is why tests in a real aircraft come so late into the development timeline ?
Not a specialist, but is flight software so complicated as compared to - for instance - autonomous ground vehicle software ?

boaclhryul

20th Jan 2020, 15:06

...what is unclear is why tests in a real aircraft come so late into the development timeline ?...

@clearedtocross has it:

The problem with interactive computer systems with both synchronous and async interactions is that any amount of testing does not make sure they always work. You have to get it right by design, not by testing...

I've been in the computer data communications field for half a century now, not in aeronautics (that was my dad's job, on the Avro Arrow - though I gather that aircraft is also still grounded...). The design of interdependent, asynchronous systems (whether multiple tasks on a single piece of hardware, or inter-system communication via data buses) is not the proverbial "rocket science" but the result of careful understanding of each component, what it's dependent on, and how that dependence is handled.

Capable practitioners in the field, who we hope are integral to aeronautical design, understand critical sections, race conditions and how to avoid, etc. ... and apply that to their own designs, their overall contributions to their own firm's designs, and their specifications to suppliers.

OldnGrounded

20th Jan 2020, 16:19

@clearedtocross has it:

Quote:
Originally Posted by Fly Aiprt View Post (https://www.pprune.org/rumours-news/628965-new-software-issues-found-max.html#post10667369)
...what is unclear is why tests in a real aircraft come so late into the development timeline ?...

Quote:
Originally Posted by clearedtocross View Post (https://www.pprune.org/rumours-news/628965-new-software-issues-found-max.html#post10667345)
The problem with interactive computer systems with both synchronous and async interactions is that any amount of testing does not make sure they always work. You have to get it right by design, not by testing...

Yes, I think most of us here understand the above. The point I've been trying to make is that you learn whether or not you've gotten it right by testing and, if your system is failing to run, or failing, a POST/initialization check, you should probably notice that well before anyone on the team suggests that "it's finished/almost finished."

The situation in which Boeing and U.S. aviation in general find themselves today is one in which announcing vaporware is probably a serious mistake.

clearedtocross

20th Jan 2020, 18:06

For those who were trained to fly rather than to write real time software just a little example that shows the problem:
Imagine a single track railway connects two very remote stations A and B where there is usually only one or two trains travelling in each direction per day. The single track is protected by a red light at both ends which usually shows red as default. When the driver at A wants to leave for B, he presses a "start" button and gets a green light at A while the light at B remains red even if driver at B presses the button too. When the driver A arrives at B, he presses the "end" button to release the line (and the lights). Obviously a driver at B would do the same in the opposite direction. Now this is tested and it works perfectly, again and again... Until one day, the buttons at both ends are pressed in exactly the same moment (lets discard Einsteins relativity theory and Heisenbergs uncertiness). What will happen? It depends on the guy who programmed the light control systems. If both lights remain red, you will get angry drivers. If both lights go green, you will get dead drivers and SLF. So the programmer must have thought about this possible problem and implemented some solution (like priority scheduling, look ahead locking etc.).

This is what I meant when I wrote about making the design safe is vital before something gets tested because tests will not always reveal unlikely but still possible events (like the failure of a sensor) . And in a complex system, its far from easy and not to be done in a hurry.

Ian W

20th Jan 2020, 18:11

Yes, I think most of us here understand the above. The point I've been trying to make is that you learn whether or not you've gotten it right by testing and, if your system is failing to run, or failing, a POST/initialization check, you should probably notice that well before anyone on the team suggests that "it's finished/almost finished."

The situation in which Boeing and U.S. aviation in general find themselves today is one in which announcing vaporware is probably a serious mistake.

The Boeing engineers did not have the luxury of starting from a clean sheet and "getting it right by design". They had stiffware that has been operational for years and had to modify the code so that the FCCs operated in a different way without changing anything that was not essential to change for the task at hand and without breaking any current functions. Maintenance programming especially of embedded code and modifying the code so it does things differently without affecting anything else is nothing like simple code writing. It is possible that some very basic timing issue made the live aircraft slightly different to the avionics test bench. This is the reason regression tests are run when the new code is ported to and implemented in the aircraft - and the tests found an issue - that is what the tests are for.

Fly Aiprt

20th Jan 2020, 18:32

Thanks for all who responded and provided examples.
Indeed my Java teachers taught us to consider and catch exceptions.

The nagging question is, considering the differences between their "engineering cab" and a real airplane, why were the real software flight tests performed so late (december)?
What kept them from flying the thing and doing ramp tests 6 or 9 months ago?

Or is it another aspect of the "no need to fly the real thing before" mentality?
"No need for sim time, just a tablet", "no need to test fly, just run the cab"...
Makes you wonder..

OldnGrounded

20th Jan 2020, 19:59

The Boeing engineers did not have the luxury of starting from a clean sheet and "getting it right by design". They had stiffware that has been operational for years and had to modify the code so that the FCCs operated in a different way without changing anything that was not essential to change for the task at hand and without breaking any current functions. Maintenance programming especially of embedded code and modifying the code so it does things differently without affecting anything else is nothing like simple code writing. It is possible that some very basic timing issue made the live aircraft slightly different to the avionics test bench. This is the reason regression tests are run when the new code is ported to and implemented in the aircraft - and the tests found an issue - that is what the tests are for.

I don't have any argument with what you have written, except that it really isn't responsive to my post to which you are responding. Maybe you clicked in the wrong post?

fergusd

20th Jan 2020, 22:00

Thanks for all who responded and provided examples.
Indeed my Java teachers taught us to consider and catch exceptions.

The nagging question is, considering the differences between their "engineering cab" and a real airplane, why were the real software flight tests performed so late (december)?
What kept them from flying the thing and doing ramp tests 6 or 9 months ago?

Or is it another aspect of the "no need to fly the real thing before" mentality?
"No need for sim time, just a tablet", "no need to test fly, just run the cab"...
Makes you wonder..

Your experience of safety critical software development must be a little rusty . . .

Fly Aiprt

20th Jan 2020, 23:03

Your experience of safety critical software development must be a little rusty . . .

Indeed, I only developed non safety critical simple software.

But does that imply that it is valid to defer an aircraft critical software testing in the real airplane until the last moment?
That is if Boeing now considers the MCAS as a safety critical software.

MechEngr

21st Jan 2020, 02:36

Thanks for all who responded and provided examples.
Indeed my Java teachers taught us to consider and catch exceptions.

The nagging question is, considering the differences between their "engineering cab" and a real airplane, why were the real software flight tests performed so late (december)?
What kept them from flying the thing and doing ramp tests 6 or 9 months ago?

Or is it another aspect of the "no need to fly the real thing before" mentality?
"No need for sim time, just a tablet", "no need to test fly, just run the cab"...
Makes you wonder..

Java? Java does a ton of work and hides a lot of details behind a pile of software that won't fit on an FCC and may or may not be busy doing something that you don't want to do when something important needs to be done. Welcome garbage collect.

Want to learn programming - C or assembler or hand code machine language. Java is to programming what MS Flight Simulator is to an F-15 Eagle.

Tobin

21st Jan 2020, 04:50

No need to pile on Fly Aiprt. True, coding Java isn't really comparable to real-time systems in assembler (something I haven't done, either) but the very valid point stands:

Why wasn't this being tested on real hardware before? Any new internal software version, and certainly any "release candidate" build, should be moved from the developer's machine, to some production test lab, to a real system, expeditiously.

BDAttitude

21st Jan 2020, 06:29

Indeed, as flying would not have been even neccessary. Loading and starting up would have been enough.

Overconfidence in lab setups is by the way another common theme. The Ariane 5 failure was in that category (posted link before, but don't have at hands right now). And it looks like the Starlifterliner failure will be as well.

crankyanker

21st Jan 2020, 07:18

Indeed my Java teachers taught us to consider and catch exceptions.

Hopefully the compiler did the same as Java will be somewhat forceful about handling exceptions and requires you to be somewhat explicit about which exceptions will be thrown.

The nagging question is, considering the differences between their "engineering cab" and a real airplane, why were the real software flight tests performed so late (december)?
What kept them from flying the thing and doing ramp tests 6 or 9 months ago?

From an outside POV Boeing appears to have rather lax software development processes (and nearly no QA) in place.

Java? Java does a ton of work and hides a lot of details behind a pile of software that won't fit on an FCC and may or may not be busy doing something that you don't want to do when something important needs to be done. Welcome garbage collect.

Want to learn programming - C or assembler or hand code machine language. Java is to programming what MS Flight Simulator is to an F-15 Eagle.
Valiant attempt at gatekeeping aside, there is indeed a formal variant of Java designed for real-time systems (RTSJ). For something developed in the 90s based on older hardware and software I'd expect that Ada was the language of choice. The DoD developed Ada explicitly for real-time safety-critical systems, and that's why Boeing chose it for the 777. But I digress.

clearedtocross

21st Jan 2020, 08:26

The nagging question is, considering the differences between their "engineering cab" and a real airplane, why were the real software flight tests performed so late (december)?
What kept them from flying the thing and doing ramp tests 6 or 9 months ago?

.
I guess the answer to this question is easy: there was no final version and there still is none. Of course, there was a lot of testing going on, even many flight tests, but they possibly failed to meet requirements. Which means back to the desk to make corrections and try again. Programming by trial and error. Not the safest and not the fastest method, could well be never-ending.

By the way, object oriented languages (like Java, C++ etc. ) cannot be used to program controllers because objects need dynamic memory allocation and lots of memory and adress space are simply not available in controllers. But - used with care - there is nothing wrong with programming in C (like the ubiquitous Arduino) or standard Fortran 4 or even an Assembler, provided the specifications and the code are well documented and kept up to date.

n5296s

21st Jan 2020, 09:03

object oriented languages (like Java, C++ etc. ) cannot be used to program controllers
Rubbish. You can do anything in C++ that you can do in C, and a great deal safer. I speak from much experience. People often confuse C++ with more "automagic" languages like Java and Ruby. C++ has no garbage collection, and if you want to manage memory allocation yourself, including doing everything statically, it's easy. In 2020 doing any major new code in C is a sign of insanity. Of course if you already have a legacy of 10 million lines of C code, like my erstwhile employer, you're a bit stuck with it.

crankyanker

21st Jan 2020, 09:16

By the way, object oriented languages (like Java, C++ etc. ) cannot be used to program controllers because objects need dynamic memory allocation and lots of memory and adress space are simply not available in controllers. But - used with care - there is nothing wrong with programming in C (like the ubiquitous Arduino) or standard Fortran 4 or even an Assembler, provided the specifications and the code are well documented and kept up to date.

Rubbish. You can do anything in C++ that you can do in C, and a great deal safer. I speak from much experience. People often confuse C++ with more "automagic" languages like Java and Ruby. C++ has no garbage collection, and if you want to manage memory allocation yourself, including doing everything statically, it's easy.

Arduino is largely based on a bastardized C++ environment, but there are indeed C and Java odds and ends for it. C++ is not a superset of C (that's Objective C/C++), so you can do most but not all things you can do in C (although the gap is closing with the more recent C standards). It's highly unlikely that the 737 made large use of assembly, C, or any C-like derivatives though as none will offer the guarantees you want for safety-critical systems.

You absolutely can allocate memory statically in C++ although microcontrollers these days are more capable than desktop processors were thirty years ago. Hell, you can run Java on the 8-bit AVR microcontrollers that made Arduino famous (as well as newer ARM based stuff like the STM32 line).

In 2020 doing any major new code in C is a sign of insanity. Of course if you already have a legacy of 10 million lines of C code, like my erstwhile employer, you're a bit stuck with it.

On a microcontroller that's not necessarily true. The DoD walked away from Ada around the time the NG was released. It's too strict, too archaic, and schools have largely moved away from Pascal based languages (e.g. Modula-3) for teaching in favor of languages with a C-like syntax (e.g. Java). These days you absolutely will see safety-critical systems written with *NEW* C/C++ code that attempt to adhere to the MISRA-C standards. That's not the trade off I would've made but finding competent Ada programmers is an expensive endeavor.

Programming by trial and error. Not the safest and not the fastest method, could well be never-ending.
Ultimately this is likely the problem, and a rather horrifying one at that. That style (and level) of testing wouldn't pass muster at some wanky startup developing a social network for your pet and it ought not be the way that a company selling $50 million flying aluminum tubes conducts business.

Fly Aiprt

21st Jan 2020, 09:17

Thanks Tobin, BDAttitude and clearedtocross.
Of course nobody suggested any FCC could be programmed in Java. That was an example as to the basics of specifying code is all about managing exceptions and crosschecks. Sorry if it wasn't clear.

Here is a link to some research done in programming a 737 version some years ago.
http://www.cse.cuhk.edu.hk/~lyu/paper_pdf/00005291.pdf

Interesting to see what languages were tried, and what delays it takes for even some experimental version.
C, C++, Pascal and Ada are not uncommon.

And yes it appears the MCAS "fix" has been rushed and not really tested in the real world.
One wonders what would have happened if FAA was still under the influence of Boeing and had returned the airplane to flight...

Imagegear

21st Jan 2020, 09:59

At the outset, I should say I have no experience of coding FCC stuff, so I could be talking through the wrong orifice however:

Reading the above, if the stability of the FCC cannot be assured using C++ Which I know generally to be very stable, could a problem be occurring in the interface between FCC hardware and the code. Is it possible that an asynchronous interrupt from an external sensor/s is not being set consistently by the hardware and when the code goes to look for the bit/s, they are not there?. Of course when you are in a hurry, the last thing to get done is the error reporting and recovery code for a missing interrupt.

IG

threep

21st Jan 2020, 10:58

Well, yeah. It isn't quite "nothing to see here", but this is exactly the sort of failure you can sometimes get with any software system going from test or staging environments (eng sim) to production - test never quite does things exactly the same. Looks like failure happened at the right place/time (ie. on the ground) and was caught by the existing self-checks.

The surprising thing for me is that this appears to mean they have not yet flown the final fix. So are all those previous test flight useless now? Was this the reason for the test-flight hiatus - ie. not that they'd finished testing (as some said) but that the software wasn't final yet?

You can go to flight test with control software which hasn't yet been certified (the certification candidate software for example). It would have to pass a whole raft of testing and quality gates to show it is fit for flight test. Unit test, software bench test, hardware/software integration testing, black-box systems test and iron-bird test. You hope to catch any problems/bug as early in that process as you can, because its quicker, simpler and cheaper to fix it at that stage. But its the nature of engineering complex systems such as aircraft that some problems only manifest when you hook everything together.

Flight test results which allow system parameters to be optimised will still be valid even if you have to go back and fix some built-in-test code. But the unit test, software bench test, hardware/software integration testing etc would all have to be repeated for the functions affected by any software change.

Turb

21st Jan 2020, 10:59

I feel sorry for the code-writers who are working on this. I mean the ones actually doing the job, trying to understand and solve the problems of the system they need to change, while the rest of the company and its suppliers, thousands and thousands of people mostly with families and mortgages, just stand around twiddling their thumbs and praying that today is the day the damn thing actually works. Just think what that must feel like for the coders, knowing that all those thousands of other people are kind of peering over their shoulders and willing them to stop messing about and just fix this, right now. And I bet the team of coders is tiny. It has to be tiny. The job can't be done by a huge team, it's not that sort of job. Throwing extra resources at it would be nonsense - in fact it would be counter-productive. If I'm right there's this tiny group of very clever people coming in to work each day, to spend another day struggling to dig the whole Boeing company out of the hole it's in, none of them making the sort of money the Boeing board does, and none of them responsible for the design errors which caused the problem in the first place. So I feel sorry for them.

Ian W

21st Jan 2020, 12:31

Thanks Tobin, BDAttitude and clearedtocross.
Of course nobody suggested any FCC could be programmed in Java. That was an example as to the basics of specifying code is all about managing exceptions and crosschecks. Sorry if it wasn't clear.

Here is a link to some research done in programming a 737 version some years ago.
http://www.cse.cuhk.edu.hk/~lyu/paper_pdf/00005291.pdf

Interesting to see what languages were tried, and what delays it takes for even some experimental version.
C, C++, Pascal and Ada are not uncommon.

And yes it appears the MCAS "fix" has been rushed and not really tested in the real world.
One wonders what would have happened if FAA was still under the influence of Boeing and had returned the airplane to flight...

I think it is a little unfair to say that the MCAS fix has been rushed, The requirement on the FCCs was completely changed - nothing to do with MCAS apart from that was why the FCC design assumptions were revisited. So now they don't run independently from what has been said they have to work with one shadowing the other. This in some areas is a large logic change. There will be a standard development test sequence of standard software unit test, bench test with avionics connected then to an aircraft simulation cab then to a live aircraft. Each time testing being carried out at different levels and then full regression testing to ensure nothing in existing code has been broken. It looks like a regression test on the live aircraft in one of the recent tests found a problem. That is why regression testing is done. As someone who has spent many 'happy' hours regression testing systems I can assure you it doesn't matter what the expertise of the programmer is, and what the pressure is from management it is rare that there will not be problems moving to the live environment, especially in real time systems. The subsequent fault fixes may even cause new issues to be found in what had been working code.

etudiant

21st Jan 2020, 12:34

I feel sorry for the code-writers who are working on this. I mean the ones actually doing the job, trying to understand and solve the problems of the system they need to change, while the rest of the company and its suppliers, thousands and thousands of people mostly with families and mortgages, just stand around twiddling their thumbs and praying that today is the day the damn thing actually works. Just think what that must feel like for the coders, knowing that all those thousands of other people are kind of peering over their shoulders and willing them to stop messing about and just fix this, right now. And I bet the team of coders is tiny. It has to be tiny. The job can't be done by a huge team, it's not that sort of job. Throwing extra resources at it would be nonsense - in fact it would be counter-productive. If I'm right there's this tiny group of very clever people coming in to work each day, to spend another day struggling to dig the whole Boeing company out of the hole it's in, none of them making the sort of money the Boeing board does, and none of them responsible for the design errors which caused the problem in the first place. So I feel sorry for them.

Think you are spot on!
We know that the MAX still uses 286s, very limited computers dating back to the 1980s. I suspect that the software is written in assembler language rather than some modern high level language, because that allows the limited hardware to be exploited to the maximum.
Unfortunately, such near machine language programming is a bear to write correctly and a beast to debug. So it is not a popular pursuit, especially as computing resources are normally dirt cheap compared to software writers.
If this supposition is correct, the work now falls on the small bunch of surviving veterans left over after the waves of 'efficiency improvements' cut the headcounts. There is no backup Team B available, nor could one create such from scratch.

Ian W

21st Jan 2020, 12:37

I feel sorry for the code-writers who are working on this. I mean the ones actually doing the job, trying to understand and solve the problems of the system they need to change, while the rest of the company and its suppliers, thousands and thousands of people mostly with families and mortgages, just stand around twiddling their thumbs and praying that today is the day the damn thing actually works. Just think what that must feel like for the coders, knowing that all those thousands of other people are kind of peering over their shoulders and willing them to stop messing about and just fix this, right now. And I bet the team of coders is tiny. It has to be tiny. The job can't be done by a huge team, it's not that sort of job. Throwing extra resources at it would be nonsense - in fact it would be counter-productive. If I'm right there's this tiny group of very clever people coming in to work each day, to spend another day struggling to dig the whole Boeing company out of the hole it's in, none of them making the sort of money the Boeing board does, and none of them responsible for the design errors which caused the problem in the first place. So I feel sorry for them.

In some respects its worse than that. Each time they 'fix' the code they then have to test the fix and regression test the system to try to break what they have just built. It's a difficult task telling the team that the system failed one of the final sequence of regression tests.

clearedtocross

21st Jan 2020, 13:25

Rubbish. You can do anything in C++ that you can do in C, and a great deal safer. I speak from much experience. People often confuse C++ with more "automagic" languages like Java and Ruby. C++ has no garbage collection, and if you want to manage memory allocation yourself, including doing everything statically, it's easy. In 2020 doing any major new code in C is a sign of insanity. Of course if you already have a legacy of 10 million lines of C code, like my erstwhile employer, you're a bit stuck with it.

Ok, rubbish if you like. But nobody suggested the FCC should be written on a 286 in C in 2020. That would be truly insane. It's questionable if the FCC should have been developped from scratch on a new platform and new soft and comms resources, but here we are in the old hamsterweel territory regarding re-do or update.

The last thing I would like to do is trigger off a debate in this forum about what's the best program language for this and that. However, I would bite my tongue off without the remark that if you do not use all the fancy ++ stuff of C++ then you end up rather close to ... you guessed it.

But thank you very much, Fly Airpt, for the link, very interesting read and the results are no surprise to me.

Luc Lion

21st Jan 2020, 15:10

I believe that the B737 FCC software is written in ADA.
I am not 100% certain of it, but I know that the FCC of the Classic serie (300, 400, 500) was written in ADA, that the FMS of nowadays B737 (NG and MAX) is in ADA, that Collins Aerospace (the FCC subcontractor) works mostly in ADA, and 99% of the B777 code is written in ADA.

Note: I believe that earlier versions of the FCC were written in FORTRAN.

FrequentSLF

21st Jan 2020, 16:00

I would assume that the main issues are the interrupt handling, working with the interrupts on the 286 has been always a pain.
What surprises me is that they did not setup a fully functional rig, they have 400 737 parked, and could have used one as a lab rig for the development of the SW, is that something difficult to do?

MechEngr

21st Jan 2020, 16:13

I would assume that the main issues are the interrupt handling, working with the interrupts on the 286 has been always a pain.
What surprises me is that they did not setup a fully functional rig, they have 400 737 parked, and could have used one as a lab rig for the development of the SW, is that something difficult to do?
That would only handle the first 5 seconds or so of the software operation. As soon as any mechanical system was required you need a plane that is ready to be flown, as in fueled and generators and engines running. Otherwise it is stuck like a non-running car turned to the Run position, but not started, and is just an idiot light check. I think the risk of a ready to fly with engines-running plane in a parking lot is more of a problem.

n5296s

21st Jan 2020, 16:58

setup a fully functional rig
Yep. Heck, when I wrote software to run my garden railway, I wrote a simulator for it that emulated everything the actual hardware would do. Nobody was likely to get killed, maybe the cat would have got a nasty surprise, but it just makes everything a great deal easier.

Of course you CAN still be surprised by something the simulator didn't do accurately (just like a flight sim) but you reduce the chances and you pay special attention to the places where that might make a difference.

Luc Lion

21st Jan 2020, 16:59

FrequentSLF please see here for interrupt support in ADA.
https://www.adaic.org/resources/add_content/standards/12rm/html/RM-C-3.html

OldnGrounded

21st Jan 2020, 17:07

That would only handle the first 5 seconds or so of the software operation. As soon as any mechanical system was required you need a plane that is ready to be flown, as in fueled and generators and engines running. Otherwise it is stuck like a non-running car turned to the Run position, but not started, and is just an idiot light check. I think the risk of a ready to fly with engines-running plane in a parking lot is more of a problem.

I seriously doubt that Boeing would find it difficult to arrange to test on a ready-to-fly MAX in a safe and convenient location.

MarcK

21st Jan 2020, 17:23

The 286 was introduced 2/1982, and notified as end-of-life (last ship) 3/1999.

fergusd

21st Jan 2020, 17:54

I would assume that the main issues are the interrupt handling, working with the interrupts on the 286 has been always a pain.
What surprises me is that they did not setup a fully functional rig, they have 400 737 parked, and could have used one as a lab rig for the development of the SW, is that something difficult to do?

What surprises me is that they don't have a full simulator, which simulates the aircraft to the control system (not the same as the simulator the fly-boys sit in and play). In this day and age it is criminally inept not to do so. 30 years ago we fully simulated trains to the train control systems to ensure that when the control system hit the tracks in a train it worked, including everythng from signalling systems to wheel slip on wet rails, same goes on today in the automotive industry, down to the simulation of the chassis/road/surface/tyre, etc, etc. I have never delivered a control system capable of killing people which was not substantially tested on a full capability simulator whch could inject any fault or situation possible to the software to ensure it dealt with it, however impossible to believe.

A commercial aircraft is not any more complex than a modern fly by wire train . . . probably not even an order of magnitude more complex than a modern car . . .

Surely . . Boeing don't do this on a real plane . . . like people used to do in the stone age . . .

Edited to add : testing as described is only a small part of the testing required, factor the cost of testing being >2-4x the cost of writing the software in the first place . . . and that is conservative . . . which answers the java programmers question of why it's taken so long to get round to testing it on an aircraft . . .

Ian W

21st Jan 2020, 18:03

The 286 was introduced 2/1982, and notified as end-of-life (last ship) 3/1999.

Possibly as a commercial PC chip but there are manufacturers making 80286 boards as aviation uses a lot of them. The reason is that in the late 1980's the safety people in aviation were really taken with the idea of 'formal proof' of all the processing in avionics. I can remember briefings back then on how this would remove errors - totally disregarding the fact that hardware manufacturing errors and even software coding errors were even then getting rarer and could be handled. The worst kind of error was as in MCAS a perfectly implemented poor design. Nevertheless, it became a requirement that the hardware had to be formally proved. Then along came multi-core chips with intelligent prefetch and with preemption and multiprocessing... and the 'formal proof' rapidly became an extremely hard Np problem. So 80286's continue to be used as they can be formally proved. And this in itself is a problem as they are 'beasts of very little brain' especially when trying to handle a multiplicity of interrupts.
Is Ancient Silicon The Root Of Boeing’s Problems? (https://www.palisadeshudson.com/2019/07/is-ancient-silicon-the-root-of-boeings-problems/)
It is about time that the entire area is revisited as the handheld game player the 4 year old has in seat 28G is many times more powerful than the FCCs. All smart phones are way way above the capabilities of the CPU in the FCC. Like communications the time has come to abandon some of these old 'safety' ideas and move to a more commercial approach as is used in other areas. Size, Weight and Power are not an issue these days.

Ian W

21st Jan 2020, 18:12

What surprises me is that they don't have a full simulator, which simulates the aircraft to the control system (not the same as the simulator the fly-boys sit in and play). In this day and age it is criminally inept not to do so. 30 years ago we fully simulated trains to the train control systems to ensure that when the control system hit the tracks in a train it worked, including everythng from signalling systems to wheel slip on wet rails, same goes on today in the automotive industry, down to the simulation of the chassis/road/surface/tyre, etc, etc. I have never delivered a control system capable of killing people which was not substantially tested on a full capability simulator whch could inject any fault or situation possible to the software to ensure it dealt with it, however impossible to believe.

A commercial aircraft is not any more complex than a modern fly by wire train . . . probably not even an order of magnitude more complex than a modern car . . .

Surely . . Boeing don't do this on a real plane . . . like people used to do in the stone age . . .

Edited to add : testing as described is only a small part of the testing required, factor the cost of testing being >2-4x the cost of writing the software in the first place . . . and that is conservative . . . which answers the java programmers question of why it's taken so long to get round to testing it on an aircraft . . .

Like this system you mean? (https://www.thepeaches.com/bios/Boeing.htm)

MechEngr

21st Jan 2020, 18:14

I seriously doubt that Boeing would find it difficult to arrange to test on a ready-to-fly MAX in a safe and convenient location.

Which is what they did. But it isn't something that seems reasonably arranged for the multiple compilations each day that might be done while working with simulator testing.

MarcK

21st Jan 2020, 18:27

Like this system you mean? (https://www.thepeaches.com/bios/Boeing.htm)Where the simulation computers are 1990's vintage?

Imagegear

21st Jan 2020, 18:31

So 80286's continue to be used as they can be formally proved

….and therein I suspect lies the problem, being "formally proved" may not necessarily mean "Failsafe". I would like to think that error reporting and recovery rapidly returns the system (hardware and software) to a state in which the integrity of the platform is not compromised.

What of "formally proved" now. How much fix testing and regression testing is required to regain the "formally proved" status. It is not sufficient to prove just the hardware and software since a layer of microcode exists as the sandwich filling within the ALU and PROM/ low order RAM portion of the 286 which may not be tolerating some minor timing discrepancy on either side of the interface to cause the failure. Is the 286 perhaps being driven to it's critical limits?

IG

MechEngr

21st Jan 2020, 18:37

The 286 was introduced 2/1982, and notified as end-of-life (last ship) 3/1999.
That would be the Intel 80286, I expect. However Intel would certainly license the core to embedded systems developers for production at rates and for markets in which Intel could no longer afford to maintain fabrication. For example: Renasas 80C286 went EOL in 2017. Would not be surprised if someone bought 5000 of them for the shelf.

I would expect the next generation processor to be an FPGA rather than a microprocessor. I see some links exploring ADA on FPGAs, but it seems to be programming the FPGA to act like a microprocessor.

fergusd

21st Jan 2020, 19:44

Like this system you mean? (https://www.thepeaches.com/bios/Boeing.htm)

Nope . . .

Fd

crankyanker

21st Jan 2020, 21:48

I would assume that the main issues are the interrupt handling, working with the interrupts on the 286 has been always a pain.

In what sense? A lot of what made the 286 unpopular was its poor backwards compatibility with DOS which was designed for real mode, that ought not be an issue on a system that's hopefully not trying to run DOS.

That would be the Intel 80286, I expect. However Intel would certainly license the core to embedded systems developers for production at rates and for markets in which Intel could no longer afford to maintain fabrication. For example: Renasas 80C286 went EOL in 2017. Would not be surprised if someone bought 5000 of them for the shelf.

Boeing almost certainly uses a radiation hardened processor (MIL-STD-883) and not a standard desktop CPU.

I would expect the next generation processor to be an FPGA rather than a microprocessor. I see some links exploring ADA on FPGAs, but it seems to be programming the FPGA to act like a microprocessor.

I'd be surprised here too. These days you can get haardened RISC and PPC based processors that would provide a ton more power and flexibility than a 286-based design. There's also probably a push to use lower cost off-the-shelf components.

clearedtocross

22nd Jan 2020, 08:42

Now the poor software guys working on the max get more time. The first move of B's new management in the right direction to get the max flying again. I hope they clear up some other issues as well until summer (2020 !).
And thank you all for your valuable contributions and explanations! Has been very enlightening and is good to see that pprune also hosts know-how complementary to stick and rudder.

RomeoTangoFoxtrotMike

22nd Jan 2020, 09:30

The Boeing engineers did not have the luxury of starting from a clean sheet and "getting it right by design". They had stiffware that has been operational for years and had to modify the code so that the FCCs operated in a different way without changing anything that was not essential to change for the task at hand and without breaking any current functions. Maintenance programming especially of embedded code and modifying the code so it does things differently without affecting anything else is nothing like simple code writing. It is possible that some very basic timing issue made the live aircraft slightly different to the avionics test bench. This is the reason regression tests are run when the new code is ported to and implemented in the aircraft - and the tests found an issue - that is what the tests are for.

If it helps, think of it like this...

https://twitter.com/campuscodi/status/1190906457954365440

aviatorhi

23rd Jan 2020, 18:24

Boeing will never get it right... the "new" management is still just as investor focused as the old. One year of dividends and buybacks could have financed a clean sheet design. Sad to see an engineering icon like Boeing continue to bury itself.