PPRuNe Forums - View Single Post - MAX’s Return Delayed by FAA Reevaluation of 737 Safety Procedures
Old 8th Nov 2019, 23:00
  #3827 (permalink)  
Water pilot
 
Join Date: Mar 2015
Location: Washington state
Posts: 209
Likes: 0
Received 0 Likes on 0 Posts
By contrast, the 737 Max had two separate computers. One operated the flight systems and another was available if the first one failed, with the roles switching on each flight. But they interacted only minimally.

Boeing decided to make the two systems monitor each other so that each computer can halt an erroneous action by the other. This change is an important modernization that brings the plane more in line with the latest safety technology but raised highly complex software and hardware issues.
What Boeing is trying to do here is really hard, even leaving out the fact that they are using a 286 (which is not ancient, you damn whippersnappers!), a homebrew 'operating system', and have probably zero institutional memory of the baseline code. The estimate “Where before you may have had 10 scenarios to test, I could see that being 100" is off by several orders of magnitude. I know, I did this kind of stuff and it is some of the most frustrating programming you can imagine. (It is a constant war between "data corruption" and "everything locks up"; anyone can understand the theory of critical sections but in practice...) It sounds so simple in the boardroom and looks so neat on the whiteboard but in real life it is a nightmare. I was part of a team that was pretty familiar with this sort of issue and a new major release would take about two years to get properly tested and even after release for the next three months or you could count on getting a call in the middle of the night about data corruption. (I won't go much further but lets just say I have concerns about our financial system as well other data critical enterprises such as elections.)

The plan to have computer #1 inspect computer #2 and halt it if it thinks it is in error is about as asinine as swapping the roles of the computers each flight. As someone who moved onto playing with real world systems with big things that sometimes stop moving, the last thing you want when trying to figure out what went wrong is a system that swaps which processor/sensor is in use based on some unknown criteria. Flight #1 exhibits a critical problem, test Flight #2 shows 'no trouble found', Flight #3 goes back to the bad processor and crashes. What frigging genius came up with that? Oh, the same genius who swapped the 'active' AOA indicator for MCAS.

However, aside from that, what they are talking about is not only really hard, but now you have to test scenarios of erronious computer shutdown at any frigging time during the duration of the flight. This is really the same rancid logic behind MCAS; a solution for an extremely rare event now creates its own problem in much more common situations. How many benign problems are in the processing code that are now going to trigger this 'kill' subroutine? What happens if the two computers get into a war with each other? How robust is the communication line between the computers, which was probably never designed to deal with the amount of data that now has to be transferred?

No wonder they did not want to completely document what they did.

In my opinion, the right solution for their computer failure scenario is to figure out how to enable the pilot to resolve it, not to try to come up with a patched together failsafe computer system that wasn't designed to be failsafe from the beginning. I have no idea whether this involves more indicator lights (I know, I know), or some sort of manual override, or some kind of hardware solution (like physically preventing the computer from misconfiguring the plane) but the right solution is surely not to create an exponentially more complex computer system in a hurry.
Water pilot is offline