PDA

View Full Version : Blackout Bug: Boeing 737 cockpit screens go blank if pilots land on specific runways


slfool
8th Jan 2020, 11:56
https://www.theregister.co.uk/2020/01/08/boeing_737_ng_cockpit_screen_blank_bug/

Seven runways, of which five are in the US, and two in South America - in Colombia and Guyana respectively – trigger the bug. Instrument approach procedures guide pilots to safe landings in all weather conditions regardless of visibility."All six display units (DUs) blanked with a selected instrument approach to a runway with a 270-degree true heading, and all six DUs stayed blank until a different runway was selected," noted the FAA's airworthiness directive (https://www.federalregister.gov/documents/2019/12/27/2019-27966/airworthiness-directives-the-boeing-company-airplanes), summarising three incidents that occurred on scheduled 737 flights to Barrow, Alaska, in 2019.

https://www.federalregister.gov/documents/2019/12/27/2019-27966/airworthiness-directives-the-boeing-company-airplanes

This AD requires revising the airplane flight manual (AFM) to prohibit selection of certain runways for airplanes equipped with certain software. This AD was prompted by reports of display electronic unit (DEU) software errors on airplanes with a selected instrument approach to a specific runway. The FAA is issuing this AD to address the unsafe condition on these products.

BDAttitude
8th Jan 2020, 12:38
https://www.theregister.co.uk/2020/01/08/boeing_737_ng_cockpit_screen_blank_bug/



https://www.federalregister.gov/documents/2019/12/27/2019-27966/airworthiness-directives-the-boeing-company-airplanes
OMG!!!
This is symptomatic treatment at its best. What about fixing that f***ing display software?

malanda
8th Jan 2020, 13:05
OMG!!!
This is symptomatic treatment at its best. What about fixing that f***ing display software?
Seems they already have
"The FAA has confirmed that the faulty version of DEU software has already been removed from all airplanes conducting scheduled airline service into the affected airports. This AD is intended to address unscheduled diversions and Boeing Business Jet (BBJ) flights into the affected airports"

turbidus
8th Jan 2020, 13:13
This is the problem when you keep building code on top of legacy code. Certain combinations can trigger unintended consequences.

BDAttitude
8th Jan 2020, 13:36
Seems they already have
"The FAA has confirmed that the faulty version of DEU software has already been removed from all airplanes conducting scheduled airline service into the affected airports. This AD is intended to address unscheduled diversions and Boeing Business Jet (BBJ) flights into the affected airports"
So there's a fix and they do not mandate it?!

Some combination of data have triggered this bug and 7 approches have been identified.
So who guarantees that no 7 + x approches exist right now or are inserted later with DB updates?

Still find it strange.

blue up
8th Jan 2020, 16:12
747s could lose all 6 DUs until the EIUs got changed. Happened twice in-flight, IIRC.

DaveReidUK
8th Jan 2020, 17:01
So who guarantees that no 7 + x approches exist right now or are inserted later with DB updates?

Runways tend not to move overnight. :O

It would be a fairly trivial task to crunch one of the FMS data providers' runway databases and determine which ones are aligned at exactly 270° (plus or minus whatever the tolerance is).

BDAttitude
8th Jan 2020, 17:15
Runways tend not to move overnight. :O

It would be a fairly trivial task to crunch one of the FMS data providers' runway databases and determine which ones are aligned at exactly 270° (plus or minus whatever the tolerance is).
That's just about every second round here - the others being 07 to 09 - pedominantly westerly winds :O. Who needs FMS approach page anyway :}.

Arydberg
8th Jan 2020, 18:03
You have to be carefull you do not fix the problem but cause a worse one.

malanda
8th Jan 2020, 18:49
That's just about every second round here - the others being 07 to 09 - pedominantly westerly winds :O. Who needs FMS approach page anyway :}.
From the AD:
"Not all runways with a 270-degree true heading are susceptible; only seven runways worldwide, as identified in this AD, have latitude and longitude values that cause the blanking behavior"

I don't see any obvious pattern in the lat/long values. A very curious bug.
Great Circle Mapper (http://www.gcmap.com/mapui?P=pabr-kcnm-82v-kciu-kbjj-sklm-sycj)

BDAttitude
8th Jan 2020, 19:38
From the AD:
"Not all runways with a 270-degree true heading are susceptible; only seven runways worldwide, as identified in this AD, have latitude and longitude values that cause the blanking behavior"

I don't see any obvious pattern in the lat/long values. A very curious bug.
Great Circle Mapper (http://www.gcmap.com/mapui?P=pabr-kcnm-82v-kciu-kbjj-sklm-sycj)
Western longitude and rather large variation is common.

tdracer
8th Jan 2020, 20:51
So there's a fix and they do not mandate it?!

Some combination of data have triggered this bug and 7 approches have been identified.
So who guarantees that no 7 + x approches exist right now or are inserted later with DB updates?

Still find it strange.
No direct knowledge of this case, but sometimes a software update may require an associated hardware update. So while installing the new s/w is easy and cheap, the associated h/w may not be. If an operator never plans to operate into one of the seven affect airports, they may not want to be bothered (or pay for) the associated changes.
Shortly after EIS of the 747-400, Boeing came out with an EICAS s/w update to correct some issues with the original s/w. However it wasn't just a s/w update, it required a hardware change as well. Well, there was a certain operator who couldn't be bothered to update the hardware and so wouldn't incorporate the update. This operator also happened to be the launch customer and so had most of the affected aircraft. So finally the FAA issued an AD mandating the new s/w. Due to the unfortunate wording of the AD, it meant that every single time Boeing updated the 747-400 EICAS software, Boeing had to obtain an 'Alternate Method of Compliance' (AMOC) to the AD (which isn't trivial).

Dualbleed
8th Jan 2020, 21:20
I’ve been operating strictly to company and Boeing SOPs for the last 30 years on different Boeing products.(to cover mine and everyone else’s back including Boeing) I’m old enough and have so much experience that I’m beginning to not trust certain things that I always took for granted, and would maybe now go for my own very experienced seat of the pants flying instinct instead of Boeing QRH etc to save the day .. Not the way it should be I know !. Maybe Boeing can’t be trusted anymore, and that makes it up to me to be safe these days. Any thoughts. ?

Tobin
9th Jan 2020, 03:12
I'm a software developer, and this makes me very angry. The fact that a "magic value" can trigger a bug like this indicates that there is something rotten in the fundamental approach the developers of this particular software took.

There had to be multiple pairs of eyes looking it over, too, and either no one was confident enough to speak up and say "Hey, this isn't the right way to do this", or they were and were then overruled by someone with more authority than sense.

To be clear, I see this kind of thing all the time. I just don't work with safety-critical software, and I had hoped the standards were materially different for that.

I'd love to see the technical explanation of the bug and/or overview of the code, but I suspect it would just make me angrier.

I also quote this post from The Register article's comment page, with which I agree completely:
The bug itself is very troubling, but what is much troubling is what the bug implies about the quality of the software at large. First it implies a lack of bounds checking on the display unit. Second, a lack of testing on the display data inputs. Third, no checking of what the FMS is pushing out. Fourth, no one bothered to write a rational error-handling on the display unit in case values were out-of-bounds (especially since this causes the entire display to go dark instead of just a single value).

But what really worries me is that a software glitch can trigger a failure of all display units simultaneously, rather than just ones showing specific pages. SO would a bad value that is only displayed on one of those tertiary EIS display pages cause the displays to go out as well? What about garbage from the WX Radar?

MechEngr
9th Jan 2020, 03:18
Runways tend not to move overnight. :O

It would be a fairly trivial task to crunch one of the FMS data providers' runway databases and determine which ones are aligned at exactly 270° (plus or minus whatever the tolerance is).

True, but the magnetic heading can effectively change overnight. It's a weird side effect of the magnetic poles not staying put.

Edit: Came across the following website because a certain drone suddenly was unable to fly when the magnetic database changed on Jan1. https://www.ngdc.noaa.gov/geomag/WMM/ I don't know the particular correlation to the avionics side of this but looking at the loops and whorls of the magnetic field leaves me dizzy. They track not only the current offsets but the rate at which the offsets change; apparently 9 dimensions all together.

ReturningVector
9th Jan 2020, 05:27
True, but the magnetic heading can effectively change overnight. It's a weird side effect of the magnetic poles not staying put.

Edit: Came across the following website because a certain drone suddenly was unable to fly when the magnetic database changed on Jan1. https://www.ngdc.noaa.gov/geomag/WMM/ I don't know the particular correlation to the avionics side of this but looking at the loops and whorls of the magnetic field leaves me dizzy. They track not only the current offsets but the rate at which the offsets change; apparently 9 dimensions all together.

I think the article related the bug to the true direction, not the magnetic direction of a runway.

Uplinker
9th Jan 2020, 06:56
So a combination of particular magnetic headings, latitude, longitude and mag variation might cause the Nav display to output nonsense, but all six displays blanked??

Why the PFD?. Why the Engine/system display?

Please tell me this is untrue or that at least the PFDs are given a higher level of coding security.

MechEngr
9th Jan 2020, 07:18
I think the article related the bug to the true direction, not the magnetic direction of a runway.
Ahh, true heading - OK.

What is unclear is why there is also a connection to latitude and longitude.

I looked at the airports - based on data from https://www.airnav.com these are the deviations in degrees from 270:

82V -0.009377087
KBJJ 0.001984966
KCIU -0.024015687
KCNM 0.017476942
PABR -0.00926391

AirNav provides Lat/Long for each end of the runway which I converted with ATAN to a degrees variation. It did not list the last two.

SKLM, is listed as runway true heading: 272.4 on SKLM - Jorge Isaac Airport (http://www.pilotnav.com/airport/SKLM) and
SYCJ is listed at 271.9

So it's a mystery. While I can imagine truncating the true heading to one place would be a problem, I can't see where they would truncate the true heading by more than 2 degrees.

MechEngr
9th Jan 2020, 07:53
So a combination of particular magnetic headings, latitude, longitude and mag variation might cause the Nav display to output nonsense, but all six displays blanked??

Why the PFD?. Why the Engine/system display?

Please tell me this is untrue or that at least the PFDs are given a higher level of coding security.
I expect it failed in some section that would ordinarily be labeled "THIS CANNOT FAIL," not because it is impervious to failure but because there is no sensible response if it does. AT&T managed to crash almost all of its communications network because of a that kind of bug and, unlike airport information, they were in control of every detail of their hardware operation. See All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse (http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse)

The 737 is rumored to be running an 80286 processor which is a well understood design. However I don't see the architecture for the software. It should have an independent watchdog timer processor to automatically reboot if there is a fundamental software failure, but maybe only the thread/task/subsystem to update the displays failed and the watchdog reset and all other controls is still running. I've seen software fail because it was awaiting feedback that never happened. It may be that one display got a message it could not handle and never responded. That would quickly stop the task to update the displays. If the displays are programmed right they should have their own watchdog timers to stop displaying when too long a time has passed without an update rather than freezing and giving the impression they are still functioning.

There was a question of why not every runway was tried on the software. While hindsight is great on this sort of problem, it also suggests trying every combination of: altitude, airspeed, pitch, AoA, lat, long, N1, fuel load, weight, true airspeed, relative airspeed, and on and on.

The reason I wondered about the magnetic heading earlier is because this last year cannot be the first time a 737NG flew into Barrow, Alaska. Can it?

The last two airports on the list are nearly 2 degrees from a true heading of 270 degrees. It's not likely for that to result in a divide by zero problem.

It's a shame, but I doubt the actual source of the problem will be revealed.

DaveReidUK
9th Jan 2020, 08:26
Ahh, true heading - OK.

What is unclear is why there is also a connection to latitude and longitude.

I looked at the airports - based on data from https://www.airnav.com these are the deviations in degrees from 270:

82V -0.009377087
KBJJ 0.001984966
KCIU -0.024015687
KCNM 0.017476942
PABR -0.00926391

AirNav provides Lat/Long for each end of the runway which I converted with ATAN to a degrees variation. It did not list the last two.

SKLM, is listed as runway true heading: 272.4 on SKLM - Jorge Isaac Airport (http://www.pilotnav.com/airport/SKLM) and
SYCJ is listed at 271.9

So it's a mystery. While I can imagine truncating the true heading to one place would be a problem, I can't see where they would truncate the true heading by more than 2 degrees.

It probably depends which FMS database you use - I've just run some numbers from the one I use (not AirNav) and the true headings for the 7 runways in question range from 269.968° to 270.012°.

As for the FAA's assertion that "only seven runways worldwide, as identified in this AD, have latitude and longitude values that cause the blanking behavior", I suspect that only runways of over a certain length (5000 feet?) at civil airports have been considered. There are around 150 runways in total with true headings in the above range, but once short runways and military fields are excluded (and, for some reason, airports with parallel runways) there are only about a dozen left worldwide.

I don't think that mag variation has anything to do with the criteria.

MechEngr
9th Jan 2020, 08:47
That's even weirder. How can there be significantly different values for a fundamental measurement? Tiny fractions of a degree are one thing, but 2 degrees? I guess the two airports aren't going to have automated landings so a couple of degrees isn't a problem, but this seems like purposeful misinformation in comparison.

Anyway, if the the values they are using are from the same source as yours, then rounding to the nearest 0.1 degree would set them to 270.

That still loops around to why the software to having a problem. The reverse approaches are 0 degrees true, and 0 is just as problematic for trigonometry as 180 is. I can see where the opposite problem would be more likely to exist - a database of the lat/long coordinates for each end and a failure to determine the correct arctangent, but the ones I looked at aren't exactly 270 so the arctangent should not be infinite nor zero.

Puzzling.

malanda
9th Jan 2020, 09:05
It should have an independent watchdog timer processor to automatically reboot if there is a fundamental software failure, but maybe only the thread/task/subsystem to update the displays failed and the watchdog reset and all other controls is still running
The fact that "all six DUs stayed blank until a different runway was selected" suggests it wasn't dead, just thinking.

I recall a widely-distributed maths library function that could take literally years to execute with certain near-zero inputs. Could be something like that.

cattletruck
9th Jan 2020, 09:14
Didn't a squadron of F22s (7 I recall) have their screens "dump" when they all crossed the international dateline.
No doubt the word "dump" (probably refers to OS dump) which is now a common term in FJ territory which will probably also become standard in civvie street.
And to think some are espousing autonomous flight using these very same software developers.

cats_five
9th Jan 2020, 09:15
At least it's reproduceable, that is a big head start over something intermittent.

DaveReidUK
9th Jan 2020, 09:17
That's even weirder. How can there be significantly different values for a fundamental measurement? Tiny fractions of a degree are one thing, but 2 degrees?

Your guess is as good as mine. :O

The link you supplied for SKLM (http://www.pilotnav.com/airport/SKLM)shows the same latitude for both thresholds. If somebody can explain to me how two points with the same latitude and less than a mile apart can have a relative bearing other than due east/west (unless they're close to the N/S Poles) then I'd be extremely grateful.

Or just plug those provided threshold values into any GC calculator, such as https://www.cactus2000.de/uk/unit/massgrk.shtml

MechEngr
9th Jan 2020, 09:41
Maybe SKLM is an unreliable narrator.

DaveReidUK
9th Jan 2020, 10:08
Maybe SKLM is an unreliable narrator.

Other random samples on the same site show fairly consistent runway heading errors, best seen on airports with long runways (London Heathrow (http://www.pilotnav.com/airport/EGLL), for example).

As an alternative to GC calculators, a quicker way to get a reasonably precise true heading for a runway is simply to use the ruler on Google Earth. For example EGLL 27R is oriented at 267.3° true, according to the above website, whereas the actual value is 269.7°.

https://cimg5.ibsrv.net/gimg/pprune.org-vbulletin/541x213/egll_runway_27r_c590871749da1ec2a9d26a4adf15ace0885ed668.jpg

BDAttitude
9th Jan 2020, 10:30
Embedded processors can usually be configured regarding their behaviour if a division/0 occurs - ignore or trap,dump and reset. When you choose later option and feed them the same div/0 after restart they would be trapped in a reset loop until you remove it which seems easy in this case but is difficult if you read the 0 from memory.
Therefore we would not do that at the expense of having a not well defined state if a div/0 occured.
Least significant bit of a heading representation in 16bit signed integer would be 0.0109866°

visibility3miles
9th Jan 2020, 15:51
If it has a serious bug, it's better to have the screens go blank than provide erroneous information.

EEngr
9th Jan 2020, 16:19
The 737 is rumored to be running an 80286 processor which is a well understood design.

This might be a part of the problem. The '286 isn't quite adequate to support fully multitasking operating systems. The memory management hardware isn't quite up to snuff*. So whatever O/S the DEUs are running, the failure of one process thread can hang the entire system.

*One of the reasons Linux and other true multitasking systems had to wait for the '386.

Uplinker
9th Jan 2020, 17:06
If it has a serious bug, it's better to have the screens go blank than provide erroneous information.

What, ALL the screens? - leaving you with just the standby instruments and nothing else?

I disagree. A navigation problem alone should not blank the attitude, IAS, TAS, altitude, V/S, A/P status, A/THR status, N1, EGT, etc, etc. You should just get a red flag for; heading, G/S, wind vector, ILS, VOR, map, compass, course, or whichever Nav function is unavailable, or disagrees with the other side. The whole bloody screen shouldn’t blank, nor should the screens which don’t display any Nav info.

squidie
9th Jan 2020, 21:03
What, ALL the screens? - leaving you with just the standby instruments and nothing else?

I disagree. A navigation problem alone should not blank the attitude, IAS, TAS, altitude, V/S, A/P status, A/THR status, N1, EGT, etc, etc. You should just get a red flag for; heading, G/S, wind vector, ILS, VOR, map, compass, course, or whichever Nav function is unavailable, or disagrees with the other side. The whole bloody screen shouldn’t blank, nor should the screens which don’t display any Nav info.That’s one serious bug if the whole panel screens go blank! Would love to see a sim video with it re-produced.

fizz57
9th Jan 2020, 21:47
What, ALL the screens? - leaving you with just the standby instruments and nothing else?

I disagree. A navigation problem alone should not blank the attitude, IAS, TAS, altitude, V/S, A/P status, A/THR status, N1, EGT, etc, etc. You should just get a red flag for; heading, G/S, wind vector, ILS, VOR, map, compass, course, or whichever Nav function is unavailable, or disagrees with the other side. The whole bloody screen shouldn’t blank, nor should the screens which don’t display any Nav info.

I have problems understanding how a fault in the nav data or its processing can cause displays to blank. I always understood from the various schematics that the Flight Management Computers, Flight Control Computers and Display Processors are physically separate boxes connected by a data bus. In this case the displays should only get turned off if they received an "off" command - surely loss of signal from a crashed FMC should produce an error display rather than a blank screen?

On the other hand if the boxes on the schematics represent logical components run on the same hardware, the plot thickens. What else stopped working when the displays blanked?

Perhaps someone in the know can chip in?

turbidus
10th Jan 2020, 00:46
Well, in programming procedures, we have to deal with the legacy code, and code that exists. The lookup feature in the coding will take you to the next line. It is always not where you want to be. This is why we test the new code as much as we can.
that being said, there will always be certain inputs that will cascade to unintended consequences. I have sen many. many strange sequences in the coded flightpath, mostly in the flyby vs flyover code. I have seen, under certain flightpaths, where the AP is locked on, and no matter where you try to disengage, that internal command locks the 1 and 0 to 1, ie on.

best advice is to always report these anomalies, with the combinations thereof.....

All that being said, it is very easy to see how this happened, sorry it got to the flightdeck, but, well damn.

BDAttitude
10th Jan 2020, 07:05
From the limited information freely available, there are two DEUs providing data to the six displays. Now if both DEUs are fed with coordinates to display the same runway on a nav screen, causing a powerlatch because they trigger some exeption, six blank screens is what you would expect.

fizz57
10th Jan 2020, 08:48
Sounds reasonable, thanks. I wasn't thinking of the bug being in the DEU's.