U.K. NATS Systems Failure

Reply Subscribe

Thread Tools

Search this Thread

31st Aug 2023, 21:01

#181 (permalink)

Ninthace

Join Date: Jan 2008

Location: Glorious Devon

Posts: 2,711

Likes: 282

Received 1,012 Likes on 602 Posts

It would be very hard to develop two systems of that complexity that did the same thing in a different way and keep them both current with user requirements. You could have a current version and a version that is one iteration older as a back up. That would protect you from failure arising from upgrades and changes but it is unlikely to protect the system from a rare, untested error as both iterations are likely to be carrying the same fault in the software if they have both been running successfully for many years.

31st Aug 2023, 21:13

#182 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Whilst I agree with what you say, this isn't about building two entirely similar, but subtly different systems. It's about how the fail over works, if the system is allowed to crash in the event of this very rare but erroneous data being inputted - then what happens? It goes offline for three hours, we then wait for another four hours for the erroneous data to be 'washed out of the system' aka cleared from local memory, then seven hours later we start clearing the backlog of aircraft, that will take a week?
Safety and mission critical systems - think power systems or train signalling, or even mobile network operations - just can't work like that. The logic has to be 'if route A fails, switch to route B' (which can't just be a carbon copy of route A'); that should allow the erroneous data to be isolated without crashing the entire system. And in the (really!) very exceptional circumstance that both route A and route B end up failing at the same time, there should be a route C fail over to cover that contingency as well.

Just having three identical route As is asking to crash the system, which has now happened, repeatedly.

31st Aug 2023, 21:44

#183 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Neo380

The logic has to be 'if route A fails, switch to route B' (which can't just be a carbon copy of route A')

There was a ‘Route B' - manual ops. Very different, and safe (a point that NATS labours). Nobody died.

Last edited by CBSITCB; 1st Sep 2023 at 11:00. Reason: Typo

31st Aug 2023, 21:48

#184 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Manual ops is only a fail safe if you expect your entire (mission critical!) ATC system to catastrophically collapse - by which time 'mission critical' no longer applies.

31st Aug 2023, 22:47

#185 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Abrahn

The scenario you quoted wasn't an extreme edge case. The system was specified to be able to deal with 193 controllers but was only tested with 130. And broke at 153.

To use your analogy that's buying a 4 seater car and only bothering to check that there are 3 seats in it.

To sharpen the analogy, the system was specified for 193 controllers, was tested for 130, and broke at 153 (unsurprisingly), and then immediately broke again, and then immediately broke again (this is the piece than should be safe guarded against), leading to a complete system collapse, which should absolutely never happen.

1st Sep 2023, 01:11

#186 (permalink)

MarcK

Join Date: Oct 2004

Location: California

Posts: 386

Likes: 0

Received 11 Likes on 8 Posts

Quote:

Originally Posted by Murty

AFTN Message Format
The Message Text ends with the End-of-Message Signal, which is the four characters NNNN. The Ending itself comprises twelve letter shift signals which represent also a Message-Separation Signal..

twelve "letter shift signals"? We're talking Baudot teletype code (5-level, ITA2) here, which has been obsolete for about 60 years.If the system had upgraded to 8-bit codes you could use the specalized message separation characters SOH, STX/ETX, ETB/EOT, not to mention FS/GS/RS/US. I suppose no one wanted to take the hit for suggesting such a radical change.

1st Sep 2023, 07:25

#187 (permalink)

pax2908

Join Date: Aug 2003

Location: FR

Posts: 234

Likes: 0

Received 0 Likes on 0 Posts

Re. #179 (Los Angeles 2014). I would like to try to read the full report if available somewhere?

This in particular:
... exceeded the amount of memory allotted to handling the flight’s data, which in turn resulted in system errors and restarts ...

OK, no good protection against excessive memory usage, resulting in system restarts ... But what seems even worse, is the part about "system restartS" (more than once). It suggests that it can be acceptable to have multiple restarts in a row, before some other action is taken (such as identifying which data input is causing a problem). This, in turn, makes me wonder how many such self-restarts occur on a regular basis ... these would normally be analyzed and their cause addressed, even in absence of a visible incident?

1st Sep 2023, 09:36

#188 (permalink)

golfbananajam

Join Date: Aug 2010

Location: UK

Age: 67

Posts: 170

Likes: 132

Received 36 Likes on 21 Posts

Quote:

Originally Posted by Neo380

That's really missing the point, as has been said a number of times.

This isn't an 'infinitesimal circumstance', that could never be tested for (unless all human inputs have become 100% reliable, which they are not, and never can be).

This is all about how a fail over works, or doesn't work to be more precise. The system should have switched to an iteration of the software that wasn't identical, so wasn't bound to immediately fail again. Because if it does the system has no fail safe and bound to collapse, catastrophically.

That is the issue that is being skirted around, and was the core fault of the 2014 failure - very bad IT architecture.

Describing 'infinitesimal circumstances' and the '100 years of testing that couldn't identify them' has nothing to do with building fail overs that fail again, and then fail over again, through poor design.

Please see my post #74. Software testing is costly both in terms of time and resource. For complex systems it is impossible to test every combination of input data to see what fails. Automated testing is also not the panacea that many think it is. To get a test script automated, you end up manually running it to make sure the element of software under test works. Once you have a test that passes then you run it again, this time sing the auto test suite to record the steps you take. Once you've done that, you then run a confirmatory test. So for every element of the requirement you end up with at least three runs of a single test script (which can have many stages).

Then the developer has an update to code, so your automated test fails, then you start all over again.

The problem with old and complex systems, is that updates and improvements are usually a bolt-on to the original, it isn't very often you redesign from a clean sheet of paper. The result is that you end up testing the areas that your update has "touched" with a quick sanity regression test of the main use cases. You just don't have the time, resource or money to fully test everything each time an update is carried out.

Even then, there will be an edge case you just don't consider or haven't even though of or have dismissed as "won't ever happen" because of checks in other systems that you use as a data source where you assume that the input data has been properly validated and is therefore correct.

1st Sep 2023, 10:47

#189 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

In the case of the 2014 Los Angeles failure, the crash stemmed from “The system evaluated all possible altitudes along the U-2’s planned flight path for potential collisions with other aircraft.”

Say there were 100 aircraft in the LA centre’s airspace, all (obviously!) in different places, going in different directions, at different speeds and climbing/descending. How can you possibly write a test case that exercises all possibilities?

1st Sep 2023, 10:52

#190 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

Quote:

Originally Posted by pax2908

Re. #179 (Los Angeles 2014). I would like to try to read the full report if available somewhere?

I’m not aware of a detailed report (I haven’t searched for one). I used these two sources:

https://www.oig.dot.gov/sites/defaul...5E11-07-18.pdf

https://arstechnica.com/information-...mputer-system/

1st Sep 2023, 14:04

#191 (permalink)

paulross

Join Date: Jan 2017

Location: UK

Posts: 65

Likes: 3

Received 2 Likes on 1 Post

Quote:

Originally Posted by Abrahn

There was a little bit more to it than that. The other issue at play was that the controller had made a mode error in selecting a soft key that put them in "Watching Mode" (a rare and obsolete mode) and only then did the comparison 153 < 151 (in a different code path) fail. It was the combination of errors both in software and by the operator that, on their own were inconsequential, but when combined became significant.

The final report paras ES8. and ES9. give an introduction to this. The report then goes on to look at why this mode was still present, how this (understandable) mode error could have been detected (it was being selected accidentally almost every other day) or prevented and the trade-offs in testing and so on.

Much of software testing is about using your imagination; "what can go wrong?" so the 2014 failure could be regarded as failure in imagination.

1st Sep 2023, 15:25

#192 (permalink)

Gupeg

Join Date: May 2016

Location: UK

Posts: 6

Likes: 0

Received 1 Like on 1 Post

Quote:

I think this is rather a favourable way of looking at it

To me the real cause of the failure was introducing new software, onto both SFS Servers, that had not been adequately tested (or rather the testing had not been adequately specified). The inadequacy of that testing was shown, whether by "Watching Mode" being needed / selected, it took only one day for the "new" software to then bring down UK ATC for a period

The report refers to "needles" and "haystacks" and how hard it is to find errors, including latent errors (as here from maybe 20 years earlier). However, the upgrade is described as being specifically to "add military controller roles". Therefore, to me, in addition to whatever normal test functions an upgrade requires, specific testing should have been specified "stress testing" the number of workstations. The testing should be intended to verify not only the upgrade changes, but the whole system to expose (as here) related latent errors that had been "got away with" to date - especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military).

The bigger picture is should the upgrade have been debugged on a live system? Or a test system? NATS of course will keep banging the safety drum which might be accurate, but irrelevant. It is whether the airline industry, travelling public and Govt find it acceptable for the system to grind to a halt every 10 years or so while latent errors are worked out. If it is not acceptable, then a different (and no doubt more costly) approach is required... We'll later see if the report on Monday's issue has any parallels?

1st Sep 2023, 18:41

#193 (permalink)

Ninthace

Join Date: Jan 2008

Location: Glorious Devon

Posts: 2,711

Likes: 282

Received 1,012 Likes on 602 Posts

Welcome to the real world of IT. The bean counters and the managers want it yesterday, the developers want it forever, everything is a trade off in the end. The trick is to know when to stop tweaking and testing and actually deliver.

1st Sep 2023, 21:04

#194 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by golfbananajam

Inputting the wrong data - often as little as a missed full stop - is not an 'edge case', actually it's normal human behaviour. This has nothing to do with fail overs that don't work.

1st Sep 2023, 21:22

#195 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

I've reread #74 and concur! We are not trying to test every combination of variables, like the U2 flight plan (with no altitude data!) and it's impact on the FAA system.

I agree that task is never ending. But you say it yourself "failure testing is often limited to defined alternate path (within the software) testing" that path CAN'T be the already failed path, because it's bound to fail again. Especially if the circumstances are more operators than the system was stress tested for, many in new (military) roles. This is the smoking gun, and the cover up (or at least not being discussed) the lack of alternate paths.

You go on "critical systems like this should ALWAYS [my emphasis] fail safe [that's what I've been saying!], ie reject any invalid input , or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time'. EXACTLY. All this talk about edge cases, and French data etc etc is really just BS...

"Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash'. But it has, repeatedly now. I wonder about BC testing too!

1st Sep 2023, 21:46

#196 (permalink)

Engineer39

Join Date: Aug 2023

Location: England

Posts: 7

Likes: 0

Received 0 Likes on 0 Posts

Ref :The major error seems to be the upgrade shortly before, adding potential (military) inputs. A correct upgrade process should have identified those added inputs, and stress tested the system with those added inputs against maximum previous inputs.

I don't know the details but suspect that it was thought that the few added military consoles would still be well within the limits. It was not at that time recognised that leaving consoles in watching mode made the software see more than the linmits. Thus they did not test for max number of consoles plus max no. of watching consoles. A failure of imagination or just complete lack of knowlege of what watching consoles did??

2nd Sep 2023, 07:48

#197 (permalink)

Dr Jekyll

Join Date: Nov 1999

Location: London UK

Posts: 531

Likes: 1

Received 2 Likes on 2 Posts

Quote:

Originally Posted by Neo380

I've reread #74 and concur! We are not trying to test every combination of variables, like the U2 flight plan (with no altitude data!) and it's impact on the FAA system.

I agree that task is never ending. But you say it yourself "failure testing is often limited to defined alternate path (within the software) testing" that path CAN'T be the already failed path, because it's bound to fail again. Especially if the circumstances are more operators than the system was stress tested for, many in new (military) roles. This is the smoking gun, and the cover up (or at least not being discussed) the lack of alternate paths.

You go on "critical systems like this should ALWAYS [my emphasis] fail safe [that's what I've been saying!], ie reject any invalid input , or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time'. EXACTLY. All this talk about edge cases, and French data etc etc is really just BS...

"Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash'. But it has, repeatedly now. I wonder about BC testing too!

There are cases were one invalid or rejected input means subsequent inputs cannot be processed properly, EG running totals or counts may ne inaccurate. Certainly in the case of a control system it's generally better to keep going, but from the developers point of view it isn't always clear whether it's a 'keep running regardless' scenario or a 'once you're on the wrong line every station is likely to be the wrong station' scenario.

2nd Sep 2023, 08:24

#198 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Dr Jekyll

So in mission critical systems it's like this - a 'car breaks down at the traffic lights, it happens, even if the car has already been checked, the traffic lights shouldn't then fail, across the entire city, and every road crossing that has to be made then has to be handled manually'. Moreover, you've got a car blocking the traffic lights now, so there's only one thing you can do, and that's reroute the traffic around the obstacle - that's a fail safe, and you normally need two of them, for fairly obvious reasons - the second route is likely to come under pressure pretty fast too. But don't ever, ever, just assume that you can push the traffic through a blocked route - that's what causes the system to crash. This has NOTHING to do with 'the chances of your car breaking down', especially, coming back to reality, when we know this issue is highly likely to be attributable to human error, ie faulty data input. And that's before adding all the military traffic and not stress testing the system properly, ever, it seems.
The key characteristics of this incident seem to be lack of competence and wishful thinking. Only saved, btw, because 'the car was eventually moved out of the way', and the only route available was restarted.

2nd Sep 2023, 08:42

#199 (permalink)

Hartington

Join Date: Apr 2002

Location: UK

Posts: 1,223

Likes: 1

Received 9 Likes on 7 Posts

A good few years ago I was testing a piece of commercial, non safety critical, software. It failed at a specific point in a way I considered "interesting". I described the failure to the programmer. He looked quizzical and said "I wondered if it would do that".

Then there was a recurring fault. It happened in client systems all over the country. Nobody experienced it frequently or consistently. In fact, across the country, It happened about twice a year. Most people never had the problem. Try as we may we never got to the bottom of it (believe me, we really tried).

Software is written by humans, tested by humans (test scripts for automated systems are written by humans) and used by humans. Humans are error prone and, in the end, it means software will be error prone.

2nd Sep 2023, 09:12

#200 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 485

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Gupeg

I think this is rather a favourable way of looking at it

We continue to discuss an earlier failure on a system that almost certainly wasn't the one involved in this case although of course currently we don't know which system was.

It wasn't new software. It was the original software, it had been there for years. The change introduced was to start using it nearer the limits of the system of which there two, 151 civil positions and 193 overall. The verification of those limits and acceptance of them happened years before. To use the poor analogy previously introduced it is akin to buying a 5 seat car, only using 4 seats for several years and one day having a need to use all 5. In my case discovering that, if isofix is in use on 2 of the seats it is actually a 4.5 seat car not 5.

Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option. And a simple test up to 193 would not have uncovered the issue, you would need to invoke watching mode when more than 151 were in use, any other mode added above 151 would not have triggered the issue. If your aim was to fully stress the system it is likely that you would have invoked the more demanding modes to do that.

Should they have spotted the error on code review? This is a bad case for humans. There are two limits in use. I'd probably spot a completely incorrect limit but I'd be far less likely to spot that the wrong one was being used.

Should SFS have 2 completely different sets of software so an error would only affect one. Ideally yes but as I've said before that is also impractical. The supplier struggled to produce one set of software in the timescale and cost originally estimated. Even if you doubled your estimate producing two would, in the end, cost considerably more than twice as much even if you managed to ever actually deliver.

Business criticality is a different matter from safety criticality but for all systems in the flight data thread you can make an adequate safety case with redundancy provided with an identical system provided you have a means of ensuring that, at all times from inception of failure, you can safely handle the level of traffic that might be present. In the case of Monday the level of traffic at failure was safely handled and the reduction of traffic as data degraded ensured that continued to be the case.

If your safety case is made than business criticality becomes purely a matter of cost benefit.

Last edited by eglnyt; 2nd Sep 2023 at 09:32. Reason: Grammar

Reply Share

First
Prev
10 / 21
Next
Last