U.K. NATS Systems Failure

Reply Subscribe

Thread Tools

Search this Thread

31st Aug 2023, 08:01

#161 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by eglnyt

NATS investment is not from the public purse. One of the reasons it was privatised was to remove it's borrowing requirements from the accounts in the days when the Public Sector Borrowing Requirements figure was thought to be important and Government hadn't realised it could just keep printing money and nobody actually cared.
It has been investing a lot since privatisation and continues to do so, whether all that investment has been wise is difficult to judge but it's kept a lot of people employed for a while.
If it was NAS that failed then NATS has been trying to replace that system since the 90s and has failed to do so, it's a bit like trying to replace the foundations on a tower block whilst keeping the block intact because it's so integrated into all NATS systems
Having finally realised that couldn't be done it embarked on replacing Both NAS and the overlying systems a few years ago. It's a huge project that does literally rip it out & start again.

NATS is currently asking for (very large) sums to run a system for uncrewed aircraft (drones), (that most people in the sector think should be provided by industry), it won't do the NAS for free!

31st Aug 2023, 08:17

#162 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Slightly off topic but NATS is funded by its users. Why should those users fund services provided to an emerging industry? Again the 2 vocal Irishmen would have something to say. All NATS expenditure is governed by the licence negotiations. There are some things it funds as a quid pro quo for its right to run the UK airspace but that currently isn't one of them

31st Aug 2023, 08:30

#163 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by eglnyt

Because it believes it should be 'controlling' UAS (of course it can't).

31st Aug 2023, 08:42

#164 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

'Just culture is not what you describe, & not limited to NATS. What you describe is a "no blame" culture, which has been out of favour for decades, for the reasons you suggest. A just culture draws a distinction between honest mistakes & errors, which occur in any environment, & non conformance. The second is definitely not acceptable nor accepted.'

Agreed. So if you build a safety critical system that won't 'fail over' successfully, that's ok, according to 'just culture' - because NATS is saying they've 'conformed'?

31st Aug 2023, 09:32

#165 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

I still don't know what failed but although NATS has some safety critical systems no loss of service in the Flight Plan thread should be. Corruption of data could be but not failure. Business critical yes and demonstrably so but not safety.

31st Aug 2023, 09:54

#166 (permalink)

ATC Watcher

Pegase Driver

Join Date: May 1997

Location: Europe

Age: 74

Posts: 3,694

Likes: 0

Received 0 Likes on 0 Posts

Very interesting discussion . Learning a few things along the way . On just culture, I wonder if some ae thinking that the concept, initially designed for front line operators would apply to top management staff

Now CANSO juts issued a statement on the UK failure :

Quote:

CANSO (Civil Air Navigation Services Organisation) the global and regional voice of Air Traffic Management, has provided its views on the system failure that caused significant disruption in the United Kingdom earlier this week.Simon Hocquard, Director General, CANSO said: “First and foremost, my thoughts are with those air passengers that have had their travel plans affected by this incident. Air Traffic Management organisations across the globe rely on a network of complex systems to safely maintain the separation of aircraft at all times. In the rare instances where a system fails, it can often be due to a seemingly small problem. Whenever a failure does occur the number one priority has to be, and is always safety.”

The disruption to UK air traffic this week was caused by a failure of NATS’ flight data processing system. In order for the global air traffic system to work, any commercial or civilian aircraft flying from one airport to another needs to file a flight plan. These flight plans contain a lot of information including the route the flight will take – this is essential as each flight invariably crosses different sections of airspace often under several jurisdictions. This important information allows sectors of airspace to ensure the safety of each aircraft entering and exiting it by maintaining separation from other flights and ensuring a smooth flow of traffic.

The processing of this essential data between sectors is done automatically, and there are millions of flight plans filed globally every month without disruption due to system failures. As an example, the UK has had a decade of flight plans filed with no technical issues. In the very rare instances where technology fails, Air Navigation Service Providers revert to the manual processing of flight data. This requires a lot of manpower and cannot be done as quickly as the technology processes it and so it is necessary to reduce the number of flights in and out of airspace sectors so that the information can be accurately processed manually in a timely manner. This slowing down of the number of flights is to ensure the safety of aircraft. Once the system is back up and running and fully tested, capacity can once again be restored.

Simon Hocquard added; “ANSPs around the globe are built on a century of safety and significant investment in their people, technology and processes. NATS is one of the leading ANSPs with a very high level of performance and reputation and the steps it took to fix the issue had safety at their very heart.”

It says a few things when one reads between the lines.

31st Aug 2023, 10:00

#167 (permalink)

Expatrick

Join Date: Dec 2015

Location: Budapest

Posts: 315

Likes: 163

Received 216 Likes on 129 Posts

Quote:

Originally Posted by eglnyt

The NATS licence includes a penalty scheme whereby a certain level of delay triggers a reduction in future charges. It is deliberately not punitive to avoid influence on operational decision making so it is unlikely to cover the airline's costs.

Thanks, I'd forgotten that!

31st Aug 2023, 10:13

#168 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Neo380

NATS has already floated replacing the (1970s based!) Swanwick main ATC system.

I strongly refute the (widely held) belief that the main Swanwick ATC system is based on 1970s technology!

IBM installed a prototype 9020 system at the Jacksonville centre in 1967. The first operational 9020 system, running the NAS En Route Stage A software (progenitor to the current Swanwick system) went live on 18 February 1970.

Therefore the Swanwick main ATC system is clearly based on 1960s technology.

The UK 9020/NAS first went live at West Drayton on 2 December 1974, running the NAS En Route Stage A software.

31st Aug 2023, 10:26

#169 (permalink)

Engineer39

Join Date: Aug 2023

Location: England

Posts: 7

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by golfbananajam

The problem with testing software is that you can't test all combinations of input values to ensure the required output values are correct, certainly not in vlarge or complex systems. Failure testing is often limited to defined alternate path (within the software) testing as defined in the requirements/specification. Edge cases will always catch you out.

With that in mind, critical systems like this should always fail safe, ie reject any invalid input, or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time.

Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash.

I wonder how often, if ever, business continuity testing is performed which should have enabled quick recovery.

I was one of the people who signed off the upgrade in 2014 as being OK to implement. We were right as it was safe, just not resilient.

As the linked report here show it was all due to the 154th workstation being turned on and crashing the system. Of course some may say “How can NATS be so stupid as to not spot this and test for it?” Well the test suite has around 90 workstations so there is no way you can turn on 154 stations to test the software past its 153 limit. And no chance of getting time in >100 ATCOs’ schedule to get them all to come in and exercise all the stations even if you had >153 to test. And you can’t test on the live system with it many stations, as of course it’s not possible to find space in the schedule to do this on a system that is live 24/27/365. So it instead relies on software engineers understanding code that was rewritten >10 years before to understand what the 153 number meant. Obviously in that case no one understood it, or if they did, thought it meant active stations and forgot about the ones in a half on state. It’s not possible to retain all the knowledge from years ago unless no one resigns, retires or is made redundant. And you don’t outsource anything.

All in all I can't see practically how that incident could have been avoided.

I have no knowledge on this new incident but suspect the causes are all rather similar and that practically it should be possible to eliminate this particular case from happening again, but you can’t say “We will never have a crash again”. Those Irish men are whistling in the wind if they think it can. As others have pointed out you might be able to spend a ton of money to have duplicate software but it’s just not worth the expense. I am baffled though why this case took so long to recover from.

Upgrading it all to a brand new software may help long term but likely there will be more short term disruption due to new bugs introduced.

I think NATS compares very favourably in disruption compared with other ASNPs. But some (small) improvements will hopefully come out of all this.

Quote:

Originally Posted by CBSITCB

I am confused. I thought it was stated here that the NAS (or is it the FPS) hardware system had been replatformed onto new processors. Thus you are wrong except in the sense that “I speak English and Shakespeare also spoke English therefore my mentality is 400 years old.” Just have one roots in something old does not make you old.

Last edited by Engineer39; 31st Aug 2023 at 16:51. Reason: Factual error

31st Aug 2023, 11:32

#170 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by CBSITCB

Good point, 1970 installed, (1960s technology)…

31st Aug 2023, 13:29

#171 (permalink)

Widger

Join Date: Mar 2005

Location: MARS

Posts: 1,102

Likes: 19

Received 10 Likes on 4 Posts

Nothing wrong with 1960s tech, especially computer tech. The now scrapped Type 45 Air Defence Destroyers and the CVS Aircraft Carriers (Invincible Class) had the Ferranti FM1600B computer and its operating software, ADAWS and this was used up until 2013-14. The front end had relatively modern interfaces but the core system was run by the FM1600. It is not a direct comparison but gives some idea of the complexity of issues. Programming in the early years was by ticker tape and a laborious system of typing code which was then translated into punched holes. If you got one element of code wrong, the system would stop loading immediately. As one was typing you would occasionally type the wrong letter or number and have to start all over again, to much cursing. Programming airways, hazards etc, took hours. Towards the end of its life, programming was via floppy disk, which at least gave you editing control, but the core system would still grind to a halt if you got the syntax wrong. So the front end would have modern PC screens and keyboards, the memory capacity was much larger, enabling things like airspace reservations and navaids etc to be displayed but, the core system was still there lurking in the background. It was reliable and capable, helping the RN to achieve the first missile to missile engagement in GW1 in what was a very tricky shot. The only real way around the issue was a new ship, with a new command system and that was achieved with the Type 45 Destroyer. I imagine the NAS and NERL issues to be very similar and equally complex. I very much recall the nervousness when leaving West Drayton about which plugs to turn off, as there might well have been dependencies on hardware lurking in the basement.

31st Aug 2023, 13:36

#172 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Engineer39

I am confused. I thought it was stated here that the NAS (or is it the FPS) hardware system had been replatformed onto new processors.

Originally the 9020 hardware running the NAS En Route Stage A software - collectively the Flight Plan Processing System(FPPS) - was part of the wider US National Airspace System. Hence "NAS".

The UK obtained a copy of the whole shebang in the 1970s. The mouthful NAS En Route Stage A software became known as just NAS.

Over the years the hardware has been replaced several times, but always with a 9020-compatible IBM system that still runs NAS. The operating system (or "monitor" in NAS-speak) is of course tweaked to fit the upgraded hardware, but the core ATC application programmes remain essentially the same, with numerous enhancements over the years of course. They are mainly written in Jovial, which was the US government language of choice at the time for such systems.

31st Aug 2023, 13:52

#173 (permalink)

Widger

Join Date: Mar 2005

Location: MARS

Posts: 1,102

Likes: 19

Received 10 Likes on 4 Posts

Well every day is a learning day. I have just found out that the FM1600 also used a derivation of the Jovial language namely CORAL, which was derived by the Royal Rader Establishment at Malvern. They were clever people these 60's software engineers, doing so much with 256K.

https://en.wikipedia.org/wiki/JOVIAL

31st Aug 2023, 14:03

#174 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

The NERC system at Swanwick originally had replacement of NAS in scope. It replaced lots of the functions of NAS but in the end the core Flight Data Processing functions were descoped and NAS was retained eventually moving down to Swanwick when West Drayton closed. It's been replatformed a few times but NAS at Swanwick is still there and has code in it that ran on the 9020. Exactly how much is difficult to quantify. A lot of the functionality of the original NAS is no longer used but is still in there because removing it posed a lot of risk. A lot of the more modern functionality was added since the 9020 was withdrawn and all the hardware drivers etc have been replaced.
The NERC system may have some components in it which were inspired by NAS but most of it was originally late 80s early 90s. It has also been replatformed, some core functionality has been totally replaced, and a lot of functionality has been added since it first handled ops.

31st Aug 2023, 14:20

#175 (permalink)

Murty

Join Date: Apr 2006

Location: England

Age: 62

Posts: 14

Likes: 0

Received 0 Likes on 0 Posts

An example of how a flight plan could trip out, if not isolated in the UK system but mabe not affect the IFPS,can also be in the callsign/flight number.

The UK system has trouble (which has a safety net now ,if everthing is followed) with any callsign with " NNN"in it ,or in Item 18 of the FPL

To date there have only been 2 aircraft out of 3 that have tripped the UK System,despite NNNN being the problem in AFTN Messaging.but

AFTN Message Format

The message format of AFTN messages is defined in ICAO Annex 10 Aeronautical Telecommunications Volume II.[4]

AFTN messages consist of a Heading, the Message Text and a message Ending.

The message Heading comprises a Heading Line, the Address and the Origin. The Heading Line comprises the Start-of-Message Signal which is the four characters ZCZC, the Transmission Identification, an Additional Service Indication (if necessary) and a Spacing Signal.

The Message Text ends with the End-of-Message Signal, which is the four characters NNNN. The Ending itself comprises twelve letter shift signals which represent also a Message-Separation Signal.

The Message Text ends with the End-of-Message Signal, which is the four characters NNNN

1. DC Aviation have C56X registered DCNNN but fIles under the fixed callsign DCS705 ,but since the HEX code and registration now have to included in the UK system this is still a trip out risk.

The other 2 known ones have passed.

JYNNN C172 was delivered to Bournemouth back in 2020 ,and tripped our system (which was rectified fairly quickly)

MNNNN Gulfstream 6 ,this was registered by a Russian on 16/7/2014,it became a regular visitor to the UK and I became involved with many e-mails with the Operator,pointing out that their ID was going to be a problem at all airports as the Flight Plan could not reach addresses STOPPING after MNNNN.
The owner reregistered the aircraft MNGNG

The fact our system stops after 3 N's is strange

Other counties have quirks the FAA system does not like callsigns begining with a number with number,,I was advised by a coleague in FAA while carrying out an investigation ,which is strange with Barbados (8P-),this may have been notice at airports with 8P-ASD Gulfstream 6 this will often file as "X8PASD" as does the Malaysian Gov 9MNAA A320 that file with a letter.
Luckily most Maltese bizjets fly with an RTF -Tri-Graph ie VJT - Vistajet,same for the large amount of Guernsey aircraft.
This may have been sorted but the owner of 8PASD carries on.

Last edited by Murty; 31st Aug 2023 at 14:27. Reason: edit

31st Aug 2023, 15:49

#176 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Abrahn

You mean that there's no automated testing? No regression testing? And the system has never been tested to its design limit?

Probably not the system that caused an issue on Monday but the investigation into that incident discussed the suitability of the Test Regime.

The report is here: https://www.caa.co.uk/media/r42hircd...port-2-0-1.pdf
Section 2.5 and G4 are the relevant parts

31st Aug 2023, 15:53

#177 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

Beginning in 2011, after several decades of abortive attempts, the FAA successfully introduced a new ground-up FDPS that included replacement of the 1960s En Route Stage A software. This was the $2.1 billion ERAM (En Route Automation Modernization) system. Roll-out to all the US centres took several more years.

Despite the presumed use of modern software engineering techniques by Lockheed Martin, the new FDPS is not immune to latent bugs in its flight plan processing software. To illustrate some of the complexity involved, and the impossibility of testing for every possible failure mode, here is an account of a FDPS system crash at the Los Angeles centre in 2014 (edited for brevity):

ERAM has a capability called “look-ahead" which searches for potential conflicts between aircraft based on their projected course, speed, and altitude. Because of the computing requirements for handling look-ahead for all flights within a given region of controlled airspace, Lockheed Martin designed the system to limit the amount of data that could be input by air traffic controllers for each flight. And since most flights tend to follow a specific point-to-point course or request operation within a limited altitude and geographic area, this hadn't caused a problem for ERAM during testing.

A flaw in the system was exposed when a U-2 spy plane entered the air traffic zone managed by the system in Los Angeles. The aircraft had a complex flight plan, entering and leaving the zone of control multiple times. On top of that, the data set for the U-2 flight plan came close to the size limit for flight plan data imposed by the design of the ERAM system. Even so, the flight plan data lacked altitude data, so it was manually entered by an air traffic controller as 60,000 feet.

The system evaluated all possible altitudes along the U-2's planned flight path for potential collisions with other aircraft. That caused the system to exceed the amount of memory allotted to handling the flight's data, which in turn resulted in system errors and restarts. It eventually crashed the ERAM look-ahead system, affecting the FAA's conflict-handling for all the other aircraft in the zone controlled out of its Los Angeles facility.

As a result, facility managers declared ATC Zero, which suspended operations and cleared the Centre's airspace. The event impacted air traffic operations with over 400 flight delays reported throughout the NAS and as many as 365 cancellations just in the Los Angeles Centre airspace alone. According to FAA the event lasted for about 2 hours, but the impact on the traveling public throughout the National Airspace System lasted for over 24 hours.

Last edited by CBSITCB; 1st Sep 2023 at 10:59. Reason: Typo

31st Aug 2023, 16:44

#178 (permalink)

Engineer39

Join Date: Aug 2023

Location: England

Posts: 7

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Abrahn

You mean that there's no automated testing? No regression testing? And the system has never been tested to its design limit?

Of course there is lots of testing. I am not a software engineer so can’t state the specifics. I’m just saying that, like in all complex software, there is no way to exercise every single combination of inputs to find out if one combination of inputs that crops up once in 10 or 100 years will trigger something undesirable. To use an analogy would a car manufacturer test a car at 12,000 feet altitude with 2 adults and 2 children on board when it is raining and the tyres are a little flat? He will do a test for each case separately but not all in combination. Maybe in 10 years’ time someone who met those conditions would find that the car was a bit unstable and slide off the road; then he tries to blame the manufacturer for not testing that condition.

31st Aug 2023, 20:37

#179 (permalink)

Gupeg

Join Date: May 2016

Location: UK

Posts: 6

Likes: 0

Received 1 Like on 1 Post

Quote:

Originally Posted by Engineer39

I am not a software engineer so can’t state the specifics.

Understood

Quote:

Originally Posted by Engineer39

I’m just saying that, like in all complex software, there is no way to exercise every single combination of inputs to find out if one combination of inputs that crops up once in 10 or 100 years will trigger something undesirable.

As detailed above, a "test mode" (that does not require real hardware / personnel inputs) can simulate to a module a wide range of inputs, but crucially, testing at the limits.

To refer to the 2014 case, there was a specified limit of 193 "atomic functions". There appears no testing done to stress it at 193, and how it dealt with an input of 194. Such testing could have revealed that not only was the limit erroneously at 152? 153? but increment to 154 and it does not reject the change to 154, but falls over.

The major error seems to be the upgrade shortly before, adding potential (military) inputs. A correct upgrade process should have identified those added inputs, and stress tested the system with those added inputs against maximum previous inputs. Had the testing been explored it hopefully would have included flexing the civil inputs as well. The testing is not just to test the added inputs (which hopefully the upgrade design should allow for), but expose any latent errors that "survived" until now but failed in these conditions (as happened).

I suspect the real problem is a system that has grown over decades with multitudes of different sub-modules from different suppliers / languages / standards sort of "works" until some combination of inputs / events causes an inappropriately handled exception (as 2014 / Monday?). Yes - good to then close the loophole (as in the 2014 CAA report), but does nothing to identify / remove the next latent error. The 2014 report seems rather self-congratulatory on the CAA's part, and how software errors are hard to detect, rather than discuss a system to find the errors before real-life carnage.

31st Aug 2023, 20:39

#180 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Engineer39

That's really missing the point, as has been said a number of times.

This isn't an 'infinitesimal circumstance', that could never be tested for (unless all human inputs have become 100% reliable, which they are not, and never can be).

This is all about how a fail over works, or doesn't work to be more precise. The system should have switched to an iteration of the software that wasn't identical, so wasn't bound to immediately fail again. Because if it does the system has no fail safe and bound to collapse, catastrophically.

That is the issue that is being skirted around, and was the core fault of the 2014 failure - very bad IT architecture.

Describing 'infinitesimal circumstances' and the '100 years of testing that couldn't identify them' has nothing to do with building fail overs that fail again, and then fail over again, through poor design.

Reply Share

First
Prev
9 / 21
Next
Last