U.K. NATS Systems Failure

Reply Subscribe

Thread Tools

Search this Thread

30th Aug 2023, 08:59

#121 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

Quote:

Originally Posted by BristolScout

I seem to remember the Chippy having a vertical DI?

A compass and a DI are separate and different things.

30th Aug 2023, 09:08

#122 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Join the dots...

There are myriad issues running here, but there won't be compensation under the Transport Act because this incident is being classed as an 'exceptional situation', but is it..?

Short answer, no. It's a repeat of the 2014 incident, (interim and final reports available - they wouldn't attach for some reason), but as mentioned, like Martin Rolfe's statement there's 'a lot of puff, and very little explanation' in them. The CAA never got to the root cause of the issue. I know less about the 2009 fail over, as it was before my time.

As context, describing wide-scale, safety critical IT systems is a bit like trying to give a headline summary of War and Peace, basically you can't. But there are certain key IT principles that should be present, such as, so long as your safety critical system is still within its capacity parameters it should not fail over unsuccessfully (it should 'stay up', as the old IBM 9020 system did, 100%). Think about it for a moment, if the Hinkley Point nuclear power station had infrequent, but repeated 'unsuccessful fail overs' we would have had two, potentially three, Fukushimas by now! But note, it is the flight planning system that is failing, not the radar links, or voice comms, yet - that would be a complete disaster.

Another critical IT principle is not having backups with the exact same code as the main net - again, when you think about it this is totally obvious. If a tube train continues through a signalling junction because of a 'software glitch', you don't want the train after it, and the one after that to go piling into the first train! And this is the core issue, the age of the Swanwick ATC system notwithstanding, it has the same code in the back up, and in the back up's back up! This is pure mismanagement, and why the incident is likely to reoccur.

Lastly, culture has a lot to blame here. NATS well-publicised 'just culture' is known internally as the 'nobody can be wrong culture'. Of course, if you make a mistake when in position, like falling asleep (a real incident btw) lessons need to be learned, more sleep provisioned for, proper rest breaks, procedures for if you suddenly feel very tired etc - that's all fine. But in encouraging people to come forwards when incidents occur the promise is 'you won't be actioned (disciplinary) for what happened' and this has leaked into other areas, like IT governance, where no one can be blamed for mistakes that have been made, even critical fail over architecture. And this is a highly risky position, hence all the 'puff'.

Failsafe's should absolutely work, period. Typos in FPLs should be caught, but if they are not the system should reject them, not collapse. But critically, nor should both backups!

30th Aug 2023, 09:15

#123 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

From one of our most esteemed aviation colleagues, (Professor S):

'I just uploaded a post- code in the wrong format into google maps and all the traffic lights in London have stopped working. Ha.'

Sums it up nicely.

30th Aug 2023, 09:17

#124 (permalink)

alfaman

Join Date: Nov 2006

Location: UK

Age: 59

Posts: 247

Likes: 4

Received 23 Likes on 11 Posts

Not sure why you think they didn't get to the root cause of the 2014 problem: it was clearly identified. The problem this time may well be unrelated, time will tell. Just culture is not what you describe, & not limited to NATS. What you describe is a "no blame" culture, which has been out of favour for decades, for the reasons you suggest. A just culture draws a distinction between honest mistakes & errors, which occur in any environment, & non conformance. The second is definitely not acceptable nor accepted.

30th Aug 2023, 09:22

#125 (permalink)

pax2908

Join Date: Aug 2003

Location: FR

Posts: 234

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by eglnyt

That Flight Plan was "checked" by at least 2 other systems before it got to where it caused the issue. There are perfectly valid flight plans which are known to cause the UK flight data processor issues and they are screened before they get there. It could be that a new one has now been added to the list.

This may suggest that a given "problematic" FPL will always or almost always trigger a problem, for a multitude or possible "environments" (e.g. rest or traffic and other dynamic data)? It would then be conceivable to do one more "dry" test before the data ends to the real live system? (Or on the contrary, how "new" the problem FPL was ... how many days for a given "bad" FPL to trigger something like this?)
Sorry, I imagine more than I should ... just curious!

30th Aug 2023, 09:33

#126 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 484

Likes: 0

Received 0 Likes on 0 Posts

I don't think this time it was the Swanwick system but the previous review following the 2014 incident pointed out that to fully test every state of that system would take over 100 years. There will be bugs in any complex system. You can't eliminate them by a bit more testing.

30th Aug 2023, 10:15

#127 (permalink)

FlyingApe

Join Date: Feb 2007

Location: Central Scotland

Posts: 61

Likes: 0

Received 0 Likes on 0 Posts

The 9020D was special....

....when it came into service in the early 1970's.

Bought from the Americans, who were the only other users at the time, it was cutting edge, and took a lot of manpower to maintain, and a lot of power and cooling.

Decades later, it wasn't of course, but the new FDP system basically re-platformed the old system.

" Stuck flighplans were common, and " Restarts and " Flops" a weekly occurrence.

Extra functionality on top of the original code was supposed to check the plan for "legality" to prevent bad data crashing the system....and generally did. Failures went from being weekly to yearly or longer.

Quote:

Originally Posted by chevvron

The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.

30th Aug 2023, 11:22

#128 (permalink)

CBSITCB

Join Date: Mar 2016

Location: Location: Location

Posts: 59

Likes: 1

Received 0 Likes on 0 Posts

The NATS CEO indicated this morning that a piece of the system (which has to be the FPPS) failed because it didn’t recognize a message, which was almost certainly an FPL.

People are questioning how a “bad” FPL came to be accepted into the FPPS. It is important to recognize that an FPL has syntax (format) and semantics (meaning).

If the syntax is correct, it is a valid FPL. By far the most complex element of an FPL is the route. The other elements are just parameters that are checked for validity. For example, if aircraft type is stated as “C172” the FPPS checks this against a list of valid aircraft types in its database.

The route syntax is checked to make sure the expression follows the rules of how the route elements should be constructed. Whilst there are many different types of route element and many rules to follow, this checking is relatively straightforward. If a “bad” FPL in terms of format is recognized it will be rejected at this stage. If not, the FPL will be passed to route conversion, where the semantics are extracted.

This is why I can’t see how the statement “it didn’t recognize a message” could lead to a processing failure. If it didn’t recognize a message, it would be rejected at this stage – business as usual. I can only assume the message was recognized as valid and passed-on.

The FPPS must now work out what the actual route is within its airspace (the semantic meaning), and this is the really difficult bit. There is an infinite number of possibilities. For example, route fixes can be expressed as lat/long coordinates which could be literally anywhere. The programme works out what it needs to do, in terms of outputting information to controllers and adjacent centres, and my guess is that this is the source of the problem that caused the FPPS to crash.

The programme came across an unusual route it had not encountered before (and had not been programmed to expect), didn’t know what to do, and a graceful recovery was not available. In other words, encountered a bug and did something unpredictable.

Just my guess.

Last edited by CBSITCB; 30th Aug 2023 at 11:30. Reason: Typo.

30th Aug 2023, 11:36

#129 (permalink)

eekeek

Join Date: Feb 2023

Location: UK

Posts: 2

Likes: 0

Received 0 Likes on 0 Posts

Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.

30th Aug 2023, 13:12

#130 (permalink)

LTNABZ

Join Date: Jun 2006

Location: Luton

Posts: 12

Likes: 0

Received 0 Likes on 0 Posts

My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.

30th Aug 2023, 13:32

#131 (permalink)

alfaman

Join Date: Nov 2006

Location: UK

Age: 59

Posts: 247

Likes: 4

Received 23 Likes on 11 Posts

Quote:

Originally Posted by LTNABZ

I wouldn't trust anything said on the Vine show, frankly.

30th Aug 2023, 14:01

#132 (permalink)

LTNABZ

Join Date: Jun 2006

Location: Luton

Posts: 12

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by alfaman

I wouldn't trust anything said on the Vine show, frankly.

Agree, but ditto the Mail, though

30th Aug 2023, 14:05

#133 (permalink)

LTNABZ

Join Date: Jun 2006

Location: Luton

Posts: 12

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by eekeek

Yes. The "Efficiency" ideal everywhere is removing all chance anyone has of fixing problems which are inevitable, not just flights, but rail, roads, NHS, Councils, just-in-time supply chains, etc etc.

30th Aug 2023, 15:05

#134 (permalink)

Superpilot

Join Date: May 2001

Location: England

Posts: 1,904

Likes: 0

Received 0 Likes on 0 Posts

Whatever's going on, we have evidence that there are no independent backup systems at NATS. Whatever processes they have go through a single point of failure. That can't be news to the developers and managers at NATS. They will have known about it.

30th Aug 2023, 15:15

#135 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 484

Likes: 0

Received 0 Likes on 0 Posts

If I have the World's most sophisticated system & it cost me, for the sake of argument, £500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?

30th Aug 2023, 17:04

#136 (permalink)

ATC Watcher

Pegase Driver

Join Date: May 1997

Location: Europe

Age: 74

Posts: 3,696

Likes: 0

Received 2 Likes on 1 Post

Quote:

Originally Posted by eglnyt

Never listen to the very vocal Irishmen, if you follow their logic, that of MOL at least , we should all work 12h a day . 7 days a week for half of the money so that his airline can make more profit. But on cost of back up systems you take the problem the wrong way . a functioning back up in an ATC Centre is part of the system , from the outset and this regardless of the end price. . You do not buy a system and later ask for a back up.
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,

What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.

30th Aug 2023, 17:12

#137 (permalink)

Ninthace

Join Date: Jan 2008

Location: Glorious Devon

Posts: 2,707

Likes: 275

Received 988 Likes on 585 Posts

Quote:

Originally Posted by Superpilot

An independent backup will help in the event of hardware or power failure but if both systems are using identical software they will react in the same way to the same input. if one crashes because of bad data, so will any backup.

30th Aug 2023, 17:17

#138 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 484

Likes: 0

Received 0 Likes on 0 Posts

There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.

30th Aug 2023, 17:31

#139 (permalink)

Ninthace

Join Date: Jan 2008

Location: Glorious Devon

Posts: 2,707

Likes: 275

Received 988 Likes on 585 Posts

Quote:

Originally Posted by eglnyt

Writing and testing the software once is hard enough. Writing and testing a second, different, version that does the same thing would be "interesting". Now throw in the need to keep both systems current with evolving user demands and constantly changing data within the system, Then add the demands of upgrades to the hardware and software from the manufacturers and the associated testing. Finally it has to work and it has to make money. Hands up who wants to be the manager answerable to the CEO for that.

Last edited by Ninthace; 30th Aug 2023 at 17:56.

30th Aug 2023, 17:56

#140 (permalink)

EGPI10BR

Join Date: Aug 2015

Location: In the mist

Posts: 16

Likes: 0

Received 0 Likes on 0 Posts

Some systems will run with the primary on version n and the backup on version n-1 so that the backup won’t be affected by a newly introduced bug.

That falls down of course if an undetected bug was in version n-15.

Misty.

Reply Share

First
Prev
7 / 21
Next
Last