Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

U.K. NATS Systems Failure

Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

U.K. NATS Systems Failure

Old 30th Aug 2023, 09:59
  #121 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 57
Received 0 Likes on 0 Posts
Originally Posted by BristolScout
I seem to remember the Chippy having a vertical DI?
A compass and a DI are separate and different things.
CBSITCB is offline  
Old 30th Aug 2023, 10:08
  #122 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 81
Likes: 0
Received 0 Likes on 0 Posts
Join the dots...

There are myriad issues running here, but there won't be compensation under the Transport Act because this incident is being classed as an 'exceptional situation', but is it..?

Short answer, no. It's a repeat of the 2014 incident, (interim and final reports available - they wouldn't attach for some reason), but as mentioned, like Martin Rolfe's statement there's 'a lot of puff, and very little explanation' in them. The CAA never got to the root cause of the issue. I know less about the 2009 fail over, as it was before my time.

As context, describing wide-scale, safety critical IT systems is a bit like trying to give a headline summary of War and Peace, basically you can't. But there are certain key IT principles that should be present, such as, so long as your safety critical system is still within its capacity parameters it should not fail over unsuccessfully (it should 'stay up', as the old IBM 9020 system did, 100%). Think about it for a moment, if the Hinkley Point nuclear power station had infrequent, but repeated 'unsuccessful fail overs' we would have had two, potentially three, Fukushimas by now! But note, it is the flight planning system that is failing, not the radar links, or voice comms, yet - that would be a complete disaster.

Another critical IT principle is not having backups with the exact same code as the main net - again, when you think about it this is totally obvious. If a tube train continues through a signalling junction because of a 'software glitch', you don't want the train after it, and the one after that to go piling into the first train! And this is the core issue, the age of the Swanwick ATC system notwithstanding, it has the same code in the back up, and in the back up's back up! This is pure mismanagement, and why the incident is likely to reoccur.

Lastly, culture has a lot to blame here. NATS well-publicised 'just culture' is known internally as the 'nobody can be wrong culture'. Of course, if you make a mistake when in position, like falling asleep (a real incident btw) lessons need to be learned, more sleep provisioned for, proper rest breaks, procedures for if you suddenly feel very tired etc - that's all fine. But in encouraging people to come forwards when incidents occur the promise is 'you won't be actioned (disciplinary) for what happened' and this has leaked into other areas, like IT governance, where no one can be blamed for mistakes that have been made, even critical fail over architecture. And this is a highly risky position, hence all the 'puff'.

Failsafe's should absolutely work, period. Typos in FPLs should be caught, but if they are not the system should reject them, not collapse. But critically, nor should both backups!
Neo380 is offline  
Old 30th Aug 2023, 10:15
  #123 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 81
Likes: 0
Received 0 Likes on 0 Posts
From one of our most esteemed aviation colleagues, (Professor S):

'I just uploaded a post- code in the wrong format into google maps and all the traffic lights in London have stopped working. Ha.'

Sums it up nicely.
Neo380 is offline  
Old 30th Aug 2023, 10:17
  #124 (permalink)  
 
Join Date: Nov 2006
Location: UK
Age: 58
Posts: 244
Received 18 Likes on 7 Posts
Not sure why you think they didn't get to the root cause of the 2014 problem: it was clearly identified. The problem this time may well be unrelated, time will tell. Just culture is not what you describe, & not limited to NATS. What you describe is a "no blame" culture, which has been out of favour for decades, for the reasons you suggest. A just culture draws a distinction between honest mistakes & errors, which occur in any environment, & non conformance. The second is definitely not acceptable nor accepted.
alfaman is offline  
Old 30th Aug 2023, 10:22
  #125 (permalink)  
 
Join Date: Aug 2003
Location: FR
Posts: 233
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
That Flight Plan was "checked" by at least 2 other systems before it got to where it caused the issue. There are perfectly valid flight plans which are known to cause the UK flight data processor issues and they are screened before they get there. It could be that a new one has now been added to the list.
This may suggest that a given "problematic" FPL will always or almost always trigger a problem, for a multitude or possible "environments" (e.g. rest or traffic and other dynamic data)? It would then be conceivable to do one more "dry" test before the data ends to the real live system? (Or on the contrary, how "new" the problem FPL was ... how many days for a given "bad" FPL to trigger something like this?)
Sorry, I imagine more than I should ... just curious!

pax2908 is offline  
Old 30th Aug 2023, 10:33
  #126 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 470
Likes: 0
Received 0 Likes on 0 Posts
I don't think this time it was the Swanwick system but the previous review following the 2014 incident pointed out that to fully test every state of that system would take over 100 years. There will be bugs in any complex system. You can't eliminate them by a bit more testing.
eglnyt is offline  
Old 30th Aug 2023, 11:15
  #127 (permalink)  
 
Join Date: Feb 2007
Location: Central Scotland
Posts: 61
Likes: 0
Received 0 Likes on 0 Posts
The 9020D was special....

....when it came into service in the early 1970's.

Bought from the Americans, who were the only other users at the time, it was cutting edge, and took a lot of manpower to maintain, and a lot of power and cooling.

Decades later, it wasn't of course, but the new FDP system basically re-platformed the old system.

" Stuck flighplans were common, and " Restarts and " Flops" a weekly occurrence.

Extra functionality on top of the original code was supposed to check the plan for "legality" to prevent bad data crashing the system....and generally did. Failures went from being weekly to yearly or longer.



Originally Posted by chevvron
The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.
FlyingApe is offline  
Old 30th Aug 2023, 12:22
  #128 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 57
Received 0 Likes on 0 Posts
The NATS CEO indicated this morning that a piece of the system (which has to be the FPPS) failed because it didn’t recognize a message, which was almost certainly an FPL.

People are questioning how a “bad” FPL came to be accepted into the FPPS. It is important to recognize that an FPL has syntax (format) and semantics (meaning).

If the syntax is correct, it is a valid FPL. By far the most complex element of an FPL is the route. The other elements are just parameters that are checked for validity. For example, if aircraft type is stated as “C172” the FPPS checks this against a list of valid aircraft types in its database.

The route syntax is checked to make sure the expression follows the rules of how the route elements should be constructed. Whilst there are many different types of route element and many rules to follow, this checking is relatively straightforward. If a “bad” FPL in terms of format is recognized it will be rejected at this stage. If not, the FPL will be passed to route conversion, where the semantics are extracted.

This is why I can’t see how the statement “it didn’t recognize a message” could lead to a processing failure. If it didn’t recognize a message, it would be rejected at this stage – business as usual. I can only assume the message was recognized as valid and passed-on.

The FPPS must now work out what the actual route is within its airspace (the semantic meaning), and this is the really difficult bit. There is an infinite number of possibilities. For example, route fixes can be expressed as lat/long coordinates which could be literally anywhere. The programme works out what it needs to do, in terms of outputting information to controllers and adjacent centres, and my guess is that this is the source of the problem that caused the FPPS to crash.

The programme came across an unusual route it had not encountered before (and had not been programmed to expect), didn’t know what to do, and a graceful recovery was not available. In other words, encountered a bug and did something unpredictable.

Just my guess.

Last edited by CBSITCB; 30th Aug 2023 at 12:30. Reason: Typo.
CBSITCB is offline  
Old 30th Aug 2023, 12:36
  #129 (permalink)  
 
Join Date: Feb 2023
Location: UK
Posts: 1
Likes: 0
Received 0 Likes on 0 Posts
Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.
eekeek is offline  
Old 30th Aug 2023, 14:12
  #130 (permalink)  
 
Join Date: Jun 2006
Location: Luton
Posts: 12
Likes: 0
Received 0 Likes on 0 Posts
My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.
LTNABZ is offline  
Old 30th Aug 2023, 14:32
  #131 (permalink)  
 
Join Date: Nov 2006
Location: UK
Age: 58
Posts: 244
Received 18 Likes on 7 Posts
Originally Posted by LTNABZ
My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.
I wouldn't trust anything said on the Vine show, frankly.
alfaman is offline  
Old 30th Aug 2023, 15:01
  #132 (permalink)  
 
Join Date: Jun 2006
Location: Luton
Posts: 12
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by alfaman
I wouldn't trust anything said on the Vine show, frankly.
Agree, but ditto the Mail, though
LTNABZ is offline  
Old 30th Aug 2023, 15:05
  #133 (permalink)  
 
Join Date: Jun 2006
Location: Luton
Posts: 12
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eekeek
Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.
Yes. The "Efficiency" ideal everywhere is removing all chance anyone has of fixing problems which are inevitable, not just flights, but rail, roads, NHS, Councils, just-in-time supply chains, etc etc.
LTNABZ is offline  
Old 30th Aug 2023, 16:05
  #134 (permalink)  
 
Join Date: May 2001
Location: England
Posts: 1,904
Likes: 0
Received 0 Likes on 0 Posts
Whatever's going on, we have evidence that there are no independent backup systems at NATS. Whatever processes they have go through a single point of failure. That can't be news to the developers and managers at NATS. They will have known about it.
Superpilot is offline  
Old 30th Aug 2023, 16:15
  #135 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 470
Likes: 0
Received 0 Likes on 0 Posts
If I have the World's most sophisticated system & it cost me, for the sake of argument, 500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?
eglnyt is offline  
Old 30th Aug 2023, 18:04
  #136 (permalink)  
Pegase Driver
 
Join Date: May 1997
Location: Europe
Age: 73
Posts: 3,657
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
If I have the World's most sophisticated system & it cost me, for the sake of argument, 500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?
Never listen to the very vocal Irishmen, if you follow their logic, that of MOL at least , we should all work 12h a day . 7 days a week for half of the money so that his airline can make more profit. But on cost of back up systems you take the problem the wrong way . a functioning back up in an ATC Centre is part of the system , from the outset and this regardless of the end price. . You do not buy a system and later ask for a back up.
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,

What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.
ATC Watcher is offline  
Old 30th Aug 2023, 18:12
  #137 (permalink)  
 
Join Date: Jan 2008
Location: Glorious Devon
Posts: 2,575
Received 505 Likes on 279 Posts
Originally Posted by Superpilot
Whatever's going on, we have evidence that there are no independent backup systems at NATS. Whatever processes they have go through a single point of failure. That can't be news to the developers and managers at NATS. They will have known about it.
An independent backup will help in the event of hardware or power failure but if both systems are using identical software they will react in the same way to the same input. if one crashes because of bad data, so will any backup.
Ninthace is offline  
Old 30th Aug 2023, 18:17
  #138 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 470
Likes: 0
Received 0 Likes on 0 Posts
There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.
eglnyt is offline  
Old 30th Aug 2023, 18:31
  #139 (permalink)  
 
Join Date: Jan 2008
Location: Glorious Devon
Posts: 2,575
Received 505 Likes on 279 Posts
Originally Posted by eglnyt
There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.
Writing and testing the software once is hard enough. Writing and testing a second, different, version that does the same thing would be "interesting". Now throw in the need to keep both systems current with evolving user demands and constantly changing data within the system, Then add the demands of upgrades to the hardware and software from the manufacturers and the associated testing. Finally it has to work and it has to make money. Hands up who wants to be the manager answerable to the CEO for that.

Last edited by Ninthace; 30th Aug 2023 at 18:56.
Ninthace is offline  
Old 30th Aug 2023, 18:56
  #140 (permalink)  
 
Join Date: Aug 2015
Location: In the mist
Posts: 16
Likes: 0
Received 0 Likes on 0 Posts
Some systems will run with the primary on version n and the backup on version n-1 so that the backup won’t be affected by a newly introduced bug.

That falls down of course if an undetected bug was in version n-15.

Misty.
EGPI10BR is offline  

Thread Tools
Search this Thread

Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.