Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

U.K. NATS Systems Failure

Wikiposts
Search
Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

U.K. NATS Systems Failure

Thread Tools
 
Search this Thread
 
Old 14th Mar 2024, 16:01
  #401 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 480
Likes: 0
Received 0 Likes on 0 Posts
I would expect several layers of protection and restrictions that mean certain actions can only be undertaken from certain physical terminals. That hint as to issues with authentication is annoying because they've raised the issue but not explained it at all. It might however explain why the 2nd line needed to go to site and couldn't just talk someone through the steps to restart. It's a blind alley though because without removing the errant data it would have crashed again on restart.

I would also expect all remote access to the system for all foreseeable tasks to have been tested when commissioned & checked regularly since. Having raised the issue in the interim report they will hopefully cover it in detail in the main report
eglnyt is offline  
Old 15th Mar 2024, 02:13
  #402 (permalink)  
 
Join Date: Dec 2006
Location: Whanganui, NZ
Posts: 279
Received 5 Likes on 4 Posts
Originally Posted by eglnyt
I would expect several layers of protection and restrictions that mean certain actions can only be undertaken from certain physical terminals. That hint as to issues with authentication is annoying because they've raised the issue but not explained it at all. It might however explain why the 2nd line needed to go to site and couldn't just talk someone through the steps to restart. It's a blind alley though because without removing the errant data it would have crashed again on restart.

I would also expect all remote access to the system for all foreseeable tasks to have been tested when commissioned & checked regularly since. Having raised the issue in the interim report they will hopefully cover it in detail in the main report
The management of a Critical Infrastructure provider that operates 24x7x365 - all day every single day without exception - has security arrangements that precluded off-site operation of certain second level support maintenance aspects of their system. That is probably a good practice.
However, they set up the second line support so that the person who would be required to activate these on-site-only aspects was on call, not working on site. Doubtless this is a much cheaper solution than having a continuous on site presence. And since the circumstances that would require the engineer to be on-site were regarded as improbably unlikely, it looked like a good bet, or it would have so long as you ignored the impact of delayed rectification of the fault.
To add insult to injury, as I read the account, the on call engineer was permitted to be when on call at a location that required a ninety minute commute for them to be on site to carry out the work. That seems to me to be extraordinary: when I worked for a much smaller Critical Infrastructure provider, part of the on call responsibility was to be able to attend on site within twenty minutes of notification.

I would classify this as Management Failure
kiwi grey is offline  
Old 15th Mar 2024, 11:37
  #403 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 480
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by kiwi grey
The management of a Critical Infrastructure provider that operates 24x7x365 - all day every single day without exception - has security arrangements that precluded off-site operation of certain second level support maintenance aspects of their system. That is probably a good practice.
However, they set up the second line support so that the person who would be required to activate these on-site-only aspects was on call, not working on site. Doubtless this is a much cheaper solution than having a continuous on site presence. And since the circumstances that would require the engineer to be on-site were regarded as improbably unlikely, it looked like a good bet, or it would have so long as you ignored the impact of delayed rectification of the fault.
To add insult to injury, as I read the account, the on call engineer was permitted to be when on call at a location that required a ninety minute commute for them to be on site to carry out the work. That seems to me to be extraordinary: when I worked for a much smaller Critical Infrastructure provider, part of the on call responsibility was to be able to attend on site within twenty minutes of notification.

I would classify this as Management Failure
There is a 24Hr on-site presence. There were eyes and hands on the ground. This person was their more "experienced" and better trained escalation support. The report doesn't really explain what that person did in the hour before they went to site, why it was decided they had to go to site, and why they didn't just talk the on site presence through the steps to do what they intended to do when they got there.

The main issue for me is why they didn't call the 3rd level at the time they made the decision for the 2nd level to attend. I'm not sure what the protocols referenced in the report are but surely 1 hour into a fault and a minimum of 90 minutes before you make any progress is the point at which you throw that protocol away and call anybody you think can help. The idea that you convene the Bronze and Silver teams and dragged all the senior managers away from their Bank Holiday activities but didn't call the system expert at the same time is just strange.

The other issue is why the 2nd Level didn't/or couldn't access the system logs remotely. I wondered for some time why it took so long to identify the errant flight plan and remove it and the answer to that question seems to be that they didn't even try until nearly 4 hours in.
eglnyt is offline  
Old 15th Mar 2024, 11:49
  #404 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
There are many errors, obfuscations, and contradictions in this Interim Report. I hope some of the comments in this whole thread get back to the panel so they can address them in the final report, and not sweep them under the carpet. I am only concerned with the technical aspects relating to the failure.

1 – The very first substantive sentence in the report shows a lack of understanding of the technical ‘system’. “The cause of the failure of the NERL flight plan processing system (FPRSA-R)”. The flight plan processing system is the NAS FPPS – not FPRSA-R.

2 – “Critical exception errors being generated which caused each system to place itself into maintenance mode”. Is there really a documented and intentional FPRSA-R Maintenance Mode? Or is it just a euphemism for “it crashed” (ie, encountered a situation it had not been programmed for and executed some random code or a catch-all “WTF just happened?” dead stop).

Such euphemisms are not uncommon. The NAS FPPS has (or at least did have) a fancy-sounding documented state called Functional Lapse of the Operational Program, or FLOP. Of course, we operational engineers just said it had crashed. More recently there is SpaceX’s “Rapid Unscheduled Disassembly”.

3 – If there was an intentional Maintenance State why on earth did the system allow both processors to deliberately enter that state at the same time? Even so, as it was foreseeable there should have been a documented procedure to recover from it.

4 – IFPS adds supplementary way points, please explain why. Presumably, inter alia, to identify boundary crossing points. If so why does FRPRSA-R not identify it as an exit point if that is what it was inserted for?

5 – “Recognising this as being not credible, a critical exception error was generated, and the primary FPRSA-R system, as it is designed to do, disconnected itself from NAS and placed itself into maintenance mode”. So it recognised the problem, and was designed to react to it, yet didn’t output a message such as “Non-credible route for FPXXXX at DVL”. Pull the other one – it crashed!

6 – “Processed flight data is presented to NAS four hours in advance of the data being required” and “The repeated cycle that occurred each time a connection was re-established between the AMS-UK and FPRSA-R ended with the assistance of system supplier, Frequentis, four hours after the event.”

So the Frequentis guys fixed the problem at the exact time the FP was due to be activated – what a coincidence! Could it be that the AMS-UK system recognized that the errant FP was now history (ie stale) and purged it from its Pending Queue without human intervention?

7 – At para 2.22 the report states Frequentis fixed the problem. In the timeline it states it was Comsoft that fixed the problem.

8 – “Adherence to escalation protocols meant that the assistance of Frequentis was not sought for more than four hours after the initial failure.” But you’ve already stated that Frequentis fixed the problem at the four hour mark??? And they would have needed time to diagnose the problem.

9 – At elapsed time 00:27 “Level 1 engineer attempts reboot FPRSA-R software”. Attempts? Presumably successfully as the report says it continues to fail when it repeatedly gets the errant FP. But the report says it needs a Level 2 engineer to do a restart. What is a reboot if its not a restart?
CBSITCB is offline  
Old 15th Mar 2024, 12:14
  #405 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 480
Likes: 0
Received 0 Likes on 0 Posts
Section 2.21 states:

There is no functionality in the AMS-UK or connected systems to remove a message from the pending queue in the event of repeated unsuccessful transmission attempts.

Section 2.22 then explains that once they problem was identified by Frequentis the errant plan was quickly removed from the pending queue. So Section 2.21 really should have clarified that there is no automated functionality to remove the message from the queue.

The Comsoft or Frequentis mix up is to be expected as the company formally known as Comsoft is now Frequentis Comsoft but if you are going to shorten it you should chose one and stick to it.
eglnyt is offline  
Old 15th Mar 2024, 14:34
  #406 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
The Comsoft or Frequentis mix up is to be expected as the company formally known as Comsoft is now Frequentis Comsoft but if you are going to shorten it you should chose one and stick to it.
The wider point I was trying to make (perhaps badly) is that FPRSA was delivered to Swanwick by Frequentis while AMS-UK was delivered to Heathrow by Comsoft. Two completely different systems. The report gives the impression that FPRSA was somehow fixed. It wasn’t – removing the errant FP at Heathrow, whether manually or by automatic purge, just removed the cause. The ‘bug’ in FPRSA was still there and at the time nobody knew what it was. Sure, they knew what caused the failure, but not how. All IMHO of course.
CBSITCB is offline  
Old 15th Mar 2024, 14:48
  #407 (permalink)  
 
Join Date: Jan 2011
Location: winchester
Posts: 33
Received 0 Likes on 0 Posts
Presumably deputy heads have rolled while the next to useless head remains in his very well paid post?
andymartin is offline  
Old 17th Mar 2024, 14:05
  #408 (permalink)  
 
Join Date: Feb 2007
Location: GLASGOW
Posts: 1,289
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by andymartin
Presumably deputy heads have rolled while the next to useless head remains in his very well paid post?
Not as yet. From the mug shot gallery on the website

'There is an amazing group of people in the NATS Board and Executive teams who dedicate themselves to the company’s purpose, goals and success.'
maxred is offline  
Old 17th Mar 2024, 17:05
  #409 (permalink)  
 
Join Date: Jan 2016
Location: Going left then going right
Posts: 101
Likes: 0
Received 0 Likes on 0 Posts
Sounds like the making of an old out of date lynch mob.
nohold is offline  
Old 18th Mar 2024, 10:19
  #410 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 480
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by CBSITCB
There are many errors, obfuscations, and contradictions in this Interim Report. I hope some of the comments in this whole thread get back to the panel so they can address them in the final report, and not sweep them under the carpet. I am only concerned with the technical aspects relating to the failure.

1 – The very first substantive sentence in the report shows a lack of understanding of the technical ‘system’. “The cause of the failure of the NERL flight plan processing system (FPRSA-R)”. The flight plan processing system is the NAS FPPS – not FPRSA-R.

2 – “Critical exception errors being generated which caused each system to place itself into maintenance mode”. Is there really a documented and intentional FPRSA-R Maintenance Mode? Or is it just a euphemism for “it crashed” (ie, encountered a situation it had not been programmed for and executed some random code or a catch-all “WTF just happened?” dead stop).

3 – If there was an intentional Maintenance State why on earth did the system allow both processors to deliberately enter that state at the same time? Even so, as it was foreseeable there should have been a documented procedure to recover from it.

4 – IFPS adds supplementary way points, please explain why. Presumably, inter alia, to identify boundary crossing points. If so why does FRPRSA-R not identify it as an exit point if that is what it was inserted for?

5 – “Recognising this as being not credible, a critical exception error was generated, and the primary FPRSA-R system, as it is designed to do, disconnected itself from NAS and placed itself into maintenance mode”. So it recognised the problem, and was designed to react to it, yet didn’t output a message such as “Non-credible route for FPXXXX at DVL”. Pull the other one – it crashed!

9 – At elapsed time 00:27 “Level 1 engineer attempts reboot FPRSA-R software”. Attempts? Presumably successfully as the report says it continues to fail when it repeatedly gets the errant FP. But the report says it needs a Level 2 engineer to do a restart. What is a reboot if its not a restart?
FPRSA is a flight plan processing system and they used lower case so it's reasonable to use that term. It isn't the UK's Flight Data Processing System (FDP), NAS and the ITEC FDP system have that role.

In any system I have ever worked on critical exception error is the catch all error handling for errors the system recognises as errors but has not been specifically programmed to handle. The system has to have such processing for the unexpected things the programmer hasn't thought about but even if they hadn't considered this particular error they should have considered a generic error case where a single flight plan was choking the system and handled that through a programmed error path rather than the critical error path. The nature of the critical path exception handler is it is that catch all and is going to function exactly the same for both processors, for it to handle it differently would mean the programmer has considered similar cases in which case it shouldn't flip to the other processor anyway.

IFPS is a cousin of FPRSA but seems to have parsed the route and found the actual exit points without problem, presumably it does this to work out which FIRs it needs to send the flight plan to. Why FPRSA has to ignore the route already generated by IFPS and instead process the original flight plan is a mystery. And why it doesn't do that the same way as IFPS is an even bigger mystery as the two systems were almost certainly written in the same building and perhaps the same office.

In modernish systems the process can be restarted independently of the hardware. I suspect the Level 1 can do this but not restart the hardware which requires a higher level of access. I really want to know the reasoning that led them to think that restarting the hardware might help when restarting the software didn't. At the start the issue crashed two separate sets of hardware so it is unlikely that anything would be flushed out by doing that. In desperation you might try switching it off and on but would you really wait 90 minutes in that hope?

Most of all I'm amazed they didn't suspect what was going on. I dislike the term "Rogue Flight Plan" because it seems to place the blame on the flight plan when what you usually mean is a flight plan that exposes a bug in your system but at one time a rogue flight plan would have been top of everybody's mind in a situation like this even when it wasn't the issue. I would have expected identifying which flight plan seemed to be the issue and using either AMS or IFPS to remove it from the queue to be high up on the list of things to do when FPRSA crashes but for some reason they didn't for nearly 4 hours.
eglnyt is offline  
Old 23rd Apr 2024, 23:55
  #411 (permalink)  
 
Join Date: Jun 2022
Location: England
Posts: 43
Received 36 Likes on 27 Posts
All credit to Martin Rolfe. He could have thrown his engineers under the bus but didn't.
Abrahn is online now  
Old 25th Apr 2024, 19:12
  #412 (permalink)  
 
Join Date: Apr 2010
Location: Europe
Posts: 61
Likes: 0
Received 1 Like on 1 Post
Happens elsewhere too...

Our Norwegian friends did not have a lot of fun this morning: all air traffic stopped in Southern Norway from roughly 06:30 local time to roughly 09:30 local time.

Link to Norwegian news: https://www.nrk.no/stor-oslo/avinor_...rge-1.16858495

A rough translation of the relevant bit:

–Hvordan kan en teknisk feil lamme flytrafikken i hele Sør-Norge?

– Denne feilen påvirket både kontrollsentralen i Røyken og kontrollsentralen i Stavanger, som har ansvaret for flytrafikken i hele Sør-Norge, sier Pedersen.

Han forklarer at når de får en sånn type feil, så må de ivareta flysikkerheten ved å stoppe trafikken og avvikle den trafikken de har. Det primære vil være å finne feilen og få den kontrollert.
-How can a technical failure paralyse air-traffic in the whole of Southern Norway?
- This failure affected both the control-centre in Røyken and the control-centre in Stavanger, which have responsibility for air-traffic in the whole of Southern Norway, says Pedersen.
He clarified that when they get such a type of failure, so they must protect air-safety by stopping the traffic and phasing-out the traffic they have. The first/primary thing will be to find the failure and get it rectified.
I don't know what system the Norwegians use, but obviously something that shouldn't happen did. It doesn't only happen in the UK.

And while nothing to do with air-transport, the Norwegian rail system has almost country wide stoppages as a result of using a private mobile phone/data network for signalling which fails all too frequently, and the only option is to stop the trains.

Norwegian infrastructure projects are not blessed with flawless implementations: the replacement helicopters for the Sea King SAR helicopters can't land at many hospital helipads because their downdraft is too great. Given how long the replacement project ran for, it is surprising this wasn't identified and dealt with far earlier.
Semreh is offline  
Old 26th Apr 2024, 20:03
  #413 (permalink)  
 
Join Date: Apr 2010
Location: Europe
Posts: 61
Likes: 0
Received 1 Like on 1 Post
Originally Posted by Semreh
Our Norwegian friends did not have a lot of fun this morning: all air traffic stopped in Southern Norway from roughly 06:30 local time to roughly 09:30 local time.

Link to Norwegian news: https://www.nrk.no/stor-oslo/avinor_...rge-1.16858495

A rough translation of the relevant bit:





I don't know what system the Norwegians use, but obviously something that shouldn't happen did. It doesn't only happen in the UK.

And while nothing to do with air-transport, the Norwegian rail system has almost country wide stoppages as a result of using a private mobile phone/data network for signalling which fails all too frequently, and the only option is to stop the trains.

Norwegian infrastructure projects are not blessed with flawless implementations: the replacement helicopters for the Sea King SAR helicopters can't land at many hospital helipads because their downdraft is too great. Given how long the replacement project ran for, it is surprising this wasn't identified and dealt with far earlier.
Our Norwegian friends are having a torrid time. Today a 6-hour outage for Northern Norway, caused by a fibre break in Telenor's telecommunications network.

https://www.nrk.no/nordland/luftromm...rge-1.16861310

I suspect this shows how thinly spread Norway's infrastructure is in reality. I would have naively thought that ATC would have duplicated, independent, telecommunications connections. Perhaps the harsh reality of the economics is that it is not worth it, and having an outage every so often is more affordable.
Semreh is offline  
Old 26th Apr 2024, 22:20
  #414 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 480
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Semreh
Our Norwegian friends are having a torrid time. Today a 6-hour outage for Northern Norway, caused by a fibre break in Telenor's telecommunications network.

https://www.nrk.no/nordland/luftromm...rge-1.16861310

I suspect this shows how thinly spread Norway's infrastructure is in reality. I would have naively thought that ATC would have duplicated, independent, telecommunications connections. Perhaps the harsh reality of the economics is that it is not worth it, and having an outage every so often is more affordable.
It is a fact of life that in any country the Infrastructure becomes weaker the further you get from the main centres of business. Most Western states rely on separate Telecom Suppliers to provide their underlying ATC Infrastructure and those suppliers are not going to run two links if they can't make a return on both of them. There are certainly parts of the UK where you'll be lucky to get one network connection let alone two. MIcrowave links have been used in the past for inaccessible areas but they are expensive and not very reliable in the sort of wet cold weather seen in Northern Europe. Longer term Starlink and its competitors might provide a solution to this problem.
eglnyt is offline  
Old 27th Apr 2024, 03:50
  #415 (permalink)  
 
Join Date: Dec 2006
Location: Whanganui, NZ
Posts: 279
Received 5 Likes on 4 Posts
Originally Posted by eglnyt
It is a fact of life that in any country the Infrastructure becomes weaker the further you get from the main centres of business. Most Western states rely on separate Telecom Suppliers to provide their underlying ATC Infrastructure and those suppliers are not going to run two links if they can't make a return on both of them. There are certainly parts of the UK where you'll be lucky to get one network connection let alone two. MIcrowave links have been used in the past for inaccessible areas but they are expensive and not very reliable in the sort of wet cold weather seen in Northern Europe. Longer term Starlink and its competitors might provide a solution to this problem.
(emphasis added)
Not longer term, this resilience issue could be addressed right now.
Every significant point in their network could have a Starlink dish and/or a OneWeb dish and use it to provide a backup channel. The costs would be 'petty cash' compared to their current annual spend.
Such a solution could be put in operation in a few days for some sites to a few months for the more difficult to access places.

Of course, this isn't "the way we do things" - not just in the UK and Norway, but all over the world - so there will often be great organisational resistance to such a radical change.

kiwi grey is offline  
Old 28th Apr 2024, 16:39
  #416 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 480
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by kiwi grey
(emphasis added)
Not longer term, this resilience issue could be addressed right now.
Every significant point in their network could have a Starlink dish and/or a OneWeb dish and use it to provide a backup channel. The costs would be 'petty cash' compared to their current annual spend.
Such a solution could be put in operation in a few days for some sites to a few months for the more difficult to access places.

Of course, this isn't "the way we do things" - not just in the UK and Norway, but all over the world - so there will often be great organisational resistance to such a radical change.
Straying a bit from the thread title but the connectivity is the easy bit and even then not entirely straightforward. Even the most successful of the possibilities has only just reached the point where it can provide a dependable service in most areas and even then it's a proprietary system that might at any time be switched off in your area by the person that controls it. The others have had a rocky path and are probably a little way from reaching the point where you can take them seriously.
The more difficult part is integrating them into your network and the most difficult bit is the security. Effectively connecting the entire Internet to your critical infrastructure takes a lot of nerve and trust that your security suppliers can outpace the bad actors elsewhere. For some reason, in the UK at least, cyber attacks are considered to always be a failing of the target and the CEO is unlikely to survive a successful attack even if everything possible had been done to prevent it. The Technical Director will follow them out of the door. I can understand their reluctance to this sort of change.
eglnyt is offline  

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off



Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.