PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Rumours & News (https://www.pprune.org/rumours-news-13/)
-   -   U.K. NATS Systems Failure (https://www.pprune.org/rumours-news/654461-u-k-nats-systems-failure.html)

eglnyt 14th Mar 2024 16:01

I would expect several layers of protection and restrictions that mean certain actions can only be undertaken from certain physical terminals. That hint as to issues with authentication is annoying because they've raised the issue but not explained it at all. It might however explain why the 2nd line needed to go to site and couldn't just talk someone through the steps to restart. It's a blind alley though because without removing the errant data it would have crashed again on restart.

I would also expect all remote access to the system for all foreseeable tasks to have been tested when commissioned & checked regularly since. Having raised the issue in the interim report they will hopefully cover it in detail in the main report

kiwi grey 15th Mar 2024 02:13


Originally Posted by eglnyt (Post 11615746)
I would expect several layers of protection and restrictions that mean certain actions can only be undertaken from certain physical terminals. That hint as to issues with authentication is annoying because they've raised the issue but not explained it at all. It might however explain why the 2nd line needed to go to site and couldn't just talk someone through the steps to restart. It's a blind alley though because without removing the errant data it would have crashed again on restart.

I would also expect all remote access to the system for all foreseeable tasks to have been tested when commissioned & checked regularly since. Having raised the issue in the interim report they will hopefully cover it in detail in the main report

The management of a Critical Infrastructure provider that operates 24x7x365 - all day every single day without exception - has security arrangements that precluded off-site operation of certain second level support maintenance aspects of their system. That is probably a good practice.
However, they set up the second line support so that the person who would be required to activate these on-site-only aspects was on call, not working on site. Doubtless this is a much cheaper solution than having a continuous on site presence. And since the circumstances that would require the engineer to be on-site were regarded as improbably unlikely, it looked like a good bet, or it would have so long as you ignored the impact of delayed rectification of the fault.
To add insult to injury, as I read the account, the on call engineer was permitted to be when on call at a location that required a ninety minute commute for them to be on site to carry out the work. That seems to me to be extraordinary: when I worked for a much smaller Critical Infrastructure provider, part of the on call responsibility was to be able to attend on site within twenty minutes of notification.

I would classify this as Management Failure

eglnyt 15th Mar 2024 11:37


Originally Posted by kiwi grey (Post 11616047)
The management of a Critical Infrastructure provider that operates 24x7x365 - all day every single day without exception - has security arrangements that precluded off-site operation of certain second level support maintenance aspects of their system. That is probably a good practice.
However, they set up the second line support so that the person who would be required to activate these on-site-only aspects was on call, not working on site. Doubtless this is a much cheaper solution than having a continuous on site presence. And since the circumstances that would require the engineer to be on-site were regarded as improbably unlikely, it looked like a good bet, or it would have so long as you ignored the impact of delayed rectification of the fault.
To add insult to injury, as I read the account, the on call engineer was permitted to be when on call at a location that required a ninety minute commute for them to be on site to carry out the work. That seems to me to be extraordinary: when I worked for a much smaller Critical Infrastructure provider, part of the on call responsibility was to be able to attend on site within twenty minutes of notification.

I would classify this as Management Failure

There is a 24Hr on-site presence. There were eyes and hands on the ground. This person was their more "experienced" and better trained escalation support. The report doesn't really explain what that person did in the hour before they went to site, why it was decided they had to go to site, and why they didn't just talk the on site presence through the steps to do what they intended to do when they got there.

The main issue for me is why they didn't call the 3rd level at the time they made the decision for the 2nd level to attend. I'm not sure what the protocols referenced in the report are but surely 1 hour into a fault and a minimum of 90 minutes before you make any progress is the point at which you throw that protocol away and call anybody you think can help. The idea that you convene the Bronze and Silver teams and dragged all the senior managers away from their Bank Holiday activities but didn't call the system expert at the same time is just strange.

The other issue is why the 2nd Level didn't/or couldn't access the system logs remotely. I wondered for some time why it took so long to identify the errant flight plan and remove it and the answer to that question seems to be that they didn't even try until nearly 4 hours in.

CBSITCB 15th Mar 2024 11:49

There are many errors, obfuscations, and contradictions in this Interim Report. I hope some of the comments in this whole thread get back to the panel so they can address them in the final report, and not sweep them under the carpet. I am only concerned with the technical aspects relating to the failure.

1 – The very first substantive sentence in the report shows a lack of understanding of the technical ‘system’. “The cause of the failure of the NERL flight plan processing system (FPRSA-R)”. The flight plan processing system is the NAS FPPS – not FPRSA-R.

2 – “Critical exception errors being generated which caused each system to place itself into maintenance mode”. Is there really a documented and intentional FPRSA-R Maintenance Mode? Or is it just a euphemism for “it crashed” (ie, encountered a situation it had not been programmed for and executed some random code or a catch-all “WTF just happened?” dead stop).

Such euphemisms are not uncommon. The NAS FPPS has (or at least did have) a fancy-sounding documented state called Functional Lapse of the Operational Program, or FLOP. Of course, we operational engineers just said it had crashed. More recently there is SpaceX’s “Rapid Unscheduled Disassembly”.

3 – If there was an intentional Maintenance State why on earth did the system allow both processors to deliberately enter that state at the same time? Even so, as it was foreseeable there should have been a documented procedure to recover from it.

4 – IFPS adds supplementary way points, please explain why. Presumably, inter alia, to identify boundary crossing points. If so why does FRPRSA-R not identify it as an exit point if that is what it was inserted for?

5 – “Recognising this as being not credible, a critical exception error was generated, and the primary FPRSA-R system, as it is designed to do, disconnected itself from NAS and placed itself into maintenance mode”. So it recognised the problem, and was designed to react to it, yet didn’t output a message such as “Non-credible route for FPXXXX at DVL”. Pull the other one – it crashed!

6 – “Processed flight data is presented to NAS four hours in advance of the data being required” and “The repeated cycle that occurred each time a connection was re-established between the AMS-UK and FPRSA-R ended with the assistance of system supplier, Frequentis, four hours after the event.”

So the Frequentis guys fixed the problem at the exact time the FP was due to be activated – what a coincidence! Could it be that the AMS-UK system recognized that the errant FP was now history (ie stale) and purged it from its Pending Queue without human intervention?

7 – At para 2.22 the report states Frequentis fixed the problem. In the timeline it states it was Comsoft that fixed the problem.

8 – “Adherence to escalation protocols meant that the assistance of Frequentis was not sought for more than four hours after the initial failure.” But you’ve already stated that Frequentis fixed the problem at the four hour mark??? And they would have needed time to diagnose the problem.

9 – At elapsed time 00:27 “Level 1 engineer attempts reboot FPRSA-R software”. Attempts? Presumably successfully as the report says it continues to fail when it repeatedly gets the errant FP. But the report says it needs a Level 2 engineer to do a restart. What is a reboot if its not a restart?

eglnyt 15th Mar 2024 12:14

Section 2.21 states:

There is no functionality in the AMS-UK or connected systems to remove a message from the pending queue in the event of repeated unsuccessful transmission attempts.

Section 2.22 then explains that once they problem was identified by Frequentis the errant plan was quickly removed from the pending queue. So Section 2.21 really should have clarified that there is no automated functionality to remove the message from the queue.

The Comsoft or Frequentis mix up is to be expected as the company formally known as Comsoft is now Frequentis Comsoft but if you are going to shorten it you should chose one and stick to it.

CBSITCB 15th Mar 2024 14:34


Originally Posted by eglnyt (Post 11616321)
The Comsoft or Frequentis mix up is to be expected as the company formally known as Comsoft is now Frequentis Comsoft but if you are going to shorten it you should chose one and stick to it.

The wider point I was trying to make (perhaps badly) is that FPRSA was delivered to Swanwick by Frequentis while AMS-UK was delivered to Heathrow by Comsoft. Two completely different systems. The report gives the impression that FPRSA was somehow fixed. It wasn’t – removing the errant FP at Heathrow, whether manually or by automatic purge, just removed the cause. The ‘bug’ in FPRSA was still there and at the time nobody knew what it was. Sure, they knew what caused the failure, but not how. All IMHO of course.

andymartin 15th Mar 2024 14:48

Presumably deputy heads have rolled while the next to useless head remains in his very well paid post?

maxred 17th Mar 2024 14:05


Originally Posted by andymartin (Post 11616431)
Presumably deputy heads have rolled while the next to useless head remains in his very well paid post?

Not as yet. From the mug shot gallery on the website

'There is an amazing group of people in the NATS Board and Executive teams who dedicate themselves to the company’s purpose, goals and success.'

nohold 17th Mar 2024 17:05

Sounds like the making of an old out of date lynch mob.

eglnyt 18th Mar 2024 10:19


Originally Posted by CBSITCB (Post 11616306)
There are many errors, obfuscations, and contradictions in this Interim Report. I hope some of the comments in this whole thread get back to the panel so they can address them in the final report, and not sweep them under the carpet. I am only concerned with the technical aspects relating to the failure.

1 – The very first substantive sentence in the report shows a lack of understanding of the technical ‘system’. “The cause of the failure of the NERL flight plan processing system (FPRSA-R)”. The flight plan processing system is the NAS FPPS – not FPRSA-R.

2 – “Critical exception errors being generated which caused each system to place itself into maintenance mode”. Is there really a documented and intentional FPRSA-R Maintenance Mode? Or is it just a euphemism for “it crashed” (ie, encountered a situation it had not been programmed for and executed some random code or a catch-all “WTF just happened?” dead stop).

3 – If there was an intentional Maintenance State why on earth did the system allow both processors to deliberately enter that state at the same time? Even so, as it was foreseeable there should have been a documented procedure to recover from it.

4 – IFPS adds supplementary way points, please explain why. Presumably, inter alia, to identify boundary crossing points. If so why does FRPRSA-R not identify it as an exit point if that is what it was inserted for?

5 – “Recognising this as being not credible, a critical exception error was generated, and the primary FPRSA-R system, as it is designed to do, disconnected itself from NAS and placed itself into maintenance mode”. So it recognised the problem, and was designed to react to it, yet didn’t output a message such as “Non-credible route for FPXXXX at DVL”. Pull the other one – it crashed!

9 – At elapsed time 00:27 “Level 1 engineer attempts reboot FPRSA-R software”. Attempts? Presumably successfully as the report says it continues to fail when it repeatedly gets the errant FP. But the report says it needs a Level 2 engineer to do a restart. What is a reboot if its not a restart?

FPRSA is a flight plan processing system and they used lower case so it's reasonable to use that term. It isn't the UK's Flight Data Processing System (FDP), NAS and the ITEC FDP system have that role.

In any system I have ever worked on critical exception error is the catch all error handling for errors the system recognises as errors but has not been specifically programmed to handle. The system has to have such processing for the unexpected things the programmer hasn't thought about but even if they hadn't considered this particular error they should have considered a generic error case where a single flight plan was choking the system and handled that through a programmed error path rather than the critical error path. The nature of the critical path exception handler is it is that catch all and is going to function exactly the same for both processors, for it to handle it differently would mean the programmer has considered similar cases in which case it shouldn't flip to the other processor anyway.

IFPS is a cousin of FPRSA but seems to have parsed the route and found the actual exit points without problem, presumably it does this to work out which FIRs it needs to send the flight plan to. Why FPRSA has to ignore the route already generated by IFPS and instead process the original flight plan is a mystery. And why it doesn't do that the same way as IFPS is an even bigger mystery as the two systems were almost certainly written in the same building and perhaps the same office.

In modernish systems the process can be restarted independently of the hardware. I suspect the Level 1 can do this but not restart the hardware which requires a higher level of access. I really want to know the reasoning that led them to think that restarting the hardware might help when restarting the software didn't. At the start the issue crashed two separate sets of hardware so it is unlikely that anything would be flushed out by doing that. In desperation you might try switching it off and on but would you really wait 90 minutes in that hope?

Most of all I'm amazed they didn't suspect what was going on. I dislike the term "Rogue Flight Plan" because it seems to place the blame on the flight plan when what you usually mean is a flight plan that exposes a bug in your system but at one time a rogue flight plan would have been top of everybody's mind in a situation like this even when it wasn't the issue. I would have expected identifying which flight plan seemed to be the issue and using either AMS or IFPS to remove it from the queue to be high up on the list of things to do when FPRSA crashes but for some reason they didn't for nearly 4 hours.

Abrahn 23rd Apr 2024 23:55

All credit to Martin Rolfe. He could have thrown his engineers under the bus but didn't.

Semreh 25th Apr 2024 19:12

Happens elsewhere too...
 
Our Norwegian friends did not have a lot of fun this morning: all air traffic stopped in Southern Norway from roughly 06:30 local time to roughly 09:30 local time.

Link to Norwegian news: https://www.nrk.no/stor-oslo/avinor_...rge-1.16858495

A rough translation of the relevant bit:


–Hvordan kan en teknisk feil lamme flytrafikken i hele Sør-Norge?

– Denne feilen påvirket både kontrollsentralen i Røyken og kontrollsentralen i Stavanger, som har ansvaret for flytrafikken i hele Sør-Norge, sier Pedersen.

Han forklarer at når de får en sånn type feil, så må de ivareta flysikkerheten ved å stoppe trafikken og avvikle den trafikken de har. Det primære vil være å finne feilen og få den kontrollert.

-How can a technical failure paralyse air-traffic in the whole of Southern Norway?
- This failure affected both the control-centre in Røyken and the control-centre in Stavanger, which have responsibility for air-traffic in the whole of Southern Norway, says Pedersen.
He clarified that when they get such a type of failure, so they must protect air-safety by stopping the traffic and phasing-out the traffic they have. The first/primary thing will be to find the failure and get it rectified.
I don't know what system the Norwegians use, but obviously something that shouldn't happen did. It doesn't only happen in the UK.

And while nothing to do with air-transport, the Norwegian rail system has almost country wide stoppages as a result of using a private mobile phone/data network for signalling which fails all too frequently, and the only option is to stop the trains.

Norwegian infrastructure projects are not blessed with flawless implementations: the replacement helicopters for the Sea King SAR helicopters can't land at many hospital helipads because their downdraft is too great. Given how long the replacement project ran for, it is surprising this wasn't identified and dealt with far earlier.

Semreh 26th Apr 2024 20:03


Originally Posted by Semreh (Post 11643060)
Our Norwegian friends did not have a lot of fun this morning: all air traffic stopped in Southern Norway from roughly 06:30 local time to roughly 09:30 local time.

Link to Norwegian news: https://www.nrk.no/stor-oslo/avinor_...rge-1.16858495

A rough translation of the relevant bit:





I don't know what system the Norwegians use, but obviously something that shouldn't happen did. It doesn't only happen in the UK.

And while nothing to do with air-transport, the Norwegian rail system has almost country wide stoppages as a result of using a private mobile phone/data network for signalling which fails all too frequently, and the only option is to stop the trains.

Norwegian infrastructure projects are not blessed with flawless implementations: the replacement helicopters for the Sea King SAR helicopters can't land at many hospital helipads because their downdraft is too great. Given how long the replacement project ran for, it is surprising this wasn't identified and dealt with far earlier.

Our Norwegian friends are having a torrid time. Today a 6-hour outage for Northern Norway, caused by a fibre break in Telenor's telecommunications network.

https://www.nrk.no/nordland/luftromm...rge-1.16861310

I suspect this shows how thinly spread Norway's infrastructure is in reality. I would have naively thought that ATC would have duplicated, independent, telecommunications connections. Perhaps the harsh reality of the economics is that it is not worth it, and having an outage every so often is more affordable.

eglnyt 26th Apr 2024 22:20


Originally Posted by Semreh (Post 11643782)
Our Norwegian friends are having a torrid time. Today a 6-hour outage for Northern Norway, caused by a fibre break in Telenor's telecommunications network.

https://www.nrk.no/nordland/luftromm...rge-1.16861310

I suspect this shows how thinly spread Norway's infrastructure is in reality. I would have naively thought that ATC would have duplicated, independent, telecommunications connections. Perhaps the harsh reality of the economics is that it is not worth it, and having an outage every so often is more affordable.

It is a fact of life that in any country the Infrastructure becomes weaker the further you get from the main centres of business. Most Western states rely on separate Telecom Suppliers to provide their underlying ATC Infrastructure and those suppliers are not going to run two links if they can't make a return on both of them. There are certainly parts of the UK where you'll be lucky to get one network connection let alone two. MIcrowave links have been used in the past for inaccessible areas but they are expensive and not very reliable in the sort of wet cold weather seen in Northern Europe. Longer term Starlink and its competitors might provide a solution to this problem.

kiwi grey 27th Apr 2024 03:50


Originally Posted by eglnyt (Post 11643844)
It is a fact of life that in any country the Infrastructure becomes weaker the further you get from the main centres of business. Most Western states rely on separate Telecom Suppliers to provide their underlying ATC Infrastructure and those suppliers are not going to run two links if they can't make a return on both of them. There are certainly parts of the UK where you'll be lucky to get one network connection let alone two. MIcrowave links have been used in the past for inaccessible areas but they are expensive and not very reliable in the sort of wet cold weather seen in Northern Europe. Longer term Starlink and its competitors might provide a solution to this problem.

(emphasis added)
Not longer term, this resilience issue could be addressed right now.
Every significant point in their network could have a Starlink dish and/or a OneWeb dish and use it to provide a backup channel. The costs would be 'petty cash' compared to their current annual spend.
Such a solution could be put in operation in a few days for some sites to a few months for the more difficult to access places.

Of course, this isn't "the way we do things" - not just in the UK and Norway, but all over the world - so there will often be great organisational resistance to such a radical change.


eglnyt 28th Apr 2024 16:39


Originally Posted by kiwi grey (Post 11643910)
(emphasis added)
Not longer term, this resilience issue could be addressed right now.
Every significant point in their network could have a Starlink dish and/or a OneWeb dish and use it to provide a backup channel. The costs would be 'petty cash' compared to their current annual spend.
Such a solution could be put in operation in a few days for some sites to a few months for the more difficult to access places.

Of course, this isn't "the way we do things" - not just in the UK and Norway, but all over the world - so there will often be great organisational resistance to such a radical change.

Straying a bit from the thread title but the connectivity is the easy bit and even then not entirely straightforward. Even the most successful of the possibilities has only just reached the point where it can provide a dependable service in most areas and even then it's a proprietary system that might at any time be switched off in your area by the person that controls it. The others have had a rocky path and are probably a little way from reaching the point where you can take them seriously.
The more difficult part is integrating them into your network and the most difficult bit is the security. Effectively connecting the entire Internet to your critical infrastructure takes a lot of nerve and trust that your security suppliers can outpace the bad actors elsewhere. For some reason, in the UK at least, cyber attacks are considered to always be a failing of the target and the CEO is unlikely to survive a successful attack even if everything possible had been done to prevent it. The Technical Director will follow them out of the door. I can understand their reluctance to this sort of change.

Lascaille 21st May 2024 01:20


Originally Posted by eglnyt (Post 11644846)
Straying a bit from the thread title but the connectivity is the easy bit and even then not entirely straightforward.

It is actually quite straightforward - the normal solution for this type of thing is a VSAT (very small aperture terminal) which is a dedicated satellite uplink/downlink channel with bandwith as per your requirements and needs.

This is commonly used to provide backup to service emergency service radio base stations located in hilly or remote areas - the primary connection being microwave point to point which hops between towers, the VSAT being the backup. You will see many of these of these in Wales and Scotland if you know what you're looking at.

These are commercial grade services with SLA etc and available basically globally apart from in extreme polar regions.


All times are GMT. The time now is 08:20.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.