Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

U.K. NATS Systems Failure

Wikiposts
Search
Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

U.K. NATS Systems Failure

Thread Tools
 
Search this Thread
 
Old 2nd Sep 2023, 09:30
  #201 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Gupeg
. We'll later see if the report on Monday's issue has any parallels?
The review report took several months to amass the evidence and compile the report. I doubt very much that anything that comes out on Monday will be sufficient to decide if there are any parallels.

I'm expecting, hoping for, an identification of the system that failed probably accompanied by a poor description of its function and how it works. Identification of this apparently errant data and in particular why the NATS system didn't like it. A Timeline of events hopefully with a hint as to why it took as long as it did to fix, a bonus if it explains how long they previously thought it would take.

That should be sufficient to keep PPRUNE going for a while based on how much discussion we have had in the absence of any of that. I then expect there to be an inquiry of some sort because the Secretary of State has to be seen to be doing something. In the Autumn once MPs come back from their lengthy holidays and have finished their conference season there will probably be a Select Committee hearing because they also want to be seen to be doing something.

In the meantime out of sight somebody will fix the issue, hopefully properly, and wait for the next one in 10 years.
eglnyt is offline  
Old 2nd Sep 2023, 10:24
  #202 (permalink)  
 
Join Date: Jan 2017
Location: UK
Posts: 65
Received 2 Likes on 1 Post
Originally Posted by eglnyt
We continue to discuss an earlier failure on a system that almost certainly wasn't the one involved in this case although of course currently we don't know which system was.

It wasn't new software. It was the original software, it had been there for years. The change introduced was to start using it nearer the limits of the system of which there two, 151 civil positions and 193 overall. The verification of those limits and acceptance of them happened years before. To use the poor analogy previously introduced it is akin to buying a 5 seat car, only using 4 seats for several years and one day having a need to use all 5. In my case discovering that, if isofix is in use on 2 of the seats it is actually a 4.5 seat car not 5.

Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option. And a simple test up to 193 would not have uncovered the issue, you would need to invoke watching mode when more than 151 were in use, any other mode added above 151 would not have triggered the issue. If your aim was to fully stress the system it is likely that you would have invoked the more demanding modes to do that.

Should they have spotted the error on code review? This is a bad case for humans. There are two limits in use. I'd probably spot a completely incorrect limit but I'd be far less likely to spot that the wrong one was being used.

Should SFS have 2 completely different sets of software so an error would only affect one. Ideally yes but as I've said before that is also impractical. The supplier struggled to produce one set of software in the timescale and cost originally estimated. Even if you doubled your estimate producing two would, in the end, cost considerably more than twice as much even if you managed to ever actually deliver.

Business criticality is a different matter from safety criticality but for all systems in the flight data thread you can make an adequate safety case with redundancy provided with an identical system provided you have a means of ensuring that, at all times from inception of failure, you can safely handle the level of traffic that might be present. In the case of Monday the level of traffic at failure was safely handled and the reduction of traffic as data degraded ensured that continued to be the case.

If your safety case is made than business criticality becomes purely a matter of cost benefit.
Agreed.

I often refer people to this article when asked "why do you ship software with bugs?":
Short form: https://www.theguardian.com/technolo...hnologysection
The long(er) form original article: https://ericsink.com/articles/Four_Questions.html
paulross is offline  
Old 2nd Sep 2023, 11:33
  #203 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Hartington
A good few years ago I was testing a piece of commercial, non safety critical, software. It failed at a specific point in a way I considered "interesting". I described the failure to the programmer. He looked quizzical and said "I wondered if it would do that".

Then there was a recurring fault. It happened in client systems all over the country. Nobody experienced it frequently or consistently. In fact, across the country, It happened about twice a year. Most people never had the problem. Try as we may we never got to the bottom of it (believe me, we really tried).

Software is written by humans, tested by humans (test scripts for automated systems are written by humans) and used by humans. Humans are error prone and, in the end, it means software will be error prone.
I think you said it, 'not in mission critical systems'. Would you build a management system for a nuclear reactor this way? No.
Neo380 is offline  
Old 2nd Sep 2023, 11:42
  #204 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
That all sounds realistic. I'm just surprised at the apparent expectation of having no fail-safes in a mission critical system, but perhaps they were deemed unnecessary with the 1960's technology, and at the time, because the old system never failed over (except that's happened a couple of times now)?
Neo380 is offline  
Old 2nd Sep 2023, 11:51
  #205 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Neo380
I think you said it, 'not in mission critical systems'. Would you build a management system for a nuclear reactor this way? No.
What you may or may not do for a nuclear reactor control system isn't really relevant unless what you are building has hazards with the same order of harm. The design of your system and the rigour of the processes surrounding that design and build have to be consistent with the level of harm not the best practice employed where the level of harm is very high.
eglnyt is offline  
Old 2nd Sep 2023, 12:50
  #206 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
What you may or may not do for a nuclear reactor control system isn't really relevant unless what you are building has hazards with the same order of harm. The design of your system and the rigour of the processes surrounding that design and build have to be consistent with the level of harm not the best practice employed where the level of harm is very high.
That’s fine, if you want to run the National Airspace System (which is defined at ‘critical national infrastructure’ btw) on manual every time your computer trips out, then no issues.

But when you come back to the public purse (as you will, despite what you might read here) and ask for £1bn+ for a new system ‘for the safe and smooth running of the UK’s CNI’, expect to be reminded of what you said.

Btw, this is the same organisation that in RP3, so only a couple of years ago, said ‘accepting any performance improvements (the point of reporting periods!) was against the business’s and national interest’.

If I was on the Parliamentary Committee I would ensure that sort of comment (and several of the above) was recorded as ‘breathtakingly arrogant!’
Neo380 is offline  
Old 2nd Sep 2023, 13:35
  #207 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Neo380
That’s fine, if you want to run the National Airspace System (which is defined at ‘critical national infrastructure’ btw) on manual every time your computer trips out, then no issues.

But when you come back to the public purse (as you will, despite what you might read here) and ask for £1bn+ for a new system ‘for the safe and smooth running of the UK’s CNI’, expect to be reminded of what you said.

Btw, this is the same organisation that in RP3, so only a couple of years ago, said ‘accepting any performance improvements (the point of reporting periods!) was against the business’s and national interest’.

If I was on the Parliamentary Committee I would ensure that sort of comment (and several of the above) was recorded as ‘breathtakingly arrogant!’
So this future demand on the public purse is invented by you? The last £1bn didn't come from the public purse so why should the next?
If anything like last time I would expect the Committee to be hell bent on making sound bites most of which have no relevance to the situation rather than actually discussing the issue at hand. Last time they continually demanded the CEO promise something nether he nor anybody else could promise and contrived to get him replaced when he stuck to the only realistic response.
The RPs are a negotiation between multiple parties and I'd expect NATS to forcefully fight its corner. You might consider that arrogant but actually as directors of a PLC the board is probably legally required to do so.
As I said before once the safety case is established the rest is a business matter. It maybe CNI but the decision as to what investment happens is part of the licence negotiations and in each case so far the customers, ie the airlines, have been offered options with more resilience and after, predictably, initially demanding low prices and more resilience have tended towards lower cost. If you don't want those who pay to define your CNI you need to give the Regulator different powers.
If Monday was a result of the failure of an old system then the initial report next week might include some information on the investment to date, why that system is still in use, and current plans and timescales for replacement.
eglnyt is offline  
Old 2nd Sep 2023, 14:07
  #208 (permalink)  
 
Join Date: Aug 2023
Location: England
Posts: 7
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Gupeg
I think this is rather a favourable way of looking at it
However, the upgrade is described as being specifically to "add military controller roles". ..... but the whole system to expose (as here) related latent errors that had been "got away with" to date - especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military).
Not correct I am afraid. It's always been a dual type system. It was just adding more military workstations on this occasion. As far as I know it was not the specific military functionality that was the problem but the total number of active and watching stations.

As someone said the safety case was fine but the resilience (which has often little to do with safety) failed on that occasion. It would indeed cost a ton of money to make it more resilient with dual software and who is willing to pay for that? Many of the NATS shareholders are UK airlines. Will they pay? As BA has found with its booking system resilience can be hard to buy.
Engineer39 is offline  
Old 2nd Sep 2023, 16:14
  #209 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
So this future demand on the public purse is invented by you? The last £1bn didn't come from the public purse so why should the next?
If anything like last time I would expect the Committee to be hell bent on making sound bites most of which have no relevance to the situation rather than actually discussing the issue at hand. Last time they continually demanded the CEO promise something nether he nor anybody else could promise and contrived to get him replaced when he stuck to the only realistic response.
The RPs are a negotiation between multiple parties and I'd expect NATS to forcefully fight its corner. You might consider that arrogant but actually as directors of a PLC the board is probably legally required to do so.
As I said before once the safety case is established the rest is a business matter. It maybe CNI but the decision as to what investment happens is part of the licence negotiations and in each case so far the customers, ie the airlines, have been offered options with more resilience and after, predictably, initially demanding low prices and more resilience have tended towards lower cost. If you don't want those who pay to define your CNI you need to give the Regulator different powers.
If Monday was a result of the failure of an old system then the initial report next week might include some information on the investment to date, why that system is still in use, and current plans and timescales for replacement.
That is the current 'word' in NATS, so you'd have to ask them, not me (it sounds like you're pretty close too). But if there's going to be no call at all on the public purse that's great - we can meet back here and you can say 'I told you so'. I'd like to read the Hansard report of the last Parliamentary Committee, do you have it, because it would be interesting to see what was actually said? (I thought) Richard Deacon, the then CEO, was replaced, are you saying this is why?
Come on, NATS didn't 'forcefully fight it's corner' (you must be an insider); it picked its ball up and refused to play. What other government department, agency of PPP is immune from scrutiny and improving itself??
NATS is not a PLC, the PPP status confers some significant privileges on it - not least of which is a monopoly that is normally banned under Competition Law. CNI is not a 'business matter', it just doesn't work like that - would you be happy if we deployed your children to Iraq or Afghanistan and said, 'oh, sorry, the business decided against buying you protective equipment and ammunition'?? The analogy shows how ridiculous this line of thinking is. The Regulator should absolutely be defining what fail safe procedures it wants to see, I agree with you there (there's little point checking 'amendments to MATS' if the country's critical national infrastructure is susceptible to catastrophic collapse). I expect the initial report to reveal very little again - if you suspend you disbelief/denial for just a few seconds it's clear that NATS isn't coming clean on this one.
Neo380 is offline  
Old 2nd Sep 2023, 16:51
  #210 (permalink)  
 
Join Date: May 2000
Location: Living In The Past
Age: 76
Posts: 299
Received 1 Like on 1 Post
The Transport Committee needs a Gwyneth Dunwoody clone at the helm - weak excuses would not be tolerated !
Eric T Cartman is offline  
Old 2nd Sep 2023, 16:57
  #211 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
NATS has a very complex structure of intertwined companies but the holder of the licence is NATS (En-Route) PLC. Whether it is a "proper" PLC is a debatable point, not least because the shares aren't publicly traded, but it is constituted as one and the rules for PLCs apply. The En-Route operation is, as you say, a regulated monopoly.
The Hansard report of the Select Committee is available at the official source and the proceedings are still available on Parliament TV.
I am closer than some, lived through much of what we have been discussing, have a reasonable but possibly dated understanding of the systems in the Flight Data thread but have no current connection with NATS. I have no more knowledge of what happened than anybody else outside NATS although I can deduce some things from the regulations imposed.
eglnyt is offline  
Old 2nd Sep 2023, 17:01
  #212 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
NATS has a very complex structure of intertwined companies but the holder of the licence is NATS (En-Route) PLC. Whether it is a "proper" PLC is a debatable point, not least because the shares aren't publicly traded, but it is constituted as one and the rules for PLCs apply. The En-Route operation is, as you say, a regulated monopoly.
The Hansard report of the Select Committee is available at the official source and the proceedings are still available on Parliament TV.
I am closer than some, lived through much of what we have been discussing, have a reasonable but possibly dated understanding of the systems in the Flight Data thread but have no current connection with NATS. I have no more knowledge of what happened than anybody else outside NATS although I can deduce some things from the regulations imposed.
That’s useful, thanks. Good chat btw.
Neo380 is offline  
Old 2nd Sep 2023, 17:01
  #213 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Eric T Cartman
The Transport Committee needs a Gwyneth Dunwoody clone at the helm - weak excuses would not be tolerated !
Fully agree. Having previously seen the MP for Crewe & Nantwich in action, well briefed and focussed and downright scarey, I was disappointed to see what the Select Committee had become. I now despair that the malaise has infected the whole Parliamentary system.
eglnyt is offline  
Old 3rd Sep 2023, 11:43
  #214 (permalink)  
 
Join Date: May 2016
Location: UK
Posts: 6
Likes: 0
Received 1 Like on 1 Post
E39:
Not correct I am afraid. It's always been a dual type system. It was just adding more military workstations on this occasion. As far as I know it was not the specific military functionality that was the problem but the total number of active and watching stations.
My comment:
especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military).
is based on the report G.3.19:
​​​​​​​The software had its origins in an earlier development in the USA that did not support military Controllers, and this might help to explain the original program design, although it is unlikely that the underlying cause for the software fault can be found at this time.
Reading elsewhere in G there is reference to 'poor' naming of a variable, the poor being because it was not written to cover civil and military,

​​​​​​​Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option.
I doubt we differ much overall some detail maybe. What you allude to is that in the 2020s we are using a system that by current standards is not fit for purpose in that it cannot be tested. The reference above shows there is code written decades ago that is not amenable to even "visual checking", and there is no practical test system in existence,

egl:
​​​​​​​It wasn't new software. It was the original software, it had been there for years.
Sorry - by "new" software, I meant a new version introduced 1 day prior the failure... From my own software experience, it tends to be the minor upgrades that bring the most grief

​​​​​​​We'll later see if the report on Monday's issue has any parallels?
... I doubt very much that anything that comes out on Monday will be sufficient to decide if there are any parallels.
Slight misunderstanding, Appreciate there will be no report out Monday (tomorrow) - I was referring to (last) Monday's issue. When we do see a report, it will be interesting to see if there are parallels between 2023 and 2014...
​​​​​​​
Gupeg is offline  
Old 3rd Sep 2023, 12:34
  #215 (permalink)  
 
Join Date: Jun 2009
Location: Bedford, UK
Age: 70
Posts: 1,319
Received 24 Likes on 13 Posts
Originally Posted by eglnyt
Fully agree. Having previously seen the MP for Crewe & Nantwich in action, well briefed and focussed and downright scarey, I was disappointed to see what the Select Committee had become. I now despair that the malaise has infected the whole Parliamentary system.
Perhaps we blame government and parliament too much. I suspect its the great bureaucratic infrastructure where the fault might lie and there is no curing that.

As for nuclear power stations and mission critical systems, from what I remember it was more a case of monitoring physical parameters and shutting it down if things went out of wack. Ten to the minus 9 only gets you so far.
Mr Optimistic is offline  
Old 3rd Sep 2023, 12:57
  #216 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Mr Optimistic
Perhaps we blame government and parliament too much. I suspect its the great bureaucratic infrastructure where the fault might lie and there is no curing that.

As for nuclear power stations and mission critical systems, from what I remember it was more a case of monitoring physical parameters and shutting it down if things went out of wack. Ten to the minus 9 only gets you so far.
Whereas putting proper fail safes in place, as would be the case in other mission critical systems, does properly manage the problem - but that’s the point NATS is very keen to avoid discussing.
Neo380 is offline  
Old 3rd Sep 2023, 13:17
  #217 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Neo380
Whereas putting proper fail safes in place, as would be the case in other mission critical systems, does properly manage the problem - but that’s the point NATS is very keen to avoid discussing.
​​​You seem to be interpreting the silence so far as avoiding discussion. That may be the case but from previous experience a lot of detailed work needs to happen to prepare the background material before you can start the discussion. In this case there appear to be external parties involved and possibly suppliers and you have to give them time an opportunity to prepare their responses as well. How many organisations do you know that can lay their hands on 20 year records quickly?
eglnyt is offline  
Old 3rd Sep 2023, 14:07
  #218 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
​​​You seem to be interpreting the silence so far as avoiding discussion. That may be the case but from previous experience a lot of detailed work needs to happen to prepare the background material before you can start the discussion. In this case there appear to be external parties involved and possibly suppliers and you have to give them time an opportunity to prepare their responses as well. How many organisations do you know that can lay their hands on 20 year records quickly?
I do, as it’s much more like obfuscation, at least on this channel, which has no time constraints.

This is sixty year old technology.

The question is ‘where were the fail safes?’

I suspect the 2014 catastrophic failure would be a good starting point for a proper investigation.

And btw (as I’ve said before, for good reason) I don’t expect NATS to come clean on this issue.
Neo380 is offline  
Old 3rd Sep 2023, 14:57
  #219 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Neo380
I do, as it’s much more like obfuscation, at least on this channel, which has no time constraints.

This is sixty year old technology.

The question is ‘where were the fail safes?’

I suspect the 2014 catastrophic failure would be a good starting point for a proper investigation.

And btw (as I’ve said before, for good reason) I don’t expect NATS to come clean on this issue.
Nobody who knows anything will be posting on "social media" channels. Like most organisations NATS has policies about that and will regularly remind its staff of their obligations even when there hasn't been an "incident".
With a reasonable knowledge of the systems involved I can't tell you which system it was. It's only speculation that it's the ageing Flight Data system although it does have previous. Not all the systems involved are as old although all, I think, have redundancy provided by a backup running the same or very similar software. If there is an investigation I would hope that is discussed although most of the systems other than NAS are used at multiple ANSPs in exactly the same way.
2014 was not catastrophic. It was of quite short duration and over the course of the day NATS handled a higher percentage of planned traffic than most businesses would expect to handle in a fallback mode. The response to 2014 was rather over the top given the actual impact. If we did the same for the railway Network Rail would be forever at the Committee and in a permanent state of review.
This one was much worse in terms of impact for which reason a similar review should be the minimum, I'd argue for one with a bit more independence. There were independent experts who wrote most of the meaningful content but CAA/NATS were allowed to lead it last time.
Some themes will be similar but if this was a different system much of the detail from 2014 will be irrelevant.
eglnyt is offline  
Old 3rd Sep 2023, 18:18
  #220 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
Nobody who knows anything will be posting on "social media" channels. Like most organisations NATS has policies about that and will regularly remind its staff of their obligations even when there hasn't been an "incident".
With a reasonable knowledge of the systems involved I can't tell you which system it was. It's only speculation that it's the ageing Flight Data system although it does have previous. Not all the systems involved are as old although all, I think, have redundancy provided by a backup running the same or very similar software. If there is an investigation I would hope that is discussed although most of the systems other than NAS are used at multiple ANSPs in exactly the same way.
2014 was not catastrophic. It was of quite short duration and over the course of the day NATS handled a higher percentage of planned traffic than most businesses would expect to handle in a fallback mode. The response to 2014 was rather over the top given the actual impact. If we did the same for the railway Network Rail would be forever at the Committee and in a permanent state of review.
This one was much worse in terms of impact for which reason a similar review should be the minimum, I'd argue for one with a bit more independence. There were independent experts who wrote most of the meaningful content but CAA/NATS were allowed to lead it last time.
Some themes will be similar but if this was a different system much of the detail from 2014 will be irrelevant.
They may not be posting on social media but they discussed it quite freely with me when I was working there shortly after the investigation - hence I know what they think the real cause of the issue is.

The system in question is also already in the public domain. Very interesting that you should then say 'Not all the systems involved are as old although all, I think, have redundancy provided by a backup running the same or very similar software.' Let's see. I just don't know, but other commentators have said that Swanwick Centre is the only centre that is still operating this particular (version of this?) system.

I agree with the need for an independent review - look at the 'dodgy French data' PR, 'Martin Rolfe hasn't been seen at home since it happened', the Radio 4 and The Times pieces - all masterful stuff, but mostly 'fluff'.

Last edited by Neo380; 3rd Sep 2023 at 19:00.
Neo380 is offline  


Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.