Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

U.K. NATS Systems Failure

Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

U.K. NATS Systems Failure

Old 30th Aug 2023, 19:05
  #141 (permalink)  
 
Join Date: Jan 2008
Location: Glorious Devon
Posts: 2,575
Received 512 Likes on 286 Posts
Originally Posted by EGPI10BR
Some systems will run with the primary on version n and the backup on version n-1 so that the backup wonít be affected by a newly introduced bug.

That falls down of course if an undetected bug was in version n-15.

Misty.
Fine for software tweaks, not so good when the change is caused by a shift in user requirements. Then you end up with a system that should do what you want but doesn't run and a system that doesn't do what you want but runs.
Ninthace is offline  
Old 30th Aug 2023, 19:15
  #142 (permalink)  
 
Join Date: Jan 2006
Location: Cyprus
Age: 76
Posts: 270
Likes: 0
Received 0 Likes on 0 Posts
Walnut

Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I donít believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. Thatís why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadnít the bug had washed out because the system holds about 4hrs of data
So my analysis is that every time there Is a faulty input of data the system resets itself to protect itself. Fine but surely an airline when discovering its Flight Plan had been rejected will check and input its data correctly . So I believe a DOS input by a third party was re introduced causing the NATS system to go down again. As so many Flight Plans were being inputted it was very difficult to detect, only when NATS admitted there was a problem did the malevolent data ceased being inputted and it eventually washed out
So CEO please explain the delay





Walnut is offline  
Old 30th Aug 2023, 19:21
  #143 (permalink)  
 
Join Date: Aug 2007
Location: Wilds of Warwickshire
Posts: 239
Received 8 Likes on 6 Posts
I suspect that an underlying reason for the severity of this breakdown was that the ATC System has been quietly Ďredliningí for some time. The post Covid boom in travel numbers is now making the decisions Airlines made wrt putting more emphasis on narrow-bodies look very suspect. The math is simple; given number of passengers, smaller A/C more movements.
An it is a problem that will take quite some time to unscramble.
KiloB is online now  
Old 30th Aug 2023, 19:27
  #144 (permalink)  
 
Join Date: Feb 2008
Location: uk
Posts: 48
Received 0 Likes on 0 Posts
well almost, larger aircraft need longer turn rounds and cause bigger headaches when they go tech
speed13ird is offline  
Old 30th Aug 2023, 19:54
  #145 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 471
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Walnut
Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I donít believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. Thatís why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadnít the bug had washed out because the system holds about 4hrs of data
So my analysis is that every time there Is a faulty input of data the system resets itself to protect itself. Fine but surely an airline when discovering its Flight Plan had been rejected will check and input its data correctly . So I believe a DOS input by a third party was re introduced causing the NATS system to go down again. As so many Flight Plans were being inputted it was very difficult to detect, only when NATS admitted there was a problem did the malevolent data ceased being inputted and it eventually washed out
So CEO please explain the delay
The originator would never know its plan was rejected because it wasn't. It was filed with Eurocontrol who had no reason to reject it. It passed through several systems without incident before it caused harm.

The NATS systems beyond the one we are assuming was the issue also have a buffer of data of up to 4 hours. This data will start to go stale as amendments, coordinations and new plans don't arrive but is a good basis to continue to operate without flow if you expect the technical issue to be fixed. At some point somebody will decide it isn't coming back and will make the call to impose flow. That was done before 10:00 UTC although it was about 2 hours later before NATS itself published anything.

Eventually all will be in the public domain however much anybody tries to stop it so why would they bother to lie?

eglnyt is offline  
Old 30th Aug 2023, 21:20
  #146 (permalink)  
 
Join Date: Nov 2006
Location: UK
Age: 58
Posts: 244
Received 19 Likes on 8 Posts
Originally Posted by LTNABZ
Agree, but ditto the Mail, though
Aimed at the same demographic
alfaman is online now  
Old 30th Aug 2023, 21:29
  #147 (permalink)  
 
Join Date: Jul 2020
Location: Beds
Posts: 9
Received 3 Likes on 1 Post
Originally Posted by eglnyt
The originator would never know its plan was rejected because it wasn't. It was filed with Eurocontrol who had no reason to reject it. It passed through several systems without incident before it caused harm.
So the flight plan was fine for Eurocontrol and other systems but not for NATS. And NATS uses a modified US software for flight handling which likely isnít the same as Eurocontrol et al? But this system has worked flawlessly until now? Does that point to something (software/firmware?) having changed recently in Euro land that hasnít been changed here thus causing the failure?
Seaking74 is offline  
Old 30th Aug 2023, 21:43
  #148 (permalink)  
 
Join Date: Aug 2007
Location: not where I want to be
Posts: 516
Received 40 Likes on 27 Posts
Originally Posted by Superpilot
...There is something else behind this that they've not divulged yet.

Uh-oh, surely they're not using a M$ product?!

Given the importance of this system you'd have to hope there's a decent investigation and full public report that would reveal both the cause as well as the underlaying products in use, along with how redundancy is managed. To my mind there are certain things that bear critical scrutiny; with decent information this should allay unhelpful speculation, but allow competent people to assess the risk faced in being involved with said system.

FP.
First_Principal is offline  
Old 30th Aug 2023, 22:01
  #149 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 471
Likes: 0
Received 0 Likes on 0 Posts
Last time there was an "independent" review & a public report which is still available on the CAA website. And a Parliamentary Select Committee hearing which it would be fair to say didn't really add much of value. This incident was orders of magnitude more disruptive so you'd expect at least the same although this Government isn't too hot on transparency and rather dismissive of experts.
eglnyt is offline  
Old 30th Aug 2023, 22:16
  #150 (permalink)  
 
Join Date: Dec 2015
Location: Budapest
Posts: 276
Received 111 Likes on 61 Posts
There has to be a system whereby the ANSP is penalised, in favour of the airlines, not the Exchequer, tricky to design & implement, but surely doable, a formula based initially on minutes delay, distributed on the grounds of impact on individual airlines, not necessarily the largest operators but those on whom the event has had the largest impact - most importantly a charge that the ANSP is not simply able to claw back from the airlines in future.

Expatrick is offline  
Old 30th Aug 2023, 22:35
  #151 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 471
Likes: 0
Received 0 Likes on 0 Posts
The NATS licence includes a penalty scheme whereby a certain level of delay triggers a reduction in future charges. It is deliberately not punitive to avoid influence on operational decision making so it is unlikely to cover the airline's costs.
eglnyt is offline  
Old 30th Aug 2023, 22:44
  #152 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 81
Likes: 0
Received 0 Likes on 0 Posts
Angry

Originally Posted by ATC Watcher
Never listen to the very vocal Irishmen, if you follow their logic, that of MOL at least , we should all work 12h a day . 7 days a week for half of the money so that his airline can make more profit. But on cost of back up systems you take the problem the wrong way . a functioning back up in an ATC Centre is part of the system , from the outset and this regardless of the end price. . You do not buy a system and later ask for a back up.
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,

What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.
The system crashed, for 7 hours, because it failed to 'fail over', which is what it's supposed to do - take a look at thread 122 to understand why. This isn't an 'IT glitch', it's 'building a house with a trap door in the floor, with the same trap door behind it, and an identical trap door under that' - fall through one and you fall through all three!

The other serious managerial error was stopping the backup that used to occur through the Prestwick Centre, in the same way that Eurocontrol does (see above), but 'the two centres went their separate ways...'
Neo380 is offline  
Old 30th Aug 2023, 22:48
  #153 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 81
Likes: 0
Received 0 Likes on 0 Posts
'Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I don’t believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. That’s why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadn’t the bug had washed out because the system holds about 4hrs of data...So CEO please explain the delay'

Exactly - he can't (without removing the cover up).
Neo380 is offline  
Old 30th Aug 2023, 23:06
  #154 (permalink)  
 
Join Date: Dec 2006
Location: Whanganui, NZ
Posts: 276
Received 5 Likes on 4 Posts
Originally Posted by Abrahn
This is clearly a plausible explanation, but at first glance it doesn't seem as complex a task as you suggest
Whilst there are clearly an infinite number of locations, lots of them can be ruled out simply. Anything outside of bounding boxes for Shanwick, Scottish and London can go immediately. Within that you've got bounding boxes for the areas controlled and only then do you actually have to start considering the real geography. The graph of controllers must be fairly small (otherwise the crew would be constantly on the radio) and the number of possible graphs isn't that big (in computer terms), so anything getting stuck should be detectable. Your satnav can cope with a much bigger problem.
(emphasis added)

The amount of data and the complexity of the calculations / algorithm to do comprehensive - even exhaustive - checking may indeed be relatively trivial for a modern system, but the NATS system doesn't seem like that from what I've read. It appears to be an older (ancient?) system re-hosted onto a more modern platform.
The NATS base system may, for example, be a 32-bit architecture and simply unable to use 'modern' amounts of memory (4GB) no matter the capacity of the hosting platform. Or it may be that (parts of) the system are inherently single-threaded, so that it matters not how many CPU instances / threads you throw at it, it only goes as fast as a single CPU. Or perhaps the overheads of emulating the original code on a less elderly system, which is in turn being emulated on a quite modern platform, are such as to ensure that the best you can hope for is 'quite a lot faster' than the original implementation, but nothing approaching a modern definition of 'high performance'.
Or if you're really lucky, all three apply.

The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.
kiwi grey is offline  
Old 31st Aug 2023, 01:13
  #155 (permalink)  
 
Join Date: Dec 2007
Location: Itinerant
Posts: 826
Received 45 Likes on 10 Posts
Originally Posted by eglnyt

Eventually all will be in the public domain however much anybody tries to stop it so why would they bother to lie?
Maybe yes, maybe no. NATS is a public / private partnership so, as such, is it subject the UK Freedom of Information Act?
grizzled is offline  
Old 31st Aug 2023, 03:18
  #156 (permalink)  
 
Join Date: Jul 2013
Location: Within AM radio broadcast range of downtown Chicago
Age: 71
Posts: 782
Received 0 Likes on 0 Posts
Originally Posted by kiwi grey

The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.
Difficult to dispute that description of a rip-out and replace project.

But would it be worse than the aftermath of a strong cybersecurity breach, or cyber attack? A version of, "if you think safety is expensive, try having an accident"
WillowRun 6-3 is offline  
Old 31st Aug 2023, 06:39
  #157 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 81
Likes: 0
Received 0 Likes on 0 Posts
Angry

Originally Posted by grizzled
Maybe yes, maybe no. NATS is a public / private partnership so, as such, is it subject the UK Freedom of Information Act?
You would think so, but as a PPP (not a publicly listed entity, or a government department/agency) NATS considers that it is NOT bound by the FOI Act, and therefore since the establishment of the Act (2000) 'picks and chooses' which FOI subject access requests (requests for information) it responds to. You can bet your pension that NATS won't be responding to FOI requests on this one!
Neo380 is offline  
Old 31st Aug 2023, 06:45
  #158 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 81
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by kiwi grey

The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.
Originally Posted by WillowRun 6-3
Difficult to dispute that description of a rip-out and replace project.

But would it be worse than the aftermath of a strong cybersecurity breach, or cyber attack? A version of, "if you think safety is expensive, try having an accident"
NATS has already floated replacing the (1970s based!) Swanwick main ATC system, at a tentative £1bn+ price tag, that they will ask to be paid for from the public purse. The issue is making the case for a 'safety critical' main system, when you have said that you can 'prioritise safety', and manage (highly reduced!) flows safely by operating manually - a bit of a contradiction in terms.
Neo380 is offline  
Old 31st Aug 2023, 08:31
  #159 (permalink)  
 
Join Date: Oct 2018
Location: Ferrara
Posts: 8,191
Received 321 Likes on 187 Posts
todays "Times" notes that they're upping the money they (the shareholders - the Govt & the airlines) take out as dividends and reducing the amount they planned to invest...............
Asturias56 is offline  
Old 31st Aug 2023, 08:53
  #160 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 471
Likes: 0
Received 0 Likes on 0 Posts
NATS investment is not from the public purse. One of the reasons it was privatised was to remove it's borrowing requirements from the accounts in the days when the Public Sector Borrowing Requirements figure was thought to be important and Government hadn't realised it could just keep printing money and nobody actually cared.
​​​​​It has been investing a lot since privatisation and continues to do so, whether all that investment has been wise is difficult to judge but it's kept a lot of people employed for a while.
If it was NAS that failed then NATS has been trying to replace that system since the 90s and has failed to do so, it's a bit like trying to replace the foundations on a tower block whilst keeping the block intact because it's so integrated into all NATS systems
Having finally realised that couldn't be done it embarked on replacing Both NAS and the overlying systems a few years ago. It's a huge project that does literally rip it out & start again.
eglnyt is offline  

Thread Tools
Search this Thread

Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.