U.K. NATS Systems Failure

Reply Subscribe

Thread Tools

Search this Thread

30th Aug 2023, 18:05

#141 (permalink)

Ninthace

Join Date: Jan 2008

Location: Glorious Devon

Posts: 2,699

Likes: 262

Received 937 Likes on 555 Posts

Quote:

Originally Posted by EGPI10BR

Some systems will run with the primary on version n and the backup on version n-1 so that the backup won’t be affected by a newly introduced bug.

That falls down of course if an undetected bug was in version n-15.

Misty.

Fine for software tweaks, not so good when the change is caused by a shift in user requirements. Then you end up with a system that should do what you want but doesn't run and a system that doesn't do what you want but runs.

30th Aug 2023, 18:15

#142 (permalink)

Walnut

Join Date: Jan 2006

Location: Cyprus

Age: 76

Posts: 270

Likes: 0

Received 0 Likes on 0 Posts

Walnut

Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I don’t believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. That’s why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadn’t the bug had washed out because the system holds about 4hrs of data
So my analysis is that every time there Is a faulty input of data the system resets itself to protect itself. Fine but surely an airline when discovering its Flight Plan had been rejected will check and input its data correctly . So I believe a DOS input by a third party was re introduced causing the NATS system to go down again. As so many Flight Plans were being inputted it was very difficult to detect, only when NATS admitted there was a problem did the malevolent data ceased being inputted and it eventually washed out
So CEO please explain the delay

30th Aug 2023, 18:21

#143 (permalink)

KiloB

Join Date: Aug 2007

Location: Wilds of Warwickshire

Posts: 240

Likes: 19

Received 8 Likes on 6 Posts

I suspect that an underlying reason for the severity of this breakdown was that the ATC System has been quietly ‘redlining’ for some time. The post Covid boom in travel numbers is now making the decisions Airlines made wrt putting more emphasis on narrow-bodies look very suspect. The math is simple; given number of passengers, smaller A/C more movements.
An it is a problem that will take quite some time to unscramble.

30th Aug 2023, 18:27

#144 (permalink)

speed13ird

Join Date: Feb 2008

Location: uk

Posts: 48

Likes: 1

Received 0 Likes on 0 Posts

well almost, larger aircraft need longer turn rounds and cause bigger headaches when they go tech

30th Aug 2023, 18:54

#145 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Walnut

The originator would never know its plan was rejected because it wasn't. It was filed with Eurocontrol who had no reason to reject it. It passed through several systems without incident before it caused harm.

The NATS systems beyond the one we are assuming was the issue also have a buffer of data of up to 4 hours. This data will start to go stale as amendments, coordinations and new plans don't arrive but is a good basis to continue to operate without flow if you expect the technical issue to be fixed. At some point somebody will decide it isn't coming back and will make the call to impose flow. That was done before 10:00 UTC although it was about 2 hours later before NATS itself published anything.

Eventually all will be in the public domain however much anybody tries to stop it so why would they bother to lie?

30th Aug 2023, 20:20

#146 (permalink)

alfaman

Join Date: Nov 2006

Location: UK

Age: 59

Posts: 247

Likes: 4

Received 23 Likes on 11 Posts

Quote:

Originally Posted by LTNABZ

Agree, but ditto the Mail, though

Aimed at the same demographic

30th Aug 2023, 20:29

#147 (permalink)

Seaking74

Join Date: Jul 2020

Location: Beds

Posts: 9

Likes: 4

Received 3 Likes on 1 Post

Quote:

Originally Posted by eglnyt

So the flight plan was fine for Eurocontrol and other systems but not for NATS. And NATS uses a modified US software for flight handling which likely isn’t the same as Eurocontrol et al? But this system has worked flawlessly until now? Does that point to something (software/firmware?) having changed recently in Euro land that hasn’t been changed here thus causing the failure?

30th Aug 2023, 20:43

#148 (permalink)

First_Principal

Join Date: Aug 2007

Location: not where I want to be

Posts: 521

Likes: 14

Received 49 Likes on 32 Posts

Quote:

Originally Posted by Superpilot

...There is something else behind this that they've not divulged yet.

Uh-oh, surely they're not using a M$ product?!

Given the importance of this system you'd have to hope there's a decent investigation and full public report that would reveal both the cause as well as the underlaying products in use, along with how redundancy is managed. To my mind there are certain things that bear critical scrutiny; with decent information this should allay unhelpful speculation, but allow competent people to assess the risk faced in being involved with said system.

FP.

30th Aug 2023, 21:01

#149 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Last time there was an "independent" review & a public report which is still available on the CAA website. And a Parliamentary Select Committee hearing which it would be fair to say didn't really add much of value. This incident was orders of magnitude more disruptive so you'd expect at least the same although this Government isn't too hot on transparency and rather dismissive of experts.

30th Aug 2023, 21:16

#150 (permalink)

Expatrick

Join Date: Dec 2015

Location: Budapest

Posts: 315

Likes: 163

Received 216 Likes on 129 Posts

There has to be a system whereby the ANSP is penalised, in favour of the airlines, not the Exchequer, tricky to design & implement, but surely doable, a formula based initially on minutes delay, distributed on the grounds of impact on individual airlines, not necessarily the largest operators but those on whom the event has had the largest impact - most importantly a charge that the ANSP is not simply able to claw back from the airlines in future.

30th Aug 2023, 21:35

#151 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

The NATS licence includes a penalty scheme whereby a certain level of delay triggers a reduction in future charges. It is deliberately not punitive to avoid influence on operational decision making so it is unlikely to cover the airline's costs.

30th Aug 2023, 21:44

#152 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by ATC Watcher

Never listen to the very vocal Irishmen, if you follow their logic, that of MOL at least , we should all work 12h a day . 7 days a week for half of the money so that his airline can make more profit. But on cost of back up systems you take the problem the wrong way . a functioning back up in an ATC Centre is part of the system , from the outset and this regardless of the end price. . You do not buy a system and later ask for a back up.
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,

What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.

The system crashed, for 7 hours, because it failed to 'fail over', which is what it's supposed to do - take a look at thread 122 to understand why. This isn't an 'IT glitch', it's 'building a house with a trap door in the floor, with the same trap door behind it, and an identical trap door under that' - fall through one and you fall through all three!

The other serious managerial error was stopping the backup that used to occur through the Prestwick Centre, in the same way that Eurocontrol does (see above), but 'the two centres went their separate ways...'

30th Aug 2023, 21:48

#153 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

'Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I don’t believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. That’s why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadn’t the bug had washed out because the system holds about 4hrs of data...So CEO please explain the delay'

Exactly - he can't (without removing the cover up).

30th Aug 2023, 22:06

#154 (permalink)

kiwi grey

Join Date: Dec 2006

Location: Whanganui, NZ

Posts: 279

Likes: 43

Received 5 Likes on 4 Posts

Quote:

Originally Posted by Abrahn

This is clearly a plausible explanation, but at first glance it doesn't seem as complex a task as you suggest
Whilst there are clearly an infinite number of locations, lots of them can be ruled out simply. Anything outside of bounding boxes for Shanwick, Scottish and London can go immediately. Within that you've got bounding boxes for the areas controlled and only then do you actually have to start considering the real geography. The graph of controllers must be fairly small (otherwise the crew would be constantly on the radio) and the number of possible graphs isn't that big (in computer terms), so anything getting stuck should be detectable. Your satnav can cope with a much bigger problem.

(emphasis added)

The amount of data and the complexity of the calculations / algorithm to do comprehensive - even exhaustive - checking may indeed be relatively trivial for a modern system, but the NATS system doesn't seem like that from what I've read. It appears to be an older (ancient?) system re-hosted onto a more modern platform.
The NATS base system may, for example, be a 32-bit architecture and simply unable to use 'modern' amounts of memory (4GB) no matter the capacity of the hosting platform. Or it may be that (parts of) the system are inherently single-threaded, so that it matters not how many CPU instances / threads you throw at it, it only goes as fast as a single CPU. Or perhaps the overheads of emulating the original code on a less elderly system, which is in turn being emulated on a quite modern platform, are such as to ensure that the best you can hope for is 'quite a lot faster' than the original implementation, but nothing approaching a modern definition of 'high performance'.
Or if you're really lucky, all three apply.

The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.

31st Aug 2023, 00:13

#155 (permalink)

grizzled

Join Date: Dec 2007

Location: Itinerant

Posts: 828

Likes: 82

Received 79 Likes on 14 Posts

Quote:

Originally Posted by eglnyt

Eventually all will be in the public domain however much anybody tries to stop it so why would they bother to lie?

Maybe yes, maybe no. NATS is a public / private partnership so, as such, is it subject the UK Freedom of Information Act?

31st Aug 2023, 02:18

#156 (permalink)

WillowRun 6-3

Join Date: Jul 2013

Location: Within AM radio broadcast range of downtown Chicago

Age: 71

Posts: 851

Likes: 2

Received 0 Likes on 0 Posts

Quote:

Originally Posted by kiwi grey

Difficult to dispute that description of a rip-out and replace project.

But would it be worse than the aftermath of a strong cybersecurity breach, or cyber attack? A version of, "if you think safety is expensive, try having an accident"

31st Aug 2023, 05:39

#157 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by grizzled

Maybe yes, maybe no. NATS is a public / private partnership so, as such, is it subject the UK Freedom of Information Act?

You would think so, but as a PPP (not a publicly listed entity, or a government department/agency) NATS considers that it is NOT bound by the FOI Act, and therefore since the establishment of the Act (2000) 'picks and chooses' which FOI subject access requests (requests for information) it responds to. You can bet your pension that NATS won't be responding to FOI requests on this one!

31st Aug 2023, 05:45

#158 (permalink)

Neo380

Join Date: Nov 2018

Location: UK

Posts: 82

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by kiwi grey

Quote:

Originally Posted by WillowRun 6-3

NATS has already floated replacing the (1970s based!) Swanwick main ATC system, at a tentative £1bn+ price tag, that they will ask to be paid for from the public purse. The issue is making the case for a 'safety critical' main system, when you have said that you can 'prioritise safety', and manage (highly reduced!) flows safely by operating manually - a bit of a contradiction in terms.

31st Aug 2023, 07:31

#159 (permalink)

Asturias56

Join Date: Oct 2018

Location: Ferrara

Posts: 8,464

Likes: 441

Received 364 Likes on 213 Posts

todays "Times" notes that they're upping the money they (the shareholders - the Govt & the airlines) take out as dividends and reducing the amount they planned to invest...............

31st Aug 2023, 07:53

#160 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

NATS investment is not from the public purse. One of the reasons it was privatised was to remove it's borrowing requirements from the accounts in the days when the Public Sector Borrowing Requirements figure was thought to be important and Government hadn't realised it could just keep printing money and nobody actually cared.
It has been investing a lot since privatisation and continues to do so, whether all that investment has been wise is difficult to judge but it's kept a lot of people employed for a while.
If it was NAS that failed then NATS has been trying to replace that system since the 90s and has failed to do so, it's a bit like trying to replace the foundations on a tower block whilst keeping the block intact because it's so integrated into all NATS systems
Having finally realised that couldn't be done it embarked on replacing Both NAS and the overlying systems a few years ago. It's a huge project that does literally rip it out & start again.

Reply Share

First
Prev
8 / 21
Next
Last