PDA

View Full Version : U.K. NATS Systems Failure


Pages : [1] 2

ORAC
28th Aug 2023, 11:05
U.K. NATS down, Major system failure.

Major disruption to both incoming and departing flights.

https://news.sky.com/story/network-wide-failure-of-uk-air-traffic-control-systems-loganair-reports-12949638

'Network-wide failure of UK air traffic control systems', Loganair reports

There has been a 'network-wide failure of UK air traffic control systems', according to the airline Loganair.

Sky News has had reports of passengers on other airlines outside of the UK being told that the air traffic control network is down and their flight will be delayed.

NATS, the national air traffic controllers, said in a statement: "We are currently experiencing a technical issue and have applied traffic flow restrictions to maintain safety. Engineers are working to find and the fix the fault.

"We apologise for any inconvenience this may cause."

hunterboy
28th Aug 2023, 11:06
Hilarious…another example of the UK’s crumbling infrastructure. Don’t we build resilience in anymore?

Flying Wild
28th Aug 2023, 11:16
12+ hour slot delays in some cases.

Flying Wild
28th Aug 2023, 11:21
28/08/2023 11:12z
United Kingdom is experiencing a Flight Data Processing System failure.
Scottish and London FIRs are regulated at low rates with very high individual delays.

Currently there is no indication of when a solution for the failure will be available so no improvements for flights entering UK airspace are forseen in the near future.

Further information will be published when available.

NMOC Brussels

Auxtank
28th Aug 2023, 11:24
Hacked?

https://www.bbc.co.uk/news/uk-66637156

Mind you - doesn't exactly look quiet ...

https://cimg8.ibsrv.net/gimg/pprune.org-vbulletin/1541x1017/atc12_fd25ff79a4ac71092a4e70a761c84ba8cc4526dd.jpg

hobbit1983
28th Aug 2023, 11:47
Yes, but flights will continue to destination if they're already airborne. Others won't be able to depart. Shouldn't it thin out over the next few hours?

wiggy
28th Aug 2023, 11:50
From Eurocontrol public portal at 1112 UTC..

United Kingdom is experiencing a Flight Data Processing System failure.
Scottish and London FIRs are regulated at low rates with very high individual delays.

Currently there is no indication of when a solution for the failure will be available so no improvements for flights entering UK airspace are forseen in the near future.

Further information will be published when available.

mikemmb
28th Aug 2023, 12:05
Hilarious…another example of the UK’s crumbling infrastructure. Don’t we build resilience in anymore?

But surely when we build stuff resilience is not in the current years budget ……so just pass the parcel.

RetiredBA/BY
28th Aug 2023, 12:21
Is this possibly Russian Hacking after Ukrainian drones caused delays at Moscow’s airports ?

hobbit1983
28th Aug 2023, 12:23
Is this possibly Russian Hacking after Ukrainian drones caused delays at Moscow’s airports ?

I doubt it

sixgee
28th Aug 2023, 12:24
I for one am glad I’m not flying today, inbound on an RCF flight plan.

skidbuggy
28th Aug 2023, 12:26
https://cimg2.ibsrv.net/gimg/pprune.org-vbulletin/400x202/59ca03b6_3561_4ade_ac12_8a3e117381a5_d657b634b40ca5fff57fc20 afbebe0b86ff41f1a.gif

RetiredBA/BY
28th Aug 2023, 12:44
I doubt it
Really? Seems too much of a coincidence that this has happened on one of the busiest days of the year, aviation wise.
Cant remember such a serious system crash in my 25 years of civil flying, not counting French mischief !

kcockayne
28th Aug 2023, 12:45
Here in Jersey we are starting to see transatlantic westbound routing over us, which don’t usually come this way. So far, there have been at least 4 KLM EHAM departures overhead plus a Brussels A/L Washington flight. This almost never happens, even when NAT tracks are a bit southerly. Also apparent are German transatlantic, although this is fairly common with southerly tracks. Looks like we’ll see a lot of different a/c this afternoon & evening. Although, “see” is not quite accurate - as it is quite cloudy at the moment.

hobbit1983
28th Aug 2023, 12:47
Really? Seems too much of a coincidence that this has happened on one of the busiest days of the year, aviation wise.
Cant remember such a serious system crash in my 25 years of civil flying, not counting French mischief !

I can't rule it out. But I'd apply Occam's razor; more likely to be Murphy's Law combined with a lack of resilience/cockup in IT systems...

bubbs
28th Aug 2023, 13:30
CFMU NOP now recommending operators consider cancelling flights from EG and EI airports to EU.

Merry Christmas 😬

MissChief
28th Aug 2023, 13:36
CFMU NOP now recommending operators consider cancelling flights from EG and EI airports to EU.

Merry Christmas 😬

CFMU? NOP?

Expatrick
28th Aug 2023, 13:39
CFMU NOP now recommending operators consider cancelling flights from EG and EI airports to EU.

Merry Christmas 😬

Not been CFMU for years!

Network Operations Portal.

drazziweht
28th Aug 2023, 13:45
Really? Seems too much of a coincidence that this has happened on one of the busiest days of the year, aviation wise.
Cant remember such a serious system crash in my 25 years of civil flying, not counting French mischief !

Do you not remember Dec 12 2014 ?

Expatrick
28th Aug 2023, 13:46
Do you not remember Dec 12 2014 ?

And 2009?

Denti
28th Aug 2023, 13:48
CFMU? NOP?

Everyone in the business should know those acronyms, CFMU used to be the central flow management unit, NOP is the network operations portal, which you can access here (https://www.public.nm.eurocontrol.int/PUBPORTAL/gateway/spec/index.html).

Expatrick
28th Aug 2023, 13:51
Everyone in the business should know those acronyms, CFMU is the central flow management unit, NOP is the network operations portal, which you can access here (https://www.public.nm.eurocontrol.int/PUBPORTAL/gateway/spec/index.html).

CFMU was...

NMOC since about 2012 (I think).

Del Prado
28th Aug 2023, 13:54
Roma ACC experiencing high delays due to FDP failure also.

ATC Watcher
28th Aug 2023, 13:56
From the look of it so far, between the diversions and the cancellations, there are hundreds of aircraft which are going to be at the wrong place tomorrow when this is over, it will takes days to recover. These outages show how tight the whole system is. a single computer ( most likely software) failure anywhere and the whole network is in shambles, in just a few hours we reached over 1 million minutes delay already :rolleyes:

N4565L
28th Aug 2023, 14:15
Not been CFMU for years!

Network Operations Portal.

Us oldtimers still refer to it as CFMU, proving you can't teach old dogs new tricks.

We are now looking at crew duty times running out & aircraft out of position, getting ugly.

blind pew
28th Aug 2023, 14:28
Was talking to an ex controller Saturday who is writing a masters thesis on the previous two major shutdown..Jesus the systems and back ups are so complicated my brain gave up trying to understand it. The data links, no paper strips, looking at radar returns without secondary idents and you just can’t phone the bloke on the next sector.
Gone are the days of wacking a TV and wiggling the aerial.
Best of luck.

ORAC
28th Aug 2023, 14:43
NATS have announced that the fault has been identified and rectified.

The delays and knock-on effects will, of course, take several delays to sort out.

piper mohawk
28th Aug 2023, 14:54
It is a 'rumour network' so my guesses are:

1. Software update which hasn't gone through the agreed change control process, or
2. Third party hosting provider outage, or
3. Hardware issue at a 'single point of failure', or
4. Undetected software bug.
All mitigatable given adequate resources.

Expatrick
28th Aug 2023, 15:04
From NMOC

Message from NATS regarding FPDS situation.


We are continuing to monitor the situation and improve this where possible.

London and Scottish are in a recovery process.
Sector regulations at reduced rates are applied.
However High delays continue during this period.

EGLL Arrivals is now available at a reduced rate.

Customers are requested to continue keeping EOBTs and flight plans up-to-date.

No further ATICCC customer teleconferences are planned. If you have any issues with individual flights please contact the UK Airspace and Capacity Management (ACM) team as normal.
We apologise for the impact of this technical failure to your operation and passengers. A thorough investigation is underway and we will update customers as soon as it has been completed.


NMOC Brussels on behalf of NATS

ShyTorque
28th Aug 2023, 15:04
Maybe someone tripped over a cable?

Flying Wild
28th Aug 2023, 15:07
Easyjet have apparently cancelled all UK outbound departures for the rest of the day as per an ACARS message to all aircraft at 1450Z

chevvron
28th Aug 2023, 15:24
It is a 'rumour network' so my guesses are:

1. Software update which hasn't gone through the agreed change control process, or
2. Third party hosting provider outage, or
3. Hardware issue at a 'single point of failure', or
4. Undetected software bug.
All mitigatable given adequate resources.
A software update was certainly the cause of one outage; system wasn't 'dumping' FPL data and was overloading the system by randomly re-using 'old' flight plans; don't ask me how, I'm useless at computers.
NATS went over to electronic flight progress strips some years ago (about 2014 I think) however they do have paper strips as backup.

RHagrid
28th Aug 2023, 16:00
More ATSA/ATCA's required obviously!!

compton3bravo
28th Aug 2023, 16:14
Sorry Flying Wild an easyJet left Luton at 1640 enroute to Palma.
Two words sum up this situation - broken Britain.

Apron Artist
28th Aug 2023, 16:21
147 departure cancellations from LGW (so far) and rising. My sympathies to my former colleagues...

Midland 331
28th Aug 2023, 16:40
I'm sure that, despite the details given in the link, the old LATCC IBM9020 is in a lock-up in West London, and could be fired up..... :-)

https://atchistory.files.wordpress.com/2016/05/ibm-9020-goodbye.pdf

kcockayne
28th Aug 2023, 16:51
=Ah, the good old 9020. Even that used to go down, at times. I remember (I think it was on a Wednesday) having to revert to “manual” & write out all the Flt. Progress Strips by hand. Don’t think that it caused too many delays, though !
Just ATCOs trying to decipher what was written on them !
Incidentally, my son’s first flight on the A380 today. Chose a good day to start !

R T Jones
28th Aug 2023, 17:30
Two words sum up this situation - broken Britain.

Haha! I used this phrase at work in the last few days to describe how I felt about the U.K. at the moment…

Flying Wild
28th Aug 2023, 17:52
Sorry Flying Wild an easyJet left Luton at 1640 enroute to Palma.
Two words sum up this situation - broken Britain.
Should have clarified. The message related to Easy UK AOC aircraft. Was the LTN-PMI a G-reg?

SWBKCB
28th Aug 2023, 18:04
Should have clarified. The message related to Easy UK AOC aircraft. Was the LTN-PMI a G-reg?

Yes - G-EZUW operating U22331/EZY58RG. 10 EZY UK a/c currently airborne

CBSITCB
28th Aug 2023, 18:24
I'm sure that, despite the details given in the link, the old LATCC IBM9020 is in a lock-up in West London, and could be fired up.....

You don't know how close to the truth that is - VERY close!😁

Midland 331
28th Aug 2023, 18:47
Working with mission-critical legacy I.T. has taught me "Always give yourself a way back"....

Diff Tail Shim
28th Aug 2023, 18:51
More ATSA/ATCA's required obviously!!
An issue that affects a good day's operation. Screws up my lates more often than not.

Diff Tail Shim
28th Aug 2023, 18:52
Haha! I used this phrase at work in the last few days to describe how I felt about the U.K. at the moment…
Ditto. Nigel voted for it.

DogTailRed2
28th Aug 2023, 18:58
There have been several of these `outages` over the last few years. One would hope that there is an SLA with the people who supply and maintain the system and that any loss can be claimed? (no, I doubt that as well).
I wonder how many of these outages the UK can suffer before carriers, countries start to look at alternative ways of routing through, around UK airspace or even flying here at all?

Expatrick
28th Aug 2023, 19:07
As at 21:00 CET

UK Flights
1,699 Departed today
1,814 Landed today
242 Currently airborne
Handled yesterday 5,572

chevvron
28th Aug 2023, 19:44
=Ah, the good old 9020. Even that used to go down, at times. I remember (I think it was on a Wednesday) having to revert to “manual” & write out all the Flt. Progress Strips by hand. Don’t think that it caused too many delays, though !
Just ATCOs trying to decipher what was written on them !
Incidentally, my son’s first flight on the A380 today. Chose a good day to start !
The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.

togsdragracing
28th Aug 2023, 19:48
The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.

In my mid-1980s tecchie days we used to joke that IBM stood for "It's being mended". ICL (if anyone remembers them) was "I can't logon".

Gupeg
28th Aug 2023, 20:08
https://www.caa.co.uk/media/r42hircd/nats-system-failure-12-12-14-independent-enquiry-final-report-2-0-1.pdf

14,863 "delay minutes", 353 flights (directly?) affected - will leave others to draw comparisons :oh:

WHBM
28th Aug 2023, 22:59
Interesting to compare the effects on the about 90 departures from Palma back to the UK, leaving there after 1100 Local, when things seem to have set in. Compared by airline - of course, all airlines are treated equally by NATS ...

BA ran just 2 out of 8, both LCY flights, rest cancelled.

Easyjet ran 6 out of 18, rest cancelled.

TUI only had a few today, managed 2 out of 3.

Ryanair operated 17 out of 20.

and Jet2 managed 35 out of 37.

Now it's just FR24 data, which is often not spot on. But there is a general trend. And there were a number of significant delays, but at least things got away So why did BA seem to scrub pretty much everything (a trend seemingly repeated out of London to all over Europe ? ). Easyjet likewise. How did Jet2, biggest Palma-UK operation of the day, manage to do almost everything ? How did Ryanair manage similar ?

Link Kilo
29th Aug 2023, 07:16
Government sources and aviation officials ruled out a cyberattack. Sources suggested the issue could be the result of an incorrectly filed plan by a French airline, although Nats would not comment.

https://www.thetimes.co.uk/article/uk-air-traffic-flight-delays-latest-news-bank-holiday-travel-2023-0s56fq8z2 (http://Government sources and aviation officials ruled out a cyberattack. Sources suggested the issue could be the result of an incorrectly filed plan by a French airline, although Nats would not comment.)

ATC Watcher
29th Aug 2023, 07:34
Sources suggested the issue could be the result of an incorrectly filed plan by a French airline,
Ah, the usual UK convenient excuse,:) but regardless of nationality , this is an interesting comment since in Europe all flights plans are centralized and go through IFPS , checked and redistributed to the ATC centers by Eurocontrol . So if there was an initial filing error by the airline it should have been spotted at IFPS level. but assuming it was not, then every other ATC center receiving the FPL should have been affected, and if not why only the UK system .

callum91
29th Aug 2023, 08:09
It’s a pretty vulnerable system if it can be brought down by filing an incorrect FPL.

Sneezy24
29th Aug 2023, 09:24
CFM (maintenance company) - 'Can't Fix Much'
STC (Storage Technology Corporation) - 'Short-Term Cowboys'
.. and probably many more!

pax britanica
29th Aug 2023, 09:43
ATC watcher , was the source the Daily Express as they blame the French for everything .

Having spent 40 plus years in telecoms we seem in that industry to have avoided catastrophes on thsi level but by and large are very focussed on avoiding any possible single source single mode failure that impacts a wide area . It would seem absurd that something on this scale could be triggered by an erroneous flight plan something that must happen a dozen times a day or more across Europe tjust through typos or errors. NATS has a complicated ownership structure but one must suspect the usual cheeseparing penny pinching UK management style has something to do with this having such serious impacts

PB

Uplinker
29th Aug 2023, 10:28
............Cant remember such a serious system crash in my 25 years of civil flying, not counting French mischief !

Back in the '70's, my Dad, who was then a senior ATCO at West Drayton, had to "ring up France" one day and tell them not to let anything take off, after a UK ATC system failure. So not a new or unique phenomenon, just very very rare.

Anybody who has attended an ATC pilot liaison day in Swanwick, cannot have failed to be impressed by our UK ATC. Working in their ATC simulator, we pilots deliberately selected incorrect headings and wrong altitudes etc, ha ha ha !,.......but all were picked up within moments. Oh....... They are actually very good these guys and gals.

Then, in a later exercise the simulator instructors failed the main ATC computer, leaving the ATC crew back to paper strips and primary radar. The way the UK ATC staff calmly coped was very impressive, and quite humbling to be honest.

Timmy Tomkins
29th Aug 2023, 10:30
ATC watcher , was the source the Daily Express as they blame the French for everything .

Having spent 40 plus years in telecoms we seem in that industry to have avoided catastrophes on thsi level but by and large are very focussed on avoiding any possible single source single mode failure that impacts a wide area . It would seem absurd that something on this scale could be triggered by an erroneous flight plan something that must happen a dozen times a day or more across Europe tjust through typos or errors. NATS has a complicated ownership structure but one must suspect the usual cheeseparing penny pinching UK management style has something to do with this having such serious impacts

PB
My thoughts also and I recall this being predicted when NATS was privatised. Ironically some airlines own part of it, so shot in own foot comes to mind.

classicwings
29th Aug 2023, 10:51
Back in the '70's, my Dad, who was then a senior ATCO at West Drayton, had to "ring up France" one day and tell them not to let anything take off, after a UK ATC system failure. So not a new or unique phenomenon, just very very rare.

Anybody who has attended an ATC pilot liaison day in Swanwick, cannot have failed to be impressed by our UK ATC. Working in their ATC simulator, we pilots deliberately selected incorrect headings and wrong altitudes etc, ha ha ha !,.......but all were picked up within moments. Oh....... They are actually very good these guys and gals.

Then, in a later exercise the simulator instructors failed the main ATC computer, leaving the ATC crew back to paper strips and primary radar. The way the UK ATC staff calmly coped was very impressive, and quite humbling to be honest.

Love it!! Back in the 90's, my Dad, who was then an approach radar ATCO at West Drayton told me when he occasionally used to practice doing SRA's with pilots, he was certain that some crews had the ILS dialled up and followed that rather than ATC instructions! :E:E:ok:

BigDaddyBoxMeal
29th Aug 2023, 10:54
Government sources and aviation officials ruled out a cyberattack. Sources suggested the issue could be the result of an incorrectly filed plan by a French airline, although Nats would not comment. https://www.thetimes.co.uk/article/uk-air-traffic-flight-delays-latest-news-bank-holiday-travel-2023-0s56fq8z2 (http://Government sources and aviation officials ruled out a cyberattack. Sources suggested the issue could be the result of an incorrectly filed plan by a French airline, although Nats would not comment.)

"Incorrect flight plan" Either I'm getting de ja vu, or we've heard that excuse before?

NWSRG
29th Aug 2023, 11:06
Don't believe the incorrect flight plan malarky...seems an incredulous reason for a whole system failure. And if it was cyber, would they actually admit it?

Truth is probably much more mundane though...in my old company, the entire IT infrastructure was collapsed for a number of days after an employee tripped in the server room and landed against an emergency stop button (you could not have written it, but it was captured on CCTV). The resulting shutdown was 'dirty' and caused absolute chaos...

CBSITCB
29th Aug 2023, 11:13
Flight Plans (FPLs) are filed in ICAO standard format. Part of the FPL is the route. Each national ANSP must, within its own specific flight plan processing system (FPPS), convert the FPL route into a route expressed in terms of its own national airspace computer model.

The UK airspace model is built using the same architecture as the US one. It is part of the National Airspace System (NAS), which the UK obtained from the US FAA in the 70s. The UK NAS airspace architecture is essentially the same today as it was in the 70s, though of course the model itself has changed to reflect the changing airspace over time.

A key part of the NAS software is a sub-programme called Route Conversion and Posting. This converts the ICAO FPL route into the internal format of the FPPS. It then determines which sectors (ATCOs) need to be provided with the FPL information (the “posting” part). NATS does not publish details of this sub-programme, but the FAA does. It is documented in a volume called NAS-MD-312. These details have naturally diverged over time, but the essentials remain the same.

To quote NAS-MD-312 “The 3-dimensional volumes of airspace that comprise an [airspace model] are described by points and lines with a specified altitude range for each. These volumes of airspace are fix posting areas (FPAs). Geographic points are described in source information in terms of latitude and longitude in units of degrees, minutes, and seconds and are converted to conventional X and Y coordinates, in units of one-eighth mile, and stored in that form. A boundary line is described by its geographic end points. Since each line segment has a specified altitude range adapted to it, a series of connected lines is used to describe a 3-dimensional volume of airspace. An FPA is the fundamental unit of airspace within the [airspace model]. Other volumes of airspace within a centre, such as sector or approach control areas, are described in terms of FPAs that comprise them. A fix posting area is a volume of airspace identified by a series of connected line segments that form a polygon when viewed in the horizontal and vertical plane, with each boundary line having a specified altitude range. The polygon may be convex or concave, permitting a variety of geometric shapes.”

This process of route conversion is obviously a very complex exercise. I understand at least one major UK NAS outage in the past was caused by errors in this process. Someone had managed to input an FPL route that passed NAS route validation (described in NAS-MD-311 Message Entry and Checking) but “did not compute” when route conversion was attempted. Of course, all possible errors should be trapped, but…

A flavour of the complexity of route conversion can be had here:
https://www.tc.faa.gov/its/worldpac/standards/nas/nas-md-312.pdf

I am not saying such a problem was the cause of the outage yesterday – though it could be. It is just some background to the way NATS processes flight plans.

Equivocal
29th Aug 2023, 11:15
There have been several of these `outages` over the last few years. One would hope that there is an SLA with the people who supply and maintain the system and that any loss can be claimed? (no, I doubt that as well).
I wonder how many of these outages the UK can suffer before carriers, countries start to look at alternative ways of routing through, around UK airspace or even flying here at all?
Maybe this illustrates part of the problem. The first reaction seems to be who to blame and can we claim compensation. Not invalid questions, but in my experience, which includes ATM systems, the problems often arise from failing properly to define what the system must do and to manage/handle errors and unexpected data in a robust way. Time will tell what went wrong yesterday, but I would be surprised if it wasn't something that could have been predicted (but maybe wasn't handled in a managed way).

Ninthace
29th Aug 2023, 11:17
We had an MoD system brought to a shuddering halt because of a single email sent by one user. OK, she managed to somehow CtrlA, CtrlC, CtrlV the entire address book, including dist lists, into the To line and add a read and delivery receipt. The resultant cascade plus all the out of office responses that were de rigeur in those day froze the system to the point where we could not even talk to it. IIRC it took 2 days to effect a full recovery.

JumpJumpJump
29th Aug 2023, 11:43
Have we ruled oit alien invasion?

Ninthace
29th Aug 2023, 11:45
Have we ruled oit alien invasion?
I am not sure. Is there intelligent life on Oit?

CBSITCB
29th Aug 2023, 11:46
Just listening the usual journalistic blather about "why no backup system?". Of course there is an online backup system to automatically switch to, but this almost exclusively caters for only hardware errors. In the case of the "erroneous FPL" scenario I described earlier <https://www.pprune.org/rumours-news/654461-u-k-nats-systems-failure.html#post11493467> it was a problem in the software logic, and the backup system runs the same software.

What happened (AIUI) was the NAS programme "FLOPed" (Functional Lapse of the Operational Programme) – aka crashed. The programme was restarted successfully, but when the recovery data (the data that is in the system – including FPLs – that is recorded as a backup from time to time) was read in it FLOPed again. The rogue FPL was in the recovery data.

Eventually the system was restarted using a recovery data set recorded before the erroneous FPL was entered. Phew, everything hunky-dory again.

Until the originator of the rogue FPL realised it was no longer in the system so re-entered it....:D

Noxegon
29th Aug 2023, 11:51
So why did BA seem to scrub pretty much everything (a trend seemingly repeated out of London to all over Europe ? )

Perhaps they were told that they could operate X number of flights and they decided to prioritize their long haul operation?

pax2908
29th Aug 2023, 12:22
CBSit, somebody mentioned that something similar ("bad FPL") happened in the past. You would then expect that, that particular FPL be identified soon after the first re-start (and second crash) so that it's not re-entered a 3rd time?

togsdragracing
29th Aug 2023, 12:41
Until the originator of the rogue FPL realised it was no longer in the system so re-entered it…:D

An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.

WB627
29th Aug 2023, 12:46
OK for what its worth, my money is someone digging through the main fibre optic cable into Swanwick, somewhere outside of the site. It happens. It would explain why the backup system didn't work - no coms in/out of the building. With infrastructure that critical, there should be a second FO cable with a completely different route to Swanwick and a different entry point into the building.

Masterful decision to blame the French :ok:

CBSITCB
29th Aug 2023, 12:46
The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.Whilst the 9020 was indeed based on IBM S/360 technology it was rather more than "just six 360s". The 9020 was being designed for the FAA well before the new S/360 line was announced. In fact, this was a major stumbling block (overcome) as the FAA contract specified "off-the-shelf" hardware.

The 9020 CPUs were based on what was to become the S/360 commercial product line but they incorporated many unique hardware features to facilitate their intended use as an ATC system. These included multiprocessing (a simplified version later added to some ‘standard’ processors), address translation, and specific ATC instructions.

When the UK 9020 was retired in 1989 it was coping very well. Capacity limitations were on the horizon, which is the main reason it was replaced, but not yet causing delays. The hardware was still reliable, though maintainability was becoming difficult. It was a source of pride to the engineers concerned that when it was shut down for the final time the whole system was 100% serviceable.

I am referring to the 9020 system itself, not the NAS software or station power supplies.

CBSITCB
29th Aug 2023, 12:50
CBSit, somebody mentioned that something similar ("bad FPL") happened in the past. You would then expect that, that particular FPL be identified soon after the first re-start (and second crash) so that it's not re-entered a 3rd time?
I think that's exactly what happened.

CBSITCB
29th Aug 2023, 12:52
With infrastructure that critical, there should be a second FO cable with a completely different route to Swanwick and a different entry point into the building.
There is a redundant communication path.

golfbananajam
29th Aug 2023, 12:54
The problem with testing software is that you can't test all combinations of input values to ensure the required output values are correct, certainly not in vlarge or complex systems. Failure testing is often limited to defined alternate path (within the software) testing as defined in the requirements/specification. Edge cases will always catch you out.

With that in mind, critical systems like this should always fail safe, ie reject any invalid input, or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time.

Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash.

I wonder how often, if ever, business continuity testing is performed which should have enabled quick recovery.

WB627
29th Aug 2023, 12:58
There is a redundant communication path.

Well in that case, it must have been the French ;) more revenge for Brexit :oh:

CBSITCB
29th Aug 2023, 13:12
An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.
There is, and it is extensive.

It would have been better to have called it the trigger FPL rather than the erroneous FPL. It was “bad” in that it caused the crash, but not necessarily incorrectly formatted. I don’t recall the precise details of the event.

The 28-day AIRAC cycle introduces new airspace, reporting points, airfields, etc.. Thus the set of valid FPL routes also changes every 28 days. This means the NAS software has to be modified (“adapted”) to accept and process the new reporting points and routes.

So the FPL probably got through the NMOC IFPS (or whatever the equivalent was back then) as a valid route, and the error was in the updated UK route conversion process adaptation.

moosepileit
29th Aug 2023, 13:28
https://cimg1.ibsrv.net/gimg/pprune.org-vbulletin/640x640/just_kidding_airplane_f0ef791e67d2dc3c3937f8c8943b95d072aa70 d7.gif
ya'll stop watching the classic aviation movies across the pond?

CBSITCB
29th Aug 2023, 13:37
ya'll stop watching the classic aviation movies across the pond?
No - https://www.pprune.org/rumours-news/654461-u-k-nats-systems-failure.html?ispreloading=1#post11492878

DaveReidUK
29th Aug 2023, 14:00
Don't believe the incorrect flight plan malarky...seems an incredulous reason for a whole system failure.

Neither do I - it's inconceivable.

Murty
29th Aug 2023, 14:20
Adding to CBSIT excellent run down (no pun intended) of the system,and answering a previous question by someone on the thread.

All airlines/operators do indeed submit flight plans in advance for flights in Europe to IFPS (Integrated Initial Flight Plan Processing System) primarily located at Haren,with back up at Bretigny.

Haren then forward the Flight Plans to each ANSP (Sovereign State) 4 hours before the flight reaches each FIR boundary.
It has to be noted that although most units are heading towards free filing Direct ,this is stil not practical in some of the UK low level . Unfortunately the IFPS system can be "got round" by operators by filing DCT point to point providing it is not too far,but his is usually caught by Swanwick's FPRS system
When a Flight Plan arrives in the FPRS system if everything is correct the flight plan is Automatically Processesed and in a normal day this is approx 87-90% ,leaving the team to manually process the remaining 13% or above ,into the system
.
13% of FPL's is still a large amount in a day and these are a mixture of complex military,airtests,many badly filed GA flights in CAS and badly filed Airlines Flight Plans,and there a still many of them especially companies that get a agency's to file their plans for them one of the worst being in tn recent years (EKBICPUF)- PPS and RocketRoute.
It should be noted that the way flight plans are processed into the UK NAS system uses an old unique input using "dots" between fixes and routes which is why only ones that match whats in the UK SRD (Standard Route Document )
eg a flight from EGGW to EGLF (NOT FAR) Simplified
Filed would be EGGW DCT CPT DCT EGLF
into NAS EGGW..CPT..EGLF
This is the same all over the network TWO dots between 2 Fixes or 2 Airways or ONE dot betwen fixes to airway,airway to fix


It sounds (theory at present) the rogue FPL would be one of the 13%,as our system should reject the bad ones ,so possibly input manually.

That said when the system failed it would hold 4 hours of traffic unless they were late files ,changes ,Cancelations and refiles.

This is when the FPRS team 4 normally with possibly 3 others on shift elsewhere,would be building up to input THE OTHER 87-90% of flight plans,plus all other messages eg DLA (and you can guess the number of these (but by prioritising these may just have to be left) CHG messges etc.

These staff would als need brakes.
The resilience is a fair question in the past Prestwick Centre had their own FPRS suite and the CTC at Whitely had an emergency system in the depths of that building back in 2014.
Fortunately or Unfrotunately it ws decided to have a "One Centre" policy to centralise the planning especially as Scottish went for the more Direct filing that has also now come into the London FIR (West of London) high level.

All this said as CBSITCB stated,all these newer systems are being added to the 1970's framework NAS system that was supposed to be repalced but appears to have a never ending lifespan (UNTIL -dread to think) ,with a New Combined Ops room that lies idle still at Swanwick yearrs over due that Cal Mac ferries would be proud off not.

Kiltrash
29th Aug 2023, 14:32
I once had to deal with a major retail park where all the stores were reporting a similar problem at the same time. Their card transactions were not going through . I got called out at silly o'clock to go there and have a look as the techy boys off site could not see the problem....
it was a wireless link to a 3G mast that someone built a building near and blocked the signal...
Solution was a Fibre Cable run of about 3 miles... So we dug two different routes in case someone else descided to put a digger through one...and yes some one did 😊

Ninthace
29th Aug 2023, 15:02
Neither do I - it's inconceivable.
Why inconceivable? Do none of the explanations of how it might happen, offered by other posters, not persuade you?
Do you have another idea that we might discuss?

pax britanica
29th Aug 2023, 15:22
Multiple fibre routes to mission critical buildings are prety much the norm for many years now, even if the building owner doesnt spec it the comms provider is likely to insist. Sometimes there are more than two routes but they do need to be celalry seperated internally and externally.

CBSITCB gave a very eleoquant and concise descritpiton of non hrdware fail modes and software is useallya more difficult fix than hardware. However in a really mission critical situation then a huge amount of effort has to go into software resilience-think Airbus- and for that reason it would seem close to nonsensicle that misentered flight plans could cause such chaos.
I have worked with telecoms network managemnt and switching systems that had very high levels of resilience national or regional level of the telecoms network including in todays world intenet hardware reallycannot be allowed to fail. . To be fair they do not have the functional complexity of a system like NATs has msotly just switch paths around and maanging digital data flows. We did have one susyem that used a comparator function where two processors controlling one system constantly compared operating states. Ifa failure mode occured this would determien which system had in someway been corrupted and switched it out leaving the unchanged processor in charge.. However we did have a short outage due to the comparator failing so nothings completley foolproof . Personally i think the Daily Mail and Express sabotaged it to stop people travelling to Europe or vice versa

grizzled
29th Aug 2023, 16:07
Why inconceivable? Do none of the expla nations of how it might happen, offered by other posters, not persuade you?
Do you have another idea that we might discuss?

DRUK said he doesn't believe the reason provided by NATS. I too find it inconceivable in this instance. See ATC Watcher's post #52 for an explanation of why NATS statement is at least highly suspect. Many of us who have significant experience in this realm (as does ATC Watcher) would call "BS" on NATS in this case.

eglnyt
29th Aug 2023, 16:20
If we assume the problem was with NAS then the explanation of an "unusual" flight plan as the initiator is not inconceivable, that system has a 40 year history of similar events. Read the lengthy contributions above from those with an understanding of the system to see why. What that doesn't explain is why the controls & processes which have controlled that risk for the last 20 years didn't yesterday.

c52
29th Aug 2023, 16:30
I worked on a system at a large airport that crashed every time the QNH reached something like 1037.5. A rare enough event and I am full of admiration for whoever it was who spotted the link between all the failures. I would have said it was a system that had no need to know about QNH.

grizzled
29th Aug 2023, 16:31
that system has a 40 year history of similar events. .

Hint as to the root cause of this event: If your statement is correct then the root cause of the system failure is NOT a specific "French airline's flight plan".

CBSITCB
29th Aug 2023, 16:35
Don't believe the incorrect flight plan malarky...seems an incredulous reason for a whole system failure. And if it was cyber, would they actually admit it?
Neither do I - it's inconceivable.

Saving this post just in case I need to refer to it later...;)

fitliker
29th Aug 2023, 16:38
Did you turn it off and on again ?
As per the IT Crowd best practices advice

eglnyt
29th Aug 2023, 16:50
Hint as to the root cause of this event: If your statement is correct then the root cause of the system failure is NOT a specific "French airline's flight plan".
Correct. I have no idea yet whether it was an issue with NAS or a French flight plan. But if that was the case it is a, probably unusual, flight plan that has been through multiple layers of validation that happens to expose an issue in the FDP system. The flight plan is the initiator and, using some methods, a cause but not the real issue.

pax britanica
29th Aug 2023, 17:10
Government,immediate response 'Its not a cyber attack'

NATS somewhat later response 'We dont know what caused it yet'

Perhaps HMG jumping the gun , ie nothing to do with us .

Can out of pocket pax sue NATS ?? Are memebers of the 'Airline Group' owning 42% of NATS liable even if the airlines are not ;. after all they are a significant part of the the management

eglnyt
29th Aug 2023, 17:42
Can out of pocket pax sue NATS ?? Are memebers of the 'Airline Group' owning 42% of NATS liable even if the airlines are not ;. after all they are a significant part of the the management

I'm no lawyer but there is no contract between NATS and the pax so what would be the basis of that action? There is a contract between the airline and the pax but what pax can claim is limited by legislation. There is a contract between NATS and the Airline but as the charging regime provides for penalties it would be unlikely that the airlines could sue NATS.

This is all quite deliberate. NATS was privatised just after the Hatfield rail crash where penalties for service standards had produced an adverse effect on safety. The Government of the day wanted to avoid any financial imperative that would have any effect on day to day operational decisions affecting safety. Hence yesterday the flow regulation could be imposed purely with regard to safety.

Note that the Airline Group is a separate entity from the Airlines who hold shares in it, indeed not all shareholders in the Airline Group are airlines. The Airline Group is a PLC limiting the liability of its shareholders and in turn NATS is a collection of Limited Companies limiting the liability of its shareholders.

Private jet
29th Aug 2023, 17:55
As I recall, Eurocontrol had then/has now, their system totally backed up, duplicated in two different control centres. As others have said, deteriorating infrastructure in the UK (or rather lack of it from the get go), but who pays to improve it & why should they pay???

Equivocal
29th Aug 2023, 18:08
This is all quite deliberate. NATS was privatised just after the Hatfield rail crash where penalties for service standards had produced an adverse effect on safety. The Government of the day wanted to avoid any financial imperative that would have any effect on day to day operational decisions affecting safety. Hence yesterday the flow regulation could be imposed purely with regard to safety. This takes me back…..I had a peripheral involvement with the original NATS licence and I can recall lots of debate about closure of airspace - principally that the civil service/government people never wanted UK airspace to be closed (unless they said so, of course). Lots of people involved in day to day running things said that there were very rare, but valid reasons that this might happen. But applying a 0 aircraft per hour flow rate was fine. Is this what they call spin?

Link Kilo
29th Aug 2023, 18:13
Some interesting discussion on this forum: https://forums.theregister.com/forum/all/2023/08/28/uk_flights_disrupted/

Longtimer
29th Aug 2023, 18:15
UK air travel disruption may last for days, says British transport minister Mark Harperhttps://www.firstpost.com/world/uk-air-travel-disruption-may-last-for-days-says-british-transport-minister-mark-harper-13053662.html

eglnyt
29th Aug 2023, 18:17
As I recall, Eurocontrol had then/has now, their system totally backed up, duplicated in two different control centres. As others have said, deteriorating infrastructure in the UK (or rather lack of it from the get go), but who pays to improve it & why should they pay???

Would that be the system that failed in 2018 when they thought it had switched to the other centre but hadn't, and because the phones didn't switch over as expected, nobody could ring them to tell them?

NATS has a lot of redundancy and duplication and is currently investing in even more but please read the great explanations by others above as to why it might not always help.

Private jet
29th Aug 2023, 18:29
Would that be the system that failed in 2018 when they thought it had switched to the other centre but hadn't, and because the phones didn't switch over as expected, nobody could ring them to tell them?

NATS has a lot of redundancy and duplication and is currently investing in even more but please read the great explanations by others above as to why it might not always help.

Thankyou, & with respect, point taken!

DaveReidUK
29th Aug 2023, 19:23
Some interesting discussion on this forum: https://forums.theregister.com/forum/all/2023/08/28/uk_flights_disrupted/

Hmmm. That's 10 minutes of my life I'll never get back, and I'm none the wiser. :O

ATC Watcher
29th Aug 2023, 20:02
As far as I know at this time , no-one really know what caused the system to crash , the root cause I mean , not yet but likely by tomorrow or in the next coming days we'll know. From my experience in the last 40 years or so the Flight plan processing systems (FDPs) are generally crashing following a system update. one line of programming is wrong ,and when an external factor comes in . it causes the issue. This could happen at any time , generally when the system is peaking . Typically system updates are done at night, tested for a few hours , then if OK put on line in the morning ,At least that how we do it in most centers. Done it for years in my own center. If it crashed later in the day we just reverted to the previous level which is on stand by on the back up computers, whole thing takes no more that minutes of an hour max to be back to normal. When if takes half a day or more then something is wrong on your processes or your system architecture. Could also simply be the result of cost cutting measures, like not replacing back up computers, or outsourcings maintenance and code writing ,to far away countries with cheaper labor , etc.. I am not saying that this was the case here in NATS, but I have seen this happening in other places recently .

Finally FDP system failures are not a unique UK/NATS issue, Geneva had a major failure some time ago, , Brussels, a couple of years back etc. even Roma had one also yesterday.at the same time as London , so we feared a wider cyber attack. But so fat it looks like the 2 were not connected. But if the investigation later shows they were, then we really are in the sh*t .

Janet Spongthrush
29th Aug 2023, 21:51
As far as I know at this time , no-one really know what caused the system to crash , the root cause I mean , not yet but likely by tomorrow or in the next coming days we'll know

I'm a little confused by what you say. As mentioned upthread, it appears that the issue was a corrupt message from IFPS that the local (NATS) automated flight plan processing system couldn't reconcile. So it fell over. Without automated processing of flight plan data the NATS systems can't distribute flight data to the required sectors thus manual reversion and manual processing of flight plans is required.

Clearly, why a single point of failure has such an impact is a reasonable question, especially as several similar events have previously occurred (google for public domain examples of investigations). One hopes that e.g. a corrupt datalink message to an A350 wouldn't cause reversion to manual operation.

eglnyt
29th Aug 2023, 22:01
I would hope that a number of people had a very good idea of what happened & why before the decision was made to remove flow yesterday afternoon. If not then NATS crisis management has changed for the worse since 2014. They will hopefully share that with the wider populace in due course although I'd love to be in the room when the Transport Secretary is presented with the report if it's as technical as they normally are.

SLXOwft
29th Aug 2023, 22:06
Not sure when this was released, lots of puff little substance - report to Sec of State for Transport to be delivered on Monday.

Statement from NATS CEO Martin Rolfe, 29 August “I would like to apologise again for our technical failure yesterday. While we resolved the problem quickly, I am very conscious that the knock-on effects at such a busy time of year are still being felt by many people travelling in and out of the UK.

“I would like to reassure everyone that since yesterday afternoon all of our systems have been running normally to support airline and airport operations as they recover from this incident.

“NATS exists to allow everyone flying in UK airspace to do so safely. Our systems enable our air traffic controllers to deliver this service all year round. These have several levels of backup and allow us to manage around 2 million flights per year in some of the busiest and most complex airspace in the world safely and efficiently.

“Very occasionally technical issues occur that are complex and take longer to resolve. In the event of such an issue our systems are designed to isolate the problem and prioritise continued safe air traffic control.

“This is what happened yesterday. At no point was UK airspace closed but the number of flights was significantly reduced.

“Initial investigations into the problem show it relates to some of the flight data we received. Our systems, both primary and the back-ups, responded by suspending automatic processing to ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic system. There are no indications that this was a cyber-attack.

“We have well established procedures, overseen by the CAA, to investigate incidents. We are already working closely with them to provide a preliminary report to the Secretary of State for Transport on Monday. The conclusions of this report will be made public.

“I would like again to apologise to everyone who has been affected.”

kghjfg
29th Aug 2023, 22:25
A variation of Bobby Drop Tables?

https://imgs.xkcd.com/comics/exploits_of_a_mom.png


https://cimg6.ibsrv.net/gimg/pprune.org-vbulletin/666x205/exploits_of_a_mom_264b8b0195ff7f1fe6287947d47ad00f40a1cf2b.p ng

Clare Prop
29th Aug 2023, 22:30
https://cimg5.ibsrv.net/gimg/pprune.org-vbulletin/278x182/img_8631_57007a695856ff0bf6ea52d19031fde29d008a01.jpegHal. HAL!

moosepileit
29th Aug 2023, 23:16
No - https://www.pprune.org/rumours-news/654461-u-k-nats-systems-failure.html?ispreloading=1#post11492878
You actually replied by linking back to this thread, itself. Was that to be substantive via ipso facto just because it's on post 11?

It still plays on any page, Oveur...

Underflow, overflow, division by zero...

Dannyboy39
30th Aug 2023, 06:05
Ref the recovery from these ATC strikes, I’m presuming there are no changes to the curfew timings?

If not, considering probably 500,000 passengers minimum have been affected by this issue, the airports not just Heathrow should be cut some slack. And should be the same in any other exceptional event like snow for example.

Ryanair for example are operating some “rescue flights” but with every flight running at capacity it’s going to be days before everyone gets back. Being the last week of the school holidays, flights were probably already at 95% of capacity at least in some cases and you could genuinely see some people not getting back within 7 days.

You need to try and get some more capacity in the system somewhere.

CW247
30th Aug 2023, 06:45
Pull the other one Rolfe, there is no one at the CAA who competently oversees what you do. They are expired relics and ex librarians who don't understand and see anything beyond their nose.

CBSITCB
30th Aug 2023, 06:47
Was that to be substantive via ipso facto just because it's on post 11?
The link was actually to post 12. I don’t understand the rest of your comments.

118.70
30th Aug 2023, 07:37
The Walmsley Report after the 2014 incident https://www.caa.co.uk/media/r42hircd/nats-system-failure-12-12-14-independent-enquiry-final-report-2-0-1.pdf included this snippet on NATS paying compensation
https://cimg5.ibsrv.net/gimg/pprune.org-vbulletin/642x177/compensation_nats_77ecadfb608da24f3dee2c755d6f6c7468279dc1.j pg

safetypee
30th Aug 2023, 07:40
For a techie, human-system view of problems with high tech failures; complexity, see:-

https://snafucatchers.github.io

jumpseater
30th Aug 2023, 07:47
Ref the recovery from these ATC strikes, I’m presuming there are no changes to the curfew timings?



Correct, it’s unlikely to have any changes to curfew. The same with non Uk airports, which restricts flexibility in running extra flights, or any delayed legs, eg Amsterdam Geneva Zurich etc etc.

Captain Boycott
30th Aug 2023, 08:09
If Airlines had enough resilience to produce manual flight plans would this have reduced their own impact. And reputation hit.

Depending heavily on an automated system with little back up and resilience down to aggressively minimum levels really doesnt seem to make much sense.

Suppose finger pointing at NATs in a blame game post event would be about the only defence Airlines would have if they were inept enough to have little resilience.

No automated system is infallible. Its unlikely it ever will be either.

100Series
30th Aug 2023, 08:22
Love it!! Back in the 90's, my Dad, who was then an approach radar ATCO at West Drayton told me when he occasionally used to practice doing SRA's with pilots, he was certain that some crews had the ILS dialled up and followed that rather than ATC instructions! :E:E:ok:
Back in the 70s on approach to Cork in the dear old gripper, we were asked if we would accept a PAR for controller training. My gallent captain said "yes" then leant across and turned off my ILS display. "Good traing for you boy". It would have been too but I never flew another PAR.
First did a PAR in a Chipmunk. "Turn 2 degrees right". With a big glass compass between my legs calibrated every 10 degrees I complained to my instructor. " Well turn 12 degrees right then 10 degrees left".
Not strictly on topic but happy days.

Superpilot
30th Aug 2023, 08:25
So, a faulty bit of data basically bombed out the application service responsible for deciphering flight plan requests? In software development we have something called "Exception Handling". You will be familiar with the below kind of message. It basically causes the application to crash because the error type was never coded for/handled. Let's say you entered a "O" where a "0" (zero) was expected. It seems highly unlikely to me that in 10+ years, a corrupt bit of data didn't find its way to the same application service before. There is something else behind this that they've not divulged yet.

https://cimg1.ibsrv.net/gimg/pprune.org-vbulletin/448x331/lac3u_295f0e190da665d85138385b4dfe1609ac92a662.png

sky9
30th Aug 2023, 08:35
"If Airlines had enough resilience to produce manual flight plans would this have reduced their own impact. And reputation hit."

In my day they were called Airline Pilots. Normal operation in the 1970's.

FlightCosting
30th Aug 2023, 08:43
An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.
My PACS system that has been in use by airlines for 30 years has always had a simple check to check if the flight data is viable before making the calculation, any fault is flagged and sector not allowed with point of failure displayed.

IFPS man
30th Aug 2023, 08:48
Re Murty’s post #80
IFPS had, until recently, two Flight Plan Processing Units - IFPU1 at Haaren (Brussels) and IFPU2 at Bretigny (Paris). Each unit was responsible
for half of Europe. We at Bretigny were the responsible Unit for aircraft departing from Azores in the west, through Portugal, Spain, France, Italy, Greece, Turkey and the Med, along with airfields around the Med, for traffic entering European airspace.

We were not a back-up Unit for Brussels, rather a stand-alone Unit. Each Unit, however, backed up the other in case of a failure at either of the two Units. In fact, there were more “outages” at IFPU1 in Brussels than at Bretigny…
In 2006/7, the designators of the Units was changed to FP1 & FP2. During 2020, FP2 was closed down and the total airspace commitment transferred to FP1

eglnyt
30th Aug 2023, 08:51
That Flight Plan was "checked" by at least 2 other systems before it got to where it caused the issue. There are perfectly valid flight plans which are known to cause the UK flight data processor issues and they are screened before they get there. It could be that a new one has now been added to the list.

BristolScout
30th Aug 2023, 08:53
Back in the 70s on approach to Cork in the dear old gripper, we were asked if we would accept a PAR for controller training. My gallent captain said "yes" then leant across and turned off my ILS display. "Good traing for you boy". It would have been too but I never flew another PAR.
First did a PAR in a Chipmunk. "Turn 2 degrees right". With a big glass compass between my legs calibrated every 10 degrees I complained to my instructor. " Well turn 12 degrees right then 10 degrees left".
Not strictly on topic but happy days.

I seem to remember the Chippy having a vertical DI?

CBSITCB
30th Aug 2023, 08:59
I seem to remember the Chippy having a vertical DI?
A compass and a DI are separate and different things.

Neo380
30th Aug 2023, 09:08
There are myriad issues running here, but there won't be compensation under the Transport Act because this incident is being classed as an 'exceptional situation', but is it..?

Short answer, no. It's a repeat of the 2014 incident, (interim and final reports available - they wouldn't attach for some reason), but as mentioned, like Martin Rolfe's statement there's 'a lot of puff, and very little explanation' in them. The CAA never got to the root cause of the issue. I know less about the 2009 fail over, as it was before my time.

As context, describing wide-scale, safety critical IT systems is a bit like trying to give a headline summary of War and Peace, basically you can't. But there are certain key IT principles that should be present, such as, so long as your safety critical system is still within its capacity parameters it should not fail over unsuccessfully (it should 'stay up', as the old IBM 9020 system did, 100%). Think about it for a moment, if the Hinkley Point nuclear power station had infrequent, but repeated 'unsuccessful fail overs' we would have had two, potentially three, Fukushimas by now! But note, it is the flight planning system that is failing, not the radar links, or voice comms, yet - that would be a complete disaster.

Another critical IT principle is not having backups with the exact same code as the main net - again, when you think about it this is totally obvious. If a tube train continues through a signalling junction because of a 'software glitch', you don't want the train after it, and the one after that to go piling into the first train! And this is the core issue, the age of the Swanwick ATC system notwithstanding, it has the same code in the back up, and in the back up's back up! This is pure mismanagement, and why the incident is likely to reoccur.

Lastly, culture has a lot to blame here. NATS well-publicised 'just culture' is known internally as the 'nobody can be wrong culture'. Of course, if you make a mistake when in position, like falling asleep (a real incident btw) lessons need to be learned, more sleep provisioned for, proper rest breaks, procedures for if you suddenly feel very tired etc - that's all fine. But in encouraging people to come forwards when incidents occur the promise is 'you won't be actioned (disciplinary) for what happened' and this has leaked into other areas, like IT governance, where no one can be blamed for mistakes that have been made, even critical fail over architecture. And this is a highly risky position, hence all the 'puff'.

Failsafe's should absolutely work, period. Typos in FPLs should be caught, but if they are not the system should reject them, not collapse. But critically, nor should both backups!

Neo380
30th Aug 2023, 09:15
From one of our most esteemed aviation colleagues, (Professor S):

'I just uploaded a post- code in the wrong format into google maps and all the traffic lights in London have stopped working. Ha.'

Sums it up nicely.

alfaman
30th Aug 2023, 09:17
Not sure why you think they didn't get to the root cause of the 2014 problem: it was clearly identified. The problem this time may well be unrelated, time will tell. Just culture is not what you describe, & not limited to NATS. What you describe is a "no blame" culture, which has been out of favour for decades, for the reasons you suggest. A just culture draws a distinction between honest mistakes & errors, which occur in any environment, & non conformance. The second is definitely not acceptable nor accepted.

pax2908
30th Aug 2023, 09:22
That Flight Plan was "checked" by at least 2 other systems before it got to where it caused the issue. There are perfectly valid flight plans which are known to cause the UK flight data processor issues and they are screened before they get there. It could be that a new one has now been added to the list.
This may suggest that a given "problematic" FPL will always or almost always trigger a problem, for a multitude or possible "environments" (e.g. rest or traffic and other dynamic data)? It would then be conceivable to do one more "dry" test before the data ends to the real live system? (Or on the contrary, how "new" the problem FPL was ... how many days for a given "bad" FPL to trigger something like this?)
Sorry, I imagine more than I should ... just curious!

eglnyt
30th Aug 2023, 09:33
I don't think this time it was the Swanwick system but the previous review following the 2014 incident pointed out that to fully test every state of that system would take over 100 years. There will be bugs in any complex system. You can't eliminate them by a bit more testing.

FlyingApe
30th Aug 2023, 10:15
....when it came into service in the early 1970's.

Bought from the Americans, who were the only other users at the time, it was cutting edge, and took a lot of manpower to maintain, and a lot of power and cooling.

Decades later, it wasn't of course, but the new FDP system basically re-platformed the old system.

" Stuck flighplans were common, and " Restarts and " Flops" a weekly occurrence.

Extra functionality on top of the original code was supposed to check the plan for "legality" to prevent bad data crashing the system....and generally did. Failures went from being weekly to yearly or longer.



The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.

CBSITCB
30th Aug 2023, 11:22
The NATS CEO indicated this morning that a piece of the system (which has to be the FPPS) failed because it didn’t recognize a message, which was almost certainly an FPL.

People are questioning how a “bad” FPL came to be accepted into the FPPS. It is important to recognize that an FPL has syntax (format) and semantics (meaning).

If the syntax is correct, it is a valid FPL. By far the most complex element of an FPL is the route. The other elements are just parameters that are checked for validity. For example, if aircraft type is stated as “C172” the FPPS checks this against a list of valid aircraft types in its database.

The route syntax is checked to make sure the expression follows the rules of how the route elements should be constructed. Whilst there are many different types of route element and many rules to follow, this checking is relatively straightforward. If a “bad” FPL in terms of format is recognized it will be rejected at this stage. If not, the FPL will be passed to route conversion, where the semantics are extracted.

This is why I can’t see how the statement “it didn’t recognize a message” could lead to a processing failure. If it didn’t recognize a message, it would be rejected at this stage – business as usual. I can only assume the message was recognized as valid and passed-on.

The FPPS must now work out what the actual route is within its airspace (the semantic meaning), and this is the really difficult bit. There is an infinite number of possibilities. For example, route fixes can be expressed as lat/long coordinates which could be literally anywhere. The programme works out what it needs to do, in terms of outputting information to controllers and adjacent centres, and my guess is that this is the source of the problem that caused the FPPS to crash.

The programme came across an unusual route it had not encountered before (and had not been programmed to expect), didn’t know what to do, and a graceful recovery was not available. In other words, encountered a bug and did something unpredictable.

Just my guess.

eekeek
30th Aug 2023, 11:36
Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.

LTNABZ
30th Aug 2023, 13:12
My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.

alfaman
30th Aug 2023, 13:32
My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.
I wouldn't trust anything said on the Vine show, frankly.

LTNABZ
30th Aug 2023, 14:01
I wouldn't trust anything said on the Vine show, frankly.
Agree, but ditto the Mail, though

LTNABZ
30th Aug 2023, 14:05
Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.

Yes. The "Efficiency" ideal everywhere is removing all chance anyone has of fixing problems which are inevitable, not just flights, but rail, roads, NHS, Councils, just-in-time supply chains, etc etc.

Superpilot
30th Aug 2023, 15:05
Whatever's going on, we have evidence that there are no independent backup systems at NATS. Whatever processes they have go through a single point of failure. That can't be news to the developers and managers at NATS. They will have known about it.

eglnyt
30th Aug 2023, 15:15
If I have the World's most sophisticated system & it cost me, for the sake of argument, £500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?

ATC Watcher
30th Aug 2023, 17:04
If I have the World's most sophisticated system & it cost me, for the sake of argument, £500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?
Never listen to the very vocal Irishmen, if you follow their logic, that of MOL at least , we should all work 12h a day . 7 days a week for half of the money so that his airline can make more profit. But on cost of back up systems you take the problem the wrong way . a functioning back up in an ATC Centre is part of the system , from the outset and this regardless of the end price. . You do not buy a system and later ask for a back up.
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,

What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.

Ninthace
30th Aug 2023, 17:12
Whatever's going on, we have evidence that there are no independent backup systems at NATS. Whatever processes they have go through a single point of failure. That can't be news to the developers and managers at NATS. They will have known about it.
An independent backup will help in the event of hardware or power failure but if both systems are using identical software they will react in the same way to the same input. if one crashes because of bad data, so will any backup.

eglnyt
30th Aug 2023, 17:17
There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.

Ninthace
30th Aug 2023, 17:31
There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.

Writing and testing the software once is hard enough. Writing and testing a second, different, version that does the same thing would be "interesting". Now throw in the need to keep both systems current with evolving user demands and constantly changing data within the system, Then add the demands of upgrades to the hardware and software from the manufacturers and the associated testing. Finally it has to work and it has to make money. Hands up who wants to be the manager answerable to the CEO for that.

EGPI10BR
30th Aug 2023, 17:56
Some systems will run with the primary on version n and the backup on version n-1 so that the backup won’t be affected by a newly introduced bug.

That falls down of course if an undetected bug was in version n-15.

Misty.

Ninthace
30th Aug 2023, 18:05
Some systems will run with the primary on version n and the backup on version n-1 so that the backup won’t be affected by a newly introduced bug.

That falls down of course if an undetected bug was in version n-15.

Misty.
Fine for software tweaks, not so good when the change is caused by a shift in user requirements. Then you end up with a system that should do what you want but doesn't run and a system that doesn't do what you want but runs. :confused:

Walnut
30th Aug 2023, 18:15
Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I don’t believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. That’s why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadn’t the bug had washed out because the system holds about 4hrs of data
So my analysis is that every time there Is a faulty input of data the system resets itself to protect itself. Fine but surely an airline when discovering its Flight Plan had been rejected will check and input its data correctly . So I believe a DOS input by a third party was re introduced causing the NATS system to go down again. As so many Flight Plans were being inputted it was very difficult to detect, only when NATS admitted there was a problem did the malevolent data ceased being inputted and it eventually washed out
So CEO please explain the delay

KiloB
30th Aug 2023, 18:21
I suspect that an underlying reason for the severity of this breakdown was that the ATC System has been quietly ‘redlining’ for some time. The post Covid boom in travel numbers is now making the decisions Airlines made wrt putting more emphasis on narrow-bodies look very suspect. The math is simple; given number of passengers, smaller A/C more movements.
An it is a problem that will take quite some time to unscramble.

speed13ird
30th Aug 2023, 18:27
well almost, larger aircraft need longer turn rounds and cause bigger headaches when they go tech

eglnyt
30th Aug 2023, 18:54
Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I don’t believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. That’s why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadn’t the bug had washed out because the system holds about 4hrs of data
So my analysis is that every time there Is a faulty input of data the system resets itself to protect itself. Fine but surely an airline when discovering its Flight Plan had been rejected will check and input its data correctly . So I believe a DOS input by a third party was re introduced causing the NATS system to go down again. As so many Flight Plans were being inputted it was very difficult to detect, only when NATS admitted there was a problem did the malevolent data ceased being inputted and it eventually washed out
So CEO please explain the delay

The originator would never know its plan was rejected because it wasn't. It was filed with Eurocontrol who had no reason to reject it. It passed through several systems without incident before it caused harm.

The NATS systems beyond the one we are assuming was the issue also have a buffer of data of up to 4 hours. This data will start to go stale as amendments, coordinations and new plans don't arrive but is a good basis to continue to operate without flow if you expect the technical issue to be fixed. At some point somebody will decide it isn't coming back and will make the call to impose flow. That was done before 10:00 UTC although it was about 2 hours later before NATS itself published anything.

Eventually all will be in the public domain however much anybody tries to stop it so why would they bother to lie?

alfaman
30th Aug 2023, 20:20
Agree, but ditto the Mail, though
Aimed at the same demographic ;)

Seaking74
30th Aug 2023, 20:29
The originator would never know its plan was rejected because it wasn't. It was filed with Eurocontrol who had no reason to reject it. It passed through several systems without incident before it caused harm.

So the flight plan was fine for Eurocontrol and other systems but not for NATS. And NATS uses a modified US software for flight handling which likely isn’t the same as Eurocontrol et al? But this system has worked flawlessly until now? Does that point to something (software/firmware?) having changed recently in Euro land that hasn’t been changed here thus causing the failure?

First_Principal
30th Aug 2023, 20:43
...There is something else behind this that they've not divulged yet.

https://cimg1.ibsrv.net/gimg/pprune.org-vbulletin/448x331/lac3u_295f0e190da665d85138385b4dfe1609ac92a662.png

Uh-oh, surely they're not using a M$ product?! :zzz:

Given the importance of this system you'd have to hope there's a decent investigation and full public report that would reveal both the cause as well as the underlaying products in use, along with how redundancy is managed. To my mind there are certain things that bear critical scrutiny; with decent information this should allay unhelpful speculation, but allow competent people to assess the risk faced in being involved with said system.

FP.

eglnyt
30th Aug 2023, 21:01
Last time there was an "independent" review & a public report which is still available on the CAA website. And a Parliamentary Select Committee hearing which it would be fair to say didn't really add much of value. This incident was orders of magnitude more disruptive so you'd expect at least the same although this Government isn't too hot on transparency and rather dismissive of experts.

Expatrick
30th Aug 2023, 21:16
There has to be a system whereby the ANSP is penalised, in favour of the airlines, not the Exchequer, tricky to design & implement, but surely doable, a formula based initially on minutes delay, distributed on the grounds of impact on individual airlines, not necessarily the largest operators but those on whom the event has had the largest impact - most importantly a charge that the ANSP is not simply able to claw back from the airlines in future.

eglnyt
30th Aug 2023, 21:35
The NATS licence includes a penalty scheme whereby a certain level of delay triggers a reduction in future charges. It is deliberately not punitive to avoid influence on operational decision making so it is unlikely to cover the airline's costs.

Neo380
30th Aug 2023, 21:44
Never listen to the very vocal Irishmen, if you follow their logic, that of MOL at least , we should all work 12h a day . 7 days a week for half of the money so that his airline can make more profit. But on cost of back up systems you take the problem the wrong way . a functioning back up in an ATC Centre is part of the system , from the outset and this regardless of the end price. . You do not buy a system and later ask for a back up.
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,

What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.

The system crashed, for 7 hours, because it failed to 'fail over', which is what it's supposed to do - take a look at thread 122 to understand why. This isn't an 'IT glitch', it's 'building a house with a trap door in the floor, with the same trap door behind it, and an identical trap door under that' - fall through one and you fall through all three!

The other serious managerial error was stopping the backup that used to occur through the Prestwick Centre, in the same way that Eurocontrol does (see above), but 'the two centres went their separate ways...'

Neo380
30th Aug 2023, 21:48
'Having just seen the CEO of NATS say a single piece of faulty data caused the meltdown I don’t believe him
i know from Airline sources that the problem was known at 0800, 4hrs before NATS admitted to the problem. That’s why flights were initially posted as being delayed before being canx around Midday Ie NATS covered up the problem.
Eventually circa 4hrs later around 1500 NATS announced they had fixed the problem. They hadn’t the bug had washed out because the system holds about 4hrs of data...So CEO please explain the delay'

Exactly - he can't (without removing the cover up).

kiwi grey
30th Aug 2023, 22:06
This is clearly a plausible explanation, but at first glance it doesn't seem as complex a task as you suggest
Whilst there are clearly an infinite number of locations, lots of them can be ruled out simply. Anything outside of bounding boxes for Shanwick, Scottish and London can go immediately. Within that you've got bounding boxes for the areas controlled and only then do you actually have to start considering the real geography. The graph of controllers must be fairly small (otherwise the crew would be constantly on the radio) and the number of possible graphs isn't that big (in computer terms), so anything getting stuck should be detectable. Your satnav can cope with a much bigger problem.
(emphasis added)

The amount of data and the complexity of the calculations / algorithm to do comprehensive - even exhaustive - checking may indeed be relatively trivial for a modern system, but the NATS system doesn't seem like that from what I've read. It appears to be an older (ancient?) system re-hosted onto a more modern platform.
The NATS base system may, for example, be a 32-bit architecture and simply unable to use 'modern' amounts of memory (4GB) no matter the capacity of the hosting platform. Or it may be that (parts of) the system are inherently single-threaded, so that it matters not how many CPU instances / threads you throw at it, it only goes as fast as a single CPU. Or perhaps the overheads of emulating the original code on a less elderly system, which is in turn being emulated on a quite modern platform, are such as to ensure that the best you can hope for is 'quite a lot faster' than the original implementation, but nothing approaching a modern definition of 'high performance'.
Or if you're really lucky, all three apply. :(

The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.

grizzled
31st Aug 2023, 00:13
Eventually all will be in the public domain however much anybody tries to stop it so why would they bother to lie?

Maybe yes, maybe no. NATS is a public / private partnership so, as such, is it subject the UK Freedom of Information Act?

WillowRun 6-3
31st Aug 2023, 02:18
The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.

Difficult to dispute that description of a rip-out and replace project.

But would it be worse than the aftermath of a strong cybersecurity breach, or cyber attack? A version of, "if you think safety is expensive, try having an accident"

Neo380
31st Aug 2023, 05:39
Maybe yes, maybe no. NATS is a public / private partnership so, as such, is it subject the UK Freedom of Information Act?

You would think so, but as a PPP (not a publicly listed entity, or a government department/agency) NATS considers that it is NOT bound by the FOI Act, and therefore since the establishment of the Act (2000) 'picks and chooses' which FOI subject access requests (requests for information) it responds to. You can bet your pension that NATS won't be responding to FOI requests on this one!

Neo380
31st Aug 2023, 05:45
The only actual solution to such a problem is to rip out the old system and replace it, and that is one of those Major I.T. Systems Project behemoths that are rightly feared. They are inevitably extremely high profile, politically sensitive, long duration, extraordinarily complex with many many stakeholders, and extremely risky.
Difficult to dispute that description of a rip-out and replace project.

But would it be worse than the aftermath of a strong cybersecurity breach, or cyber attack? A version of, "if you think safety is expensive, try having an accident"

NATS has already floated replacing the (1970s based!) Swanwick main ATC system, at a tentative £1bn+ price tag, that they will ask to be paid for from the public purse. The issue is making the case for a 'safety critical' main system, when you have said that you can 'prioritise safety', and manage (highly reduced!) flows safely by operating manually - a bit of a contradiction in terms.

Asturias56
31st Aug 2023, 07:31
todays "Times" notes that they're upping the money they (the shareholders - the Govt & the airlines) take out as dividends and reducing the amount they planned to invest...............

eglnyt
31st Aug 2023, 07:53
NATS investment is not from the public purse. One of the reasons it was privatised was to remove it's borrowing requirements from the accounts in the days when the Public Sector Borrowing Requirements figure was thought to be important and Government hadn't realised it could just keep printing money and nobody actually cared.
​​​​​It has been investing a lot since privatisation and continues to do so, whether all that investment has been wise is difficult to judge but it's kept a lot of people employed for a while.
If it was NAS that failed then NATS has been trying to replace that system since the 90s and has failed to do so, it's a bit like trying to replace the foundations on a tower block whilst keeping the block intact because it's so integrated into all NATS systems
Having finally realised that couldn't be done it embarked on replacing Both NAS and the overlying systems a few years ago. It's a huge project that does literally rip it out & start again.

Neo380
31st Aug 2023, 08:01
NATS investment is not from the public purse. One of the reasons it was privatised was to remove it's borrowing requirements from the accounts in the days when the Public Sector Borrowing Requirements figure was thought to be important and Government hadn't realised it could just keep printing money and nobody actually cared.
​​​​​It has been investing a lot since privatisation and continues to do so, whether all that investment has been wise is difficult to judge but it's kept a lot of people employed for a while.
If it was NAS that failed then NATS has been trying to replace that system since the 90s and has failed to do so, it's a bit like trying to replace the foundations on a tower block whilst keeping the block intact because it's so integrated into all NATS systems
Having finally realised that couldn't be done it embarked on replacing Both NAS and the overlying systems a few years ago. It's a huge project that does literally rip it out & start again.

NATS is currently asking for (very large) sums to run a system for uncrewed aircraft (drones), (that most people in the sector think should be provided by industry), it won't do the NAS for free!

eglnyt
31st Aug 2023, 08:17
Slightly off topic but NATS is funded by its users. Why should those users fund services provided to an emerging industry? Again the 2 vocal Irishmen would have something to say. All NATS expenditure is governed by the licence negotiations. There are some things it funds as a quid pro quo for its right to run the UK airspace but that currently isn't one of them

Neo380
31st Aug 2023, 08:30
Slightly off topic but NATS is funded by its users. Why should those users fund services provided to an emerging industry? Again the 2 vocal Irishmen would have something to say. All NATS expenditure is governed by the licence negotiations. There are some things it funds as a quid pro quo for its right to run the UK airspace but that currently isn't one of them

Because it believes it should be 'controlling' UAS (of course it can't).

Neo380
31st Aug 2023, 08:42
'Just culture is not what you describe, & not limited to NATS. What you describe is a "no blame" culture, which has been out of favour for decades, for the reasons you suggest. A just culture draws a distinction between honest mistakes & errors, which occur in any environment, & non conformance. The second is definitely not acceptable nor accepted.'

Agreed. So if you build a safety critical system that won't 'fail over' successfully, that's ok, according to 'just culture' - because NATS is saying they've 'conformed'?

eglnyt
31st Aug 2023, 09:32
I still don't know what failed but although NATS has some safety critical systems no loss of service in the Flight Plan thread should be. Corruption of data could be but not failure. Business critical yes and demonstrably so but not safety.

ATC Watcher
31st Aug 2023, 09:54
Very interesting discussion . Learning a few things along the way . On just culture, I wonder if some ae thinking that the concept, initially designed for front line operators would apply to top management staff :)
Now CANSO juts issued a statement on the UK failure :CANSO (Civil Air Navigation Services Organisation) the global and regional voice of Air Traffic Management, has provided its views on the system failure that caused significant disruption in the United Kingdom earlier this week.Simon Hocquard, Director General, CANSO said: “First and foremost, my thoughts are with those air passengers that have had their travel plans affected by this incident. Air Traffic Management organisations across the globe rely on a network of complex systems to safely maintain the separation of aircraft at all times. In the rare instances where a system fails, it can often be due to a seemingly small problem. Whenever a failure does occur the number one priority has to be, and is always safety.”

The disruption to UK air traffic this week was caused by a failure of NATS’ flight data processing system. In order for the global air traffic system to work, any commercial or civilian aircraft flying from one airport to another needs to file a flight plan. These flight plans contain a lot of information including the route the flight will take – this is essential as each flight invariably crosses different sections of airspace often under several jurisdictions. This important information allows sectors of airspace to ensure the safety of each aircraft entering and exiting it by maintaining separation from other flights and ensuring a smooth flow of traffic.

The processing of this essential data between sectors is done automatically, and there are millions of flight plans filed globally every month without disruption due to system failures. As an example, the UK has had a decade of flight plans filed with no technical issues. In the very rare instances where technology fails, Air Navigation Service Providers revert to the manual processing of flight data. This requires a lot of manpower and cannot be done as quickly as the technology processes it and so it is necessary to reduce the number of flights in and out of airspace sectors so that the information can be accurately processed manually in a timely manner. This slowing down of the number of flights is to ensure the safety of aircraft. Once the system is back up and running and fully tested, capacity can once again be restored.

Simon Hocquard added; “ANSPs around the globe are built on a century of safety and significant investment in their people, technology and processes. NATS is one of the leading ANSPs with a very high level of performance and reputation and the steps it took to fix the issue had safety at their very heart.”

It says a few things when one reads between the lines.

Expatrick
31st Aug 2023, 10:00
The NATS licence includes a penalty scheme whereby a certain level of delay triggers a reduction in future charges. It is deliberately not punitive to avoid influence on operational decision making so it is unlikely to cover the airline's costs.

Thanks, I'd forgotten that!

CBSITCB
31st Aug 2023, 10:13
NATS has already floated replacing the (1970s based!) Swanwick main ATC system.

I strongly refute the (widely held) belief that the main Swanwick ATC system is based on 1970s technology!

IBM installed a prototype 9020 system at the Jacksonville centre in 1967. The first operational 9020 system, running the NAS En Route Stage A software (progenitor to the current Swanwick system) went live on 18 February 1970.

Therefore the Swanwick main ATC system is clearly based on 1960s technology.

The UK 9020/NAS first went live at West Drayton on 2 December 1974, running the NAS En Route Stage A software.

Engineer39
31st Aug 2023, 10:26
The problem with testing software is that you can't test all combinations of input values to ensure the required output values are correct, certainly not in vlarge or complex systems. Failure testing is often limited to defined alternate path (within the software) testing as defined in the requirements/specification. Edge cases will always catch you out.

With that in mind, critical systems like this should always fail safe, ie reject any invalid input, or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time.

Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash.

I wonder how often, if ever, business continuity testing is performed which should have enabled quick recovery.I was one of the people who signed off the upgrade in 2014 as being OK to implement. We were right as it was safe, just not resilient.

As the linked report here show it was all due to the 154th workstation being turned on and crashing the system. Of course some may say “How can NATS be so stupid as to not spot this and test for it?” Well the test suite has around 90 workstations so there is no way you can turn on 154 stations to test the software past its 153 limit. And no chance of getting time in >100 ATCOs’ schedule to get them all to come in and exercise all the stations even if you had >153 to test. And you can’t test on the live system with it many stations, as of course it’s not possible to find space in the schedule to do this on a system that is live 24/27/365. So it instead relies on software engineers understanding code that was rewritten >10 years before to understand what the 153 number meant. Obviously in that case no one understood it, or if they did, thought it meant active stations and forgot about the ones in a half on state. It’s not possible to retain all the knowledge from years ago unless no one resigns, retires or is made redundant. And you don’t outsource anything.

All in all I can't see practically how that incident could have been avoided.

I have no knowledge on this new incident but suspect the causes are all rather similar and that practically it should be possible to eliminate this particular case from happening again, but you can’t say “We will never have a crash again”. Those Irish men are whistling in the wind if they think it can. As others have pointed out you might be able to spend a ton of money to have duplicate software but it’s just not worth the expense. I am baffled though why this case took so long to recover from.

Upgrading it all to a brand new software may help long term but likely there will be more short term disruption due to new bugs introduced.

I think NATS compares very favourably in disruption compared with other ASNPs. But some (small) improvements will hopefully come out of all this.



I strongly refute the (widely held) belief that the main Swanwick ATC system is based on 1970s technology!

IBM installed a prototype 9020 system at the Jacksonville centre in 1967. The first operational 9020 system, running the NAS En Route Stage A software (progenitor to the current Swanwick system) went live on 18 February 1970.

Therefore the Swanwick main ATC system is clearly based on 1960s technology.

The UK 9020/NAS first went live at West Drayton on 2 December 1974, running the NAS En Route Stage A software.I am confused. I thought it was stated here that the NAS (or is it the FPS) hardware system had been replatformed onto new processors. Thus you are wrong except in the sense that “I speak English and Shakespeare also spoke English therefore my mentality is 400 years old.” Just have one roots in something old does not make you old.

Neo380
31st Aug 2023, 11:32
I strongly refute the (widely held) belief that the main Swanwick ATC system is based on 1970s technology!

IBM installed a prototype 9020 system at the Jacksonville centre in 1967. The first operational 9020 system, running the NAS En Route Stage A software (progenitor to the current Swanwick system) went live on 18 February 1970.

Therefore the Swanwick main ATC system is clearly based on 1960s technology.

The UK 9020/NAS first went live at West Drayton on 2 December 1974, running the NAS En Route Stage A software.

Good point, 1970 installed, (1960s technology)…

Widger
31st Aug 2023, 13:29
Nothing wrong with 1960s tech, especially computer tech. The now scrapped Type 45 Air Defence Destroyers and the CVS Aircraft Carriers (Invincible Class) had the Ferranti FM1600B (https://www.computinghistory.org.uk/det/16840/Ferranti-FM1600-B/) computer and its operating software, ADAWS and this was used up until 2013-14. The front end had relatively modern interfaces but the core system was run by the FM1600. It is not a direct comparison but gives some idea of the complexity of issues. Programming in the early years was by ticker tape and a laborious system of typing code which was then translated into punched holes. If you got one element of code wrong, the system would stop loading immediately. As one was typing you would occasionally type the wrong letter or number and have to start all over again, to much cursing. Programming airways, hazards etc, took hours. Towards the end of its life, programming was via floppy disk, which at least gave you editing control, but the core system would still grind to a halt if you got the syntax wrong. So the front end would have modern PC screens and keyboards, the memory capacity was much larger, enabling things like airspace reservations and navaids etc to be displayed but, the core system was still there lurking in the background. It was reliable and capable, helping the RN to achieve the first missile to missile engagement in GW1 in what was a very tricky shot. The only real way around the issue was a new ship, with a new command system and that was achieved with the Type 45 Destroyer. I imagine the NAS and NERL issues to be very similar and equally complex. I very much recall the nervousness when leaving West Drayton about which plugs to turn off, as there might well have been dependencies on hardware lurking in the basement.

CBSITCB
31st Aug 2023, 13:36
I am confused. I thought it was stated here that the NAS (or is it the FPS) hardware system had been replatformed onto new processors.

Originally the 9020 hardware running the NAS En Route Stage A software - collectively the Flight Plan Processing System(FPPS) - was part of the wider US National Airspace System. Hence "NAS".

The UK obtained a copy of the whole shebang in the 1970s. The mouthful NAS En Route Stage A software became known as just NAS.

Over the years the hardware has been replaced several times, but always with a 9020-compatible IBM system that still runs NAS. The operating system (or "monitor" in NAS-speak) is of course tweaked to fit the upgraded hardware, but the core ATC application programmes remain essentially the same, with numerous enhancements over the years of course. They are mainly written in Jovial, which was the US government language of choice at the time for such systems.

Widger
31st Aug 2023, 13:52
Well every day is a learning day. I have just found out that the FM1600 also used a derivation of the Jovial language namely CORAL, which was derived by the Royal Rader Establishment at Malvern. They were clever people these 60's software engineers, doing so much with 256K.

https://en.wikipedia.org/wiki/JOVIAL

eglnyt
31st Aug 2023, 14:03
The NERC system at Swanwick originally had replacement of NAS in scope. It replaced lots of the functions of NAS but in the end the core Flight Data Processing functions were descoped and NAS was retained eventually moving down to Swanwick when West Drayton closed. It's been replatformed a few times but NAS at Swanwick is still there and has code in it that ran on the 9020. Exactly how much is difficult to quantify. A lot of the functionality of the original NAS is no longer used but is still in there because removing it posed a lot of risk. A lot of the more modern functionality was added since the 9020 was withdrawn and all the hardware drivers etc have been replaced.
The NERC system may have some components in it which were inspired by NAS but most of it was originally late 80s early 90s. It has also been replatformed, some core functionality has been totally replaced, and a lot of functionality has been added since it first handled ops.

Murty
31st Aug 2023, 14:20
An example of how a flight plan could trip out, if not isolated in the UK system but mabe not affect the IFPS,can also be in the callsign/flight number.

The UK system has trouble (which has a safety net now ,if everthing is followed) with any callsign with " NNN"in it ,or in Item 18 of the FPL

To date there have only been 2 aircraft out of 3 that have tripped the UK System,despite NNNN being the problem in AFTN Messaging.butAFTN Message FormatThe message format of AFTN messages is defined in ICAO Annex 10 Aeronautical Telecommunications Volume II.[4] (https://en.wikipedia.org/wiki/Aeronautical_Fixed_Telecommunication_Network#cite_note-4)

AFTN messages consist of a Heading, the Message Text and a message Ending.

The message Heading comprises a Heading Line, the Address and the Origin. The Heading Line comprises the Start-of-Message Signal which is the four characters ZCZC, the Transmission Identification, an Additional Service Indication (if necessary) and a Spacing Signal.

The Message Text ends with the End-of-Message Signal, which is the four characters NNNN. The Ending itself comprises twelve letter shift signals which represent also a Message-Separation Signal.

The Message Text ends with the End-of-Message Signal, which is the four characters NNNN


1. DC Aviation have C56X registered DCNNN but fIles under the fixed callsign DCS705 ,but since the HEX code and registration now have to included in the UK system this is still a trip out risk.

The other 2 known ones have passed.

JYNNN C172 was delivered to Bournemouth back in 2020 ,and tripped our system (which was rectified fairly quickly)

MNNNN Gulfstream 6 ,this was registered by a Russian on 16/7/2014,it became a regular visitor to the UK and I became involved with many e-mails with the Operator,pointing out that their ID was going to be a problem at all airports as the Flight Plan could not reach addresses STOPPING after MNNNN.
The owner reregistered the aircraft MNGNG

The fact our system stops after 3 N's is strange

Other counties have quirks the FAA system does not like callsigns begining with a number with number,,I was advised by a coleague in FAA while carrying out an investigation ,which is strange with Barbados (8P-),this may have been notice at airports with 8P-ASD Gulfstream 6 this will often file as "X8PASD" as does the Malaysian Gov 9MNAA A320 that file with a letter.
Luckily most Maltese bizjets fly with an RTF -Tri-Graph ie VJT - Vistajet,same for the large amount of Guernsey aircraft.
This may have been sorted but the owner of 8PASD carries on.

eglnyt
31st Aug 2023, 15:49
You mean that there's no automated testing? No regression testing? And the system has never been tested to its design limit?
Probably not the system that caused an issue on Monday but the investigation into that incident discussed the suitability of the Test Regime.

The report is here: https://www.caa.co.uk/media/r42hircd/nats-system-failure-12-12-14-independent-enquiry-final-report-2-0-1.pdf
Section 2.5 and G4 are the relevant parts

CBSITCB
31st Aug 2023, 15:53
Beginning in 2011, after several decades of abortive attempts, the FAA successfully introduced a new ground-up FDPS that included replacement of the 1960s En Route Stage A software. This was the $2.1 billion ERAM (En Route Automation Modernization) system. Roll-out to all the US centres took several more years.

Despite the presumed use of modern software engineering techniques by Lockheed Martin, the new FDPS is not immune to latent bugs in its flight plan processing software. To illustrate some of the complexity involved, and the impossibility of testing for every possible failure mode, here is an account of a FDPS system crash at the Los Angeles centre in 2014 (edited for brevity):

ERAM has a capability called “look-ahead" which searches for potential conflicts between aircraft based on their projected course, speed, and altitude. Because of the computing requirements for handling look-ahead for all flights within a given region of controlled airspace, Lockheed Martin designed the system to limit the amount of data that could be input by air traffic controllers for each flight. And since most flights tend to follow a specific point-to-point course or request operation within a limited altitude and geographic area, this hadn't caused a problem for ERAM during testing.

A flaw in the system was exposed when a U-2 spy plane entered the air traffic zone managed by the system in Los Angeles. The aircraft had a complex flight plan, entering and leaving the zone of control multiple times. On top of that, the data set for the U-2 flight plan came close to the size limit for flight plan data imposed by the design of the ERAM system. Even so, the flight plan data lacked altitude data, so it was manually entered by an air traffic controller as 60,000 feet.

The system evaluated all possible altitudes along the U-2's planned flight path for potential collisions with other aircraft. That caused the system to exceed the amount of memory allotted to handling the flight's data, which in turn resulted in system errors and restarts. It eventually crashed the ERAM look-ahead system, affecting the FAA's conflict-handling for all the other aircraft in the zone controlled out of its Los Angeles facility.

As a result, facility managers declared ATC Zero, which suspended operations and cleared the Centre's airspace. The event impacted air traffic operations with over 400 flight delays reported throughout the NAS and as many as 365 cancellations just in the Los Angeles Centre airspace alone. According to FAA the event lasted for about 2 hours, but the impact on the traveling public throughout the National Airspace System lasted for over 24 hours.

Engineer39
31st Aug 2023, 16:44
You mean that there's no automated testing? No regression testing? And the system has never been tested to its design limit?Of course there is lots of testing. I am not a software engineer so can’t state the specifics. I’m just saying that, like in all complex software, there is no way to exercise every single combination of inputs to find out if one combination of inputs that crops up once in 10 or 100 years will trigger something undesirable. To use an analogy would a car manufacturer test a car at 12,000 feet altitude with 2 adults and 2 children on board when it is raining and the tyres are a little flat? He will do a test for each case separately but not all in combination. Maybe in 10 years’ time someone who met those conditions would find that the car was a bit unstable and slide off the road; then he tries to blame the manufacturer for not testing that condition.

Gupeg
31st Aug 2023, 20:37
I am not a software engineer so can’t state the specifics.
Understood

I’m just saying that, like in all complex software, there is no way to exercise every single combination of inputs to find out if one combination of inputs that crops up once in 10 or 100 years will trigger something undesirable.
As detailed above, a "test mode" (that does not require real hardware / personnel inputs) can simulate to a module a wide range of inputs, but crucially, testing at the limits.

To refer to the 2014 case, there was a specified limit of 193 "atomic functions". There appears no testing done to stress it at 193, and how it dealt with an input of 194. Such testing could have revealed that not only was the limit erroneously at 152? 153? but increment to 154 and it does not reject the change to 154, but falls over.

The major error seems to be the upgrade shortly before, adding potential (military) inputs. A correct upgrade process should have identified those added inputs, and stress tested the system with those added inputs against maximum previous inputs. Had the testing been explored it hopefully would have included flexing the civil inputs as well. The testing is not just to test the added inputs (which hopefully the upgrade design should allow for), but expose any latent errors that "survived" until now but failed in these conditions (as happened).

I suspect the real problem is a system that has grown over decades with multitudes of different sub-modules from different suppliers / languages / standards sort of "works" until some combination of inputs / events causes an inappropriately handled exception (as 2014 / Monday?). Yes - good to then close the loophole (as in the 2014 CAA report), but does nothing to identify / remove the next latent error. The 2014 report seems rather self-congratulatory on the CAA's part, and how software errors are hard to detect, rather than discuss a system to find the errors before real-life carnage.

Neo380
31st Aug 2023, 20:39
Of course there is lots of testing. I am not a software engineer so can’t state the specifics. I’m just saying that, like in all complex software, there is no way to exercise every single combination of inputs to find out if one combination of inputs that crops up once in 10 or 100 years will trigger something undesirable. To use an analogy would a car manufacturer test a car at 12,000 feet altitude with 2 adults and 2 children on board when it is raining and the tyres are a little flat? He will do a test for each case separately but not all in combination. Maybe in 10 years’ time someone who met those conditions would find that the car was a bit unstable and slide off the road; then he tries to blame the manufacturer for not testing that condition.

That's really missing the point, as has been said a number of times.

This isn't an 'infinitesimal circumstance', that could never be tested for (unless all human inputs have become 100% reliable, which they are not, and never can be).

This is all about how a fail over works, or doesn't work to be more precise. The system should have switched to an iteration of the software that wasn't identical, so wasn't bound to immediately fail again. Because if it does the system has no fail safe and bound to collapse, catastrophically.

That is the issue that is being skirted around, and was the core fault of the 2014 failure - very bad IT architecture.

Describing 'infinitesimal circumstances' and the '100 years of testing that couldn't identify them' has nothing to do with building fail overs that fail again, and then fail over again, through poor design.

Ninthace
31st Aug 2023, 21:01
It would be very hard to develop two systems of that complexity that did the same thing in a different way and keep them both current with user requirements. You could have a current version and a version that is one iteration older as a back up. That would protect you from failure arising from upgrades and changes but it is unlikely to protect the system from a rare, untested error as both iterations are likely to be carrying the same fault in the software if they have both been running successfully for many years.

Neo380
31st Aug 2023, 21:13
Whilst I agree with what you say, this isn't about building two entirely similar, but subtly different systems. It's about how the fail over works, if the system is allowed to crash in the event of this very rare but erroneous data being inputted - then what happens? It goes offline for three hours, we then wait for another four hours for the erroneous data to be 'washed out of the system' aka cleared from local memory, then seven hours later we start clearing the backlog of aircraft, that will take a week?
Safety and mission critical systems - think power systems or train signalling, or even mobile network operations - just can't work like that. The logic has to be 'if route A fails, switch to route B' (which can't just be a carbon copy of route A'); that should allow the erroneous data to be isolated without crashing the entire system. And in the (really!) very exceptional circumstance that both route A and route B end up failing at the same time, there should be a route C fail over to cover that contingency as well.

Just having three identical route As is asking to crash the system, which has now happened, repeatedly.

CBSITCB
31st Aug 2023, 21:44
The logic has to be 'if route A fails, switch to route B' (which can't just be a carbon copy of route A')
There was a ‘Route B' - manual ops. Very different, and safe (a point that NATS labours). Nobody died.

Neo380
31st Aug 2023, 21:48
Manual ops is only a fail safe if you expect your entire (mission critical!) ATC system to catastrophically collapse - by which time 'mission critical' no longer applies.

Neo380
31st Aug 2023, 22:47
The scenario you quoted wasn't an extreme edge case. The system was specified to be able to deal with 193 controllers but was only tested with 130. And broke at 153.

To use your analogy that's buying a 4 seater car and only bothering to check that there are 3 seats in it.

To sharpen the analogy, the system was specified for 193 controllers, was tested for 130, and broke at 153 (unsurprisingly), and then immediately broke again, and then immediately broke again (this is the piece than should be safe guarded against), leading to a complete system collapse, which should absolutely never happen.

MarcK
1st Sep 2023, 01:11
AFTN Message Format
The Message Text ends with the End-of-Message Signal, which is the four characters NNNN. The Ending itself comprises twelve letter shift signals which represent also a Message-Separation Signal..
twelve "letter shift signals"? We're talking Baudot teletype code (5-level, ITA2) here, which has been obsolete for about 60 years.If the system had upgraded to 8-bit codes you could use the specalized message separation characters SOH, STX/ETX, ETB/EOT, not to mention FS/GS/RS/US. I suppose no one wanted to take the hit for suggesting such a radical change.

pax2908
1st Sep 2023, 07:25
Re. #179 (Los Angeles 2014). I would like to try to read the full report if available somewhere?

This in particular:
... exceeded the amount of memory allotted to handling the flight’s data, which in turn resulted in system errors and restarts ...

OK, no good protection against excessive memory usage, resulting in system restarts ... But what seems even worse, is the part about "system restartS" (more than once). It suggests that it can be acceptable to have multiple restarts in a row, before some other action is taken (such as identifying which data input is causing a problem). This, in turn, makes me wonder how many such self-restarts occur on a regular basis ... these would normally be analyzed and their cause addressed, even in absence of a visible incident?

golfbananajam
1st Sep 2023, 09:36
That's really missing the point, as has been said a number of times.

This isn't an 'infinitesimal circumstance', that could never be tested for (unless all human inputs have become 100% reliable, which they are not, and never can be).

This is all about how a fail over works, or doesn't work to be more precise. The system should have switched to an iteration of the software that wasn't identical, so wasn't bound to immediately fail again. Because if it does the system has no fail safe and bound to collapse, catastrophically.

That is the issue that is being skirted around, and was the core fault of the 2014 failure - very bad IT architecture.

Describing 'infinitesimal circumstances' and the '100 years of testing that couldn't identify them' has nothing to do with building fail overs that fail again, and then fail over again, through poor design.


Please see my post #74. Software testing is costly both in terms of time and resource. For complex systems it is impossible to test every combination of input data to see what fails. Automated testing is also not the panacea that many think it is. To get a test script automated, you end up manually running it to make sure the element of software under test works. Once you have a test that passes then you run it again, this time sing the auto test suite to record the steps you take. Once you've done that, you then run a confirmatory test. So for every element of the requirement you end up with at least three runs of a single test script (which can have many stages).

Then the developer has an update to code, so your automated test fails, then you start all over again.

The problem with old and complex systems, is that updates and improvements are usually a bolt-on to the original, it isn't very often you redesign from a clean sheet of paper. The result is that you end up testing the areas that your update has "touched" with a quick sanity regression test of the main use cases. You just don't have the time, resource or money to fully test everything each time an update is carried out.

Even then, there will be an edge case you just don't consider or haven't even though of or have dismissed as "won't ever happen" because of checks in other systems that you use as a data source where you assume that the input data has been properly validated and is therefore correct.

CBSITCB
1st Sep 2023, 10:47
In the case of the 2014 Los Angeles failure, the crash stemmed from “The system evaluated all possible altitudes along the U-2’s planned flight path for potential collisions with other aircraft.”

Say there were 100 aircraft in the LA centre’s airspace, all (obviously!) in different places, going in different directions, at different speeds and climbing/descending. How can you possibly write a test case that exercises all possibilities?

CBSITCB
1st Sep 2023, 10:52
Re. #179 (Los Angeles 2014). I would like to try to read the full report if available somewhere?

I’m not aware of a detailed report (I haven’t searched for one). I used these two sources:

https://www.oig.dot.gov/sites/default/files/FAA%20Actions%20to%20Address%20ERAM%20Outages%20Final%20Repo rt%5E11-07-18.pdf

https://arstechnica.com/information-technology/2014/05/u-2s-flight-plan-was-like-malware-to-faa-computer-system/

paulross
1st Sep 2023, 14:04
The scenario you quoted wasn't an extreme edge case. The system was specified to be able to deal with 193 controllers but was only tested with 130. And broke at 153.

To use your analogy that's buying a 4 seater car and only bothering to check that there are 3 seats in it.

There was a little bit more to it than that. The other issue at play was that the controller had made a mode error in selecting a soft key that put them in "Watching Mode" (a rare and obsolete mode) and only then did the comparison 153 < 151 (in a different code path) fail. It was the combination of errors both in software and by the operator that, on their own were inconsequential, but when combined became significant.

The final report paras ES8. and ES9. give an introduction to this. The report then goes on to look at why this mode was still present, how this (understandable) mode error could have been detected (it was being selected accidentally almost every other day) or prevented and the trade-offs in testing and so on.

Much of software testing is about using your imagination; "what can go wrong?" so the 2014 failure could be regarded as failure in imagination.

Gupeg
1st Sep 2023, 15:25
There was a little bit more to it than that. The other issue at play was that the controller had made a mode error in selecting a soft key that put them in "Watching Mode" (a rare and obsolete mode) and only then did the comparison 153 < 151 (in a different code path) fail. It was the combination of errors both in software and by the operator that, on their own were inconsequential, but when combined became significant. I think this is rather a favourable way of looking at it :eek:

To me the real cause of the failure was introducing new software, onto both SFS Servers, that had not been adequately tested (or rather the testing had not been adequately specified). The inadequacy of that testing was shown, whether by "Watching Mode" being needed / selected, it took only one day for the "new" software to then bring down UK ATC for a period :ugh:

The report refers to "needles" and "haystacks" and how hard it is to find errors, including latent errors (as here from maybe 20 years earlier). However, the upgrade is described as being specifically to "add military controller roles". Therefore, to me, in addition to whatever normal test functions an upgrade requires, specific testing should have been specified "stress testing" the number of workstations. The testing should be intended to verify not only the upgrade changes, but the whole system to expose (as here) related latent errors that had been "got away with" to date - especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military).

The bigger picture is should the upgrade have been debugged on a live system? Or a test system? NATS of course will keep banging the safety drum which might be accurate, but irrelevant. It is whether the airline industry, travelling public and Govt find it acceptable for the system to grind to a halt every 10 years or so while latent errors are worked out. If it is not acceptable, then a different (and no doubt more costly) approach is required... We'll later see if the report on Monday's issue has any parallels?

Ninthace
1st Sep 2023, 18:41
Welcome to the real world of IT. The bean counters and the managers want it yesterday, the developers want it forever, everything is a trade off in the end. The trick is to know when to stop tweaking and testing and actually deliver.

Neo380
1st Sep 2023, 21:04
Please see my post #74. Software testing is costly both in terms of time and resource. For complex systems it is impossible to test every combination of input data to see what fails. Automated testing is also not the panacea that many think it is. To get a test script automated, you end up manually running it to make sure the element of software under test works. Once you have a test that passes then you run it again, this time sing the auto test suite to record the steps you take. Once you've done that, you then run a confirmatory test. So for every element of the requirement you end up with at least three runs of a single test script (which can have many stages).

Then the developer has an update to code, so your automated test fails, then you start all over again.

The problem with old and complex systems, is that updates and improvements are usually a bolt-on to the original, it isn't very often you redesign from a clean sheet of paper. The result is that you end up testing the areas that your update has "touched" with a quick sanity regression test of the main use cases. You just don't have the time, resource or money to fully test everything each time an update is carried out.

Even then, there will be an edge case you just don't consider or haven't even though of or have dismissed as "won't ever happen" because of checks in other systems that you use as a data source where you assume that the input data has been properly validated and is therefore correct.

Inputting the wrong data - often as little as a missed full stop - is not an 'edge case', actually it's normal human behaviour. This has nothing to do with fail overs that don't work.

Neo380
1st Sep 2023, 21:22
I've reread #74 and concur! We are not trying to test every combination of variables, like the U2 flight plan (with no altitude data!) and it's impact on the FAA system.

I agree that task is never ending. But you say it yourself "failure testing is often limited to defined alternate path (within the software) testing" that path CAN'T be the already failed path, because it's bound to fail again. Especially if the circumstances are more operators than the system was stress tested for, many in new (military) roles. This is the smoking gun, and the cover up (or at least not being discussed) the lack of alternate paths.

You go on "critical systems like this should ALWAYS [my emphasis] fail safe [that's what I've been saying!], ie reject any invalid input , or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time'. EXACTLY. All this talk about edge cases, and French data etc etc is really just BS...

"Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash'. But it has, repeatedly now. I wonder about BC testing too!

Engineer39
1st Sep 2023, 21:46
Ref :The major error seems to be the upgrade shortly before, adding potential (military) inputs. A correct upgrade process should have identified those added inputs, and stress tested the system with those added inputs against maximum previous inputs.

I don't know the details but suspect that it was thought that the few added military consoles would still be well within the limits. It was not at that time recognised that leaving consoles in watching mode made the software see more than the linmits. Thus they did not test for max number of consoles plus max no. of watching consoles. A failure of imagination or just complete lack of knowlege of what watching consoles did??

Dr Jekyll
2nd Sep 2023, 07:48
I've reread #74 and concur! We are not trying to test every combination of variables, like the U2 flight plan (with no altitude data!) and it's impact on the FAA system.

I agree that task is never ending. But you say it yourself "failure testing is often limited to defined alternate path (within the software) testing" that path CAN'T be the already failed path, because it's bound to fail again. Especially if the circumstances are more operators than the system was stress tested for, many in new (military) roles. This is the smoking gun, and the cover up (or at least not being discussed) the lack of alternate paths.

You go on "critical systems like this should ALWAYS [my emphasis] fail safe [that's what I've been saying!], ie reject any invalid input , or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time'. EXACTLY. All this talk about edge cases, and French data etc etc is really just BS...

"Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash'. But it has, repeatedly now. I wonder about BC testing too!

There are cases were one invalid or rejected input means subsequent inputs cannot be processed properly, EG running totals or counts may ne inaccurate. Certainly in the case of a control system it's generally better to keep going, but from the developers point of view it isn't always clear whether it's a 'keep running regardless' scenario or a 'once you're on the wrong line every station is likely to be the wrong station' scenario.

Neo380
2nd Sep 2023, 08:24
There are cases were one invalid or rejected input means subsequent inputs cannot be processed properly, EG running totals or counts may ne inaccurate. Certainly in the case of a control system it's generally better to keep going, but from the developers point of view it isn't always clear whether it's a 'keep running regardless' scenario or a 'once you're on the wrong line every station is likely to be the wrong station' scenario.

So in mission critical systems it's like this - a 'car breaks down at the traffic lights, it happens, even if the car has already been checked, the traffic lights shouldn't then fail, across the entire city, and every road crossing that has to be made then has to be handled manually'. Moreover, you've got a car blocking the traffic lights now, so there's only one thing you can do, and that's reroute the traffic around the obstacle - that's a fail safe, and you normally need two of them, for fairly obvious reasons - the second route is likely to come under pressure pretty fast too. But don't ever, ever, just assume that you can push the traffic through a blocked route - that's what causes the system to crash. This has NOTHING to do with 'the chances of your car breaking down', especially, coming back to reality, when we know this issue is highly likely to be attributable to human error, ie faulty data input. And that's before adding all the military traffic and not stress testing the system properly, ever, it seems.
The key characteristics of this incident seem to be lack of competence and wishful thinking. Only saved, btw, because 'the car was eventually moved out of the way', and the only route available was restarted.

Hartington
2nd Sep 2023, 08:42
A good few years ago I was testing a piece of commercial, non safety critical, software. It failed at a specific point in a way I considered "interesting". I described the failure to the programmer. He looked quizzical and said "I wondered if it would do that".

Then there was a recurring fault. It happened in client systems all over the country. Nobody experienced it frequently or consistently. In fact, across the country, It happened about twice a year. Most people never had the problem. Try as we may we never got to the bottom of it (believe me, we really tried).

Software is written by humans, tested by humans (test scripts for automated systems are written by humans) and used by humans. Humans are error prone and, in the end, it means software will be error prone.

eglnyt
2nd Sep 2023, 09:12
I think this is rather a favourable way of looking at it :eek:

To me the real cause of the failure was introducing new software, onto both SFS Servers, that had not been adequately tested (or rather the testing had not been adequately specified). The inadequacy of that testing was shown, whether by "Watching Mode" being needed / selected, it took only one day for the "new" software to then bring down UK ATC for a period :ugh:

We continue to discuss an earlier failure on a system that almost certainly wasn't the one involved in this case although of course currently we don't know which system was.

It wasn't new software. It was the original software, it had been there for years. The change introduced was to start using it nearer the limits of the system of which there two, 151 civil positions and 193 overall. The verification of those limits and acceptance of them happened years before. To use the poor analogy previously introduced it is akin to buying a 5 seat car, only using 4 seats for several years and one day having a need to use all 5. In my case discovering that, if isofix is in use on 2 of the seats it is actually a 4.5 seat car not 5.

Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option. And a simple test up to 193 would not have uncovered the issue, you would need to invoke watching mode when more than 151 were in use, any other mode added above 151 would not have triggered the issue. If your aim was to fully stress the system it is likely that you would have invoked the more demanding modes to do that.

Should they have spotted the error on code review? This is a bad case for humans. There are two limits in use. I'd probably spot a completely incorrect limit but I'd be far less likely to spot that the wrong one was being used.

Should SFS have 2 completely different sets of software so an error would only affect one. Ideally yes but as I've said before that is also impractical. The supplier struggled to produce one set of software in the timescale and cost originally estimated. Even if you doubled your estimate producing two would, in the end, cost considerably more than twice as much even if you managed to ever actually deliver.

Business criticality is a different matter from safety criticality but for all systems in the flight data thread you can make an adequate safety case with redundancy provided with an identical system provided you have a means of ensuring that, at all times from inception of failure, you can safely handle the level of traffic that might be present. In the case of Monday the level of traffic at failure was safely handled and the reduction of traffic as data degraded ensured that continued to be the case.

If your safety case is made than business criticality becomes purely a matter of cost benefit.

eglnyt
2nd Sep 2023, 09:30
. We'll later see if the report on Monday's issue has any parallels?

The review report took several months to amass the evidence and compile the report. I doubt very much that anything that comes out on Monday will be sufficient to decide if there are any parallels.

I'm expecting, hoping for, an identification of the system that failed probably accompanied by a poor description of its function and how it works. Identification of this apparently errant data and in particular why the NATS system didn't like it. A Timeline of events hopefully with a hint as to why it took as long as it did to fix, a bonus if it explains how long they previously thought it would take.

That should be sufficient to keep PPRUNE going for a while based on how much discussion we have had in the absence of any of that. I then expect there to be an inquiry of some sort because the Secretary of State has to be seen to be doing something. In the Autumn once MPs come back from their lengthy holidays and have finished their conference season there will probably be a Select Committee hearing because they also want to be seen to be doing something.

In the meantime out of sight somebody will fix the issue, hopefully properly, and wait for the next one in 10 years.

paulross
2nd Sep 2023, 10:24
We continue to discuss an earlier failure on a system that almost certainly wasn't the one involved in this case although of course currently we don't know which system was.

It wasn't new software. It was the original software, it had been there for years. The change introduced was to start using it nearer the limits of the system of which there two, 151 civil positions and 193 overall. The verification of those limits and acceptance of them happened years before. To use the poor analogy previously introduced it is akin to buying a 5 seat car, only using 4 seats for several years and one day having a need to use all 5. In my case discovering that, if isofix is in use on 2 of the seats it is actually a 4.5 seat car not 5.

Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option. And a simple test up to 193 would not have uncovered the issue, you would need to invoke watching mode when more than 151 were in use, any other mode added above 151 would not have triggered the issue. If your aim was to fully stress the system it is likely that you would have invoked the more demanding modes to do that.

Should they have spotted the error on code review? This is a bad case for humans. There are two limits in use. I'd probably spot a completely incorrect limit but I'd be far less likely to spot that the wrong one was being used.

Should SFS have 2 completely different sets of software so an error would only affect one. Ideally yes but as I've said before that is also impractical. The supplier struggled to produce one set of software in the timescale and cost originally estimated. Even if you doubled your estimate producing two would, in the end, cost considerably more than twice as much even if you managed to ever actually deliver.

Business criticality is a different matter from safety criticality but for all systems in the flight data thread you can make an adequate safety case with redundancy provided with an identical system provided you have a means of ensuring that, at all times from inception of failure, you can safely handle the level of traffic that might be present. In the case of Monday the level of traffic at failure was safely handled and the reduction of traffic as data degraded ensured that continued to be the case.

If your safety case is made than business criticality becomes purely a matter of cost benefit.

Agreed.

I often refer people to this article when asked "why do you ship software with bugs?":
Short form: https://www.theguardian.com/technology/2006/may/25/insideit.guardianweeklytechnologysection
The long(er) form original article: https://ericsink.com/articles/Four_Questions.html

Neo380
2nd Sep 2023, 11:33
A good few years ago I was testing a piece of commercial, non safety critical, software. It failed at a specific point in a way I considered "interesting". I described the failure to the programmer. He looked quizzical and said "I wondered if it would do that".

Then there was a recurring fault. It happened in client systems all over the country. Nobody experienced it frequently or consistently. In fact, across the country, It happened about twice a year. Most people never had the problem. Try as we may we never got to the bottom of it (believe me, we really tried).

Software is written by humans, tested by humans (test scripts for automated systems are written by humans) and used by humans. Humans are error prone and, in the end, it means software will be error prone.
I think you said it, 'not in mission critical systems'. Would you build a management system for a nuclear reactor this way? No.

Neo380
2nd Sep 2023, 11:42
That all sounds realistic. I'm just surprised at the apparent expectation of having no fail-safes in a mission critical system, but perhaps they were deemed unnecessary with the 1960's technology, and at the time, because the old system never failed over (except that's happened a couple of times now)?

eglnyt
2nd Sep 2023, 11:51
I think you said it, 'not in mission critical systems'. Would you build a management system for a nuclear reactor this way? No.
What you may or may not do for a nuclear reactor control system isn't really relevant unless what you are building has hazards with the same order of harm. The design of your system and the rigour of the processes surrounding that design and build have to be consistent with the level of harm not the best practice employed where the level of harm is very high.

Neo380
2nd Sep 2023, 12:50
What you may or may not do for a nuclear reactor control system isn't really relevant unless what you are building has hazards with the same order of harm. The design of your system and the rigour of the processes surrounding that design and build have to be consistent with the level of harm not the best practice employed where the level of harm is very high.

That’s fine, if you want to run the National Airspace System (which is defined at ‘critical national infrastructure’ btw) on manual every time your computer trips out, then no issues.

But when you come back to the public purse (as you will, despite what you might read here) and ask for £1bn+ for a new system ‘for the safe and smooth running of the UK’s CNI’, expect to be reminded of what you said.

Btw, this is the same organisation that in RP3, so only a couple of years ago, said ‘accepting any performance improvements (the point of reporting periods!) was against the business’s and national interest’.

If I was on the Parliamentary Committee I would ensure that sort of comment (and several of the above) was recorded as ‘breathtakingly arrogant!’

eglnyt
2nd Sep 2023, 13:35
That’s fine, if you want to run the National Airspace System (which is defined at ‘critical national infrastructure’ btw) on manual every time your computer trips out, then no issues.

But when you come back to the public purse (as you will, despite what you might read here) and ask for £1bn+ for a new system ‘for the safe and smooth running of the UK’s CNI’, expect to be reminded of what you said.

Btw, this is the same organisation that in RP3, so only a couple of years ago, said ‘accepting any performance improvements (the point of reporting periods!) was against the business’s and national interest’.

If I was on the Parliamentary Committee I would ensure that sort of comment (and several of the above) was recorded as ‘breathtakingly arrogant!’
So this future demand on the public purse is invented by you? The last £1bn didn't come from the public purse so why should the next?
If anything like last time I would expect the Committee to be hell bent on making sound bites most of which have no relevance to the situation rather than actually discussing the issue at hand. Last time they continually demanded the CEO promise something nether he nor anybody else could promise and contrived to get him replaced when he stuck to the only realistic response.
The RPs are a negotiation between multiple parties and I'd expect NATS to forcefully fight its corner. You might consider that arrogant but actually as directors of a PLC the board is probably legally required to do so.
As I said before once the safety case is established the rest is a business matter. It maybe CNI but the decision as to what investment happens is part of the licence negotiations and in each case so far the customers, ie the airlines, have been offered options with more resilience and after, predictably, initially demanding low prices and more resilience have tended towards lower cost. If you don't want those who pay to define your CNI you need to give the Regulator different powers.
If Monday was a result of the failure of an old system then the initial report next week might include some information on the investment to date, why that system is still in use, and current plans and timescales for replacement.

Engineer39
2nd Sep 2023, 14:07
I think this is rather a favourable way of looking at it :eek:
However, the upgrade is described as being specifically to "add military controller roles". ..... but the whole system to expose (as here) related latent errors that had been "got away with" to date - especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military).

Not correct I am afraid. It's always been a dual type system. It was just adding more military workstations on this occasion. As far as I know it was not the specific military functionality that was the problem but the total number of active and watching stations.

As someone said the safety case was fine but the resilience (which has often little to do with safety) failed on that occasion. It would indeed cost a ton of money to make it more resilient with dual software and who is willing to pay for that? Many of the NATS shareholders are UK airlines. Will they pay? As BA has found with its booking system resilience can be hard to buy.

Neo380
2nd Sep 2023, 16:14
So this future demand on the public purse is invented by you? The last £1bn didn't come from the public purse so why should the next?
If anything like last time I would expect the Committee to be hell bent on making sound bites most of which have no relevance to the situation rather than actually discussing the issue at hand. Last time they continually demanded the CEO promise something nether he nor anybody else could promise and contrived to get him replaced when he stuck to the only realistic response.
The RPs are a negotiation between multiple parties and I'd expect NATS to forcefully fight its corner. You might consider that arrogant but actually as directors of a PLC the board is probably legally required to do so.
As I said before once the safety case is established the rest is a business matter. It maybe CNI but the decision as to what investment happens is part of the licence negotiations and in each case so far the customers, ie the airlines, have been offered options with more resilience and after, predictably, initially demanding low prices and more resilience have tended towards lower cost. If you don't want those who pay to define your CNI you need to give the Regulator different powers.
If Monday was a result of the failure of an old system then the initial report next week might include some information on the investment to date, why that system is still in use, and current plans and timescales for replacement.
That is the current 'word' in NATS, so you'd have to ask them, not me (it sounds like you're pretty close too). But if there's going to be no call at all on the public purse that's great - we can meet back here and you can say 'I told you so'. I'd like to read the Hansard report of the last Parliamentary Committee, do you have it, because it would be interesting to see what was actually said? (I thought) Richard Deacon, the then CEO, was replaced, are you saying this is why?
Come on, NATS didn't 'forcefully fight it's corner' (you must be an insider); it picked its ball up and refused to play. What other government department, agency of PPP is immune from scrutiny and improving itself??
NATS is not a PLC, the PPP status confers some significant privileges on it - not least of which is a monopoly that is normally banned under Competition Law. CNI is not a 'business matter', it just doesn't work like that - would you be happy if we deployed your children to Iraq or Afghanistan and said, 'oh, sorry, the business decided against buying you protective equipment and ammunition'?? The analogy shows how ridiculous this line of thinking is. The Regulator should absolutely be defining what fail safe procedures it wants to see, I agree with you there (there's little point checking 'amendments to MATS' if the country's critical national infrastructure is susceptible to catastrophic collapse). I expect the initial report to reveal very little again - if you suspend you disbelief/denial for just a few seconds it's clear that NATS isn't coming clean on this one.

Eric T Cartman
2nd Sep 2023, 16:51
The Transport Committee needs a Gwyneth Dunwoody clone at the helm - weak excuses would not be tolerated !

eglnyt
2nd Sep 2023, 16:57
NATS has a very complex structure of intertwined companies but the holder of the licence is NATS (En-Route) PLC. Whether it is a "proper" PLC is a debatable point, not least because the shares aren't publicly traded, but it is constituted as one and the rules for PLCs apply. The En-Route operation is, as you say, a regulated monopoly.
The Hansard report of the Select Committee is available at the official source and the proceedings are still available on Parliament TV.
I am closer than some, lived through much of what we have been discussing, have a reasonable but possibly dated understanding of the systems in the Flight Data thread but have no current connection with NATS. I have no more knowledge of what happened than anybody else outside NATS although I can deduce some things from the regulations imposed.

Neo380
2nd Sep 2023, 17:01
NATS has a very complex structure of intertwined companies but the holder of the licence is NATS (En-Route) PLC. Whether it is a "proper" PLC is a debatable point, not least because the shares aren't publicly traded, but it is constituted as one and the rules for PLCs apply. The En-Route operation is, as you say, a regulated monopoly.
The Hansard report of the Select Committee is available at the official source and the proceedings are still available on Parliament TV.
I am closer than some, lived through much of what we have been discussing, have a reasonable but possibly dated understanding of the systems in the Flight Data thread but have no current connection with NATS. I have no more knowledge of what happened than anybody else outside NATS although I can deduce some things from the regulations imposed.

That’s useful, thanks. Good chat btw.

eglnyt
2nd Sep 2023, 17:01
The Transport Committee needs a Gwyneth Dunwoody clone at the helm - weak excuses would not be tolerated !
Fully agree. Having previously seen the MP for Crewe & Nantwich in action, well briefed and focussed and downright scarey, I was disappointed to see what the Select Committee had become. I now despair that the malaise has infected the whole Parliamentary system.

Gupeg
3rd Sep 2023, 11:43
E39: Not correct I am afraid. It's always been a dual type system. It was just adding more military workstations on this occasion. As far as I know it was not the specific military functionality that was the problem but the total number of active and watching stations. My comment: especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military). is based on the report G.3.19:​​​​​​​The software had its origins in an earlier development in the USA that did not support military Controllers, and this might help to explain the original program design, although it is unlikely that the underlying cause for the software fault can be found at this time. Reading elsewhere in G there is reference to 'poor' naming of a variable, the poor being because it was not written to cover civil and military,

​​​​​​​Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option. I doubt we differ much overall :) some detail maybe. What you allude to is that in the 2020s we are using a system that by current standards is not fit for purpose in that it cannot be tested. The reference above shows there is code written decades ago that is not amenable to even "visual checking", and there is no practical test system in existence,

egl: ​​​​​​​It wasn't new software. It was the original software, it had been there for years. Sorry - by "new" software, I meant a new version introduced 1 day prior the failure... From my own software experience, it tends to be the minor upgrades that bring the most grief :{

​​​​​​​We'll later see if the report on Monday's issue has any parallels?
... I doubt very much that anything that comes out on Monday will be sufficient to decide if there are any parallels. Slight misunderstanding, Appreciate there will be no report out Monday (tomorrow) - I was referring to (last) Monday's issue. When we do see a report, it will be interesting to see if there are parallels between 2023 and 2014...
​​​​​​​

Mr Optimistic
3rd Sep 2023, 12:34
Fully agree. Having previously seen the MP for Crewe & Nantwich in action, well briefed and focussed and downright scarey, I was disappointed to see what the Select Committee had become. I now despair that the malaise has infected the whole Parliamentary system.

Perhaps we blame government and parliament too much. I suspect its the great bureaucratic infrastructure where the fault might lie and there is no curing that.

As for nuclear power stations and mission critical systems, from what I remember it was more a case of monitoring physical parameters and shutting it down if things went out of wack. Ten to the minus 9 only gets you so far.

Neo380
3rd Sep 2023, 12:57
Perhaps we blame government and parliament too much. I suspect its the great bureaucratic infrastructure where the fault might lie and there is no curing that.

As for nuclear power stations and mission critical systems, from what I remember it was more a case of monitoring physical parameters and shutting it down if things went out of wack. Ten to the minus 9 only gets you so far.

Whereas putting proper fail safes in place, as would be the case in other mission critical systems, does properly manage the problem - but that’s the point NATS is very keen to avoid discussing.

eglnyt
3rd Sep 2023, 13:17
Whereas putting proper fail safes in place, as would be the case in other mission critical systems, does properly manage the problem - but that’s the point NATS is very keen to avoid discussing.
​​​You seem to be interpreting the silence so far as avoiding discussion. That may be the case but from previous experience a lot of detailed work needs to happen to prepare the background material before you can start the discussion. In this case there appear to be external parties involved and possibly suppliers and you have to give them time an opportunity to prepare their responses as well. How many organisations do you know that can lay their hands on 20 year records quickly?

Neo380
3rd Sep 2023, 14:07
​​​You seem to be interpreting the silence so far as avoiding discussion. That may be the case but from previous experience a lot of detailed work needs to happen to prepare the background material before you can start the discussion. In this case there appear to be external parties involved and possibly suppliers and you have to give them time an opportunity to prepare their responses as well. How many organisations do you know that can lay their hands on 20 year records quickly?

I do, as it’s much more like obfuscation, at least on this channel, which has no time constraints.

This is sixty year old technology.

The question is ‘where were the fail safes?’

I suspect the 2014 catastrophic failure would be a good starting point for a proper investigation.

And btw (as I’ve said before, for good reason) I don’t expect NATS to come clean on this issue.

eglnyt
3rd Sep 2023, 14:57
I do, as it’s much more like obfuscation, at least on this channel, which has no time constraints.

This is sixty year old technology.

The question is ‘where were the fail safes?’

I suspect the 2014 catastrophic failure would be a good starting point for a proper investigation.

And btw (as I’ve said before, for good reason) I don’t expect NATS to come clean on this issue.
Nobody who knows anything will be posting on "social media" channels. Like most organisations NATS has policies about that and will regularly remind its staff of their obligations even when there hasn't been an "incident".
With a reasonable knowledge of the systems involved I can't tell you which system it was. It's only speculation that it's the ageing Flight Data system although it does have previous. Not all the systems involved are as old although all, I think, have redundancy provided by a backup running the same or very similar software. If there is an investigation I would hope that is discussed although most of the systems other than NAS are used at multiple ANSPs in exactly the same way.
2014 was not catastrophic. It was of quite short duration and over the course of the day NATS handled a higher percentage of planned traffic than most businesses would expect to handle in a fallback mode. The response to 2014 was rather over the top given the actual impact. If we did the same for the railway Network Rail would be forever at the Committee and in a permanent state of review.
This one was much worse in terms of impact for which reason a similar review should be the minimum, I'd argue for one with a bit more independence. There were independent experts who wrote most of the meaningful content but CAA/NATS were allowed to lead it last time.
Some themes will be similar but if this was a different system much of the detail from 2014 will be irrelevant.

Neo380
3rd Sep 2023, 18:18
Nobody who knows anything will be posting on "social media" channels. Like most organisations NATS has policies about that and will regularly remind its staff of their obligations even when there hasn't been an "incident".
With a reasonable knowledge of the systems involved I can't tell you which system it was. It's only speculation that it's the ageing Flight Data system although it does have previous. Not all the systems involved are as old although all, I think, have redundancy provided by a backup running the same or very similar software. If there is an investigation I would hope that is discussed although most of the systems other than NAS are used at multiple ANSPs in exactly the same way.
2014 was not catastrophic. It was of quite short duration and over the course of the day NATS handled a higher percentage of planned traffic than most businesses would expect to handle in a fallback mode. The response to 2014 was rather over the top given the actual impact. If we did the same for the railway Network Rail would be forever at the Committee and in a permanent state of review.
This one was much worse in terms of impact for which reason a similar review should be the minimum, I'd argue for one with a bit more independence. There were independent experts who wrote most of the meaningful content but CAA/NATS were allowed to lead it last time.
Some themes will be similar but if this was a different system much of the detail from 2014 will be irrelevant.

They may not be posting on social media but they discussed it quite freely with me when I was working there shortly after the investigation - hence I know what they think the real cause of the issue is.

The system in question is also already in the public domain. Very interesting that you should then say 'Not all the systems involved are as old although all, I think, have redundancy provided by a backup running the same or very similar software.' Let's see. I just don't know, but other commentators have said that Swanwick Centre is the only centre that is still operating this particular (version of this?) system.

I agree with the need for an independent review - look at the 'dodgy French data' PR, 'Martin Rolfe hasn't been seen at home since it happened', the Radio 4 and The Times pieces - all masterful stuff, but mostly 'fluff'.

eglnyt
3rd Sep 2023, 19:00
NATS is the only ANSP that has ever operated the "Swanwick system" which may also be referred to as NERC or the LACC system. It was part of that system which failed in 2014. I can't rule out a failure of that system as the issue this time but I would expect the regulations and operational effect to be different than they were. The En-Route operation has several different civil operations, all seem to have been affected but only one of those uses that system. There are several other systems in the Flight Data thread before it gets to the "Swanwick system". My money is on one of those. One of them is NAS which is only used by NATS in its current form. The others are variations of systems used all over the World.

I expect the CEO has spent the weekend locked in a room with a few people attempting the futile task of condensing a complex technical report into something the Secretary of State and the average Daily Mail reader can understand. Pointless really because the minister's catchy three word response written and the Mail editorial have probably already been written.

Neo380
3rd Sep 2023, 19:12
I expect the CEO has spent the weekend locked in a room with a few people attempting the futile task of condensing a complex technical report into something the Secretary of State and the average Daily Mail reader can understand. Pointless really because the minister's catchy three word response written and the Mail editorial have probably already been written.

Well, it's caught public interest, and rightly so.

But, you have to be careful eginyt, it sounds like you just just want this problem to go away. What if post-enquiry the headline is (justifiably): NATS System Failure 'Inevitable' ?

eglnyt
3rd Sep 2023, 19:32
That headline is justified.The NATS system failed. Dodgy French data or not this was a failing of a NATS system. We know it failed, the investigation will need to explain why but far more important is which of the following three cases applies.
Either:
That failure was unforeseen, in my experience unlikely for NAS for which there are precursor failures, and I would hope unlikely for the other systems in the thread given the obvious possibility, however rare, of common software failure.
The failure was foreseen but the impact was not correctly assessed
The failure was foreseen and the impact correctly assessed but the controls expected to contain such a failure didn't work as expected.
It's always a bit tricky for systems like this because of the interaction between flow regulation to maintain safety and the business impact that results from that regulation.

Neo380
3rd Sep 2023, 22:36
That headline is justified.The NATS system failed. Dodgy French data or not this was a failing of a NATS system. We know it failed, the investigation will need to explain why but far more important is which of the following three cases applies.
Either:
That failure was unforeseen, in my experience unlikely for NAS for which there are precursor failures, and I would hope unlikely for the other systems in the thread given the obvious possibility, however rare, of common software failure.
The failure was foreseen but the impact was not correctly assessed
The failure was foreseen and the impact correctly assessed but the controls expected to contain such a failure didn't work as expected.
It's always a bit tricky for systems like this because of the interaction between flow regulation to maintain safety and the business impact that results from that regulation.

The word is there were no fallbacks. So, assuming that’s correct, the options don’t look good:

a. Not possible, you can’t build without fallbacks and say a crash can’t happen, NATS was living on hope (and tbh, tempting fate)

b. Ditto, as the system was never fully stress
tested

c. What controls; there were no fallbacks (tbc)?

Conclusion, the headline’s fully justified. Let’s see.

eglnyt
3rd Sep 2023, 23:06
The word is there were no fallbacks. So, assuming that’s correct, the options don’t look good:

a. Not possible, you can’t build without fallbacks and say a crash can’t happen, NATS was living on hope (and tbh, tempting fate)

b. Ditto, as the system was never fully stress
tested

c. What controls; there were no fallbacks (tbc)?

Conclusion, the headline’s fully justified. Let’s see.
We don't know which system it was yet but apparently it had no fallback and wasn't stress tested?
We don't know if operating at its limits was an issue so the latter point may or may not be relevant. We know the number of atomic functions was the issue in 2014 but that limitation is unique to the NERC system and different boundary conditions apply elsewhere in the Flight Data thread. We don't know which system so how do we know whether those conditions were tested?
We know any fallback was ineffective but not what it was. I would expect the first level to be provided by an identical system because that is normal in this thread. I'd expect the last fallback to be manual fallback. We don't know what if anything was expected to happen between those.

Engineer39
4th Sep 2023, 10:19
The failure was foreseen and the impact correctly assessed but the controls expected to contain such a failure didn't work as expected.
It's always a bit tricky for systems like this because of the interaction between flow regulation to maintain safety and the business impact that results from that regulation.Any engineer worth their salt expects failure, as you can never say, “My great design will never fail” E.g. even the most solid bridge can fall down one day. What you don’t know is which of the many possibilities that theoretically occur once in 1000 years may crop us in 10 years’ time. For software if you knew exactly what could happen it’s likely you could change the code to eliminate it.

What is less certain is how the backup plan worked this time. It looks to me like the normal backup plan of manual operation, whilst safe, went a bit awry in this case as the impact was longer lasting. Or was it just that being a bank holiday, with its high pax volumes, delays were longer?

I fail to see how parliament can really help here. NATS is semi-private and semi-controlled by the UK airlines. Unless government said “Here is £1b to build a duplicate system or lets structure NATS in a completely different way” it’s up to the airlines to decide how much NATS spends to upgrade its systems and pass that cost back to the airlines and customers through the route charges. Neither government (even a Labour one) nor the airlines are likely to agree to say £1b when NATS operations are actually amongst the most reliable in the world. Especially as the ATCOs don’t go on strike, unlike many others.

By the way, a week on, have all the pax that were delayed got alternative flights now?

golfbananajam
4th Sep 2023, 14:28
Inputting the wrong data - often as little as a missed full stop - is not an 'edge case', actually it's normal human behaviour. This has nothing to do with fail overs that don't work.

If I understand it correctly, the data provided to NATS has been pre-processed and so any errors (mistakes) should have been caught by the system doing the pre-processing. If my understanding is correct, I would expect that the assumption is that the data is now ONLY valid data, so this failure case isn't as straight forward as you may believe.

I work in the world of complex software, testing is a nightmare. As others have asked, at what point do you stop? How many bugs do you release (bearing in mind you may not know about them)? These are questions we're constantly asking and we never ever get to a stage where we satisfy everyone.

Neo380
4th Sep 2023, 16:42
If I understand it correctly, the data provided to NATS has been pre-processed and so any errors (mistakes) should have been caught by the system doing the pre-processing. If my understanding is correct, I would expect that the assumption is that the data is now ONLY valid data, so this failure case isn't as straight forward as you may believe.

I work in the work of complex software, testing is a nightmare. As others have asked, at what point do you stop? How many bugs do you release (bearing in mind you may not know about them)? These are questions we're constantly asking and we never ever get to a stage where we satisfy everyone.

This answer (aka 'excuse') has been repeated so often, ie 'testing's soooo complicated, we couldn't possibly capture every permutation (which actually means not process an incorrectly formatted message). Really??

'at what point do you stop?'. Er, when you've built one fall back, and preferably two - if, as you say, this is really a mission critical system.

eglnyt
4th Sep 2023, 17:33
This answer (aka 'excuse') has been repeated so often, ie 'testing's soooo complicated, we couldn't possibly capture every permutation (which actually means not process an incorrectly formatted message). Really??

'at what point do you stop?'. Er, when you've built one fall back, and preferably two - if, as you say, this is really a mission critical system.
You've made the same point several times but if you had a solution to the issue where the customer says that the don't think that's necessary & they won't pay for it I must have missed it. What do you do about that?

Neo380
4th Sep 2023, 17:52
You've made the same point several times but if you had a solution to the issue where the customer says that the don't think that's necessary & they won't pay for it I must have missed it. What do you do about that?

Let’s see what the final bill is (the small scale drone scare at Gatwick ram into tens of millions of pounds).

You’re scenario’s surely hypothetical as I can’t imagine anyone - with the facts - would vote down building proper fall backs, whether it meant saving hundreds of millions of pounds in lost business, and costs, let alone more in reputational damage.

In fact, I’m sure they’d assume no one would consider building such a system WITHOUT fall backs in place.

Neo380
4th Sep 2023, 17:59
NATS wrote their own service standards, so that's a very circular argument. An independent inquiry might question whether having NATS write their own exam paper was the right thing to do.

Do you think?! Maybe as much as NATS running their own enquiry…

eglnyt
4th Sep 2023, 18:03
I must have imagined the scrutiny applied to the NATS Investment Plan at every licence renewal. Maybe it was just a bad dream. Similarly the need to halt the investment programme at the start of the pandemic because without income there was no way to fund it.

If it turns out the system, which is still unknown, didn't have appropriate fallbacks then any inquiry should examine why that is including any issues about funding any resilience required.

Neo380
4th Sep 2023, 18:19
I must have imagined the scrutiny applied to the NATS Investment Plan at every licence renewal. Maybe it was just a bad dream. Similarly the need to halt the investment programme at the start of the pandemic because without income there was no way to fund it.

If it turns out the system, which is still unknown, didn't have appropriate fallbacks then any inquiry should examine why that is including any issues about funding any resilience required.

Er, ‘no way to pay for it’. Now I do believe you’re actually NATS Corporate Comms team eglynt!

NATS took £1.5BILLION in debt financing during Covid (and still sacked all 120 ‘apprentice’ ATCOs)!

Neo380
4th Sep 2023, 18:45
NATS Holdings profit before tax:

2013 £215m
2014 £240m
2015 £252m
2016 £137m
2017 £136m
2018 £132m
2019 £ 98m
2020 £ 25m
2021 £-38m
2022 £ 8m
2023 £148m

You forget to add, £1,500m to 2021…

(Cash flow, of course, but the odd billion or two does make a difference).

eglnyt
4th Sep 2023, 19:34
I'm a bit old for Corporate Comms and lacking in the Facebook/Twitter skills probably required nowadays.
I wouldn't claim to understand NATS financing, I got confused when the Airline Group bought half of NATS for several hundred million and somehow NATS paid the Treasury. However I think you need to look at total borrowing before and after that deal and how much of that profit over the years was reinvested.
I am however in awe of the gentleman who somehow sold bonds for a company with no real income in the middle of a pandemic. I'm not surprised they gave him a pay rise.

Neo380
4th Sep 2023, 20:05
I'm a bit old for Corporate Comms and lacking in the Facebook/Twitter skills probably required nowadays.
I wouldn't claim to understand NATS financing, I got confused when the Airline Group bought half of NATS for several hundred million and somehow NATS paid the Treasury. However I think you need to look at total borrowing before and after that deal and how much of that profit over the years was reinvested.
I am however in awe of the gentleman who somehow sold bonds for a company with no real income in the middle of a pandemic. I'm not surprised they gave him a pay rise.

It's because of the monopoly, ask any banker whether they prefer one off or annuity revenues. The 'cause celebre' was the seven-figure bonus when business dropped off 80% (albeit not his fault)...

ATC Watcher
4th Sep 2023, 20:13
E

NATS took £1.5BILLION in debt financing during Covid (and still sacked all 120 ‘apprentice’ ATCOs)!
a bit off topic but someone knows what happened to those 120 trainees. were they ever rehired now that traffic is back to pre-Covid 2019 levels ? If not , how is NATS going to cope with the planned retirements in the next 2-3 years ?.
That is one of the main issue of privatized ATC, no revenue,? reduce controllers workforce, , traffic restarts , no staff.as it takes minimum 3 years from recruitment to fully validated .Same old , same old....

eglnyt
4th Sep 2023, 20:26
It's because of the monopoly, ask any banker whether they prefer one off or annuity revenues. The 'cause celebre' was the seven-figure bonus when business dropped off 80% (albeit not his fault)...
You won't find me defending Executive pay but we may not be talking about the same person.
The CEO renumeration did dip when income dipped, remember most of financial year 2019/2020 wasn't affected and the critical summer period was the best ever. It has rebounded as traffic did, to my mind rather better than NATS staff in general who actually gave back their 2019/20 payrise when the pandemic hit.

eglnyt
4th Sep 2023, 20:30
a bit off topic but someone knows what happened to those 120 trainees. were they ever rehired now that traffic is back to pre-Covid 2019 levels ? If not , how is NATS going to cope with the planned retirements in the next 2-3 years ?.
That is one of the main issue of privatized ATC, no revenue,? reduce controllers workforce, , traffic restarts , no staff.as it takes minimum 3 years from recruitment to fully validated .Same old , same old....
I believe those that wished were put at the front of the queue when abinitio recruitment was restarted recently. Trainees have to factor in living on a minimal income during training for later reward and not all of the 120 would have been in a position to have another go.

Neo380
4th Sep 2023, 21:00
I believe those that wished were put at the front of the queue when abinitio recruitment was restarted recently. Trainees have to factor in living on a minimal income during training for later reward and not all of the 120 would have been in a position to have another go.

Why sack an entire cohort, some of whom were only 2 weeks away from qualifying?

You put a fantastic burnish on everything NATS does, with real detail (hence 'Corporate Comms'!), but this was clearly the wrong thing to do.

Neo380
4th Sep 2023, 21:02
You won't find me defending Executive pay but we may not be talking about the same person.
The CEO renumeration did dip when income dipped, remember most of financial year 2019/2020 wasn't affected and the critical summer period was the best ever. It has rebounded as traffic did, to my mind rather better than NATS staff in general who actually gave back their 2019/20 payrise when the pandemic hit.

We're talking about the (current) CEO, and his 7-figure bonus. This is supposed to be 'performance related', but that's a joke when your business drops by 80% (I know they're now covered over the pay outs)

eglnyt
4th Sep 2023, 21:18
Why sack an entire cohort, some of whom were only 2 weeks away from qualifying?

You put a fantastic burnish on everything NATS does, with real detail (hence 'Corporate Comms'!), but this was clearly the wrong thing to do.
Heading way off topic now. Not my decision to defend but those were very strange times. We now know how the future of aviation panned out but at the time pre vacs it didn't look anywhere near as positive. The furlough scheme offered a reprieve to many NATS jobs but it was quite clear that was going to run out long before income returned. NATS prioritised keeping it's validated controllers. Others were not so lucky. A large proportion of technical staff were on short term contracts through complex multi company contracts and they were let go straight away. NATS had long since outsourced many ancillary functions to organisations who weren't so good to their staff. A fair few NATS were made redundant although the terms were pretty good so they faired pretty well.

Neo380
4th Sep 2023, 21:23
Heading way off topic now. Not my decision to defend but those were very strange times. We now know how the future of aviation panned out but at the time pre vacs it didn't look anywhere near as positive. The furlough scheme offered a reprieve to many NATS jobs but it was quite clear that was going to run out long before income returned. NATS prioritised keeping it's validated controllers. Others were not so lucky. A large proportion of technical staff were on short term contracts through complex multi company contracts and they were let go straight away. NATS had long since outsourced many ancillary functions to organisations who weren't so good to their staff. A fair few NATS were made redundant although the terms were pretty good so they faired pretty well.

Yes and no, voluntary redundancy was taken by all (bar one) volunteers. The £1.5m debt financing prevented the culling of any non-volunteers - and NATS is already well back into 'feast and famine' of controllers.

The subjects may change a little, but the theme throughout is identical, NATS is taking no blame at all for a catastrophic systems failure, that even you accept was inevitable - I suppose complexity will fool the tabloids.

eglnyt
4th Sep 2023, 21:36
Yes and no, voluntary redundancy was taken by all (bar one) volunteers. The £1.5m debt financing prevented the culling of any non-volunteers - and NATS is already well back into 'feast and famine' of controllers.

As I said those made redundant faired quite well but voluntary is an interesting term. NATS had served notice that it was going to unilaterally withdraw the relatively good redundancy scheme. In mid 2020 it looked quite possible that income would not recover and redundancies would be required further down the track possibly on far worse terms. Many jumped because of that. Others realised that if a lot of people jumped they would be left in an under resourced organisation that wouldn't be a nice place to be.

Neo380
4th Sep 2023, 21:44
As I said those made redundant faired quite well but voluntary is an interesting term. NATS had served notice that it was going to unilaterally withdraw the relatively good redundancy scheme. In mid 2020 it looked quite possible that income would not recover and redundancies would be required further down the track possibly on far worse terms. Many jumped because of that. Others realised that if a lot of people jumped they would be left in an under resourced organisation that wouldn't be a nice place to be.

Hahahahaha - under resourced?! And voluntary means all (bar one) volunteers. No compulsory redundancies were made, despite the spin at the time (and clearly now), sorry, but this is turning into fantasy cover up.

eglnyt
4th Sep 2023, 21:46
The subjects may change a little, but the theme throughout is identical, NATS is taking no blame at all for a catastrophic systems failure, that even you accept was inevitable - I suppose complexity will fool the tabloids.
We don't know who is taking what blame. The report was supposed to be with CAA today, the Secretary of State said it would be published later in the week. Until that runs its course we have no idea what happened or how much blame is to be apportioned.
Every software fault is inevitable we just don't know why this one played out with the impact it did. I expect every system to fail sooner or later, I wouldn't expect the failure of a system in the Flight Data thread to take so long to fix or have that impact so I'm waiting for the report to see why it did.

Ninthace
5th Sep 2023, 00:09
When you think that they started with a system that had been running successfully for many years and that the fault occurred on a Bank Holiday weekend, and take into account all the steps they would have to work through just to isolate the faulty code, identify the cause, work out a fix and then test it, in many ways, I am surprised how quickly they did fix it.

kiwi grey
5th Sep 2023, 04:24
Parliament can help in quite a lots of ways.
Firstly it can compel NATS management to tell the truth in public, which would otherwise be sadly lacking.
Secondly HMG holds a 49% share, and parliament could require that stake be exercised to improve things.​​​​

They could also commission their own report. And could come up with another number for the cost, one that might be less susceptible to manipulation by vested interests (or at least manipulated by different vested interests).

Or they could legislate to compulsorily acquire say 10% of each of the other shareholders' shares, thus leaving HMG as a 54.1% majority shareholder. Then require the Board (or a new Board if the current Directors seem reluctant) to immediately and comprehensively address the underlying issues.


Or they could modify the terms of the licence to require whatever they see fit.

They might even look at a company that makes huge profits and then comes begging for public subsidy and think, you know what, maybe they should reinvest some of that profit rather than beg from the public purse. Or they could propose a tax on NATS profit to subsidise the investment if NATS won't do it themselves.
​​​​

eglnyt
5th Sep 2023, 09:35
NATS Holdings profit before tax:

2013 £215m
2014 £240m
2015 £252m
2016 £137m
2017 £136m
2018 £132m
2019 £ 98m
2020 £ 25m
2021 £-38m
2022 £ 8m
2023 £148m
As I've previously stated the whole World of NATS Finance remains a mystery to me but shouldn't you use the profit after tax as a meaningful measure of what might be available to invest? In fact isn't the amount paid out in dividends the only real indicator of cash lost to investment as any other "profit" is effectively retained within the business?

eglnyt
5th Sep 2023, 09:47
Or they could legislate to compulsorily acquire say 10% of each of the other shareholders' shares, thus leaving HMG as a 54.1% majority shareholder. Then require the Board (or a new Board if the current Directors seem reluctant) to immediately and comprehensively address the underlying issues.

I'm not sure that's a valid way forward. There are rules about acquiring shareholdings in public companies. If you introduce legislation that steamrollers through those rules, even for a limited scope, you are setting a dangerous precedent. For a country totally reliant on other people's investment any hint that it might go all "Socialist" would be very problematic.

It actually probably doesn't need to do that. It holds more shares than anybody else and the 5% in the Employee Sharetrust is effectively out of play so it already has the power to control the company by asking its appointed directors to intervene but again that would send shockwaves through the markets. Still we currently have a Government that takes decisions based on "focus groups" regardless of the effect on the economy so who knows.