PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Rumours & News (https://www.pprune.org/rumours-news-13/)
-   -   U.K. NATS Systems Failure (https://www.pprune.org/rumours-news/654461-u-k-nats-systems-failure.html)

CBSITCB 29th Aug 2023 11:13

Flight Plans (FPLs) are filed in ICAO standard format. Part of the FPL is the route. Each national ANSP must, within its own specific flight plan processing system (FPPS), convert the FPL route into a route expressed in terms of its own national airspace computer model.

The UK airspace model is built using the same architecture as the US one. It is part of the National Airspace System (NAS), which the UK obtained from the US FAA in the 70s. The UK NAS airspace architecture is essentially the same today as it was in the 70s, though of course the model itself has changed to reflect the changing airspace over time.

A key part of the NAS software is a sub-programme called Route Conversion and Posting. This converts the ICAO FPL route into the internal format of the FPPS. It then determines which sectors (ATCOs) need to be provided with the FPL information (the “posting” part). NATS does not publish details of this sub-programme, but the FAA does. It is documented in a volume called NAS-MD-312. These details have naturally diverged over time, but the essentials remain the same.

To quote NAS-MD-312 “The 3-dimensional volumes of airspace that comprise an [airspace model] are described by points and lines with a specified altitude range for each. These volumes of airspace are fix posting areas (FPAs). Geographic points are described in source information in terms of latitude and longitude in units of degrees, minutes, and seconds and are converted to conventional X and Y coordinates, in units of one-eighth mile, and stored in that form. A boundary line is described by its geographic end points. Since each line segment has a specified altitude range adapted to it, a series of connected lines is used to describe a 3-dimensional volume of airspace. An FPA is the fundamental unit of airspace within the [airspace model]. Other volumes of airspace within a centre, such as sector or approach control areas, are described in terms of FPAs that comprise them. A fix posting area is a volume of airspace identified by a series of connected line segments that form a polygon when viewed in the horizontal and vertical plane, with each boundary line having a specified altitude range. The polygon may be convex or concave, permitting a variety of geometric shapes.”

This process of route conversion is obviously a very complex exercise. I understand at least one major UK NAS outage in the past was caused by errors in this process. Someone had managed to input an FPL route that passed NAS route validation (described in NAS-MD-311 Message Entry and Checking) but “did not compute” when route conversion was attempted. Of course, all possible errors should be trapped, but…

A flavour of the complexity of route conversion can be had here:
https://www.tc.faa.gov/its/worldpac/...nas-md-312.pdf

I am not saying such a problem was the cause of the outage yesterday – though it could be. It is just some background to the way NATS processes flight plans.

Equivocal 29th Aug 2023 11:15


Originally Posted by munnst (Post 11493068)
There have been several of these `outages` over the last few years. One would hope that there is an SLA with the people who supply and maintain the system and that any loss can be claimed? (no, I doubt that as well).
I wonder how many of these outages the UK can suffer before carriers, countries start to look at alternative ways of routing through, around UK airspace or even flying here at all?

Maybe this illustrates part of the problem. The first reaction seems to be who to blame and can we claim compensation. Not invalid questions, but in my experience, which includes ATM systems, the problems often arise from failing properly to define what the system must do and to manage/handle errors and unexpected data in a robust way. Time will tell what went wrong yesterday, but I would be surprised if it wasn't something that could have been predicted (but maybe wasn't handled in a managed way).

Ninthace 29th Aug 2023 11:17

We had an MoD system brought to a shuddering halt because of a single email sent by one user. OK, she managed to somehow CtrlA, CtrlC, CtrlV the entire address book, including dist lists, into the To line and add a read and delivery receipt. The resultant cascade plus all the out of office responses that were de rigeur in those day froze the system to the point where we could not even talk to it. IIRC it took 2 days to effect a full recovery.

JumpJumpJump 29th Aug 2023 11:43

Have we ruled oit alien invasion?

Ninthace 29th Aug 2023 11:45


Originally Posted by JumpJumpJump (Post 11493496)
Have we ruled oit alien invasion?

I am not sure. Is there intelligent life on Oit?

CBSITCB 29th Aug 2023 11:46

Just listening the usual journalistic blather about "why no backup system?". Of course there is an online backup system to automatically switch to, but this almost exclusively caters for only hardware errors. In the case of the "erroneous FPL" scenario I described earlier <https://www.pprune.org/rumours-news/...l#post11493467> it was a problem in the software logic, and the backup system runs the same software.

What happened (AIUI) was the NAS programme "FLOPed" (Functional Lapse of the Operational Programme) – aka crashed. The programme was restarted successfully, but when the recovery data (the data that is in the system – including FPLs – that is recorded as a backup from time to time) was read in it FLOPed again. The rogue FPL was in the recovery data.

Eventually the system was restarted using a recovery data set recorded before the erroneous FPL was entered. Phew, everything hunky-dory again.

Until the originator of the rogue FPL realised it was no longer in the system so re-entered it....:D

Noxegon 29th Aug 2023 11:51


Originally Posted by WHBM (Post 11493194)
So why did BA seem to scrub pretty much everything (a trend seemingly repeated out of London to all over Europe ? )

Perhaps they were told that they could operate X number of flights and they decided to prioritize their long haul operation?

pax2908 29th Aug 2023 12:22

CBSit, somebody mentioned that something similar ("bad FPL") happened in the past. You would then expect that, that particular FPL be identified soon after the first re-start (and second crash) so that it's not re-entered a 3rd time?

togsdragracing 29th Aug 2023 12:41


Originally Posted by CBSITCB (Post 11493499)
Until the originator of the rogue FPL realised it was no longer in the system so re-entered it…:D

An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.

WB627 29th Aug 2023 12:46

OK for what its worth, my money is someone digging through the main fibre optic cable into Swanwick, somewhere outside of the site. It happens. It would explain why the backup system didn't work - no coms in/out of the building. With infrastructure that critical, there should be a second FO cable with a completely different route to Swanwick and a different entry point into the building.

Masterful decision to blame the French :ok:


CBSITCB 29th Aug 2023 12:46


Originally Posted by chevvron (Post 11493098)
The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.

Whilst the 9020 was indeed based on IBM S/360 technology it was rather more than "just six 360s". The 9020 was being designed for the FAA well before the new S/360 line was announced. In fact, this was a major stumbling block (overcome) as the FAA contract specified "off-the-shelf" hardware.

The 9020 CPUs were based on what was to become the S/360 commercial product line but they incorporated many unique hardware features to facilitate their intended use as an ATC system. These included multiprocessing (a simplified version later added to some ‘standard’ processors), address translation, and specific ATC instructions.

When the UK 9020 was retired in 1989 it was coping very well. Capacity limitations were on the horizon, which is the main reason it was replaced, but not yet causing delays. The hardware was still reliable, though maintainability was becoming difficult. It was a source of pride to the engineers concerned that when it was shut down for the final time the whole system was 100% serviceable.

I am referring to the 9020 system itself, not the NAS software or station power supplies.

CBSITCB 29th Aug 2023 12:50


Originally Posted by pax2908 (Post 11493519)
CBSit, somebody mentioned that something similar ("bad FPL") happened in the past. You would then expect that, that particular FPL be identified soon after the first re-start (and second crash) so that it's not re-entered a 3rd time?

I think that's exactly what happened.

CBSITCB 29th Aug 2023 12:52


Originally Posted by WB627 (Post 11493533)
With infrastructure that critical, there should be a second FO cable with a completely different route to Swanwick and a different entry point into the building.

There is a redundant communication path.

golfbananajam 29th Aug 2023 12:54

The problem with testing software is that you can't test all combinations of input values to ensure the required output values are correct, certainly not in vlarge or complex systems. Failure testing is often limited to defined alternate path (within the software) testing as defined in the requirements/specification. Edge cases will always catch you out.

With that in mind, critical systems like this should always fail safe, ie reject any invalid input, or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time.

Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash.

I wonder how often, if ever, business continuity testing is performed which should have enabled quick recovery.


WB627 29th Aug 2023 12:58


Originally Posted by CBSITCB (Post 11493541)
There is a redundant communication path.

Well in that case, it must have been the French ;) more revenge for Brexit :oh:



CBSITCB 29th Aug 2023 13:12


Originally Posted by togsdragracing (Post 11493532)
An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.

There is, and it is extensive.

It would have been better to have called it the trigger FPL rather than the erroneous FPL. It was “bad” in that it caused the crash, but not necessarily incorrectly formatted. I don’t recall the precise details of the event.

The 28-day AIRAC cycle introduces new airspace, reporting points, airfields, etc.. Thus the set of valid FPL routes also changes every 28 days. This means the NAS software has to be modified (“adapted”) to accept and process the new reporting points and routes.

So the FPL probably got through the NMOC IFPS (or whatever the equivalent was back then) as a valid route, and the error was in the updated UK route conversion process adaptation.

moosepileit 29th Aug 2023 13:28

https://cimg1.ibsrv.net/gimg/pprune....d072aa70d7.gif
ya'll stop watching the classic aviation movies across the pond?

CBSITCB 29th Aug 2023 13:37


Originally Posted by moosepileit (Post 11493570)
ya'll stop watching the classic aviation movies across the pond?

No - https://www.pprune.org/rumours-news/...1#post11492878

DaveReidUK 29th Aug 2023 14:00


Originally Posted by NWSRG (Post 11493457)
Don't believe the incorrect flight plan malarky...seems an incredulous reason for a whole system failure.

Neither do I - it's inconceivable.

Murty 29th Aug 2023 14:20

Adding to CBSIT excellent run down (no pun intended) of the system,and answering a previous question by someone on the thread.

All airlines/operators do indeed submit flight plans in advance for flights in Europe to IFPS (Integrated Initial Flight Plan Processing System) primarily located at Haren,with back up at Bretigny.

Haren then forward the Flight Plans to each ANSP (Sovereign State) 4 hours before the flight reaches each FIR boundary.
It has to be noted that although most units are heading towards free filing Direct ,this is stil not practical in some of the UK low level . Unfortunately the IFPS system can be "got round" by operators by filing DCT point to point providing it is not too far,but his is usually caught by Swanwick's FPRS system
When a Flight Plan arrives in the FPRS system if everything is correct the flight plan is Automatically Processesed and in a normal day this is approx 87-90% ,leaving the team to manually process the remaining 13% or above ,into the system
.
13% of FPL's is still a large amount in a day and these are a mixture of complex military,airtests,many badly filed GA flights in CAS and badly filed Airlines Flight Plans,and there a still many of them especially companies that get a agency's to file their plans for them one of the worst being in tn recent years (EKBICPUF)- PPS and RocketRoute.
It should be noted that the way flight plans are processed into the UK NAS system uses an old unique input using "dots" between fixes and routes which is why only ones that match whats in the UK SRD (Standard Route Document )
eg a flight from EGGW to EGLF (NOT FAR) Simplified
Filed would be EGGW DCT CPT DCT EGLF
into NAS EGGW..CPT..EGLF
This is the same all over the network TWO dots between 2 Fixes or 2 Airways or ONE dot betwen fixes to airway,airway to fix


It sounds (theory at present) the rogue FPL would be one of the 13%,as our system should reject the bad ones ,so possibly input manually.

That said when the system failed it would hold 4 hours of traffic unless they were late files ,changes ,Cancelations and refiles.

This is when the FPRS team 4 normally with possibly 3 others on shift elsewhere,would be building up to input THE OTHER 87-90% of flight plans,plus all other messages eg DLA (and you can guess the number of these (but by prioritising these may just have to be left) CHG messges etc.

These staff would als need brakes.
The resilience is a fair question in the past Prestwick Centre had their own FPRS suite and the CTC at Whitely had an emergency system in the depths of that building back in 2014.
Fortunately or Unfrotunately it ws decided to have a "One Centre" policy to centralise the planning especially as Scottish went for the more Direct filing that has also now come into the London FIR (West of London) high level.

All this said as CBSITCB stated,all these newer systems are being added to the 1970's framework NAS system that was supposed to be repalced but appears to have a never ending lifespan (UNTIL -dread to think) ,with a New Combined Ops room that lies idle still at Swanwick yearrs over due that Cal Mac ferries would be proud off not.





All times are GMT. The time now is 09:36.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.