Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

U.K. NATS Systems Failure

Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

U.K. NATS Systems Failure

Old 29th Aug 2023, 11:13
  #61 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Flight Plans (FPLs) are filed in ICAO standard format. Part of the FPL is the route. Each national ANSP must, within its own specific flight plan processing system (FPPS), convert the FPL route into a route expressed in terms of its own national airspace computer model.

The UK airspace model is built using the same architecture as the US one. It is part of the National Airspace System (NAS), which the UK obtained from the US FAA in the 70s. The UK NAS airspace architecture is essentially the same today as it was in the 70s, though of course the model itself has changed to reflect the changing airspace over time.

A key part of the NAS software is a sub-programme called Route Conversion and Posting. This converts the ICAO FPL route into the internal format of the FPPS. It then determines which sectors (ATCOs) need to be provided with the FPL information (the “posting” part). NATS does not publish details of this sub-programme, but the FAA does. It is documented in a volume called NAS-MD-312. These details have naturally diverged over time, but the essentials remain the same.

To quote NAS-MD-312 “The 3-dimensional volumes of airspace that comprise an [airspace model] are described by points and lines with a specified altitude range for each. These volumes of airspace are fix posting areas (FPAs). Geographic points are described in source information in terms of latitude and longitude in units of degrees, minutes, and seconds and are converted to conventional X and Y coordinates, in units of one-eighth mile, and stored in that form. A boundary line is described by its geographic end points. Since each line segment has a specified altitude range adapted to it, a series of connected lines is used to describe a 3-dimensional volume of airspace. An FPA is the fundamental unit of airspace within the [airspace model]. Other volumes of airspace within a centre, such as sector or approach control areas, are described in terms of FPAs that comprise them. A fix posting area is a volume of airspace identified by a series of connected line segments that form a polygon when viewed in the horizontal and vertical plane, with each boundary line having a specified altitude range. The polygon may be convex or concave, permitting a variety of geometric shapes.”

This process of route conversion is obviously a very complex exercise. I understand at least one major UK NAS outage in the past was caused by errors in this process. Someone had managed to input an FPL route that passed NAS route validation (described in NAS-MD-311 Message Entry and Checking) but “did not compute” when route conversion was attempted. Of course, all possible errors should be trapped, but…

A flavour of the complexity of route conversion can be had here:
https://www.tc.faa.gov/its/worldpac/...nas-md-312.pdf

I am not saying such a problem was the cause of the outage yesterday – though it could be. It is just some background to the way NATS processes flight plans.
CBSITCB is offline  
Old 29th Aug 2023, 11:15
  #62 (permalink)  
 
Join Date: Dec 2020
Location: Home
Posts: 134
Received 44 Likes on 12 Posts
Originally Posted by munnst
There have been several of these `outages` over the last few years. One would hope that there is an SLA with the people who supply and maintain the system and that any loss can be claimed? (no, I doubt that as well).
I wonder how many of these outages the UK can suffer before carriers, countries start to look at alternative ways of routing through, around UK airspace or even flying here at all?
Maybe this illustrates part of the problem. The first reaction seems to be who to blame and can we claim compensation. Not invalid questions, but in my experience, which includes ATM systems, the problems often arise from failing properly to define what the system must do and to manage/handle errors and unexpected data in a robust way. Time will tell what went wrong yesterday, but I would be surprised if it wasn't something that could have been predicted (but maybe wasn't handled in a managed way).
Equivocal is offline  
Old 29th Aug 2023, 11:17
  #63 (permalink)  
 
Join Date: Jan 2008
Location: Glorious Devon
Posts: 2,816
Received 1,476 Likes on 891 Posts
We had an MoD system brought to a shuddering halt because of a single email sent by one user. OK, she managed to somehow CtrlA, CtrlC, CtrlV the entire address book, including dist lists, into the To line and add a read and delivery receipt. The resultant cascade plus all the out of office responses that were de rigeur in those day froze the system to the point where we could not even talk to it. IIRC it took 2 days to effect a full recovery.
Ninthace is offline  
Old 29th Aug 2023, 11:43
  #64 (permalink)  
 
Join Date: Jan 2011
Location: Brasil
Age: 42
Posts: 145
Likes: 0
Received 0 Likes on 0 Posts
Have we ruled oit alien invasion?
JumpJumpJump is offline  
Old 29th Aug 2023, 11:45
  #65 (permalink)  
 
Join Date: Jan 2008
Location: Glorious Devon
Posts: 2,816
Received 1,476 Likes on 891 Posts
Originally Posted by JumpJumpJump
Have we ruled oit alien invasion?
I am not sure. Is there intelligent life on Oit?
Ninthace is offline  
Old 29th Aug 2023, 11:46
  #66 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Just listening the usual journalistic blather about "why no backup system?". Of course there is an online backup system to automatically switch to, but this almost exclusively caters for only hardware errors. In the case of the "erroneous FPL" scenario I described earlier <U.K. NATS Systems Failure> it was a problem in the software logic, and the backup system runs the same software.

What happened (AIUI) was the NAS programme "FLOPed" (Functional Lapse of the Operational Programme) – aka crashed. The programme was restarted successfully, but when the recovery data (the data that is in the system – including FPLs – that is recorded as a backup from time to time) was read in it FLOPed again. The rogue FPL was in the recovery data.

Eventually the system was restarted using a recovery data set recorded before the erroneous FPL was entered. Phew, everything hunky-dory again.

Until the originator of the rogue FPL realised it was no longer in the system so re-entered it....

Last edited by CBSITCB; 29th Aug 2023 at 16:45. Reason: Replace bad glyphs (composed in Word and PPrune converts curly quote marks to square boxes - grrr)
CBSITCB is offline  
Old 29th Aug 2023, 11:51
  #67 (permalink)  
 
Join Date: Mar 2008
Location: Dublin
Posts: 415
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by WHBM
So why did BA seem to scrub pretty much everything (a trend seemingly repeated out of London to all over Europe ? )
Perhaps they were told that they could operate X number of flights and they decided to prioritize their long haul operation?
Noxegon is offline  
Old 29th Aug 2023, 12:22
  #68 (permalink)  
 
Join Date: Aug 2003
Location: FR
Posts: 235
Likes: 0
Received 0 Likes on 0 Posts
CBSit, somebody mentioned that something similar ("bad FPL") happened in the past. You would then expect that, that particular FPL be identified soon after the first re-start (and second crash) so that it's not re-entered a 3rd time?
pax2908 is offline  
Old 29th Aug 2023, 12:41
  #69 (permalink)  
 
Join Date: Aug 2008
Location: Purfleet
Posts: 88
Received 5 Likes on 1 Post
Originally Posted by CBSITCB
Until the originator of the rogue FPL realised it was no longer in the system so re-entered it…
An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.
togsdragracing is offline  
Old 29th Aug 2023, 12:46
  #70 (permalink)  
 
Join Date: Jun 2009
Location: East Sussex
Posts: 499
Received 68 Likes on 24 Posts
OK for what its worth, my money is someone digging through the main fibre optic cable into Swanwick, somewhere outside of the site. It happens. It would explain why the backup system didn't work - no coms in/out of the building. With infrastructure that critical, there should be a second FO cable with a completely different route to Swanwick and a different entry point into the building.

Masterful decision to blame the French

WB627 is offline  
Old 29th Aug 2023, 12:46
  #71 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Originally Posted by chevvron
The 9020 was nothing really special, just 6 x IBM 360s linked together, however it coped - just.
Whilst the 9020 was indeed based on IBM S/360 technology it was rather more than "just six 360s". The 9020 was being designed for the FAA well before the new S/360 line was announced. In fact, this was a major stumbling block (overcome) as the FAA contract specified "off-the-shelf" hardware.

The 9020 CPUs were based on what was to become the S/360 commercial product line but they incorporated many unique hardware features to facilitate their intended use as an ATC system. These included multiprocessing (a simplified version later added to some ‘standard’ processors), address translation, and specific ATC instructions.

When the UK 9020 was retired in 1989 it was coping very well. Capacity limitations were on the horizon, which is the main reason it was replaced, but not yet causing delays. The hardware was still reliable, though maintainability was becoming difficult. It was a source of pride to the engineers concerned that when it was shut down for the final time the whole system was 100% serviceable.

I am referring to the 9020 system itself, not the NAS software or station power supplies.

Last edited by CBSITCB; 29th Aug 2023 at 16:42. Reason: Replace bad glyphs (composed in Word and PPrune converts curly quote marks to square boxes - grrr)
CBSITCB is offline  
Old 29th Aug 2023, 12:50
  #72 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Originally Posted by pax2908
CBSit, somebody mentioned that something similar ("bad FPL") happened in the past. You would then expect that, that particular FPL be identified soon after the first re-start (and second crash) so that it's not re-entered a 3rd time?
I think that's exactly what happened.
CBSITCB is offline  
Old 29th Aug 2023, 12:52
  #73 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Originally Posted by WB627
With infrastructure that critical, there should be a second FO cable with a completely different route to Swanwick and a different entry point into the building.
There is a redundant communication path.
CBSITCB is offline  
Old 29th Aug 2023, 12:54
  #74 (permalink)  
 
Join Date: Aug 2010
Location: UK
Age: 67
Posts: 181
Received 52 Likes on 33 Posts
The problem with testing software is that you can't test all combinations of input values to ensure the required output values are correct, certainly not in vlarge or complex systems. Failure testing is often limited to defined alternate path (within the software) testing as defined in the requirements/specification. Edge cases will always catch you out.

With that in mind, critical systems like this should always fail safe, ie reject any invalid input, or input which causes invalid output, rather than fail catastrophically, which appears to be the case this time.

Similarly for hardware and connectivity of critical systems, no one failure should cause a system wide crash.

I wonder how often, if ever, business continuity testing is performed which should have enabled quick recovery.

golfbananajam is offline  
Old 29th Aug 2023, 12:58
  #75 (permalink)  
 
Join Date: Jun 2009
Location: East Sussex
Posts: 499
Received 68 Likes on 24 Posts
Originally Posted by CBSITCB
There is a redundant communication path.
Well in that case, it must have been the French more revenge for Brexit


WB627 is offline  
Old 29th Aug 2023, 13:12
  #76 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Originally Posted by togsdragracing
An old tecchie writes: is there not some sort of validity check carried out on submitted plans? Be it syntax, format, whatever.
There is, and it is extensive.

It would have been better to have called it the trigger FPL rather than the erroneous FPL. It was “bad” in that it caused the crash, but not necessarily incorrectly formatted. I don’t recall the precise details of the event.

The 28-day AIRAC cycle introduces new airspace, reporting points, airfields, etc.. Thus the set of valid FPL routes also changes every 28 days. This means the NAS software has to be modified (“adapted”) to accept and process the new reporting points and routes.

So the FPL probably got through the NMOC IFPS (or whatever the equivalent was back then) as a valid route, and the error was in the updated UK route conversion process adaptation.
CBSITCB is offline  
Old 29th Aug 2023, 13:28
  #77 (permalink)  
 
Join Date: Jan 2008
Location: USA
Posts: 36
Received 0 Likes on 0 Posts

ya'll stop watching the classic aviation movies across the pond?
moosepileit is offline  
Old 29th Aug 2023, 13:37
  #78 (permalink)  
 
Join Date: Mar 2016
Location: Location: Location
Posts: 59
Received 0 Likes on 0 Posts
Originally Posted by moosepileit
ya'll stop watching the classic aviation movies across the pond?
No - U.K. NATS Systems Failure
CBSITCB is offline  
Old 29th Aug 2023, 14:00
  #79 (permalink)  
 
Join Date: Jan 2008
Location: Reading, UK
Posts: 15,984
Received 305 Likes on 158 Posts
Originally Posted by NWSRG
Don't believe the incorrect flight plan malarky...seems an incredulous reason for a whole system failure.
Neither do I - it's inconceivable.
DaveReidUK is offline  
Old 29th Aug 2023, 14:20
  #80 (permalink)  
 
Join Date: Apr 2006
Location: England
Age: 62
Posts: 14
Likes: 0
Received 0 Likes on 0 Posts
Adding to CBSIT excellent run down (no pun intended) of the system,and answering a previous question by someone on the thread.

All airlines/operators do indeed submit flight plans in advance for flights in Europe to IFPS (Integrated Initial Flight Plan Processing System) primarily located at Haren,with back up at Bretigny.

Haren then forward the Flight Plans to each ANSP (Sovereign State) 4 hours before the flight reaches each FIR boundary.
It has to be noted that although most units are heading towards free filing Direct ,this is stil not practical in some of the UK low level . Unfortunately the IFPS system can be "got round" by operators by filing DCT point to point providing it is not too far,but his is usually caught by Swanwick's FPRS system
When a Flight Plan arrives in the FPRS system if everything is correct the flight plan is Automatically Processesed and in a normal day this is approx 87-90% ,leaving the team to manually process the remaining 13% or above ,into the system
.
13% of FPL's is still a large amount in a day and these are a mixture of complex military,airtests,many badly filed GA flights in CAS and badly filed Airlines Flight Plans,and there a still many of them especially companies that get a agency's to file their plans for them one of the worst being in tn recent years (EKBICPUF)- PPS and RocketRoute.
It should be noted that the way flight plans are processed into the UK NAS system uses an old unique input using "dots" between fixes and routes which is why only ones that match whats in the UK SRD (Standard Route Document )
eg a flight from EGGW to EGLF (NOT FAR) Simplified
Filed would be EGGW DCT CPT DCT EGLF
into NAS EGGW..CPT..EGLF
This is the same all over the network TWO dots between 2 Fixes or 2 Airways or ONE dot betwen fixes to airway,airway to fix


It sounds (theory at present) the rogue FPL would be one of the 13%,as our system should reject the bad ones ,so possibly input manually.

That said when the system failed it would hold 4 hours of traffic unless they were late files ,changes ,Cancelations and refiles.

This is when the FPRS team 4 normally with possibly 3 others on shift elsewhere,would be building up to input THE OTHER 87-90% of flight plans,plus all other messages eg DLA (and you can guess the number of these (but by prioritising these may just have to be left) CHG messges etc.

These staff would als need brakes.
The resilience is a fair question in the past Prestwick Centre had their own FPRS suite and the CTC at Whitely had an emergency system in the depths of that building back in 2014.
Fortunately or Unfrotunately it ws decided to have a "One Centre" policy to centralise the planning especially as Scottish went for the more Direct filing that has also now come into the London FIR (West of London) high level.

All this said as CBSITCB stated,all these newer systems are being added to the 1970's framework NAS system that was supposed to be repalced but appears to have a never ending lifespan (UNTIL -dread to think) ,with a New Combined Ops room that lies idle still at Swanwick yearrs over due that Cal Mac ferries would be proud off not.




Last edited by Murty; 29th Aug 2023 at 15:02.
Murty is offline  

Thread Tools
Search this Thread

Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.