Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

U.K. NATS Systems Failure

Wikiposts
Search
Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

U.K. NATS Systems Failure

Thread Tools
 
Search this Thread
 
Old 12th Sep 2023, 23:21
  #361 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by steamchicken
Probably worth reading this:

https://jameshaydon.github.io/nats-fail/

The algorithm used to find the UK portion is, not to put too fine a point on it, dumb. It works by searching forwards through the flight plan until it finds the entry point, then skips to the end and searches backwards, which will have all kinds of exciting consequences if:

- the route leaves UK airspace and re-enters it (the leg or legs outside the airspace will be wrongly included as part of the UK portion)
- the route exits UK airspace through the same waypoint it entered it (the leg within UK airspace will disappear)
- the exit point is not explicitly stated and a duplicate is present (what happened this time)
- probably more cases I haven't thought of

Compare the following program: from the beginning of the plan, check each waypoint to see if it's in the UK. When the first UK waypoint is found, create a UK leg starting at that waypoint. Add subsequent waypoints to the leg if they are in the UK. When a foreign waypoint is found, end the UK leg after the preceding waypoint and add it to the list "UK legs" under the flight plan ID. Continue until you have no more waypoints, and move to the next flight plan.

You'll observe that this copes fine with loops and missing entry/exit points, although we still need to check for duplicates explicitly; a simple dupe catcher would be to flag any UK leg that contains exactly one waypoint for review, because either it's a route that passes in and out over the same point without going anywhere else in the UK (weird but I suppose just possible) or it's a duplicate.

Even if those edge cases are rare and weird, they're not impossible, and of course someone might file a malformed plan maliciously now they know how to break the system.
The Author's general analysis which accompanies that extract is one of the better ones I've seen. Unlike most of those which start "As a software engineer" or "As a test manager" they have actually read the detail in the report and taken the trouble to do further research. Some of the assumptions are off but based on what is in the public domain and without the context of domain knowledge it's a reasonable assessment.

I'm not sure about the part copied above though. The description in the original report isn't very detailed and jumps around a lot. It isn't clear on when it is using ADEXP or the raw ICAO format and makes no attempt to explain why it might need to do the latter. It also isn't clear why, if it is searching backwards, it somehow finds the first of the duplicates rather than second if they are both in the plan. The algorithm may be dumb, and I'd hope most could improve it knowing what we know now, but I don't think you can make that judgement based on the report alone. I'd like to know whether there was a design rationale for searching backwards before I'd make any comment on the merit of doing so.

I'd be surprised if FPRSA can't cope with re-entrant flight plans as that was a shortcoming of previous systems so it would almost certainly have a specific requirement with associated test to cope with that. It is quite possible, depending upon where the flight came out of Oceanic Airspace, that it could have passed out of UK airspace and back into it before it got anywhere near the exit point with the duplicate a little way beyond.

I also don't share the author's faith in fuzzing for this type of system. As it gets cheaper and easier to do you'll probably do it for most of these systems in the future but it's a bit like infinite monkeys typing the complete works of shakespeare. It might eventually hit on that combination that breaks the system but how long do you give it?

eglnyt is offline  
Old 13th Sep 2023, 07:30
  #362 (permalink)  
 
Join Date: Mar 2002
Location: Surrey, UK
Posts: 898
Received 12 Likes on 7 Posts
Originally Posted by eglnyt
The Author's general analysis which accompanies that extract is one of the better ones I've seen. Unlike most of those which start "As a software engineer" or "As a test manager" they have actually read the detail in the report and taken the trouble to do further research. Some of the assumptions are off but based on what is in the public domain and without the context of domain knowledge it's a reasonable assessment.

I'm not sure about the part copied above though. The description in the original report isn't very detailed and jumps around a lot. It isn't clear on when it is using ADEXP or the raw ICAO format and makes no attempt to explain why it might need to do the latter. It also isn't clear why, if it is searching backwards, it somehow finds the first of the duplicates rather than second if they are both in the plan. The algorithm may be dumb, and I'd hope most could improve it knowing what we know now, but I don't think you can make that judgement based on the report alone. I'd like to know whether there was a design rationale for searching backwards before I'd make any comment on the merit of doing so.

I'd be surprised if FPRSA can't cope with re-entrant flight plans as that was a shortcoming of previous systems so it would almost certainly have a specific requirement with associated test to cope with that. It is quite possible, depending upon where the flight came out of Oceanic Airspace, that it could have passed out of UK airspace and back into it before it got anywhere near the exit point with the duplicate a little way beyond.

I also don't share the author's faith in fuzzing for this type of system. As it gets cheaper and easier to do you'll probably do it for most of these systems in the future but it's a bit like infinite monkeys typing the complete works of shakespeare. It might eventually hit on that combination that breaks the system but how long do you give it?
I think the problem might be premature optimization. Skipping to the end is quicker if the flight ends in the UK so there are only a few waypoints after entry, and slower if it continues somewhere else. If you expect a large majority of flight plans coming from Eurocontrol to terminate here, you might see this as a performance speedup. However, the operation described is just iterating through a list of strings, checking if each string is in a hash table, and appending to another list, all extremely fast operations, so any performance gain would be a very small change in a very small number and not really worth the bother or more to the point, the complexity.
steamchicken is offline  
Old 13th Sep 2023, 12:22
  #363 (permalink)  
 
Join Date: Jun 2008
Location: Cambridge UK
Posts: 192
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
The Author's general analysis which accompanies that extract is one of the better ones I've seen. Unlike most of those which start "As a software engineer" or "As a test manager" they have actually read the detail in the report and taken the trouble to do further research. Some of the assumptions are off but based on what is in the public domain and without the context of domain knowledge it's a reasonable assessment.
...
I stated I was a software engineer, prior to ask a question about the context of the error handling. Intending to indicate the nature of the answer I could handle: the s/w aspects could be reasonably complex, while the flight-plan aspects would need to be pretty simplistic.

Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).

As a friend used to say "They wouldn't do that. Would they?".

Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.

PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.

2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
Peter H is offline  
Old 13th Sep 2023, 13:06
  #364 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Peter H
I stated I was a software engineer, prior to ask a question about the context of the error handling. Intending to indicate the nature of the answer I could handle: the s/w aspects could be reasonably complex, while the flight-plan aspects would need to be pretty simplistic.

Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).

As a friend used to say "They wouldn't do that. Would they?".

Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.

PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.

2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
The problem wouldn't have occurred with a unique waypoint data structure. Not having one is an international feature of the environment in which the system operates and neither NATS or the software supplier can change that. Many of those war stories in earlier posts are generally controlled within newer systems such as FPRSA but still inherent in the ageing message structure.

We won't know until the fuller report what the system wrote where when it shutdown. I would expect the logs with that info to be readily available to the people you expect to react reasonably quickly if you have a system you expect to restore quickly. We know the logs were accessed by those further down the support chain, we don't know yet why it took so long to identify the errant plan, isolate it and restore the system.

I have said earlier I would expect a system of this type to react differently in those circumstances. Isolating flight plans the system doesn't like & throwing them out for human attention is standard practice for most of the flight data systems I know. The suggestion you can't do that for safety reasons is daft. Hopefully rather than double down on that the next report will examine why this system didn't. If they want to double down they might need to check whether any of their other systems do. You don't need to be able to identify the issue and fix it that quickly, most systems have the ability to quickly create a minimal flight plan which can be put in the system so controllers down stream have an awareness of the aircraft without corrupt data.
eglnyt is offline  
Old 13th Sep 2023, 15:46
  #365 (permalink)  
 
Join Date: Jun 2008
Location: Cambridge UK
Posts: 192
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
The problem wouldn't have occurred with a unique waypoint data structure. Not having one is an international feature of the environment in which the system operates and neither NATS or the software supplier can change that. Many of those war stories in earlier posts are generally controlled within newer systems such as FPRSA but still inherent in the ageing message structure.
I fully appreciate that waypoint names aren't unique. Which is why I suggested that the software should convert the non-unique names to a unique-id of some sort in the internal data structures it uses for generating routes. Purely as an example it could use a unique-map-reference instead of the name. Obviously converting back to the non-unique name when producing the required message structure for the generated route.
Peter H is offline  
Old 13th Sep 2023, 16:45
  #366 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Peter H
I fully appreciate that waypoint names aren't unique. Which is why I suggested that the software should convert the non-unique names to a unique-id of some sort in the internal data structures it uses for generating routes. Purely as an example it could use a unique-map-reference instead of the name. Obviously converting back to the non-unique name when producing the required message structure for the generated route.
The problem is that you would need to make sure that the system you use to resolve a stream of non unique names to a set of unique names within the context of your airspace doesn't have a bug which causes it to stop working when it fails to resolve that stream.
eglnyt is offline  
Old 14th Sep 2023, 17:15
  #367 (permalink)  
 
Join Date: Mar 2006
Location: Vance, Belgium
Age: 62
Posts: 271
Likes: 0
Received 5 Likes on 3 Posts
What is particularly strange is that there IS a database with unique 5 characters waypoint designators. This is the 5LNCs ICARD database promoted by ICAO and eurocontrol.
The purpose of this database and initiative is to progressively eliminate all duplicate 5LNCs, as all countries are invited to register their en-route designator documented in their AIP. Of course, no duplicate is allowed in the database. I understand that all countries participating in eurocontrol have already completed the process.

As the message format ADEXP originates from eurocontrol, it is strange that the paragraph RTEPTS does not contain an indicator stating for each 5LNC if it is a unique ICARD identifier or not. The non-unique identifiers could then simply be ignored by the NATS software.

https://www.icao.int/MID/Documents/2...520SG8/IP6.pdf
https://www.icao.int/EURNAT/Document...l%25202017.pdf
Luc Lion is offline  
Old 15th Sep 2023, 09:59
  #368 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Luc Lion
What is particularly strange is that there IS a database with unique 5 characters waypoint designators. This is the 5LNCs ICARD database promoted by ICAO and eurocontrol.
The purpose of this database and initiative is to progressively eliminate all duplicate 5LNCs, as all countries are invited to register their en-route designator documented in their AIP. Of course, no duplicate is allowed in the database. I understand that all countries participating in eurocontrol have already completed the process.

As the message format ADEXP originates from eurocontrol, it is strange that the paragraph RTEPTS does not contain an indicator stating for each 5LNC if it is a unique ICARD identifier or not. The non-unique identifiers could then simply be ignored by the NATS software.

https://www.icao.int/MID/Documents/2...520SG8/IP6.pdf
https://www.icao.int/EURNAT/Document...l%25202017.pdf
ICARD contains duplications that previously existed. It won't let you introduce new ones but not all states used it the last time I saw anything official on this matter. Eurocontrol tried very hard to make ICARD a complete list but accepted it wasn't although it should be complete for those states with an effective and up to date AIP.

It might get rather more difficult to take out duplications in the 3 letter designators for Navaids. I'm not aware that anybody considered those a problem before. In the UK at least the name is chosen to avoid confusion with similar sounding beacons but once you've chosen a name the available letter combinations are usually quite limited.

Why is everybody so determined to change systems other than FPRSA and any from the same family with the same flaw? There are good reasons to try and eliminate duplicates but this was a bug in a system used by NATS so lets fix that rather than change something else. It is perfectly possible to resolve this issue with the information already in the flight plan, every waypoint in the plan exists in the context of other waypoints which would allow you to do that.
eglnyt is offline  
Old 27th Sep 2023, 15:09
  #369 (permalink)  
 
Join Date: Sep 2006
Location: UK - Hants
Posts: 151
Likes: 0
Received 0 Likes on 0 Posts
I actually raised a question years ago when I worked in ATM about how aircraft FMS can tell the difference between BCN (Brecon) and BCN (Barcelona). The former a VOR the latter a DVOR.
I was informed there are some geographical interpretations made by those systems - but it was not a very satisfactory answer. It is not like they are particularly far apart in the bigger scheme of things.

The two still exist today and I assume the ATC systems can make the distinction when processing the flight plans...but perhaps not?
11K-AVML is offline  
Old 27th Sep 2023, 15:44
  #370 (permalink)  
 
Join Date: May 2001
Location: England
Posts: 1,904
Likes: 0
Received 0 Likes on 0 Posts
Within most FMSs/FMGCs, when you enter a waypoint that's ambiguous, you are presented with a page that allows you to select the correct one from a list of choices. The LAT/LONG is written next to each choice along with the distance from your present? position. Gives you a fighting chance to select the right one. Duplicate beacon identifiers are very common but as most predate modern civil aviation, I suppose it can be forgiven that no one paid attention to what they were doing. However, what's truly surprising is that somehow ICAO have allowed duplicate waypoint names to exist in the first place. As mentioned above, they now seem to be doing something about a problem that people from the god damn Flight Sim community have been talking about for 25 years (back in the day I was involved with writing FMS software for sims).

Edit: NATS failure Thoroughly explained here
Superpilot is offline  
Old 27th Sep 2023, 16:28
  #371 (permalink)  
 
Join Date: Jan 2008
Location: Reading, UK
Posts: 15,822
Received 206 Likes on 94 Posts
Originally Posted by Superpilot
Edit: NATS failure Thoroughly explained here
Thanks for that - proof that the DVL is indeed in the detail.
DaveReidUK is offline  
Old 28th Sep 2023, 13:28
  #372 (permalink)  
 
Join Date: Jan 2008
Location: USA
Posts: 34
Received 0 Likes on 0 Posts
Pun of the year !

Originally Posted by DaveReidUK
Thanks for that - proof that the DVL is indeed in the detail.
Well played!
moosepileit is offline  
Old 6th Oct 2023, 07:41
  #373 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Details of the Independent Review Chair and Terms of Reference now on the CAA site.
eglnyt is offline  
Old 6th Oct 2023, 08:01
  #374 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by eglnyt
Details of the Independent Review Chair and Terms of Reference now on the CAA site.
Ha! You couldn’t make it up! So they’ve put the former head of the UK’s slot monopoly body in charge of the review of the airspace and ATC monopoly - or at least the incident on the monopoly system. No chance of any independent thinking there then.

I hope readers managed to see what the CEO of Europe’s largest airline had to say about all this.
Neo380 is offline  
Old 6th Oct 2023, 10:03
  #375 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Neo380
Ha! You couldn’t make it up! So they’ve put the former head of the UK’s slot monopoly body in charge of the review of the airspace and ATC monopoly - or at least the incident on the monopoly system. No chance of any independent thinking there then.

I hope readers managed to see what the CEO of Europe’s largest airline had to say about all this.
It was always going to be one of the gang and the choice probably had more to do with who was available. It is slightly more independent than last time. The critical bit is who the two assistants are and their technical and relevant expertise as they are likely to do most of the work. It is a shame they haven't been named yet.
eglnyt is offline  
Old 17th Oct 2023, 11:56
  #376 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
The Select Committee will be looking at this on Wednesday 18th October
eglnyt is offline  
Old 20th Oct 2023, 15:24
  #377 (permalink)  
 
Join Date: Jul 2022
Location: Up Narf
Posts: 437
Received 137 Likes on 68 Posts
Originally Posted by eglnyt
The Select Committee will be looking at this on Wednesday 18th October
And the CEO of Ryanair was pulling no punches. Nor was the CEO of Loganair (whom was slightly more diplomatic).

https://parliamentlive.tv/event/inde...1-da5ff4053cdb
Diff Tail Shim is offline  
Old 17th Nov 2023, 23:00
  #378 (permalink)  
 
Join Date: Jan 2008
Location: Wintermute
Posts: 76
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Peter H
I stated I was a software engineer, prior to ask a question about the context of the error handling. Intending to indicate the nature of the answer I could handle: the s/w aspects could be reasonably complex, while the flight-plan aspects would need to be pretty simplistic.

Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).

As a friend used to say "They wouldn't do that. Would they?".

Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.

PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.

2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
This system failed precisely as designed, it's backup system then took over and failed precisely as designed, The system then failed back to a manual process, as designed, and continued to function safely. At no time were any lives at risk, as designed. Capacity was reduced, as expected. To the ignorati the behaviour of safety critical systems can sometimes be beyond your very simple world view and experience, understand that you are way, way out of your depth and continue living in ignorance. Alternatively spend 20 years learning the how's and why's . . .
fergusd is offline  
Old 18th Nov 2023, 11:56
  #379 (permalink)  
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Peter H raised legitimate questions, he does it well, and politely. To try to style him as ‘ignorati’ speaks much more about the respondent than the original question asker.
There are huge amounts of spin going on around this issue, in exactly the same way that occurred in 2014 when NATS was allowed to investigate itself for that critical failure - it’s almost as if former engineers have taken a ‘vow of omertà’, and will never admit a mistake!
But the spin doesn’t work.
Failover criteria for a critical safety system are not measured in the number of lives lost - no regulator would sign up to that in a million years - they are measured in the system safely failing over to a backup, or a backup to a backup, and that didn’t happen. Period.
Reverting to manual handling of aircraft, at hugely reduced capacity is crisis management, safety critical systems are never meant to need to work that way ie to completely fail.
NATS knows that the system is ancient, it’s 1960s technology implemented in 1970 after all. They also know that it’s at capacity limits and at risk of failing over again, as evidenced by the fact it has done so repeatedly.
NATS is also forever lobbying the DfT for major funding for system upgrades (despite its internal battle on whether these are even necessary, or can be achieved), this is also publicly less well known. What is not being acknowledged, at all though, is that NATS HAS had the funding to cover this gap through the ITEC programme, which has a very large team (look online, before NATS removes the post) and HAS FAILED TO DELIVER FOR 15 YEARS!
It’s all very well keyboard warriors browbeating reasonable questions to deflect from what has really been going on at NATS, but if they are incapable of telling the whole truth themselves their opinion is really of no value.

Last edited by Neo380; 20th Nov 2023 at 10:01.
Neo380 is offline  
Old 20th Nov 2023, 09:43
  #380 (permalink)  
 
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by fergusd
This system failed precisely as designed, it's backup system then took over and failed precisely as designed, The system then failed back to a manual process, as designed, and continued to function safely. At no time were any lives at risk, as designed. Capacity was reduced, as expected. To the ignorati the behaviour of safety critical systems can sometimes be beyond your very simple world view and experience, understand that you are way, way out of your depth and continue living in ignorance. Alternatively spend 20 years learning the how's and why's . . .
Whilst all efforts have to be made to ensure that what appears in front of a controller can be trusted, and sometimes that may mean bringing the system to a halt, the idea that was an appropriate thing to do in this case because it is a "safety critical" system is, as somebody else has remarked, just "spin". If it was "designed" to do this in this specific instance then that is poor design. This system is designed to throw out flight plans for human response in many different cases without any detriment to safety so it could, and should, have been designed to do so for any issue affecting a single flight plan even if it didn't know what was wrong with it.
eglnyt is offline  


Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.