U.K. NATS Systems Failure
Join Date: Oct 2004
Location: Southern England
Posts: 466
Likes: 0
Received 0 Likes
on
0 Posts
Probably worth reading this:
https://jameshaydon.github.io/nats-fail/
The algorithm used to find the UK portion is, not to put too fine a point on it, dumb. It works by searching forwards through the flight plan until it finds the entry point, then skips to the end and searches backwards, which will have all kinds of exciting consequences if:
- the route leaves UK airspace and re-enters it (the leg or legs outside the airspace will be wrongly included as part of the UK portion)
- the route exits UK airspace through the same waypoint it entered it (the leg within UK airspace will disappear)
- the exit point is not explicitly stated and a duplicate is present (what happened this time)
- probably more cases I haven't thought of
Compare the following program: from the beginning of the plan, check each waypoint to see if it's in the UK. When the first UK waypoint is found, create a UK leg starting at that waypoint. Add subsequent waypoints to the leg if they are in the UK. When a foreign waypoint is found, end the UK leg after the preceding waypoint and add it to the list "UK legs" under the flight plan ID. Continue until you have no more waypoints, and move to the next flight plan.
You'll observe that this copes fine with loops and missing entry/exit points, although we still need to check for duplicates explicitly; a simple dupe catcher would be to flag any UK leg that contains exactly one waypoint for review, because either it's a route that passes in and out over the same point without going anywhere else in the UK (weird but I suppose just possible) or it's a duplicate.
Even if those edge cases are rare and weird, they're not impossible, and of course someone might file a malformed plan maliciously now they know how to break the system.
https://jameshaydon.github.io/nats-fail/
The algorithm used to find the UK portion is, not to put too fine a point on it, dumb. It works by searching forwards through the flight plan until it finds the entry point, then skips to the end and searches backwards, which will have all kinds of exciting consequences if:
- the route leaves UK airspace and re-enters it (the leg or legs outside the airspace will be wrongly included as part of the UK portion)
- the route exits UK airspace through the same waypoint it entered it (the leg within UK airspace will disappear)
- the exit point is not explicitly stated and a duplicate is present (what happened this time)
- probably more cases I haven't thought of
Compare the following program: from the beginning of the plan, check each waypoint to see if it's in the UK. When the first UK waypoint is found, create a UK leg starting at that waypoint. Add subsequent waypoints to the leg if they are in the UK. When a foreign waypoint is found, end the UK leg after the preceding waypoint and add it to the list "UK legs" under the flight plan ID. Continue until you have no more waypoints, and move to the next flight plan.
You'll observe that this copes fine with loops and missing entry/exit points, although we still need to check for duplicates explicitly; a simple dupe catcher would be to flag any UK leg that contains exactly one waypoint for review, because either it's a route that passes in and out over the same point without going anywhere else in the UK (weird but I suppose just possible) or it's a duplicate.
Even if those edge cases are rare and weird, they're not impossible, and of course someone might file a malformed plan maliciously now they know how to break the system.
I'm not sure about the part copied above though. The description in the original report isn't very detailed and jumps around a lot. It isn't clear on when it is using ADEXP or the raw ICAO format and makes no attempt to explain why it might need to do the latter. It also isn't clear why, if it is searching backwards, it somehow finds the first of the duplicates rather than second if they are both in the plan. The algorithm may be dumb, and I'd hope most could improve it knowing what we know now, but I don't think you can make that judgement based on the report alone. I'd like to know whether there was a design rationale for searching backwards before I'd make any comment on the merit of doing so.
I'd be surprised if FPRSA can't cope with re-entrant flight plans as that was a shortcoming of previous systems so it would almost certainly have a specific requirement with associated test to cope with that. It is quite possible, depending upon where the flight came out of Oceanic Airspace, that it could have passed out of UK airspace and back into it before it got anywhere near the exit point with the duplicate a little way beyond.
I also don't share the author's faith in fuzzing for this type of system. As it gets cheaper and easier to do you'll probably do it for most of these systems in the future but it's a bit like infinite monkeys typing the complete works of shakespeare. It might eventually hit on that combination that breaks the system but how long do you give it?
The Author's general analysis which accompanies that extract is one of the better ones I've seen. Unlike most of those which start "As a software engineer" or "As a test manager" they have actually read the detail in the report and taken the trouble to do further research. Some of the assumptions are off but based on what is in the public domain and without the context of domain knowledge it's a reasonable assessment.
I'm not sure about the part copied above though. The description in the original report isn't very detailed and jumps around a lot. It isn't clear on when it is using ADEXP or the raw ICAO format and makes no attempt to explain why it might need to do the latter. It also isn't clear why, if it is searching backwards, it somehow finds the first of the duplicates rather than second if they are both in the plan. The algorithm may be dumb, and I'd hope most could improve it knowing what we know now, but I don't think you can make that judgement based on the report alone. I'd like to know whether there was a design rationale for searching backwards before I'd make any comment on the merit of doing so.
I'd be surprised if FPRSA can't cope with re-entrant flight plans as that was a shortcoming of previous systems so it would almost certainly have a specific requirement with associated test to cope with that. It is quite possible, depending upon where the flight came out of Oceanic Airspace, that it could have passed out of UK airspace and back into it before it got anywhere near the exit point with the duplicate a little way beyond.
I also don't share the author's faith in fuzzing for this type of system. As it gets cheaper and easier to do you'll probably do it for most of these systems in the future but it's a bit like infinite monkeys typing the complete works of shakespeare. It might eventually hit on that combination that breaks the system but how long do you give it?
I'm not sure about the part copied above though. The description in the original report isn't very detailed and jumps around a lot. It isn't clear on when it is using ADEXP or the raw ICAO format and makes no attempt to explain why it might need to do the latter. It also isn't clear why, if it is searching backwards, it somehow finds the first of the duplicates rather than second if they are both in the plan. The algorithm may be dumb, and I'd hope most could improve it knowing what we know now, but I don't think you can make that judgement based on the report alone. I'd like to know whether there was a design rationale for searching backwards before I'd make any comment on the merit of doing so.
I'd be surprised if FPRSA can't cope with re-entrant flight plans as that was a shortcoming of previous systems so it would almost certainly have a specific requirement with associated test to cope with that. It is quite possible, depending upon where the flight came out of Oceanic Airspace, that it could have passed out of UK airspace and back into it before it got anywhere near the exit point with the duplicate a little way beyond.
I also don't share the author's faith in fuzzing for this type of system. As it gets cheaper and easier to do you'll probably do it for most of these systems in the future but it's a bit like infinite monkeys typing the complete works of shakespeare. It might eventually hit on that combination that breaks the system but how long do you give it?
Join Date: Jun 2008
Location: Cambridge UK
Posts: 184
Likes: 0
Received 0 Likes
on
0 Posts
The Author's general analysis which accompanies that extract is one of the better ones I've seen. Unlike most of those which start "As a software engineer" or "As a test manager" they have actually read the detail in the report and taken the trouble to do further research. Some of the assumptions are off but based on what is in the public domain and without the context of domain knowledge it's a reasonable assessment.
...
...
Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).
As a friend used to say "They wouldn't do that. Would they?".
Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.
PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.
2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
Join Date: Oct 2004
Location: Southern England
Posts: 466
Likes: 0
Received 0 Likes
on
0 Posts
I stated I was a software engineer, prior to ask a question about the context of the error handling. Intending to indicate the nature of the answer I could handle: the s/w aspects could be reasonably complex, while the flight-plan aspects would need to be pretty simplistic.
Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).
As a friend used to say "They wouldn't do that. Would they?".
Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.
PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.
2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).
As a friend used to say "They wouldn't do that. Would they?".
Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.
PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.
2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
We won't know until the fuller report what the system wrote where when it shutdown. I would expect the logs with that info to be readily available to the people you expect to react reasonably quickly if you have a system you expect to restore quickly. We know the logs were accessed by those further down the support chain, we don't know yet why it took so long to identify the errant plan, isolate it and restore the system.
I have said earlier I would expect a system of this type to react differently in those circumstances. Isolating flight plans the system doesn't like & throwing them out for human attention is standard practice for most of the flight data systems I know. The suggestion you can't do that for safety reasons is daft. Hopefully rather than double down on that the next report will examine why this system didn't. If they want to double down they might need to check whether any of their other systems do. You don't need to be able to identify the issue and fix it that quickly, most systems have the ability to quickly create a minimal flight plan which can be put in the system so controllers down stream have an awareness of the aircraft without corrupt data.
Join Date: Jun 2008
Location: Cambridge UK
Posts: 184
Likes: 0
Received 0 Likes
on
0 Posts
The problem wouldn't have occurred with a unique waypoint data structure. Not having one is an international feature of the environment in which the system operates and neither NATS or the software supplier can change that. Many of those war stories in earlier posts are generally controlled within newer systems such as FPRSA but still inherent in the ageing message structure.
Join Date: Oct 2004
Location: Southern England
Posts: 466
Likes: 0
Received 0 Likes
on
0 Posts
I fully appreciate that waypoint names aren't unique. Which is why I suggested that the software should convert the non-unique names to a unique-id of some sort in the internal data structures it uses for generating routes. Purely as an example it could use a unique-map-reference instead of the name. Obviously converting back to the non-unique name when producing the required message structure for the generated route.
Join Date: Mar 2006
Location: Vance, Belgium
Age: 61
Posts: 242
Likes: 0
Received 0 Likes
on
0 Posts
What is particularly strange is that there IS a database with unique 5 characters waypoint designators. This is the 5LNCs ICARD database promoted by ICAO and eurocontrol.
The purpose of this database and initiative is to progressively eliminate all duplicate 5LNCs, as all countries are invited to register their en-route designator documented in their AIP. Of course, no duplicate is allowed in the database. I understand that all countries participating in eurocontrol have already completed the process.
As the message format ADEXP originates from eurocontrol, it is strange that the paragraph RTEPTS does not contain an indicator stating for each 5LNC if it is a unique ICARD identifier or not. The non-unique identifiers could then simply be ignored by the NATS software.
https://www.icao.int/MID/Documents/2...520SG8/IP6.pdf
https://www.icao.int/EURNAT/Document...l%25202017.pdf
The purpose of this database and initiative is to progressively eliminate all duplicate 5LNCs, as all countries are invited to register their en-route designator documented in their AIP. Of course, no duplicate is allowed in the database. I understand that all countries participating in eurocontrol have already completed the process.
As the message format ADEXP originates from eurocontrol, it is strange that the paragraph RTEPTS does not contain an indicator stating for each 5LNC if it is a unique ICARD identifier or not. The non-unique identifiers could then simply be ignored by the NATS software.
https://www.icao.int/MID/Documents/2...520SG8/IP6.pdf
https://www.icao.int/EURNAT/Document...l%25202017.pdf
Join Date: Oct 2004
Location: Southern England
Posts: 466
Likes: 0
Received 0 Likes
on
0 Posts
What is particularly strange is that there IS a database with unique 5 characters waypoint designators. This is the 5LNCs ICARD database promoted by ICAO and eurocontrol.
The purpose of this database and initiative is to progressively eliminate all duplicate 5LNCs, as all countries are invited to register their en-route designator documented in their AIP. Of course, no duplicate is allowed in the database. I understand that all countries participating in eurocontrol have already completed the process.
As the message format ADEXP originates from eurocontrol, it is strange that the paragraph RTEPTS does not contain an indicator stating for each 5LNC if it is a unique ICARD identifier or not. The non-unique identifiers could then simply be ignored by the NATS software.
https://www.icao.int/MID/Documents/2...520SG8/IP6.pdf
https://www.icao.int/EURNAT/Document...l%25202017.pdf
The purpose of this database and initiative is to progressively eliminate all duplicate 5LNCs, as all countries are invited to register their en-route designator documented in their AIP. Of course, no duplicate is allowed in the database. I understand that all countries participating in eurocontrol have already completed the process.
As the message format ADEXP originates from eurocontrol, it is strange that the paragraph RTEPTS does not contain an indicator stating for each 5LNC if it is a unique ICARD identifier or not. The non-unique identifiers could then simply be ignored by the NATS software.
https://www.icao.int/MID/Documents/2...520SG8/IP6.pdf
https://www.icao.int/EURNAT/Document...l%25202017.pdf
It might get rather more difficult to take out duplications in the 3 letter designators for Navaids. I'm not aware that anybody considered those a problem before. In the UK at least the name is chosen to avoid confusion with similar sounding beacons but once you've chosen a name the available letter combinations are usually quite limited.
Why is everybody so determined to change systems other than FPRSA and any from the same family with the same flaw? There are good reasons to try and eliminate duplicates but this was a bug in a system used by NATS so lets fix that rather than change something else. It is perfectly possible to resolve this issue with the information already in the flight plan, every waypoint in the plan exists in the context of other waypoints which would allow you to do that.
Join Date: Sep 2006
Location: UK - Hants
Posts: 151
Likes: 0
Received 0 Likes
on
0 Posts
I actually raised a question years ago when I worked in ATM about how aircraft FMS can tell the difference between BCN (Brecon) and BCN (Barcelona). The former a VOR the latter a DVOR.
I was informed there are some geographical interpretations made by those systems - but it was not a very satisfactory answer. It is not like they are particularly far apart in the bigger scheme of things.
The two still exist today and I assume the ATC systems can make the distinction when processing the flight plans...but perhaps not?
I was informed there are some geographical interpretations made by those systems - but it was not a very satisfactory answer. It is not like they are particularly far apart in the bigger scheme of things.
The two still exist today and I assume the ATC systems can make the distinction when processing the flight plans...but perhaps not?
Join Date: May 2001
Location: England
Posts: 1,901
Likes: 0
Received 0 Likes
on
0 Posts
Within most FMSs/FMGCs, when you enter a waypoint that's ambiguous, you are presented with a page that allows you to select the correct one from a list of choices. The LAT/LONG is written next to each choice along with the distance from your present? position. Gives you a fighting chance to select the right one. Duplicate beacon identifiers are very common but as most predate modern civil aviation, I suppose it can be forgiven that no one paid attention to what they were doing. However, what's truly surprising is that somehow ICAO have allowed duplicate waypoint names to exist in the first place. As mentioned above, they now seem to be doing something about a problem that people from the god damn Flight Sim community have been talking about for 25 years (back in the day I was involved with writing FMS software for sims).
Edit: NATS failure Thoroughly explained here
Edit: NATS failure Thoroughly explained here
Join Date: Nov 2018
Location: UK
Posts: 79
Likes: 0
Received 0 Likes
on
0 Posts
I hope readers managed to see what the CEO of Europe’s largest airline had to say about all this.
Join Date: Oct 2004
Location: Southern England
Posts: 466
Likes: 0
Received 0 Likes
on
0 Posts
Ha! You couldn’t make it up! So they’ve put the former head of the UK’s slot monopoly body in charge of the review of the airspace and ATC monopoly - or at least the incident on the monopoly system. No chance of any independent thinking there then.
I hope readers managed to see what the CEO of Europe’s largest airline had to say about all this.
I hope readers managed to see what the CEO of Europe’s largest airline had to say about all this.
And the CEO of Ryanair was pulling no punches. Nor was the CEO of Loganair (whom was slightly more diplomatic).
https://parliamentlive.tv/event/inde...1-da5ff4053cdb
https://parliamentlive.tv/event/inde...1-da5ff4053cdb
Join Date: Jan 2008
Location: Wintermute
Posts: 76
Likes: 0
Received 0 Likes
on
0 Posts
I stated I was a software engineer, prior to ask a question about the context of the error handling. Intending to indicate the nature of the answer I could handle: the s/w aspects could be reasonably complex, while the flight-plan aspects would need to be pretty simplistic.
Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).
As a friend used to say "They wouldn't do that. Would they?".
Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.
PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.
2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
Perhaps I was too polite, and should have said what seemed all too possible. A situation occurred that the code recognised it couldn't handle, and the programmer initiated a controlled shutdown without logging the identity of the associated flight-plan somewhere where it would be immediately seen by those responding to the shutdown. But it seemed pretty rude to suggest that until it was confirmed that it was a controlled shutdown and there was no such logging (in a place that would get checked quickly).
As a friend used to say "They wouldn't do that. Would they?".
Sadly from steamchicken's postings this seems to have been exactly what happened. Although it seems a blindingly obvious system requirement, and easy to check at a code reading. Obviously the choice of log requires coordination between the coding specs and the recovery procedure.
PS
1) Re-writing old code is of course fraught with costs and the possibility of new-mistakes, but code-reading to confirm that all "fail-safe" requests are preceded by appropriate logging of the identity of the flight plan seems potentially manageable.
2) Am I right in thinking that the problem couldn't have occurred if the algorithm had operated on a data structure using some sort of unique waypoint-id rather than an unqualified waypoint-name (e.g map location?).
Perhaps a similar "shooting yourself in the foot" to the problems described in:
U.K. NATS Systems Failure &
U.K. NATS Systems Failure.
The latter is of course involves historic code development, but I think this error was in relatively recent code.
Join Date: Nov 2018
Location: UK
Posts: 79
Likes: 0
Received 0 Likes
on
0 Posts
Peter H raised legitimate questions, he does it well, and politely. To try to style him as ‘ignorati’ speaks much more about the respondent than the original question asker.
There are huge amounts of spin going on around this issue, in exactly the same way that occurred in 2014 when NATS was allowed to investigate itself for that critical failure - it’s almost as if former engineers have taken a ‘vow of omertà’, and will never admit a mistake!
But the spin doesn’t work.
Failover criteria for a critical safety system are not measured in the number of lives lost - no regulator would sign up to that in a million years - they are measured in the system safely failing over to a backup, or a backup to a backup, and that didn’t happen. Period.
Reverting to manual handling of aircraft, at hugely reduced capacity is crisis management, safety critical systems are never meant to need to work that way ie to completely fail.
NATS knows that the system is ancient, it’s 1960s technology implemented in 1970 after all. They also know that it’s at capacity limits and at risk of failing over again, as evidenced by the fact it has done so repeatedly.
NATS is also forever lobbying the DfT for major funding for system upgrades (despite its internal battle on whether these are even necessary, or can be achieved), this is also publicly less well known. What is not being acknowledged, at all though, is that NATS HAS had the funding to cover this gap through the ITEC programme, which has a very large team (look online, before NATS removes the post) and HAS FAILED TO DELIVER FOR 15 YEARS!
It’s all very well keyboard warriors browbeating reasonable questions to deflect from what has really been going on at NATS, but if they are incapable of telling the whole truth themselves their opinion is really of no value.
There are huge amounts of spin going on around this issue, in exactly the same way that occurred in 2014 when NATS was allowed to investigate itself for that critical failure - it’s almost as if former engineers have taken a ‘vow of omertà’, and will never admit a mistake!
But the spin doesn’t work.
Failover criteria for a critical safety system are not measured in the number of lives lost - no regulator would sign up to that in a million years - they are measured in the system safely failing over to a backup, or a backup to a backup, and that didn’t happen. Period.
Reverting to manual handling of aircraft, at hugely reduced capacity is crisis management, safety critical systems are never meant to need to work that way ie to completely fail.
NATS knows that the system is ancient, it’s 1960s technology implemented in 1970 after all. They also know that it’s at capacity limits and at risk of failing over again, as evidenced by the fact it has done so repeatedly.
NATS is also forever lobbying the DfT for major funding for system upgrades (despite its internal battle on whether these are even necessary, or can be achieved), this is also publicly less well known. What is not being acknowledged, at all though, is that NATS HAS had the funding to cover this gap through the ITEC programme, which has a very large team (look online, before NATS removes the post) and HAS FAILED TO DELIVER FOR 15 YEARS!
It’s all very well keyboard warriors browbeating reasonable questions to deflect from what has really been going on at NATS, but if they are incapable of telling the whole truth themselves their opinion is really of no value.
Last edited by Neo380; 20th Nov 2023 at 11:01.
Join Date: Oct 2004
Location: Southern England
Posts: 466
Likes: 0
Received 0 Likes
on
0 Posts
This system failed precisely as designed, it's backup system then took over and failed precisely as designed, The system then failed back to a manual process, as designed, and continued to function safely. At no time were any lives at risk, as designed. Capacity was reduced, as expected. To the ignorati the behaviour of safety critical systems can sometimes be beyond your very simple world view and experience, understand that you are way, way out of your depth and continue living in ignorance. Alternatively spend 20 years learning the how's and why's . . .