U.K. NATS Systems Failure
Join Date: Oct 2004
Location: Southern England
Posts: 479
Likes: 0
Received 0 Likes
on
0 Posts
Well, "it" the software did not decide (that would be AI ), the designers / specifiers did.
Likely a question for eglynt, the FPRSA-R seems to be planned on falling over, since there is manual input system. Can the 2 work alongside each other? i.e. could the FPRSA-R throw up an error message (rather than full exception) saying "I cannot deal with Flt Plan XYZ, meanwhile I will carry on". The an ATCO can get Flt Plan XYZ and manually input it.
Martin Rolfe seems keen to say this has never happened before in 15 million flights. Someone inclined to believe him might understand that to mean it has flawlessly processed every Flt Plan since 2018 without a single hiccup. From this thread, it seems it has always been temperamental, just the 4 hour buffer has, to date proved adequate to solve the issue.
(p9) looks forward through flight to identify UK entry point. Then goes to end of whole route and works back to find UK exit point, which it could not find because it was not (no need for it to be) specified. It then looked for points near to UK airspace to try and work out an exit - but seems it found the duplicate name point and picked the wrong one, and now "the software could not extract a valid UK portion of flight plan between these two points" at which point it threw a wobbly.
I am not entirely convinced by "However, since flight data is safety critical information that is passed to ATCOs the system must be sure it is correct and could not do so in this case. It therefore stopped operating, avoiding any opportunity for incorrect data being passed to a controller. The change to the software will now remove the need for a critical exception to be raised in these specific circumstances." - since the software correctly identified the dodgy Flt Plan, and could just not have passed it on, letting someone manually do it.
My guess is this now becomes the work of spin doctors who can only state "safety - worked as designed" and will require leaks from inside NATS as to how "wonderful" FPRSA-R really was, and whether MR is speaking the truth in that it had never got confused or stopped since 2018?? If it was temperamental, requiring manual interventions, then how these were reported investigated solved would be interesting...
Likely a question for eglynt, the FPRSA-R seems to be planned on falling over, since there is manual input system. Can the 2 work alongside each other? i.e. could the FPRSA-R throw up an error message (rather than full exception) saying "I cannot deal with Flt Plan XYZ, meanwhile I will carry on". The an ATCO can get Flt Plan XYZ and manually input it.
Martin Rolfe seems keen to say this has never happened before in 15 million flights. Someone inclined to believe him might understand that to mean it has flawlessly processed every Flt Plan since 2018 without a single hiccup. From this thread, it seems it has always been temperamental, just the 4 hour buffer has, to date proved adequate to solve the issue.
(p9) looks forward through flight to identify UK entry point. Then goes to end of whole route and works back to find UK exit point, which it could not find because it was not (no need for it to be) specified. It then looked for points near to UK airspace to try and work out an exit - but seems it found the duplicate name point and picked the wrong one, and now "the software could not extract a valid UK portion of flight plan between these two points" at which point it threw a wobbly.
I am not entirely convinced by "However, since flight data is safety critical information that is passed to ATCOs the system must be sure it is correct and could not do so in this case. It therefore stopped operating, avoiding any opportunity for incorrect data being passed to a controller. The change to the software will now remove the need for a critical exception to be raised in these specific circumstances." - since the software correctly identified the dodgy Flt Plan, and could just not have passed it on, letting someone manually do it.
My guess is this now becomes the work of spin doctors who can only state "safety - worked as designed" and will require leaks from inside NATS as to how "wonderful" FPRSA-R really was, and whether MR is speaking the truth in that it had never got confused or stopped since 2018?? If it was temperamental, requiring manual interventions, then how these were reported investigated solved would be interesting...
To be fair to all involved as far as I know FPRSA has not been temperamental in either its current or pre 2018 forms. NAS has but not for a very long time, ironically lots of similar issues are controlled for NAS purely because it is so old and they occurred previously. I would be surprised if all the "issues" associated with NAS in the past did not form the basis of requirements and subsequent test placed on FPRSA in 2018.
Join Date: Jun 2008
Location: Cambridge UK
Posts: 192
Likes: 0
Received 0 Likes
on
0 Posts
FPRSA stopped, it seems to have been designed to do so when it found something like this. That meant the UK effectively stopped receiving IFR Flight Plans, all of them regardless of the planned routing within the UK. No problem initially as the downstream systems have a buffer. As that buffer came close to depletion and with no fix in sight flow was imposed to limit the number of incoming flights to the UK to that number which could be manually created beyond FPRSA.
I'm finding it hard to understand why no attempt was made to continue with the the offending flight-plan handled by the manual system. For arguments sake, say after a cold restart of the system in case there had been any corruption.
Presumably some sort of fix or all-clear was eventually issued (I assume "continue but handle the offending flight-plan by hand"). I'm having difficulty imaging what type of issue(s) would require 4 hours to find or exclude.
PS I'm in some of the outer circles of confusion shown in the latest New Scientist cartoon.
Join Date: Oct 2004
Location: Southern England
Posts: 479
Likes: 0
Received 0 Likes
on
0 Posts
Non-aviation lurker & retired software engineer.
I'm finding it hard to understand why no attempt was made to continue with the the offending flight-plan handled by the manual system. For arguments sake, say after a cold restart of the system in case there had been any corruption.
Presumably some sort of fix or all-clear was eventually issued issued (I assume "continue but handle the offending flight-plan by hand"). I'm having difficulty imaging what type of issue(s) would require 4 hours to find or exclude.
PS I'm in some of the outer circles of confusion shown in the latest New Scientist cartoon.
I'm finding it hard to understand why no attempt was made to continue with the the offending flight-plan handled by the manual system. For arguments sake, say after a cold restart of the system in case there had been any corruption.
Presumably some sort of fix or all-clear was eventually issued issued (I assume "continue but handle the offending flight-plan by hand"). I'm having difficulty imaging what type of issue(s) would require 4 hours to find or exclude.
PS I'm in some of the outer circles of confusion shown in the latest New Scientist cartoon.
If you can't isolate that data then it's going to fail every time. If you take all the data out then your system is equally ineffective.
"I'm finding it hard to understand why no attempt was made to continue with the the offending flight-plan handled by the manual system."
Possibly that the errant flight plan could not be identified - if it could then why didn't the software throw it back.
Some very odd numbers / usage in the public domain.
"1 in 15m flight plans over five years" could be 1 in 3m per year; either value is meaningless because they would be forecasts - technical guess, objectives - which were not met.
interesting views in https://snafucatchers.github.io/ #111
Possibly that the errant flight plan could not be identified - if it could then why didn't the software throw it back.
Some very odd numbers / usage in the public domain.
"1 in 15m flight plans over five years" could be 1 in 3m per year; either value is meaningless because they would be forecasts - technical guess, objectives - which were not met.
interesting views in https://snafucatchers.github.io/ #111
Join Date: Oct 2004
Location: Southern England
Posts: 479
Likes: 0
Received 0 Likes
on
0 Posts
It may be a deliberate omission to avoid the "blame the French" type headlines we saw last week.
The expectation would seem to have been that the would do exactly that. Identify the errant data, isolate it and move on. They seem to have struggled to identify the errant data and possibly then struggled to remove it, exactly why isn't really covered in the prelim report, hopefully it will come later.
If you can't isolate that data then it's going to fail every time. If you take all the data out then your system is equally ineffective.
If you can't isolate that data then it's going to fail every time. If you take all the data out then your system is equally ineffective.
Telnet randomly lose option
...which addressed the following:
Several hosts appear to provide random lossage, such as system crashes, lost data, incorrectly functioning programs, etc., as part of their services. These services are often undocumented and are in general quite confusing to the novice user.
A general means is needed to allow the user to disable these features.
A general means is needed to allow the user to disable these features.
Join Date: Mar 2008
Location: London
Age: 69
Posts: 148
Likes: 0
Received 0 Likes
on
0 Posts
I see that Simon Calder speculates on the identity of the flight with the flight plan that triggered the failure :
"The report does not reveal the airline or route, merely saying: “The flight was planned to depart at around 4am [BST] on 28 August, and arrive at around 3pm.”
The service that most closely matches these timings, and passed over UK airspace, is Air France flight AF85 from San Francisco to Paris CDG. It is scheduled to depart daily at 4am British time and arrive in the French capital at 2.50pm, BST. This is speculation by The Independent and has not been confirmed."
Is that flight plan accessible to determine the duplicated waypoints ?
"The report does not reveal the airline or route, merely saying: “The flight was planned to depart at around 4am [BST] on 28 August, and arrive at around 3pm.”
The service that most closely matches these timings, and passed over UK airspace, is Air France flight AF85 from San Francisco to Paris CDG. It is scheduled to depart daily at 4am British time and arrive in the French capital at 2.50pm, BST. This is speculation by The Independent and has not been confirmed."
Is that flight plan accessible to determine the duplicated waypoints ?
Join Date: Mar 2008
Location: London
Age: 69
Posts: 148
Likes: 0
Received 0 Likes
on
0 Posts
FlightAware gives https://www.flightaware.com/live/fli...310Z/KSFO/LFPG
And there appears to be the possibility of duplicated MORAG waypoints (at least according to the OpenNav database)
And there appears to be the possibility of duplicated MORAG waypoints (at least according to the OpenNav database)
Join Date: Oct 2004
Location: Southern England
Posts: 479
Likes: 0
Received 0 Likes
on
0 Posts
It may be that flight but only if it had never filed that route recently. Dave might have info on how long it has been operating and where it normally goes.
AFR085 has been operating at least as far back as 2011. Routeing varies all the way from overflying the whole of the UK from the Hebrides south, to not overflying at all, dependent obviously on the day's NAT tracks.
So doesn't really help, I'm afraid.
So doesn't really help, I'm afraid.
The airline sent the plan to IFPS which checked it was in the correct format (which it was) and accepted it. IFPS passed it to Swanwick at the appropriate time. There was an anomaly in the route (duplicate fixes) which by NATS' admission the FPRSA-R program logic couldn't handle. NATS says this "led to a 'critical exception' whereby both the primary system and its backup entered a fail-safe mode". Personally I find this hard to believe. If the FPRSA-R was still in control of itself surely it would say "Look chaps - this route looks rather weird. There appears to be a duplicate fix. I'll ignore it for now until you guys figure it out. In the meantime I'll carry on processing all the other flight plans". IMHO 'critical exception' and 'fail-safe mode' are spin for "it crashed".
Last edited by CBSITCB; 6th Sep 2023 at 22:40. Reason: Typo
Join Date: Oct 2004
Location: Southern England
Posts: 479
Likes: 0
Received 0 Likes
on
0 Posts
The airline sent the plan to IFPS which checked it was in the correct format (which it was) and accepted it. IFPS passed it to Swanwick at the appropriate time. There was an anomaly in the route (duplicate fixes) which by NATS' admission the FPRSA-R program logic couldn't handle. NATS says this "led to a ‘critical exception’ whereby both the primary system and its backup entered a fail-safe mode". Personally I find this hard to believe. If the FPRSA-R was still in control of itself surely it would say "Look chaps - this route looks rather weird. There appears to be a duplicate fix. I'll ignore it for now until you guys figure it out. In the meantime I'll carry on processing all the other flight plans". IMHO 'critical exception' and 'fail-safe mode' are spin for "it crashed".
Only the design authority can explain why it handled things the way it did. The nuclear option is appropriate for some errors but rather overkill in this case.
As you say describing a crash of your system as failsafe is a rather desperate spin.
Ryanair’s Michael O’Leary said [[url]https://corporate.ryanair.com/news/ryanair-rejects-nats-whitewash-report/?market=en&fbclid=IwAR3prT8auG3wrZGt_sHH1fFvA0lQJAyJWUKm1HLp ioYJBwVTP5bez-46-xU]:
“Finally, we do not accept NATS claim that it is “not within remit” to provide cost reimbursement to customers. Ryanair pays NATS almost €100m p.a. for an ATC service that is repeatedly short staffed and on the 28thAug, collapsed altogether. The least NATS could and should do is to reimburse its airline customers for the tens of millions of pounds they have spent reimbursing passengers for their hotel, meals and transport expenses, which were entirely due to NATS system failure, and NATS backup system failure on Mon 28th Aug last.
If NATS fail to reimburse its customers for these expenses, then Secretary of Transport Mark Harper should intervene (as the largest shareholder in NATS), and instruct NATS to reimburse NATS airline customers for these right to care expenses.
This Report, which is full of false figures about flight cancellations and delays, and avoids any explanation of why NATS backup system failed so spectacularly will not solve this problem unless NATS accepts responsibility for its incompetence and reimburses airlines and passengers for the avoidable right to care expenses they suffered due to NATS failure on Mon 28th Aug last.”
I bet the contract and legal people at Frequentis are keeping their heads well down...
“Finally, we do not accept NATS claim that it is “not within remit” to provide cost reimbursement to customers. Ryanair pays NATS almost €100m p.a. for an ATC service that is repeatedly short staffed and on the 28thAug, collapsed altogether. The least NATS could and should do is to reimburse its airline customers for the tens of millions of pounds they have spent reimbursing passengers for their hotel, meals and transport expenses, which were entirely due to NATS system failure, and NATS backup system failure on Mon 28th Aug last.
If NATS fail to reimburse its customers for these expenses, then Secretary of Transport Mark Harper should intervene (as the largest shareholder in NATS), and instruct NATS to reimburse NATS airline customers for these right to care expenses.
This Report, which is full of false figures about flight cancellations and delays, and avoids any explanation of why NATS backup system failed so spectacularly will not solve this problem unless NATS accepts responsibility for its incompetence and reimburses airlines and passengers for the avoidable right to care expenses they suffered due to NATS failure on Mon 28th Aug last.”
I bet the contract and legal people at Frequentis are keeping their heads well down...
He has a very strong argument in my opinion, however all that will happen is the taxpayer footing the bill eventually. The Treasury paying out £10Ms to a company already making several billion in profit isn’t going to look great optically.
I suspect the answer to this question lies in the revenue raised by taxation on these activities, by analogy with road fund licence and fuel duty BUT a) none of these revenue streams are hypothecated for the spend and b) the sorts of sums being claimed in compensation are beyond anything already budgeted for and so WOULD fall on the public purse.
Join Date: Oct 2004
Location: Southern England
Posts: 479
Likes: 0
Received 0 Likes
on
0 Posts
Currently the "system" places the obligation to look after customers onto the airline, that obligation is limited in cases such as this one. Airlines know they have that obligation and have the opportunity to price that risk and factor it into the ticket price. Whether they do is entirely a business decision on their part, ATC failures are not unheard of so they can hardly claim them to be a total surprise when they happen.
In the UK at least the ANSP doesn't have that obligation and hasn't had the opportunity to price that risk and factor it into the fees.
I don't think it is right to change the rules of the game half way through even if you think the rules are wrong. If you want to change the rules you have to allow the parties to change their game plan based on the new rules. That would mean a discussion on what ATC fees should be to allow the ANSP to price in that risk.
Discuss
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes
on
0 Posts
Does he have a strong argument?
Currently the "system" places the obligation to look after customers onto the airline, that obligation is limited in cases such as this one. Airlines know they have that obligation and have the opportunity to price that risk and factor it into the ticket price. Whether they do is entirely a business decision on their part, ATC failures are not unheard of so they can hardly claim them to be a total surprise when they happen.
In the UK at least the ANSP doesn't have that obligation and hasn't had the opportunity to price that risk and factor it into the fees.
I don't think it is right to change the rules of the game half way through even if you think the rules are wrong. If you want to change the rules you have to allow the parties to change their game plan based on the new rules. That would mean a discussion on what ATC fees should be to allow the ANSP to price in that risk.
Discuss
Currently the "system" places the obligation to look after customers onto the airline, that obligation is limited in cases such as this one. Airlines know they have that obligation and have the opportunity to price that risk and factor it into the ticket price. Whether they do is entirely a business decision on their part, ATC failures are not unheard of so they can hardly claim them to be a total surprise when they happen.
In the UK at least the ANSP doesn't have that obligation and hasn't had the opportunity to price that risk and factor it into the fees.
I don't think it is right to change the rules of the game half way through even if you think the rules are wrong. If you want to change the rules you have to allow the parties to change their game plan based on the new rules. That would mean a discussion on what ATC fees should be to allow the ANSP to price in that risk.
Discuss