U.K. NATS Systems Failure
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes
on
0 Posts
Join the dots...
There are myriad issues running here, but there won't be compensation under the Transport Act because this incident is being classed as an 'exceptional situation', but is it..?
Short answer, no. It's a repeat of the 2014 incident, (interim and final reports available - they wouldn't attach for some reason), but as mentioned, like Martin Rolfe's statement there's 'a lot of puff, and very little explanation' in them. The CAA never got to the root cause of the issue. I know less about the 2009 fail over, as it was before my time.
As context, describing wide-scale, safety critical IT systems is a bit like trying to give a headline summary of War and Peace, basically you can't. But there are certain key IT principles that should be present, such as, so long as your safety critical system is still within its capacity parameters it should not fail over unsuccessfully (it should 'stay up', as the old IBM 9020 system did, 100%). Think about it for a moment, if the Hinkley Point nuclear power station had infrequent, but repeated 'unsuccessful fail overs' we would have had two, potentially three, Fukushimas by now! But note, it is the flight planning system that is failing, not the radar links, or voice comms, yet - that would be a complete disaster.
Another critical IT principle is not having backups with the exact same code as the main net - again, when you think about it this is totally obvious. If a tube train continues through a signalling junction because of a 'software glitch', you don't want the train after it, and the one after that to go piling into the first train! And this is the core issue, the age of the Swanwick ATC system notwithstanding, it has the same code in the back up, and in the back up's back up! This is pure mismanagement, and why the incident is likely to reoccur.
Lastly, culture has a lot to blame here. NATS well-publicised 'just culture' is known internally as the 'nobody can be wrong culture'. Of course, if you make a mistake when in position, like falling asleep (a real incident btw) lessons need to be learned, more sleep provisioned for, proper rest breaks, procedures for if you suddenly feel very tired etc - that's all fine. But in encouraging people to come forwards when incidents occur the promise is 'you won't be actioned (disciplinary) for what happened' and this has leaked into other areas, like IT governance, where no one can be blamed for mistakes that have been made, even critical fail over architecture. And this is a highly risky position, hence all the 'puff'.
Failsafe's should absolutely work, period. Typos in FPLs should be caught, but if they are not the system should reject them, not collapse. But critically, nor should both backups!
Short answer, no. It's a repeat of the 2014 incident, (interim and final reports available - they wouldn't attach for some reason), but as mentioned, like Martin Rolfe's statement there's 'a lot of puff, and very little explanation' in them. The CAA never got to the root cause of the issue. I know less about the 2009 fail over, as it was before my time.
As context, describing wide-scale, safety critical IT systems is a bit like trying to give a headline summary of War and Peace, basically you can't. But there are certain key IT principles that should be present, such as, so long as your safety critical system is still within its capacity parameters it should not fail over unsuccessfully (it should 'stay up', as the old IBM 9020 system did, 100%). Think about it for a moment, if the Hinkley Point nuclear power station had infrequent, but repeated 'unsuccessful fail overs' we would have had two, potentially three, Fukushimas by now! But note, it is the flight planning system that is failing, not the radar links, or voice comms, yet - that would be a complete disaster.
Another critical IT principle is not having backups with the exact same code as the main net - again, when you think about it this is totally obvious. If a tube train continues through a signalling junction because of a 'software glitch', you don't want the train after it, and the one after that to go piling into the first train! And this is the core issue, the age of the Swanwick ATC system notwithstanding, it has the same code in the back up, and in the back up's back up! This is pure mismanagement, and why the incident is likely to reoccur.
Lastly, culture has a lot to blame here. NATS well-publicised 'just culture' is known internally as the 'nobody can be wrong culture'. Of course, if you make a mistake when in position, like falling asleep (a real incident btw) lessons need to be learned, more sleep provisioned for, proper rest breaks, procedures for if you suddenly feel very tired etc - that's all fine. But in encouraging people to come forwards when incidents occur the promise is 'you won't be actioned (disciplinary) for what happened' and this has leaked into other areas, like IT governance, where no one can be blamed for mistakes that have been made, even critical fail over architecture. And this is a highly risky position, hence all the 'puff'.
Failsafe's should absolutely work, period. Typos in FPLs should be caught, but if they are not the system should reject them, not collapse. But critically, nor should both backups!
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes
on
0 Posts
From one of our most esteemed aviation colleagues, (Professor S):
'I just uploaded a post- code in the wrong format into google maps and all the traffic lights in London have stopped working. Ha.'
Sums it up nicely.
'I just uploaded a post- code in the wrong format into google maps and all the traffic lights in London have stopped working. Ha.'
Sums it up nicely.
Not sure why you think they didn't get to the root cause of the 2014 problem: it was clearly identified. The problem this time may well be unrelated, time will tell. Just culture is not what you describe, & not limited to NATS. What you describe is a "no blame" culture, which has been out of favour for decades, for the reasons you suggest. A just culture draws a distinction between honest mistakes & errors, which occur in any environment, & non conformance. The second is definitely not acceptable nor accepted.
Join Date: Aug 2003
Location: FR
Posts: 234
Likes: 0
Received 0 Likes
on
0 Posts
That Flight Plan was "checked" by at least 2 other systems before it got to where it caused the issue. There are perfectly valid flight plans which are known to cause the UK flight data processor issues and they are screened before they get there. It could be that a new one has now been added to the list.
Sorry, I imagine more than I should ... just curious!
Join Date: Oct 2004
Location: Southern England
Posts: 484
Likes: 0
Received 0 Likes
on
0 Posts
I don't think this time it was the Swanwick system but the previous review following the 2014 incident pointed out that to fully test every state of that system would take over 100 years. There will be bugs in any complex system. You can't eliminate them by a bit more testing.
Join Date: Feb 2007
Location: Central Scotland
Posts: 61
Likes: 0
Received 0 Likes
on
0 Posts
The 9020D was special....
....when it came into service in the early 1970's.
Bought from the Americans, who were the only other users at the time, it was cutting edge, and took a lot of manpower to maintain, and a lot of power and cooling.
Decades later, it wasn't of course, but the new FDP system basically re-platformed the old system.
" Stuck flighplans were common, and " Restarts and " Flops" a weekly occurrence.
Extra functionality on top of the original code was supposed to check the plan for "legality" to prevent bad data crashing the system....and generally did. Failures went from being weekly to yearly or longer.
Bought from the Americans, who were the only other users at the time, it was cutting edge, and took a lot of manpower to maintain, and a lot of power and cooling.
Decades later, it wasn't of course, but the new FDP system basically re-platformed the old system.
" Stuck flighplans were common, and " Restarts and " Flops" a weekly occurrence.
Extra functionality on top of the original code was supposed to check the plan for "legality" to prevent bad data crashing the system....and generally did. Failures went from being weekly to yearly or longer.
The NATS CEO indicated this morning that a piece of the system (which has to be the FPPS) failed because it didn’t recognize a message, which was almost certainly an FPL.
People are questioning how a “bad” FPL came to be accepted into the FPPS. It is important to recognize that an FPL has syntax (format) and semantics (meaning).
If the syntax is correct, it is a valid FPL. By far the most complex element of an FPL is the route. The other elements are just parameters that are checked for validity. For example, if aircraft type is stated as “C172” the FPPS checks this against a list of valid aircraft types in its database.
The route syntax is checked to make sure the expression follows the rules of how the route elements should be constructed. Whilst there are many different types of route element and many rules to follow, this checking is relatively straightforward. If a “bad” FPL in terms of format is recognized it will be rejected at this stage. If not, the FPL will be passed to route conversion, where the semantics are extracted.
This is why I can’t see how the statement “it didn’t recognize a message” could lead to a processing failure. If it didn’t recognize a message, it would be rejected at this stage – business as usual. I can only assume the message was recognized as valid and passed-on.
The FPPS must now work out what the actual route is within its airspace (the semantic meaning), and this is the really difficult bit. There is an infinite number of possibilities. For example, route fixes can be expressed as lat/long coordinates which could be literally anywhere. The programme works out what it needs to do, in terms of outputting information to controllers and adjacent centres, and my guess is that this is the source of the problem that caused the FPPS to crash.
The programme came across an unusual route it had not encountered before (and had not been programmed to expect), didn’t know what to do, and a graceful recovery was not available. In other words, encountered a bug and did something unpredictable.
Just my guess.
People are questioning how a “bad” FPL came to be accepted into the FPPS. It is important to recognize that an FPL has syntax (format) and semantics (meaning).
If the syntax is correct, it is a valid FPL. By far the most complex element of an FPL is the route. The other elements are just parameters that are checked for validity. For example, if aircraft type is stated as “C172” the FPPS checks this against a list of valid aircraft types in its database.
The route syntax is checked to make sure the expression follows the rules of how the route elements should be constructed. Whilst there are many different types of route element and many rules to follow, this checking is relatively straightforward. If a “bad” FPL in terms of format is recognized it will be rejected at this stage. If not, the FPL will be passed to route conversion, where the semantics are extracted.
This is why I can’t see how the statement “it didn’t recognize a message” could lead to a processing failure. If it didn’t recognize a message, it would be rejected at this stage – business as usual. I can only assume the message was recognized as valid and passed-on.
The FPPS must now work out what the actual route is within its airspace (the semantic meaning), and this is the really difficult bit. There is an infinite number of possibilities. For example, route fixes can be expressed as lat/long coordinates which could be literally anywhere. The programme works out what it needs to do, in terms of outputting information to controllers and adjacent centres, and my guess is that this is the source of the problem that caused the FPPS to crash.
The programme came across an unusual route it had not encountered before (and had not been programmed to expect), didn’t know what to do, and a graceful recovery was not available. In other words, encountered a bug and did something unpredictable.
Just my guess.
Last edited by CBSITCB; 30th Aug 2023 at 11:30. Reason: Typo.
Join Date: Feb 2023
Location: UK
Posts: 2
Likes: 0
Received 0 Likes
on
0 Posts
Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.
Join Date: Jun 2006
Location: Luton
Posts: 12
Likes: 0
Received 0 Likes
on
0 Posts
My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.
My wife told me that on Jeremy Vine's show at lunchtime it was stated that the faulty flight plan should have been submitted as a pdf document but wasn't. Surely not!? Someone please tell me that the system does not rely on machine reading of a format specifically designed to be presented to and seen by a human the same regardless of platform. I initially thought the Daily Mail's report, that a faulty plan submitted by a French carrier was the problem, was rubbish and pooh-poohed it as it was the Mail. Seems now they were probably right. These days it seems even the incredible is real.
Join Date: Jun 2006
Location: Luton
Posts: 12
Likes: 0
Received 0 Likes
on
0 Posts
Nobody else think this is a massive over-reaction by all involved? One outage every 9 years? That's not bad is it.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.
It didn't even really fail. The service they provide was significantly slower for a few hours, while they went manual. Is it NATS fault that R***air run their operation with no margin on crew hours? Or that Karen Pax would rather screech on twitter than find a hotel for an extra night while it all blows over.
"How did we ever win the war?" etc.
Join Date: May 2001
Location: England
Posts: 1,904
Likes: 0
Received 0 Likes
on
0 Posts
Whatever's going on, we have evidence that there are no independent backup systems at NATS. Whatever processes they have go through a single point of failure. That can't be news to the developers and managers at NATS. They will have known about it.
Join Date: Oct 2004
Location: Southern England
Posts: 484
Likes: 0
Received 0 Likes
on
0 Posts
If I have the World's most sophisticated system & it cost me, for the sake of argument, £500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?
Pegase Driver
If I have the World's most sophisticated system & it cost me, for the sake of argument, £500 million, a truly independent backup would more than double that cost even if you could get 2 capable of handling all the traffic. So I go to the two very vocal Irishmen & tell them I want to double the bit of their fees that cover systems & interest on capital investment to avoid a disruption every 10 years or so but I can't guarantee that there will never be any disruption even if I spend that money. What do you think their response would be?
That said, today if you have a very stable modern advanced system build bay any of the well known established manufacturers , with good preventive maintenance and well paid technical staff doing the upgrades when necessary your back up system is likely to be used only some minutes a year. Also we do not have back up system to only maintain capacity, it is primarily to maintain the same level of safety. In ATC we are in the safety business, even if our vocal Irishmen claim we are there to provide endless capacity and no delays.
In this incident safety was not compromised ( at least as far as I heard) , it caused delays, diversions and cancellations , but this is a capacity consequence , not a safety one.,
What caused the system to crash is not really the important issue, Nice to know to learn and prevent it from happening again , but why it took so long to restart the primary system is the question I would like to ask if I had the chance.
An independent backup will help in the event of hardware or power failure but if both systems are using identical software they will react in the same way to the same input. if one crashes because of bad data, so will any backup.
Join Date: Oct 2004
Location: Southern England
Posts: 484
Likes: 0
Received 0 Likes
on
0 Posts
There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.
There is a big difference between a functional backup & a truly independent backup. The latter implies different software from the main. I don't know any ANSP that has a truly independent backup capable of handling the same traffic as the main. If you just have a second copy of the main then when you feed it the same data it will do exactly the same.
Last edited by Ninthace; 30th Aug 2023 at 17:56.
Join Date: Aug 2015
Location: In the mist
Posts: 16
Likes: 0
Received 0 Likes
on
0 Posts
Some systems will run with the primary on version n and the backup on version n-1 so that the backup won’t be affected by a newly introduced bug.
That falls down of course if an undetected bug was in version n-15.
Misty.
That falls down of course if an undetected bug was in version n-15.
Misty.