PPRuNe Forums - View Single Post - U.K. NATS Systems Failure
View Single Post
Old 30th Aug 2023, 09:08
  #122 (permalink)  
Neo380
 
Join Date: Nov 2018
Location: UK
Posts: 82
Likes: 0
Received 0 Likes on 0 Posts
Join the dots...

There are myriad issues running here, but there won't be compensation under the Transport Act because this incident is being classed as an 'exceptional situation', but is it..?

Short answer, no. It's a repeat of the 2014 incident, (interim and final reports available - they wouldn't attach for some reason), but as mentioned, like Martin Rolfe's statement there's 'a lot of puff, and very little explanation' in them. The CAA never got to the root cause of the issue. I know less about the 2009 fail over, as it was before my time.

As context, describing wide-scale, safety critical IT systems is a bit like trying to give a headline summary of War and Peace, basically you can't. But there are certain key IT principles that should be present, such as, so long as your safety critical system is still within its capacity parameters it should not fail over unsuccessfully (it should 'stay up', as the old IBM 9020 system did, 100%). Think about it for a moment, if the Hinkley Point nuclear power station had infrequent, but repeated 'unsuccessful fail overs' we would have had two, potentially three, Fukushimas by now! But note, it is the flight planning system that is failing, not the radar links, or voice comms, yet - that would be a complete disaster.

Another critical IT principle is not having backups with the exact same code as the main net - again, when you think about it this is totally obvious. If a tube train continues through a signalling junction because of a 'software glitch', you don't want the train after it, and the one after that to go piling into the first train! And this is the core issue, the age of the Swanwick ATC system notwithstanding, it has the same code in the back up, and in the back up's back up! This is pure mismanagement, and why the incident is likely to reoccur.

Lastly, culture has a lot to blame here. NATS well-publicised 'just culture' is known internally as the 'nobody can be wrong culture'. Of course, if you make a mistake when in position, like falling asleep (a real incident btw) lessons need to be learned, more sleep provisioned for, proper rest breaks, procedures for if you suddenly feel very tired etc - that's all fine. But in encouraging people to come forwards when incidents occur the promise is 'you won't be actioned (disciplinary) for what happened' and this has leaked into other areas, like IT governance, where no one can be blamed for mistakes that have been made, even critical fail over architecture. And this is a highly risky position, hence all the 'puff'.

Failsafe's should absolutely work, period. Typos in FPLs should be caught, but if they are not the system should reject them, not collapse. But critically, nor should both backups!
Neo380 is offline