PPRuNe Forums - View Single Post - U.K. NATS Systems Failure
View Single Post
Old 1st Sep 2023, 15:25
  #192 (permalink)  
Gupeg
 
Join Date: May 2016
Location: UK
Posts: 6
Likes: 0
Received 1 Like on 1 Post
There was a little bit more to it than that. The other issue at play was that the controller had made a mode error in selecting a soft key that put them in "Watching Mode" (a rare and obsolete mode) and only then did the comparison 153 < 151 (in a different code path) fail. It was the combination of errors both in software and by the operator that, on their own were inconsequential, but when combined became significant.
I think this is rather a favourable way of looking at it

To me the real cause of the failure was introducing new software, onto both SFS Servers, that had not been adequately tested (or rather the testing had not been adequately specified). The inadequacy of that testing was shown, whether by "Watching Mode" being needed / selected, it took only one day for the "new" software to then bring down UK ATC for a period

The report refers to "needles" and "haystacks" and how hard it is to find errors, including latent errors (as here from maybe 20 years earlier). However, the upgrade is described as being specifically to "add military controller roles". Therefore, to me, in addition to whatever normal test functions an upgrade requires, specific testing should have been specified "stress testing" the number of workstations. The testing should be intended to verify not only the upgrade changes, but the whole system to expose (as here) related latent errors that had been "got away with" to date - especially since it was a "one type" system (civil) that had been transferred and adapted into a "two type" system (civil and military).

The bigger picture is should the upgrade have been debugged on a live system? Or a test system? NATS of course will keep banging the safety drum which might be accurate, but irrelevant. It is whether the airline industry, travelling public and Govt find it acceptable for the system to grind to a halt every 10 years or so while latent errors are worked out. If it is not acceptable, then a different (and no doubt more costly) approach is required... We'll later see if the report on Monday's issue has any parallels?
Gupeg is offline