PPRuNe Forums - View Single Post - U.K. NATS Systems Failure
View Single Post
Old 2nd Sep 2023, 09:12
  #200 (permalink)  
eglnyt
 
Join Date: Oct 2004
Location: Southern England
Posts: 487
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Gupeg
I think this is rather a favourable way of looking at it

To me the real cause of the failure was introducing new software, onto both SFS Servers, that had not been adequately tested (or rather the testing had not been adequately specified). The inadequacy of that testing was shown, whether by "Watching Mode" being needed / selected, it took only one day for the "new" software to then bring down UK ATC for a period
We continue to discuss an earlier failure on a system that almost certainly wasn't the one involved in this case although of course currently we don't know which system was.

It wasn't new software. It was the original software, it had been there for years. The change introduced was to start using it nearer the limits of the system of which there two, 151 civil positions and 193 overall. The verification of those limits and acceptance of them happened years before. To use the poor analogy previously introduced it is akin to buying a 5 seat car, only using 4 seats for several years and one day having a need to use all 5. In my case discovering that, if isofix is in use on 2 of the seats it is actually a 4.5 seat car not 5.

Should they have tested up to 193 when the software was written? The review report discusses the impracticality of that. At that time the only time that could be done was on the actual system before it was handed over to the customer and even then it was unlikely that the system would have had had sufficient serviceable and available resources available at the same time to do that test. Once it enters service you no longer have the production system available for test. The test system for NERC includes a complete representation of all the servers and most of the external inputs but can't replicate the entire set of workstations. To do so would require another room the same size and a lot more hardware with the cost, energy and cooling requirement that brings. In modern times we might use virtualisation to address that but this is a system developed long before that was an option. And a simple test up to 193 would not have uncovered the issue, you would need to invoke watching mode when more than 151 were in use, any other mode added above 151 would not have triggered the issue. If your aim was to fully stress the system it is likely that you would have invoked the more demanding modes to do that.

Should they have spotted the error on code review? This is a bad case for humans. There are two limits in use. I'd probably spot a completely incorrect limit but I'd be far less likely to spot that the wrong one was being used.

Should SFS have 2 completely different sets of software so an error would only affect one. Ideally yes but as I've said before that is also impractical. The supplier struggled to produce one set of software in the timescale and cost originally estimated. Even if you doubled your estimate producing two would, in the end, cost considerably more than twice as much even if you managed to ever actually deliver.

Business criticality is a different matter from safety criticality but for all systems in the flight data thread you can make an adequate safety case with redundancy provided with an identical system provided you have a means of ensuring that, at all times from inception of failure, you can safely handle the level of traffic that might be present. In the case of Monday the level of traffic at failure was safely handled and the reduction of traffic as data degraded ensured that continued to be the case.

If your safety case is made than business criticality becomes purely a matter of cost benefit.

Last edited by eglnyt; 2nd Sep 2023 at 09:32. Reason: Grammar
eglnyt is offline