PPRuNe Forums - View Single Post - U.K. NATS Systems Failure

1st Sep 2023, 09:36

#188 (permalink)

golfbananajam

Join Date: Aug 2010

Location: UK

Age: 67

Posts: 173

Likes: 133

Received 38 Likes on 23 Posts

Quote:

Originally Posted by Neo380

That's really missing the point, as has been said a number of times.

This isn't an 'infinitesimal circumstance', that could never be tested for (unless all human inputs have become 100% reliable, which they are not, and never can be).

This is all about how a fail over works, or doesn't work to be more precise. The system should have switched to an iteration of the software that wasn't identical, so wasn't bound to immediately fail again. Because if it does the system has no fail safe and bound to collapse, catastrophically.

That is the issue that is being skirted around, and was the core fault of the 2014 failure - very bad IT architecture.

Describing 'infinitesimal circumstances' and the '100 years of testing that couldn't identify them' has nothing to do with building fail overs that fail again, and then fail over again, through poor design.

Please see my post #74. Software testing is costly both in terms of time and resource. For complex systems it is impossible to test every combination of input data to see what fails. Automated testing is also not the panacea that many think it is. To get a test script automated, you end up manually running it to make sure the element of software under test works. Once you have a test that passes then you run it again, this time sing the auto test suite to record the steps you take. Once you've done that, you then run a confirmatory test. So for every element of the requirement you end up with at least three runs of a single test script (which can have many stages).

Then the developer has an update to code, so your automated test fails, then you start all over again.

The problem with old and complex systems, is that updates and improvements are usually a bolt-on to the original, it isn't very often you redesign from a clean sheet of paper. The result is that you end up testing the areas that your update has "touched" with a quick sanity regression test of the main use cases. You just don't have the time, resource or money to fully test everything each time an update is carried out.

Even then, there will be an edge case you just don't consider or haven't even though of or have dismissed as "won't ever happen" because of checks in other systems that you use as a data source where you assume that the input data has been properly validated and is therefore correct.

Reply