Originally Posted by
procede
Fact of the matter is that every backup system will introduce new failure modes.
It happens that everything stops because of inconstancy between primary and secondary systems. Systems can become unavailable as they need to re-synchronize (a common one is where a drive in a RAID array fails and the server starts filling the hot-spare). The best one I ever experienced is a UPS that failed: Everything had power, except the systems behind the UPS...
That is why you do not have 'backup systems' you have a widely distributed fault tolerant system. Yes they are a real pain to test as Scott says above, especially the regression testing after every change and fix. But Delta is probably wishing it had spent the money on a distributed system.
And Neilki, I was talking with your compatriots at around 4am this morning - they are doing a good job. I can only imagine the workload in ops and dispatch over the last few days.