Totally agree with PAXboy at #40 and others who have pointed out that this is likely to be, at its source, a management failure.
If your IT system is mission-critical, then you have to do what it takes to make failure a real improbability (as in 0.0001%). Bean-counters are very likely to balk at the expense until they have an expensive failure. Even then they may make a trade-off: if the expense of the failure is significantly less than the expense of a resilient system, then we can live with a failure now and then ... Having said that, even people with mission-critical systems which one would expect to have full resilience do have failures now and then (ATC comes to mind).
In theory one should be able to fall back to manual operation (if you have manual procedures available, and you've tested them, and your people have not only been trained with them but have actually used them in real life in the recent past). But it may be that we're now so dependent on interlinked IT systems that manual operation is no longer practical.