PPRuNe Forums - View Single Post - BA delays at LHR - Computer issue
View Single Post
Old 6th June 2017 | 15:38
  #564 (permalink)  
Ian W
 
Joined: Dec 2006
Posts: 1,350
Likes: 0
From: Florida and wherever my laptop is
Originally Posted by PAXboy
Restarting a data centre is just like starting an aircraft: There is a sequence that has been tested and proved correct. Any component/generator/system that is dependent on another item being running - will be set to start after it. There is testing of links to other systems - just like checking 'full and free'.

It used to be that you started your car by setting manual controls and then going to the front of the car to swing a handle. Once it had 'caught', you jumped into the seat to adjust choke and mixture etc. Now the car does it all for you when you turn the key/push the button and it sequences everything in the right order.

Wrong order for anything and the flight crew have to go back to the top of the checklist before calling for push or moving under own power, or turning onto the active. So the question is: What state was BA's startup list in and when was it last read?
But there are geographically separate data centers backing each other up - or so we are told. But this is obviously not the case. It would appear that what they have really been operating with is a closely coupled distributed system which provides (provided) no redundancy or fault tolerance. It would appear someone has implemented something that turned an otherwise redundant system into a single monolith and failing any part of the monolith results in a total system crash. This was indeed what was reported with phones and display board failures.

This was not a power supply fault although that exposed it - it was a gross system architecture design failure. I can't imagine that it was originally set up like that, it is more likely that someone has removed the redundancy from the system in some way possibly through ignorance of how the fault tolerance operated.
Ian W is offline  
Reply