PPRuNe Forums - View Single Post - BA delays at LHR - Computer issue
View Single Post
Old 30th May 2017, 10:11
  #339 (permalink)  
Ian W
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by fchan
One thing I can’t understand in this scenario is that it was reported that everything went down for hours last Friday, even PA in Terminal 5. That may not be true. Surely a good IT architecture is one like most ATC where flight data, voice, radar, planning etc. are on separate servers and networks. Thus if one server and its backup goes down you have some picture of what is happening from the others. It may not enable full ops but you have some capability.

Thus if BA check-in or baggage goes down at least the supervisor should have a working screen derived for a different server/network on which management can issue messages to staff who can relay it to the public in the terminals with PA or whiteboards. Thus the 0800 number can be disseminated along with "sit there because we can't check in anyone for a while". And the BA.com site could be redirected to a completely different site outside the BA server cluster that gave simple messages like the 0800 number, rather than just showing 404 error message.
Yes in theory subsystems should be loosely coupled so that a domino effect is avoided. However, it is obviously not the case in BA (actually IAG) systems as was demonstrated by this event. Obviously, events found a single point of failure. As I said before "single points of failure always fail".

It may be that someone has unthinkingly created a single point of failure by changing something while not (fully) understanding the interconnections and interfaces between the interworking subsystems. Or perhaps the problem has always been there in some common code library and when a value that could never possibly be exceeded was exceeded everything using that library went down, with subsystems trying to failover to each other. One problem with pure replication to provide fault tolerance is that it only gives hardware fault tolerance. An input that breaks one replicated system's software can break them all Seen that happen and still people create systems like it.
Ian W is offline