PPRuNe Forums - View Single Post - BA delays at LHR - Computer issue
View Single Post
Old 29th May 2017, 13:50
  #277 (permalink)  
Nialler
 
Join Date: May 2008
Location: Paris
Age: 60
Posts: 101
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by yoganmahew
This on the register in the comments:
"
Comment from a Times article.

From the IT rumour mill
Allegedly, the staff at the Indian data centre were told to apply some security fixes to the computers in the data centre. The BA IT systems have two, parallel systems to cope with updates. What was supposed to happen was that they apply the fixes to the computers of the secondary system, and when all is working, apply to the computers of the primary system. In this way, the programs all keep running without any interruption.
What they actually did was apply the patches to _all_ the computers. Then they shutdown and restarted the entire data centre. Unfortunately, computers in these data centres are used to being up and running for lengthy periods of time. That means, when you restart them, components like memory chips and network cards fail. Compounding this, if you start all the systems at once, the power drain is immense and you may end up with not enough power going to the computers - this can also cause components to fail. It takes quite a long time to identify all the hardware that failed and replace it.
So the claim that it was caused by "power supply issues" is not untrue. Bluntly - some idiot shut down the power.
Would this have happened if outsourcing had not be done? Probably not, because prior to outsourcing you had BA employees who were experienced in maintaining BA computer systems, and know without thinking what the proper procedures are. To the offshore staff, there is no context, they've no idea what they're dealing with - it's just a bunch of computers that need to be patched. Job done, get bonus for doing it quickly, move on."
https://forums.theregister.co.uk/for..._supply_issue/

I'm astonished, absolutely gobsmacked if this is true - never mind cocking up the change, that happens all the time, it's doing the change on a busy weekend at all! This speaks volumes about commodity staff, unfamiliar with airlines - for everyone else, a bank holiday is an opportunity to do housekeeping at a quiet time. In an airline, though, a bank holiday is the busiest time of the year. The US company I work for has change freezes around every major US holiday. When I worked at somewhere close to BA many moons ago, there were even freezes for large customer demos and trade shows.


Words fail me. Well, after the words above and some more I won't repeat here!
I would dispense what that rumour as the work of someone who has little clue about large scale high-availability IT (such skills not being a prerequisite for life). Systems are not patched/tested on production environments (for the record a failback mirror site is most certainly a production system). There will be a chain of systems for testing, from a sandpit environment, through test, development, pre-production, eser acceptance testing then production itself. These type of fixes are usually completely dynamic, but those that require restarts require only that the operating system be restarted - not the actual hardware. There should be no power issues and certainly none where remotely distinct sites are involved.


Finally, given that they're still on a background of TPF, the machines running TPF are typically z-Series enterprise servers from IBM. i.e. designed with internal redundancy and with continuous uptime as one of the core aspects of their architecture. Their power requirements have shrunk from a time when, yes, the airport lights might flicker as the beast was woken up, through to today's models, which are CMOS based and run off little more than a kettle connection. The meantime between failures on these machines is measured in years. They do not fail in the type of circumstances described.


Thanks for posting it, though.
Nialler is offline