BA delays at LHR - Computer issue
Join Date: Jan 2010
Location: London
Posts: 379
Likes: 0
Received 0 Likes
on
0 Posts
A contractor doing maintenance work at a BA data centre is said to have inadvertently switched off the power supply, knocking out the airline’s computer systems, according to a report in The Times.
Quoting a BA source, the newspaper said the power supply unit that prompted the IT failure was working perfectly but was accidentally shut down by a worker. An investigation into the power outage is likely to focus on human error rather than any equipment failure, it said.
Quoting a BA source, the newspaper said the power supply unit that prompted the IT failure was working perfectly but was accidentally shut down by a worker. An investigation into the power outage is likely to focus on human error rather than any equipment failure, it said.
One is why had there not been a freeze on such work on one of the busiest periods of the year.
Another is why would an apparently competent person with a key shut down the power supply, which presumably was the UPS.
Any when it happened, why didn't the backup data centre at nearby Cranebank take over? I hope that one UPS isn't used for both data centres.
Passengers who wish to make a claim for out-of-pocket expenses or compensation following this disruption should contact British Airways directly in the first instance. If they are not satisfied with the response, they should refer their claim to British Airways' appointed dispute resolution service, CEDR. Dispute resolution services provide independent decisions on passengers' claims that the airline is contractually obliged to abide by."
BA were going round posting that overnight hotel expenses UP TO £200 for a double room (ie £100 each) would be reimbursed. None of the hotels around Heathrow normally charge such a low rate, and there were press accounts that on the weekend in question they were quoting four figure sums for a room for a night.
Join Date: Sep 2015
Location: UK
Posts: 110
Likes: 0
Received 0 Likes
on
0 Posts
In an internal email to staff, Bill Francis, the head of IT for BA’s parent company, IAG, said an uninterruptible power supply to a core data centre at Heathrow was over-ridden.
He said: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries … After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem.”
He said: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries … After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem.”
BA were going round posting that overnight hotel expenses UP TO £200 for a double room (ie £100 each) would be reimbursed. None of the hotels around Heathrow normally charge such a low rate,
Join Date: Jan 2006
Location: Gatwick
Posts: 117
Likes: 0
Received 0 Likes
on
0 Posts
This is why in situations like this, I pass the onus back to the bill payer to book and pay for my room rather than try to claim it back myself. That way, they get to pay the whole bill direct and they benefit from their discount arrangements that we don't have access to.
As a general rule, companies work for the people that pay their invoices. It would be surprising if this was not true of CEDR.
It is noteworthy that they provide no information on the percentage of cases settled in favour of the claimant and the percentage in favour of the airline. Or the value of the settlements.
Is this simply a wheeze to discourage people from going to the small claims court?
It is noteworthy that they provide no information on the percentage of cases settled in favour of the claimant and the percentage in favour of the airline. Or the value of the settlements.
Is this simply a wheeze to discourage people from going to the small claims court?
Join Date: Aug 2008
Location: UK
Posts: 3
Likes: 0
Received 0 Likes
on
0 Posts
It is simply not feasible to require a domestic broadband router to be connected in such a way that it cannot be unplugged or switched off.
Indeed it might even be against the domestic electrical regulations to fit such an arrangement
The only method which I have seen in large houses where staff are used is indeed to label the plug/socket as "do not switch off broadband router" and maybe similarly on the fuseboard.
I’m still amazed that a FTSE100 company whose almost entire business relies on 100% uptime from their IT systems could be vulnerable to a cascading failure like this.
I remember watching a demo a long time ago, could have been the early-mid 90’s, where they actually physically destroyed a live server by dropping an anvil on it and everything carried on running from backup/replication servers without any sign of distress. OK, it was smaller-scale but that’s the kind of contingency planning you need: datacenter taken out by earthquake, fire, bomb, whatever, you just lose redundancy for a while not all that’s running in there.
I remember watching a demo a long time ago, could have been the early-mid 90’s, where they actually physically destroyed a live server by dropping an anvil on it and everything carried on running from backup/replication servers without any sign of distress. OK, it was smaller-scale but that’s the kind of contingency planning you need: datacenter taken out by earthquake, fire, bomb, whatever, you just lose redundancy for a while not all that’s running in there.
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes
on
0 Posts
This is far far worse than human error regarding a single event. It's not just one contractor. I hate to repeat that I have decades of experience in this form of computing. If I operated at my current client site and had malicious intent it would take me about an hour to take their systems down. That is with access all areas clearance.
It would be quicker if there were inherent architectural design flaws.
It would be quicker if there were inherent architectural design flaws.
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes
on
0 Posts
Someone will have already done a back of an envelope risk analysis and reckoned the cost of changing employment practice/organization/systems exceeds the probability*cost of a repeat failure. They will then be advising the CFO that the costs of change outweigh the potential cost of inaction as there is a low probability of recurrence. The do nothing option will be taken with suitable PR campaign reports fed to the press about inquiries by experts that will take long enough for even PPRUNE to forget. Luckily the IT support reports to a head of Communications with full PR experience so this will be the preferred option.
Join Date: May 2008
Location: Paris
Age: 60
Posts: 101
Likes: 0
Received 0 Likes
on
0 Posts
I think the comment was in respect of domestic/home equipment connected.
It is simply not feasible to require a domestic broadband router to be connected in such a way that it cannot be unplugged or switched off.
Indeed it might even be against the domestic electrical regulations to fit such an arrangement
The only method which I have seen in large houses where staff are used is indeed to label the plug/socket as "do not switch off broadband router" and maybe similarly on the fuseboard.
It is simply not feasible to require a domestic broadband router to be connected in such a way that it cannot be unplugged or switched off.
Indeed it might even be against the domestic electrical regulations to fit such an arrangement
The only method which I have seen in large houses where staff are used is indeed to label the plug/socket as "do not switch off broadband router" and maybe similarly on the fuseboard.
I was asked to look at ways to reduce fault rates for a famous global telecoms provider. I established that a large proportion of faults were due to kit being unplugged at the customer premises. The cleaner and the hoover were regular culprits. My solution was to provide stickers that read "Do not turn off. Call 0800-123-4567 before disconnecting this power supply".
Join Date: Sep 2011
Location: FL390
Posts: 238
Likes: 0
Received 0 Likes
on
0 Posts
The point remains that the second (or third?) datacenter should have immediately picked up the load. It would have been an expensive and time-consuming mistake, but it shouldn't have impacted operations.
When is a power outage not a power outage? When someone from outside switches the system off then on again? Did Messrs Walsh and Cruz lie?
BA to blame computer meltdown on IT engineer | Daily Mail Online
BA to blame computer meltdown on IT engineer | Daily Mail Online
Join Date: Apr 2007
Location: London
Age: 64
Posts: 118
Likes: 0
Received 0 Likes
on
0 Posts
Join Date: Jun 2016
Location: Cheshire
Posts: 141
Likes: 0
Received 0 Likes
on
0 Posts
More lies from BA? make your own mind up.
BA corrected by insurance firms over passenger claims wording
BA corrected by insurance firms over passenger claims wording
Join Date: Apr 2008
Location: Durham
Age: 62
Posts: 187
Likes: 0
Received 0 Likes
on
0 Posts
Permit to work systems
I am familiar with this type of operational management but did BA have any system which required a permit to work so that various systems did not become overwhelmed or melted out?
"British Airways has changed its advice to customers who claim expenses for the weekend's travel chaos after a row with insurers. The BA website initially suggested that customers should make a claim on their travel insurance for expenses such as meals during the delays. But the Association of British Insurers (ABI) and consumer rights experts say responsibility is with the airline. BA has now updated the language, removing any reference to insurance."
BA delays: Airline changes advice over claims for expenses - BBC News
More complete incompetence. Cruz must have ok'd this approach, presumably it accorded with his "every day I think of a way to save us money" mantra.
When will the non-execs on the board, and/or the major institutional shareholders, finally suss Willie and Cruz, and put someone with some customer-focused ability in place.
BA delays: Airline changes advice over claims for expenses - BBC News
More complete incompetence. Cruz must have ok'd this approach, presumably it accorded with his "every day I think of a way to save us money" mantra.
When will the non-execs on the board, and/or the major institutional shareholders, finally suss Willie and Cruz, and put someone with some customer-focused ability in place.
Plastic PPRuNer
A few pages back Andy D posted a link to "Flying Squirrels and Unspun Gyros", an excellent 10 min talk by Mike Christian (then of Yahoo) about how these systems can fail & yes, power issues are a high factor.
OTOH, the more complicated you make your failsafe protection systems the more failure modes you have (rather like AB...)
Sometime turning it all back on immediately can make a bad situation a great deal worse.
https://www.youtube.com/watch?v=iO2z3ttlpi4
Very much worth watching if you want to educate yourself a bit (I learned some new things), rather than just cursing the Spaniards.
[And I've also written enough code to know that there'll always be an edge-case that only turns up every 20 years...]
OTOH, the more complicated you make your failsafe protection systems the more failure modes you have (rather like AB...)
Sometime turning it all back on immediately can make a bad situation a great deal worse.
https://www.youtube.com/watch?v=iO2z3ttlpi4
Very much worth watching if you want to educate yourself a bit (I learned some new things), rather than just cursing the Spaniards.
[And I've also written enough code to know that there'll always be an edge-case that only turns up every 20 years...]
Join Date: Mar 2008
Location: South London
Posts: 35
Likes: 0
Received 0 Likes
on
0 Posts
Nice little par in the Guardian (ignore the ref to Waterside)
BA’s creaking IT infrastructure includes over 500 data cabinets across six halls around its Waterside base, northwest of Heathrow. A contractor with knowledge of Boadicea House said: “It’s a very old facility, there are lots and lots of problems with it. We weren’t particularly surprised, knowing the set-up there.” He added that a number of senior managers at the data centre have retired or left in the past three years.
BA’s creaking IT infrastructure includes over 500 data cabinets across six halls around its Waterside base, northwest of Heathrow. A contractor with knowledge of Boadicea House said: “It’s a very old facility, there are lots and lots of problems with it. We weren’t particularly surprised, knowing the set-up there.” He added that a number of senior managers at the data centre have retired or left in the past three years.
Join Date: Apr 2007
Location: UK
Posts: 38
Likes: 0
Received 0 Likes
on
0 Posts
At the Kegworth incident engine X lost power due to mechanical failure. Pilot looked at the poorly designed instruments and decided engine Y was the culprit so shut it down. Later, realising the error, he tried to relight engine Y but too late to avoid the ground short of the diversion field.
To translate that to the current IT incident what about this possible scenario?
Maintainer sees that UPS X has fully or partially failed as they occasionally do. He did not have to do anything as UPS Y continues smooth ops. But it’s good practice to investigate and fix (but not on a Bank Hol) as losing a second one may be serious. So he goes to shut it down whilst he works on it. He accidentally shuts down UPS Y after misreading the diagnosis screen/some labels on the switches etc. Or the maintenance screen gives the wrong info (note it was all upgraded quite recently so an error in logic/labelling may have recently been introduced). Now with 2 down warning alarms and messages are seriously starting to sound and maybe a third UPS is rapidly draining its batteries whilst taking the entire load. So he panics and turns the wrong switches to get UPS Y back on air quickly. Or the reconnect sequence is partly or fully computer controlled and gets it wrong. Damage ensues for reasons stated elsewhere here although I don’t really see how a well-designed system would cause this.
Does not explain why data centre 2 did not take over. IT not my thing but UPSs and reliability modelling are.
To translate that to the current IT incident what about this possible scenario?
Maintainer sees that UPS X has fully or partially failed as they occasionally do. He did not have to do anything as UPS Y continues smooth ops. But it’s good practice to investigate and fix (but not on a Bank Hol) as losing a second one may be serious. So he goes to shut it down whilst he works on it. He accidentally shuts down UPS Y after misreading the diagnosis screen/some labels on the switches etc. Or the maintenance screen gives the wrong info (note it was all upgraded quite recently so an error in logic/labelling may have recently been introduced). Now with 2 down warning alarms and messages are seriously starting to sound and maybe a third UPS is rapidly draining its batteries whilst taking the entire load. So he panics and turns the wrong switches to get UPS Y back on air quickly. Or the reconnect sequence is partly or fully computer controlled and gets it wrong. Damage ensues for reasons stated elsewhere here although I don’t really see how a well-designed system would cause this.
Does not explain why data centre 2 did not take over. IT not my thing but UPSs and reliability modelling are.
Last edited by fchan; 5th Jun 2017 at 10:31.