Go Back  PPRuNe Forums > Misc. Forums > Passengers & SLF (Self Loading Freight)
Reload this Page >

BA delays at LHR - Computer issue

Passengers & SLF (Self Loading Freight) If you are regularly a passenger on any airline then why not post your questions here?

BA delays at LHR - Computer issue

Old 2nd Jun 2017, 11:07
  #461 (permalink)  
 
Join Date: Jan 2010
Location: London
Posts: 379
Originally Posted by Epsomdog View Post
A contractor doing maintenance work at a BA data centre is said to have inadvertently switched off the power supply, knocking out the airline’s computer systems, according to a report in The Times.

Quoting a BA source, the newspaper said the power supply unit that prompted the IT failure was working perfectly but was accidentally shut down by a worker. An investigation into the power outage is likely to focus on human error rather than any equipment failure, it said.
If this article is correct, then several questions have to be answered.

One is why had there not been a freeze on such work on one of the busiest periods of the year.

Another is why would an apparently competent person with a key shut down the power supply, which presumably was the UPS.

Any when it happened, why didn't the backup data centre at nearby Cranebank take over? I hope that one UPS isn't used for both data centres.
Caribbean Boy is offline  
Old 2nd Jun 2017, 11:22
  #462 (permalink)  
 
Join Date: Oct 2002
Location: London UK
Posts: 6,406
Passengers who wish to make a claim for out-of-pocket expenses or compensation following this disruption should contact British Airways directly in the first instance. If they are not satisfied with the response, they should refer their claim to British Airways' appointed dispute resolution service, CEDR. Dispute resolution services provide independent decisions on passengers' claims that the airline is contractually obliged to abide by."
That seems very commercially unwise for BA. The "contract" is of BA's making and entirely weighted towards the BA point of view, ie 'we don't owe nobody nothing for anything'. If all the dispute resolution contractor (selected and paid for by BA) is going to do is look at the BA contract and conclude people aren't entitled to anything more, not only do they add nothing to the situation (apart from pocketing their fee from BA) but are going to lead to a lot more hacked-off people, with their stories in the Daily Mail, etc.

BA were going round posting that overnight hotel expenses UP TO £200 for a double room (ie £100 each) would be reimbursed. None of the hotels around Heathrow normally charge such a low rate, and there were press accounts that on the weekend in question they were quoting four figure sums for a room for a night.
WHBM is online now  
Old 2nd Jun 2017, 11:30
  #463 (permalink)  
 
Join Date: Sep 2015
Location: UK
Posts: 86
Originally Posted by Superpilot View Post
Am I missing something here. The (probably TCS) contractor shutdown a power supply and no one knew how to switch it on again?
I think you'll find that switching it back on was what caused most of the damage. From the Guardian website:

In an internal email to staff, Bill Francis, the head of IT for BA’s parent company, IAG, said an uninterruptible power supply to a core data centre at Heathrow was over-ridden.
He said: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries … After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem.”
Joe_K is offline  
Old 2nd Jun 2017, 11:30
  #464 (permalink)  
 
Join Date: Feb 2001
Location: The Winchester
Posts: 5,643
BA were going round posting that overnight hotel expenses UP TO £200 for a double room (ie £100 each) would be reimbursed. None of the hotels around Heathrow normally charge such a low rate,
Point of order: Not defending BAs policy on this and I certainly can't speak for what horse trading went on over the weekend but there are certainly a few decent hotels around LHR that normally charge under 100 STG for a double room for a night.
wiggy is offline  
Old 2nd Jun 2017, 11:37
  #465 (permalink)  
 
Join Date: Jan 2006
Location: Gatwick
Posts: 111
Originally Posted by wiggy View Post
Point of order: Not defending BAs policy on this and I certainly can't speak for what horse trading went on over the weekend but there are certainly a few decent hotels around LHR that normally charge under 100 STG for a double room for a night.

This is why in situations like this, I pass the onus back to the bill payer to book and pay for my room rather than try to claim it back myself. That way, they get to pay the whole bill direct and they benefit from their discount arrangements that we don't have access to.
bbrown1664 is offline  
Old 2nd Jun 2017, 11:43
  #466 (permalink)  
 
Join Date: May 2000
Location: London, UK
Posts: 337
As a general rule, companies work for the people that pay their invoices. It would be surprising if this was not true of CEDR.

It is noteworthy that they provide no information on the percentage of cases settled in favour of the claimant and the percentage in favour of the airline. Or the value of the settlements.

Is this simply a wheeze to discourage people from going to the small claims court?
SLF3 is offline  
Old 2nd Jun 2017, 12:01
  #467 (permalink)  
 
Join Date: Aug 2008
Location: UK
Posts: 3
Originally Posted by Nialler View Post
To be honest, you failed in your remit.
I've designed several datacenters. You simply make it impossible for such unplugging. I'd have been horrified if you made this suggestion to me.
I think the comment was in respect of domestic/home equipment connected.
It is simply not feasible to require a domestic broadband router to be connected in such a way that it cannot be unplugged or switched off.
Indeed it might even be against the domestic electrical regulations to fit such an arrangement
The only method which I have seen in large houses where staff are used is indeed to label the plug/socket as "do not switch off broadband router" and maybe similarly on the fuseboard.
dsc810 is offline  
Old 2nd Jun 2017, 12:29
  #468 (permalink)  
 
Join Date: Dec 2003
Location: Tring, UK
Posts: 1,451
I’m still amazed that a FTSE100 company whose almost entire business relies on 100% uptime from their IT systems could be vulnerable to a cascading failure like this.

I remember watching a demo a long time ago, could have been the early-mid 90’s, where they actually physically destroyed a live server by dropping an anvil on it and everything carried on running from backup/replication servers without any sign of distress. OK, it was smaller-scale but that’s the kind of contingency planning you need: datacenter taken out by earthquake, fire, bomb, whatever, you just lose redundancy for a while not all that’s running in there.
FullWings is offline  
Old 2nd Jun 2017, 12:29
  #469 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,347
Originally Posted by Nialler View Post
This is far far worse than human error regarding a single event. It's not just one contractor. I hate to repeat that I have decades of experience in this form of computing. If I operated at my current client site and had malicious intent it would take me about an hour to take their systems down. That is with access all areas clearance.


It would be quicker if there were inherent architectural design flaws.
I have to agree with this. It should not be possible to 'take down' a complete distributed system by crashing one subsystem however 'dirty' the crash was and whatever the process being run was. The entire point of having a distributed system is to prevent such total system crashes. Either the system has been extremely poorly designed or maintained to the level it is no longer fault tolerant or someone spent that hour killing the system with nobody alert enough to stop them. Personally, I would go for the poor maintenance or maintenance practices. It could be that it has been maintained by people that have not carried out the level of regression testing of the impact on fault tolerance of system or software changes, or they were carrying out a procedure that deliberately ignored the impact on or even disabled fault tolerance. This was probably not malicious but through lack of understanding . For the system the effect is the same. Lack of understanding would then explain the failure to recover the system rapidly and gracefully. IAG have a major problem on their hands but it appears that the engineering illiterate management is unaware of it.
Ian W is offline  
Old 2nd Jun 2017, 12:43
  #470 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,347
Originally Posted by pax2908 View Post
That comment (and a previous one I made about this event) was meant to sarcastic ...
Unfortunately, it is all too true.
Someone will have already done a back of an envelope risk analysis and reckoned the cost of changing employment practice/organization/systems exceeds the probability*cost of a repeat failure. They will then be advising the CFO that the costs of change outweigh the potential cost of inaction as there is a low probability of recurrence. The do nothing option will be taken with suitable PR campaign reports fed to the press about inquiries by experts that will take long enough for even PPRuNe to forget. Luckily the IT support reports to a head of Communications with full PR experience so this will be the preferred option.
Ian W is offline  
Old 2nd Jun 2017, 14:23
  #471 (permalink)  
 
Join Date: May 2008
Location: Paris
Age: 56
Posts: 101
Originally Posted by dsc810 View Post
I think the comment was in respect of domestic/home equipment connected.
It is simply not feasible to require a domestic broadband router to be connected in such a way that it cannot be unplugged or switched off.
Indeed it might even be against the domestic electrical regulations to fit such an arrangement
The only method which I have seen in large houses where staff are used is indeed to label the plug/socket as "do not switch off broadband router" and maybe similarly on the fuseboard.
The poster mentioned:


I was asked to look at ways to reduce fault rates for a famous global telecoms provider. I established that a large proportion of faults were due to kit being unplugged at the customer premises. The cleaner and the hoover were regular culprits. My solution was to provide stickers that read "Do not turn off. Call 0800-123-4567 before disconnecting this power supply".
As far as I am concerned (and I maintain very huge mission critical systems), I'd treat your average small system for a tiny company in the same way. You have a small system and it requires uptime from a critical system? I'll give you that. I'm not putting labels on plugs, though. The defences will be a bit stronger than post-its.
Nialler is offline  
Old 2nd Jun 2017, 15:09
  #472 (permalink)  
 
Join Date: Sep 2011
Location: FL390
Posts: 19
Originally Posted by Joe_K View Post
I think you'll find that switching it back on was what caused most of the damage. From the Guardian website:
The point remains that the second (or third?) datacenter should have immediately picked up the load. It would have been an expensive and time-consuming mistake, but it shouldn't have impacted operations.
Fursty Ferret is offline  
Old 2nd Jun 2017, 15:19
  #473 (permalink)  
 
Join Date: Feb 2003
Location: BHX LXR ASW
Posts: 2,191
When is a power outage not a power outage? When someone from outside switches the system off then on again? Did Messrs Walsh and Cruz lie?

BA to blame computer meltdown on IT engineer | Daily Mail Online
crewmeal is offline  
Old 2nd Jun 2017, 15:45
  #474 (permalink)  
 
Join Date: Apr 2007
Location: London
Age: 60
Posts: 95
Some more analysis from El reg ...

https://www.theregister.co.uk/2017/0...configuration/
Alanwsg is online now  
Old 2nd Jun 2017, 16:03
  #475 (permalink)  
 
Join Date: Jun 2016
Location: Cheshire
Posts: 138
More lies from BA? make your own mind up.

BA corrected by insurance firms over passenger claims wording
Trav a la is offline  
Old 2nd Jun 2017, 16:43
  #476 (permalink)  
 
Join Date: Apr 2008
Location: Durham
Age: 58
Posts: 183
Permit to work systems

I am familiar with this type of operational management but did BA have any system which required a permit to work so that various systems did not become overwhelmed or melted out?
mercurydancer is offline  
Old 2nd Jun 2017, 17:25
  #477 (permalink)  
 
Join Date: Oct 2002
Location: London UK
Posts: 6,406
"British Airways has changed its advice to customers who claim expenses for the weekend's travel chaos after a row with insurers. The BA website initially suggested that customers should make a claim on their travel insurance for expenses such as meals during the delays. But the Association of British Insurers (ABI) and consumer rights experts say responsibility is with the airline. BA has now updated the language, removing any reference to insurance."

BA delays: Airline changes advice over claims for expenses - BBC News


More complete incompetence. Cruz must have ok'd this approach, presumably it accorded with his "every day I think of a way to save us money" mantra.

When will the non-execs on the board, and/or the major institutional shareholders, finally suss Willie and Cruz, and put someone with some customer-focused ability in place.
WHBM is online now  
Old 2nd Jun 2017, 17:44
  #478 (permalink)  

Plastic PPRuNer
 
Join Date: Sep 2000
Location: Cape Town
Posts: 1,891
A few pages back Andy D posted a link to "Flying Squirrels and Unspun Gyros", an excellent 10 min talk by Mike Christian (then of Yahoo) about how these systems can fail & yes, power issues are a high factor.

OTOH, the more complicated you make your failsafe protection systems the more failure modes you have (rather like AB...)

Sometime turning it all back on immediately can make a bad situation a great deal worse.

https://www.youtube.com/watch?v=iO2z3ttlpi4

Very much worth watching if you want to educate yourself a bit (I learned some new things), rather than just cursing the Spaniards.


[And I've also written enough code to know that there'll always be an edge-case that only turns up every 20 years...]
Mac the Knife is offline  
Old 2nd Jun 2017, 18:37
  #479 (permalink)  
 
Join Date: Mar 2008
Location: South London
Posts: 35
Originally Posted by gordonroxburgh View Post
Nice little par in the Guardian (ignore the ref to Waterside)

BA’s creaking IT infrastructure includes over 500 data cabinets across six halls around its Waterside base, northwest of Heathrow. A contractor with knowledge of Boadicea House said: “It’s a very old facility, there are lots and lots of problems with it. We weren’t particularly surprised, knowing the set-up there.” He added that a number of senior managers at the data centre have retired or left in the past three years.
I think you will find that other firms have creaking IT as well; BA wouldn't be unique in that regard.
Tight Accountant is offline  
Old 2nd Jun 2017, 19:16
  #480 (permalink)  
 
Join Date: Apr 2007
Location: UK
Posts: 38
At the Kegworth incident engine X lost power due to mechanical failure. Pilot looked at the poorly designed instruments and decided engine Y was the culprit so shut it down. Later, realising the error, he tried to relight engine Y but too late to avoid the ground short of the diversion field.

To translate that to the current IT incident what about this possible scenario?

Maintainer sees that UPS X has fully or partially failed as they occasionally do. He did not have to do anything as UPS Y continues smooth ops. But it’s good practice to investigate and fix (but not on a Bank Hol) as losing a second one may be serious. So he goes to shut it down whilst he works on it. He accidentally shuts down UPS Y after misreading the diagnosis screen/some labels on the switches etc. Or the maintenance screen gives the wrong info (note it was all upgraded quite recently so an error in logic/labelling may have recently been introduced). Now with 2 down warning alarms and messages are seriously starting to sound and maybe a third UPS is rapidly draining its batteries whilst taking the entire load. So he panics and turns the wrong switches to get UPS Y back on air quickly. Or the reconnect sequence is partly or fully computer controlled and gets it wrong. Damage ensues for reasons stated elsewhere here although I don’t really see how a well-designed system would cause this.

Does not explain why data centre 2 did not take over. IT not my thing but UPSs and reliability modelling are.

Last edited by fchan; 5th Jun 2017 at 10:31.
fchan is offline  

Thread Tools
Search this Thread

Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service - Do Not Sell My Personal Information

Copyright © 2018 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.