BA delays at LHR - Computer issue

Reply Subscribe

Thread Tools

Search this Thread

29th May 2017 | 15:04

#281 (permalink)

Self Loading Freight

None but a blockhead

Joined: Nov 1999

Posts: 534

Likes: 0

From: London, UK

I think that rumour is credible too - absent a knowledge of how modern the BA data centre's infrastructure actually is. Many cascading faults in corporate infrastructure wouldn't happen if said infrastructure had no legacy systems; but much corporate infrastructure is heavily dependent on legacy systems. And say it didn't happen that way because it couldn't - OK, so how could a modern, well-engineered, inherently reliable system fail so badly? Because it did.

29th May 2017 | 15:06

#282 (permalink)

Nialler

Joined: May 2008

Posts: 101

Likes: 0

From: Paris

Quote:

Originally Posted by MG23

If BA had high-availability IT, we wouldn't be having this discussion.

That rumour seems dubious to me, but having worked with Indian outsourcing in a previous life, it's not that dubious.

I've refused contracts where one particular outsourcer was involved. No names, obviously. The practice has been in the very large systems I've worked on to keep the crown jewels at home. Functions such as security, systems design and management remain in the home territory while some application development and support can more easily be taken on by an appropriate outsourcer with the right skills. The problem is that the skills required for very high-end computing are relatively new to India. It's also a mindset. You've or, worse, caused a problem on of these machines? First step, push way from the keyboard ; second step beginning ring teammates and the boss. Deliberate. Running at the problem almost always makes it worse. Declare a disaster if needed. The latter is the problem. Everyone is afraid of that word. In decades I've seen that anything short of 737 crashing into a data center will not be treated as a disaster. Surely it can be fixed?

It isn't like that. That single message about a pointer error can proliferate rapidly and be compounded by errant efforts to airbrush it away.

Sorry for using a flight forum for going on about this, but I so enjoy reading what the flight jocks have to say that I can't help contributing from inside the climate rooms I haunt.

29th May 2017 | 15:13

#283 (permalink)

KelvinD

Joined: May 2011

Posts: 822

Likes: 0

From: Hampshire

While Senor Cruz is still going on about a power failure being the culprit, it struck me that everyone, including me, has been assuming this refers to the big electric, coming through the wall (230V, 440V 3KV or whatever). If any of these fails, one would expect UPS to immediately take over ( a "no break" supply).
What about the internal power suuplies driving the servers etc, ie the bits that turn the incoming electricity into 5V, 12V, 24V etc? If you let the smoke out of these, no amount of UPS back up is going to help.
Just a thought.

29th May 2017 | 15:17

#284 (permalink)

Nialler

Joined: May 2008

Posts: 101

Likes: 0

From: Paris

Quote:

Originally Posted by Self Loading Freight

My experience based on decades as a consultant operating directly in the fields of resilience, availability, disaster recovery and business continuity planning is that the legacy systems are always the most robust.

29th May 2017 | 15:21

#285 (permalink)

Twiglet1

Joined: Jan 2015

Posts: 387

Likes: 1

From: Centre of Universe

I was at LHR on Saturday trying to use Staff Travel. Clearly with this utmost in my mind my experiences;
1. The BA Staff at check in were superb. When the back up "system" came on line they had to share PC's / work in 2-3's to get individual bookings ticketed and tagged. Accessing their systems was slow and time consuming and familiarity a issue - Overall 10 out of 10.

2. Early on they cancelled all Staff Travel to concentrate on fare paying punters - can't argue with that. It didn't materialise much - the order came from upon high - we just hung around as it was ever changing.

3. Their Manager came round giving directions and updates, he missed a couple of occasions to say thanks and I could see the opportunity was missed. This was down to pressure i'm sure but the relationship with the Manager is the number 1 on staff engagement.

4. Having checked in at STD -30 mins (having been in queue for 2hrs 30mins) finally got boarding passes and legged it to drop bags off (bag belts u/s also is that BA or BAA)
got airside - no gate on FFS just a security man helping

5. Got to gate 10B no BA staff for some time. They came and gave a few updates. After some time the BBC news TV on the gate said all flights up to 1800 cancelled. The gate staff hadn't heard this though (as comms was all down to phones). The one particular BA lady kept us updated every 30 mins or so, great communications with as much as she could give - again 10/10. When we gave up she was being hassled by about 20 pax so we just shouted well done and she thanked us.

6. Getting out was another issue. Everyone exited via gate 12 (hundreds) and only two small exit lanes. Could have gone wrong big time but old bill helped out. Quick dash through immigration and into bag hall - BA staff said don't bother so we just went out.

Apart from the disappointment / long day the only negative to me was some other BA Staff Traveller trying to "jump" the queue as his flight was going soon (just like rest).

There is always one, trying to go away on school hols with kids - well when it goes wrong you'll only do that once

And finally at my work we use mainly Indians and they are better by a country mile

29th May 2017 | 15:23

#286 (permalink)

Nialler

Joined: May 2008

Posts: 101

Likes: 0

From: Paris

Quote:

Originally Posted by KelvinD

No. The UPS is more than a bank of batteries. It's an expensive piece of kit which smooths the supply during Brown outs, during spikes and during the absence of any power at all. The electrical input to a properly specced enterprise server should never fluctuate.

29th May 2017 | 15:24

#287 (permalink)

ILS27LEFT

Joined: Aug 2002

Posts: 7

Likes: 4

From: Europe

Cost cutting...indefinitely

From Bbc News:
"Earlier this year, Mr Cruz told Skift magazine: "We're always going to be reducing costs... It's now injected into the DNA. If one particular day we don't come up with an idea to reduce our costs, then we're not doing our job."

This IT global mess is the result of the above corporate philosophy.
Lives have been ruined. Millions of pounds wasted.
Constant and indefinite cost cutting is a corporate suicide. There is a limit. BA is showing the first signs of this suicide mission. It must be stopped.

29th May 2017 | 15:52

#288 (permalink)

Ian W

Joined: Dec 2006

Posts: 1,350

Likes: 0

From: Florida and wherever my laptop is

Quote:

Originally Posted by Nialler

This kind of crash after outsourcing and losing experienced staff has happened before. You would think that CEOs would learn from other's disasters but apparently not.

It was precisely the reason that the patched and kludged together RBS/Nat West banking system fell over...and the 'inexperienced' operatives in Hyderabad were the likely culprits in screwing up an upgrade backout.

RBS computer failure 'caused by inexperienced operative in India' - Telegraph
https://www.theregister.co.uk/2012/0...at_went_wrong/

This kind of thing should never ever happen, but if you are unaware of the particular foibles of what is otherwise a fully fault tolerant system it can be surprisingly easy to break the system when you have full SysAdmin privileges and have finger trouble trying to stop the system going down.

29th May 2017 | 15:55

#289 (permalink)

Heathrow Harry

Thread Starter

Joined: Apr 2010

Posts: 7,056

Likes: 2

From: London

"Constant and indefinite cost cutting is a corporate suicide. There is a limit."

Not so - every organisation should alwys be looking to cut costs - working practices change, technology changes

BUT you have to do it while maintaining or improving the product - cost cutting as a sole driver is very very bad business pratice

29th May 2017 | 15:56

#290 (permalink)

Tight Accountant

Joined: Mar 2008

Posts: 35

Likes: 0

From: South London

Quote:

Originally Posted by aox

"a friend told me".

I really dislike non-attributable stories. Sure, cock-ups occur in business all the time and I've seen plenty by Accountants and Non-Accountants alike.

29th May 2017 | 15:58

#291 (permalink)

yoganmahew

Joined: Aug 2007

Posts: 27

Likes: 0

From: Tullamore

Quote:

Originally Posted by Nialler

I would dispense what that rumour as the work of someone who has little clue about large scale high-availability IT (such skills not being a prerequisite for life). Systems are not patched/tested on production environments (for the record a failback mirror site is most certainly a production system). There will be a chain of systems for testing, from a sandpit environment, through test, development, pre-production, eser acceptance testing then production itself. These type of fixes are usually completely dynamic, but those that require restarts require only that the operating system be restarted - not the actual hardware. There should be no power issues and certainly none where remotely distinct sites are involved.

Finally, given that they're still on a background of TPF, the machines running TPF are typically z-Series enterprise servers from IBM. i.e. designed with internal redundancy and with continuous uptime as one of the core aspects of their architecture. Their power requirements have shrunk from a time when, yes, the airport lights might flicker as the beast was woken up, through to today's models, which are CMOS based and run off little more than a kettle connection. The meantime between failures on these machines is measured in years. They do not fail in the type of circumstances described.

Thanks for posting it, though.

Hi Nialler. The rumour is not suggesting that the patch itself was faulty, just that the restart procedure was inadequately careful.

BA have no TPF neither in their own site nor, if Amadeus are to believed, in the underlying Amadeus architecture. This, I'm afraid, is all 'modern' stuff with hundreds of boxen performing trivial proportions of the overall workload.

If the fix is the SMB fix for WannaCry to the server, it could require an OS restart, not just an appliaction restart (depending on the OS). Even if it didn't, hundreds of applications starting will draw more power as they reload, rebuild caches etc.

Anyway, the whole thing is so unclear, and this from a man who claims to be digital to the core, that you have to think it was something enormously f'd up.

29th May 2017 | 15:59

#292 (permalink)

rideforever

Joined: Mar 2014

Posts: 54

Likes: 0

From: UK

Quote:

Originally Posted by ILS27LEFT

Constant and indefinite cost cutting is a corporate suicide. There is a limit. BA is showing the first signs of this suicide mission. It must be stopped.

Yes, our whole species is undergoing this new philosophy. And is failing, but nobody notices ... what people notice are the promises of extreme savings and extreme profits.
At the same time the lack of challenge in our society is creating a new level of incompetence. Not only is there incompetence, but nobody really cares. Conscience is a long way away.
Does anyone have a goal except mortgage payments and facebook?
Without a goal there is no reason to do any more than this.
Why did Victorian engineers build so well ?
What was inside them that is not inside this generation ?
Quite a lot, methinks.

29th May 2017 | 15:59

#293 (permalink)

Ian W

Joined: Dec 2006

Posts: 1,350

Likes: 0

From: Florida and wherever my laptop is

Quote:

Originally Posted by Nialler

Many of the major systems I have dealt with run on batteries all the time. The choice is which system: grid, standby grid, standby generators to trickle charge the batteries.

29th May 2017 | 15:59

#294 (permalink)

Tight Accountant

Joined: Mar 2008

Posts: 35

Likes: 0

From: South London

Quote:

Originally Posted by Ian W

RBS computer failure 'caused by inexperienced operative in India' - Telegraph
https://www.theregister.co.uk/2012/0...at_went_wrong/

Ian - I don't know your background but I understand that RBS has a whole host of legacy systems which need considerable TLC. It will be interesting to understand whether legacy systems fell over at BA; I suspect not.

Last edited by Tight Accountant; 29th May 2017 at 16:01. Reason: Grammar and improved clarity.

29th May 2017 | 16:01

#295 (permalink)

Nialler

Joined: May 2008

Posts: 101

Likes: 0

From: Paris

Quote:

Originally Posted by Ian W

This kind of crash after outsourcing and losing experienced staff has happened before. You would think that CEOs would learn from other's disasters but apparently not.

I've seen the complete history which led to the problem. The problem was most certainly not one caused by outsourcing. The ad referenced in the article is for an experienced admin. That is way below the level where this problem occurred. I've spoken with the principals involved.

29th May 2017 | 16:08

#296 (permalink)

Nialler

Joined: May 2008

Posts: 101

Likes: 0

From: Paris

Quote:

Originally Posted by Ian W

Many of the major systems I have dealt with run on batteries all the time. The choice is which system: grid, standby grid, standby generators to trickle charge the batteries.

Exactly. They're not just there for hard outages. In the past, with older mainframes, I have wanted to limit the Ups to twenty minutes. If air-handling is gone too a system will cook itself.

29th May 2017 | 16:08

#297 (permalink)

yoganmahew

Joined: Aug 2007

Posts: 27

Likes: 0

From: Tullamore

Quote:

Originally Posted by Tight Accountant

Once it has been in long enough for the original staff to have moved on to other projects, it is legacy. If it's 5 years old, it's probably legacy. That legacy can extend back for many years beyond that really just increases the difficulty of finding someone who really understands it!

29th May 2017 | 17:36

#298 (permalink)

Super VC-10

Joined: Jul 2007

Posts: 597

Likes: 0

From: Hadlow

Pilot drove cancer sufferer home.

British Airways boss refuses to resign as Heathrow endures third day of disruption

Last edited by Super VC-10; 29th May 2017 at 17:37. Reason: reason for link

29th May 2017 | 17:43

#299 (permalink)

WHBM

Joined: Oct 2002

Aviation Qualifications: PPL

Posts: 8,201

Likes: 347

From: London UK

Quote:

Originally Posted by Tight Accountant

Is the RBS event (which everyone is referring to), the one where they were fined £42m (thereabouts) by the Regulator?

Speaking of the Regulator, I've yet to hear any comment by the Secretary of State for Transport about this major failure. Although individual MPs are currently not in Parliament while the election goes on, the Ministers continue until the next government, and continue to draw their hefty salaries for the responsibility.

So where's the Rt Hon Mr Grayling then ?

29th May 2017 | 17:48

#300 (permalink)

Stoic

Joined: Jan 2006

Posts: 145

Likes: 0

From: England

And where's Willie?