PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Passengers & SLF (Self Loading Freight) (https://www.pprune.org/passengers-slf-self-loading-freight-61/)
-   -   BA delays at LHR - Computer issue (https://www.pprune.org/passengers-slf-self-loading-freight/595169-ba-delays-lhr-computer-issue.html)

MG23 29th May 2017 14:22


Originally Posted by Nialler (Post 9786325)
I would dispense what that rumour as the work of someone who has little clue about large scale high-availability IT (such skills not being a prerequisite for life).

If BA had high-availability IT, we wouldn't be having this discussion.

That rumour seems dubious to me, but having worked with Indian outsourcing in a previous life, it's not that dubious.

BikerMark 29th May 2017 14:47

I think Alex Cruz is choosing his words carefully. There may well be local staff working on local systems. Those local staff are likely to be onshore Indian contractors. The permanent BA staff have been thinned out greatly by redundancies so TCS contractors are filling the gaps. They will have been onshore for no more than a few months due to visa restrictions.

I also agree rumour is credible. BA lost the watchlist for a weekend due to incorrect procedures being followed by a contractor.

Many BA IT staff are choosing to take redundancy rather than gamble on a now somewhat flimsy career path. The IAG head of BA IT was once the head of cabin crew during the 2010 dispute. He's determined to get 91m euro savings over a 3 year period. Just as well, those savings are going to be needed now.

Self Loading Freight 29th May 2017 15:04

I think that rumour is credible too - absent a knowledge of how modern the BA data centre's infrastructure actually is. Many cascading faults in corporate infrastructure wouldn't happen if said infrastructure had no legacy systems; but much corporate infrastructure is heavily dependent on legacy systems. And say it didn't happen that way because it couldn't - OK, so how could a modern, well-engineered, inherently reliable system fail so badly? Because it did.

Nialler 29th May 2017 15:06


Originally Posted by MG23 (Post 9786360)
If BA had high-availability IT, we wouldn't be having this discussion.

That rumour seems dubious to me, but having worked with Indian outsourcing in a previous life, it's not that dubious.

I've refused contracts where one particular outsourcer was involved. No names, obviously. The practice has been in the very large systems I've worked on to keep the crown jewels at home. Functions such as security, systems design and management remain in the home territory while some application development and support can more easily be taken on by an appropriate outsourcer with the right skills. The problem is that the skills required for very high-end computing are relatively new to India. It's also a mindset. You've or, worse, caused a problem on of these machines? First step, push way from the keyboard ; second step beginning ring teammates and the boss. Deliberate. Running at the problem almost always makes it worse. Declare a disaster if needed. The latter is the problem. Everyone is afraid of that word. In decades I've seen that anything short of 737 crashing into a data center will not be treated as a disaster. Surely it can be fixed?

It isn't like that. That single message about a pointer error can proliferate rapidly and be compounded by errant efforts to airbrush it away.

Sorry for using a flight forum for going on about this, but I so enjoy reading what the flight jocks have to say that I can't help contributing from inside the climate rooms I haunt.

KelvinD 29th May 2017 15:13

While Senor Cruz is still going on about a power failure being the culprit, it struck me that everyone, including me, has been assuming this refers to the big electric, coming through the wall (230V, 440V 3KV or whatever). If any of these fails, one would expect UPS to immediately take over ( a "no break" supply).
What about the internal power suuplies driving the servers etc, ie the bits that turn the incoming electricity into 5V, 12V, 24V etc? If you let the smoke out of these, no amount of UPS back up is going to help.
Just a thought.

Nialler 29th May 2017 15:17


Originally Posted by Self Loading Freight (Post 9786407)
I think that rumour is credible too - absent a knowledge of how modern the BA data centre's infrastructure actually is. Many cascading faults in corporate infrastructure wouldn't happen if said infrastructure had no legacy systems; but much corporate infrastructure is heavily dependent on legacy systems. And say it didn't happen that way because it couldn't - OK, so how could a modern, well-engineered, inherently reliable system fail so badly? Because it did.

My experience based on decades as a consultant operating directly in the fields of resilience, availability, disaster recovery and business continuity planning is that the legacy systems are always the most robust.

Twiglet1 29th May 2017 15:21

I was at LHR on Saturday trying to use Staff Travel. Clearly with this utmost in my mind my experiences;
1. The BA Staff at check in were superb. When the back up "system" came on line they had to share PC's / work in 2-3's to get individual bookings ticketed and tagged. Accessing their systems was slow and time consuming and familiarity a issue - Overall 10 out of 10.

2. Early on they cancelled all Staff Travel to concentrate on fare paying punters - can't argue with that. It didn't materialise much - the order came from upon high - we just hung around as it was ever changing.

3. Their Manager came round giving directions and updates, he missed a couple of occasions to say thanks and I could see the opportunity was missed. This was down to pressure i'm sure but the relationship with the Manager is the number 1 on staff engagement.

4. Having checked in at STD -30 mins (having been in queue for 2hrs 30mins) finally got boarding passes and legged it to drop bags off (bag belts u/s also is that BA or BAA)
got airside - no gate on FFS just a security man helping

5. Got to gate 10B no BA staff for some time. They came and gave a few updates. After some time the BBC news TV on the gate said all flights up to 1800 cancelled. The gate staff hadn't heard this though (as comms was all down to phones). The one particular BA lady kept us updated every 30 mins or so, great communications with as much as she could give - again 10/10. When we gave up she was being hassled by about 20 pax so we just shouted well done and she thanked us.

6. Getting out was another issue. Everyone exited via gate 12 (hundreds) and only two small exit lanes. Could have gone wrong big time but old bill helped out. Quick dash through immigration and into bag hall - BA staff said don't bother so we just went out.

Apart from the disappointment / long day the only negative to me was some other BA Staff Traveller trying to "jump" the queue as his flight was going soon (just like rest).

There is always one, trying to go away on school hols with kids - well when it goes wrong you'll only do that once

And finally at my work we use mainly Indians and they are better by a country mile

Nialler 29th May 2017 15:23


Originally Posted by KelvinD (Post 9786415)
While Senor Cruz is still going on about a power failure being the culprit, it struck me that everyone, including me, has been assuming this refers to the big electric, coming through the wall (230V, 440V 3KV or whatever). If any of these fails, one would expect UPS to immediately take over ( a "no break" supply).
What about the internal power suuplies driving the servers etc, ie the bits that turn the incoming electricity into 5V, 12V, 24V etc? If you let the smoke out of these, no amount of UPS back up is going to help.
Just a thought.

No. The UPS is more than a bank of batteries. It's an expensive piece of kit which smooths the supply during Brown outs, during spikes and during the absence of any power at all. The electrical input to a properly specced enterprise server should never fluctuate.

ILS27LEFT 29th May 2017 15:24

Cost cutting...indefinitely
 
From Bbc News:
"Earlier this year, Mr Cruz told Skift magazine: "We're always going to be reducing costs... It's now injected into the DNA. If one particular day we don't come up with an idea to reduce our costs, then we're not doing our job."

This IT global mess is the result of the above corporate philosophy.
Lives have been ruined. Millions of pounds wasted.
Constant and indefinite cost cutting is a corporate suicide. There is a limit. BA is showing the first signs of this suicide mission. It must be stopped.

Ian W 29th May 2017 15:52


Originally Posted by Nialler (Post 9786411)
I've refused contracts where one particular outsourcer was involved. No names, obviously. The practice has been in the very large systems I've worked on to keep the crown jewels at home. Functions such as security, systems design and management remain in the home territory while some application development and support can more easily be taken on by an appropriate outsourcer with the right skills. The problem is that the skills required for very high-end computing are relatively new to India. It's also a mindset. You've or, worse, caused a problem on of these machines? First step, push way from the keyboard ; second step beginning ring teammates and the boss. Deliberate. Running at the problem almost always makes it worse. Declare a disaster if needed. The latter is the problem. Everyone is afraid of that word. In decades I've seen that anything short of 737 crashing into a data center will not be treated as a disaster. Surely it can be fixed?

It isn't like that. That single message about a pointer error can proliferate rapidly and be compounded by errant efforts to airbrush it away.

Sorry for using a flight forum for going on about this, but I so enjoy reading what the flight jocks have to say that I can't help contributing from inside the climate rooms I haunt.

This kind of crash after outsourcing and losing experienced staff has happened before. You would think that CEOs would learn from other's disasters but apparently not. :ugh:


It was precisely the reason that the patched and kludged together RBS/Nat West banking system fell over...and the 'inexperienced' operatives in Hyderabad were the likely culprits in screwing up an upgrade backout. :D


RBS computer failure 'caused by inexperienced operative in India' - Telegraph
https://www.theregister.co.uk/2012/0...at_went_wrong/

This kind of thing should never ever happen, but if you are unaware of the particular foibles of what is otherwise a fully fault tolerant system it can be surprisingly easy to break the system when you have full SysAdmin privileges and have finger trouble trying to stop the system going down.

Heathrow Harry 29th May 2017 15:55

"Constant and indefinite cost cutting is a corporate suicide. There is a limit."

Not so - every organisation should alwys be looking to cut costs - working practices change, technology changes

BUT you have to do it while maintaining or improving the product - cost cutting as a sole driver is very very bad business pratice

Tight Accountant 29th May 2017 15:56


Originally Posted by aox (Post 9786229)
"a friend told me".

I really dislike non-attributable stories. Sure, cock-ups occur in business all the time and I've seen plenty by Accountants and Non-Accountants alike.

yoganmahew 29th May 2017 15:58


Originally Posted by Nialler (Post 9786325)
I would dispense what that rumour as the work of someone who has little clue about large scale high-availability IT (such skills not being a prerequisite for life). Systems are not patched/tested on production environments (for the record a failback mirror site is most certainly a production system). There will be a chain of systems for testing, from a sandpit environment, through test, development, pre-production, eser acceptance testing then production itself. These type of fixes are usually completely dynamic, but those that require restarts require only that the operating system be restarted - not the actual hardware. There should be no power issues and certainly none where remotely distinct sites are involved.


Finally, given that they're still on a background of TPF, the machines running TPF are typically z-Series enterprise servers from IBM. i.e. designed with internal redundancy and with continuous uptime as one of the core aspects of their architecture. Their power requirements have shrunk from a time when, yes, the airport lights might flicker as the beast was woken up, through to today's models, which are CMOS based and run off little more than a kettle connection. The meantime between failures on these machines is measured in years. They do not fail in the type of circumstances described.


Thanks for posting it, though.

Hi Nialler. The rumour is not suggesting that the patch itself was faulty, just that the restart procedure was inadequately careful.

BA have no TPF neither in their own site nor, if Amadeus are to believed, in the underlying Amadeus architecture. This, I'm afraid, is all 'modern' stuff with hundreds of boxen performing trivial proportions of the overall workload.

If the fix is the SMB fix for WannaCry to the server, it could require an OS restart, not just an appliaction restart (depending on the OS). Even if it didn't, hundreds of applications starting will draw more power as they reload, rebuild caches etc.

Anyway, the whole thing is so unclear, and this from a man who claims to be digital to the core, that you have to think it was something enormously f'd up.

rideforever 29th May 2017 15:59


Originally Posted by ILS27LEFT (Post 9786425)
Constant and indefinite cost cutting is a corporate suicide. There is a limit. BA is showing the first signs of this suicide mission. It must be stopped.

Yes, our whole species is undergoing this new philosophy. And is failing, but nobody notices ... what people notice are the promises of extreme savings and extreme profits.
At the same time the lack of challenge in our society is creating a new level of incompetence. Not only is there incompetence, but nobody really cares. Conscience is a long way away.
Does anyone have a goal except mortgage payments and facebook?
Without a goal there is no reason to do any more than this.
Why did Victorian engineers build so well ?
What was inside them that is not inside this generation ?
Quite a lot, methinks.

Ian W 29th May 2017 15:59


Originally Posted by Nialler (Post 9786424)
No. The UPS is more than a bank of batteries. It's an expensive piece of kit which smooths the supply during Brown outs, during spikes and during the absence of any power at all. The electrical input to a properly specced enterprise server should never fluctuate.

Many of the major systems I have dealt with run on batteries all the time. The choice is which system: grid, standby grid, standby generators to trickle charge the batteries.

Tight Accountant 29th May 2017 15:59


Originally Posted by Ian W (Post 9786440)
It was precisely the reason that the patched and kludged together RBS/Nat West banking system fell over...and the 'inexperienced' operatives in Hyderabad were the likely culprits in screwing up an upgrade backout. :D

RBS computer failure 'caused by inexperienced operative in India' - Telegraph
https://www.theregister.co.uk/2012/0...at_went_wrong/

Ian - I don't know your background but I understand that RBS has a whole host of legacy systems which need considerable TLC. It will be interesting to understand whether legacy systems fell over at BA; I suspect not.

Nialler 29th May 2017 16:01


Originally Posted by Ian W (Post 9786440)
This kind of crash after outsourcing and losing experienced staff has happened before. You would think that CEOs would learn from other's disasters but apparently not. :ugh:


It was precisely the reason that the patched and kludged together RBS/Nat West banking system fell over...and the 'inexperienced' operatives in Hyderabad were the likely culprits in screwing up an upgrade backout. :D


RBS computer failure 'caused by inexperienced operative in India' - Telegraph
https://www.theregister.co.uk/2012/0...at_went_wrong/

This kind of thing should never ever happen, but if you are unaware of the particular foibles of what is otherwise a fully fault tolerant system it can be surprisingly easy to break the system when you have full SysAdmin privileges and have finger trouble trying to stop the system going down.

I've seen the complete history which led to the problem. The problem was most certainly not one caused by outsourcing. The ad referenced in the article is for an experienced admin. That is way below the level where this problem occurred. I've spoken with the principals involved.

Nialler 29th May 2017 16:08


Originally Posted by Ian W (Post 9786448)
Many of the major systems I have dealt with run on batteries all the time. The choice is which system: grid, standby grid, standby generators to trickle charge the batteries.

Exactly. They're not just there for hard outages. In the past, with older mainframes, I have wanted to limit the Ups to twenty minutes. If air-handling is gone too a system will cook itself.

yoganmahew 29th May 2017 16:08


Originally Posted by Tight Accountant (Post 9786449)
Ian - I don't know your background but I understand that RBS has a whole host of legacy systems which need considerable TLC. It will be interesting to understand whether legacy systems fell over at BA; I suspect not.

Once it has been in long enough for the original staff to have moved on to other projects, it is legacy. If it's 5 years old, it's probably legacy. That legacy can extend back for many years beyond that really just increases the difficulty of finding someone who really understands it!

Super VC-10 29th May 2017 17:36

Pilot drove cancer sufferer home.

British Airways boss refuses to resign as Heathrow endures third day of disruption


All times are GMT. The time now is 06:06.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.