BA delays at LHR - Computer issue

Reply Subscribe

Thread Tools

Search this Thread

27th May 2017, 20:57

#81 (permalink)

HamishMcBush

Join Date: Jul 2009

Location: Surrey, UK

Posts: 130

Likes: 0

Received 0 Likes on 0 Posts

Currently semi-stranded at PHL. Should have been on BA66 to LHR. No attempt made by BA to make contact by phone, e-mail or text as apparently all those systems are down too. If that really ' s the case then IMHO their operating licence should be taken away as there are no contingency plans that work. At least I appear to have been rebooked onto AA but downgraded to Economy and no compensation offered so will have to claim myself if/when I get home. You'd have thought they could at least have been proactive with compensation

27th May 2017, 21:02

#82 (permalink)

nohold

Join Date: Jan 2016

Location: Going left then going right

Posts: 101

Likes: 0

Received 0 Likes on 0 Posts

...and BA400 off Heathrow 21:55 BST to Brussels, some three hours late.

Reminds me of the British Rail advert, we're getting there.

27th May 2017, 21:07

#83 (permalink)

wiggy

Join Date: Feb 2001

Location: The Winchester

Posts: 6,550

Likes: 55

Received 5 Likes on 5 Posts

Quote:

Planefinder shows BA215 in the air, off Heathrow circa 21:40 BST off to Boston.

Possibly positioning empty ( to at least have an aircraft in position in BOS for the return sector).

27th May 2017, 21:10

#84 (permalink)

gordonroxburgh

Join Date: Feb 2001

Location: UK

Posts: 223

Likes: 0

Received 0 Likes on 0 Posts

Highly likely that the flight going off now are empty and just trying to get aircraft / crew in the right place.

27th May 2017, 21:11

#85 (permalink)

Ian W

Join Date: Dec 2006

Location: Florida and wherever my laptop is

Posts: 1,350

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by xyzzy

I did operational IT for a living, rising to CIO, before I went into academia. I got bored with being told that filesystems and databases wouldn't stand sudden stops (ACID properties, right?). I was expected to buy exotic database products from Larry Ellison, tended by smug contractors who had a million and one reasons Postgres just wouldn't do. So I made it a point of acceptance testing from development into production that the systems I was expected to run had to survive a sudden stop. Salesmen from Oracle and NetApp talk about journalling, so let's see it: we're going to flip the power at the time of our choice during your testing, and your product will survive it, and we'll do it again a few times for fun, or you can all go back to your offices and fix it. It's not the 1990s, and fsck isn't a thing any more. They'd whinge and whine that I should be doing an orderly shutdown, but I genuinely meant "I will go into the development lab and flip switches at random".

I flushed out any number of problems with this approach.

I can remember a 7 day 24 hrs a day test cycle where we were running the system under load then going in and deleting primary then backup processes and confirming that the system kept running, then crashing hardware and confirming that the system kept running and that the error messages led the support engineer to the right fault and the documented recovery worked.
Like you I had to insist on that level of testing including overload testing.

Anson Harris

Quote:

Perhaps the management should read this thread for some expert advice on how to run their IT systems - it seems that the world's supply of experts' opinions are here for the taking.

I don't know that it is the world's supply of opinions. There was a time that this level of understanding was common knowledge. Unfortunately, in the same way that manual flying skills are not valued by MBA management the same applies to architecture and design of computer systems that (as has been shown) are essential for the company operations.

There is NO excuse for a company the size of BA / IAG not to have mirrored redundant systems, ideally three, in widely separated locations. The systems should all be sized to be able to support the the entire operation so can take over 'standalone' if necessary. Under normal operations they load share providing excellent transaction times.

It beggars belief that any modern company would put all its IT eggs in one basket dependent on one power supply or switchover gear. Yet now we have had both BA and Delta exhibit the same lack of foresight. With the EU rules on compensation this will be a huge cost to BA. They could have had a reliable computer system for a lot less than it will cost them.

27th May 2017, 21:17

#86 (permalink)

BARKINGMAD

Join Date: Mar 2007

Location: Another Planet.

Posts: 559

Likes: 0

Received 0 Likes on 0 Posts

Redundancy Capability.

Quote:

Originally Posted by OldLurker

Murphy can bite you in the ankle any time. One company I worked with seemed to have done most things right, including two mains feeds coming into the building from two separate grids (three would have been better, but there wasn't a third available). What they didn't realise was that in one place in the street the contractors had run the two cables side by side, in a shallow conduit because of some obstruction underneath. Inevitably some fool with a backhoe began to dig a hole in just that place (he should have been on the other side of the street) and chopped both cables at once ... Thankfully the company did have not only a UPS that worked, but also a backup generator that started on demand!

. Smooth-tallking female NATS spokesperson seen recently on UK national media bragging about how the new LCY tower control had 3 separate cables linking the airport to Swanwick NATS. Anyone like to start the betting as to when this brilliant IT innovation will fall over at the critical moment??

27th May 2017, 21:24

#87 (permalink)

t1grm

Join Date: Dec 2011

Location: .

Posts: 130

Likes: 0

Received 0 Likes on 0 Posts

Doesn't say much for BA's business continuity plan. What happens if their data centre burns down? Are BA ISO2001 certified? I believe a tested BCP is a requirement. If so why wasn't this picked up by an ISO audit?

27th May 2017, 21:28

#88 (permalink)

nohold

Join Date: Jan 2016

Location: Going left then going right

Posts: 101

Likes: 0

Received 0 Likes on 0 Posts

Dunno about empty ferry flights, BRU airport website shows...

21:00 BA400 London Heathrow Delayed 23:34

27th May 2017, 21:35

#89 (permalink)

Heathrow Harry

Thread Starter

Join Date: Apr 2010

Location: London

Posts: 7,072

Likes: 0

Received 0 Likes on 0 Posts

A year back we were travelling on Eurostar when some poor b****** threw themsleves in front of a train = total stop

Withing 15 minutes St Pancras was full of Eurostar employees- all with yellow jackets on, all talkking to the benighted travelers - did they have all the answers - no - were they helpfull - too damn true. They had tablets , they had numbers to call - and when you called them the staff were up to speed and were rebooking everyone

We eventually left 6 hours late - but we had a definite slot and were able to dump the bags and go into C London - good? - not perfect but in terms of response a zillion miles ahead of any airline I've ever traveled on

27th May 2017, 21:38

#90 (permalink)

wiggy

Join Date: Feb 2001

Location: The Winchester

Posts: 6,550

Likes: 55

Received 5 Likes on 5 Posts

Nohold

Quote:

Dunno about empty ferry flights, BRU airport website shows...

I honestly have no idea what the status of the BRU but BA did announce (at 1645 BST) that all passenger services ex LHR and LGW were cancelled, I don't think there's going to be any attempt to resurrect normal service ex London until the AM. I do know that at least some Longhaul services are most definitely departing empty for positioning purposes tonight -some have already left. I suspect similar might be done with some of the Shorthaul fleet.

27th May 2017, 21:39

#91 (permalink)

Nialler

Join Date: May 2008

Location: Paris

Age: 60

Posts: 101

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by RevMan2

Any decent data centre has an array of batteries that kick in as soon as one of the main power supplies( you'll normally have 3) fails and keep the machines running until the whacking great diesel generator (kept at operating temperature) takes over. It'll have fuel for the next 48 hours.
And you'll have your core systems mirrored.
This is industry standard.

Err, no. If your climate management has also failed you *want* your system to go down before it cooks itself and turns its drives into pizza.

All of clients have geographically dispersed sys s. Great in the event of physical disaster at one site; the first fallback will kick in seamlessly. If needed, the third is there also. The probl m is not that Jumbo flying into Datacenter one. That's easy to deal with. DC2 will take over, right?

Jumbo jets don't have a habit of flying into datacentres. The real problem is logical errors. Pointer errors in sophisticated relational databases. That type of thing. Once they happen in one site they are faithfully replicated on mirror sites. All becomes useless at a stroke. The concept of sysplexing very large systems only prevents against large physical attacks. Logical ones are not prevented against.

27th May 2017, 21:44

#92 (permalink)

Nialler

Join Date: May 2008

Location: Paris

Age: 60

Posts: 101

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by t1grm

Nobody calls a disaster situation. They're afraid to. Twenty years in the business thought me that. When I moved into disaster recovery it was my first lesson.

27th May 2017, 22:01

#93 (permalink)

PAXfips

Join Date: Mar 2015

Location: XFW, Germany

Posts: 128

Likes: 0

Received 0 Likes on 0 Posts

Since Delta incident was mentioned and all the talk about redundant power. James Hamilton put up a nice article the other day of what had happened (and for the very pro, how to avoid it):
At Scale, Rare Events aren?t Rare ? Perspectives

(and even ultra-redundant Amazon AWS was able to thrash S3 service for over 4 hours last month

)

Real desaster was the time to recover PLUS the horrific communications.

27th May 2017, 22:04

#94 (permalink)

yoganmahew

Join Date: Aug 2007

Location: Tullamore

Posts: 27

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Heathrow Harry

CARR30 - you mean they are supposed to think of BA's customers? Every IT outfit I've ever dealt with does major upgrades over the weekend - when it doesn't get in the way of the real customers who are the BA management. Long weeknds are kept for BIG jobs.... It's like the railways - always close the lines over CHristmas, Easter, weekends etc

It's not the way airline IT works, or rather, not the way it should work. All work, barring emergency changes, should have been frozen for the weekend, as it's both busy and a holiday weekend. Airline IT avoids busy times for change as best practice.

Mind you, "BA IT" and "best practice" don't sit easily in the same sentence...

27th May 2017, 22:11

#95 (permalink)

ILS27LEFT

Join Date: Aug 2002

Location: Europe

Posts: 4

Likes: 0

Received 0 Likes on 0 Posts

Extreme cost cutting?

From BBC:
"The GMB union has suggested the failure could have been avoided, had the airline not outsourced its IT work."
BA denied the claim, saying: "We would never compromise the integrity and security of our IT systems".

I do not trust large corporations anymore, common sense seems to have completely disappeared. Today, well today is another example of total failure of modern management.
I do not know if BA is telling the truth or the Union however I know that BA officially stated cause as a power failure. If this is true then it is totally unacceptable: a power failure cannot cause a total catastrophic IT crash as this one, this is why there are numerous power back-up solutions and BCP plans.
The real problem is the modern top management theory of constant improvements at all costs, indefinitely--->constant reduction of costs, indefinitely--->this brings to total failures (with enormous impact on a Company credibility and financial health) simply because the basics are missed (like a power outage back up) all in the name of beautiful and colourful PowerPoint slides, artificial stats created to earn astonishing bonuses nearly always directly linked to cost reductions (as a % in the best cases!) and therefore we end up with corporate greed as main foundation behind critical decisions.

I would not be surprised if this incredible failure is the direct result of the extreme cost cutting measures of recent months/years, modern management theories can seriously turn against an entire organisation if not correctly managed from the top.
It is quite simple.

We cannot keep cutting costs indefinitely and then we are surprised when we suffer this type of failures, it is a Company choice.
A cost cutting exercise can become incredibly expensive indeed.

I still hope that this was not a power outage. it cannot be.

Simply scary.

27th May 2017, 22:13

#96 (permalink)

yoganmahew

Join Date: Aug 2007

Location: Tullamore

Posts: 27

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by MG23

Plenty of companies have huge IT infrastructure without these kind of problems. Netflix, for example, has a policy of constant testing by randomly making its servers crash and ensuring that nothing bad happens when they do. As I understand it, the only thing they're 100% reliant on is Amazon staying up in at least one region of the world.

Netflix do not care that two people watch the same movie at once. Try sitting two people in 13A or sending a fuzzy logic APIS list. Airline IT is about perishable real stuff. You can't just order some more or turn around and say "oops, you know we said we'd deliver that by tuesday, well, we meant tuesday next month"... Unfortunately, senior IT management in airlines, often not being airline born and bred, seem unaware of this.

27th May 2017, 22:15

#97 (permalink)

yoganmahew

Join Date: Aug 2007

Location: Tullamore

Posts: 27

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by lamer

More a lack of professionalism in the budgeting department.

Looks to me like their mainframe went down.
All tracking of Planes, Passengers, Baggage, Freight, Meals, Maintenance, Flight Planning and on and on and on will be down.

CIA, NSA, MIA, KGB, bla bla bla take a dim view of Flights leaving without prior notification of browsing history, credit card details and so on of each person on board.

Mainframes generally take many hours to get up and running again once the problem is identified and resolved. Many hundreds of subsystems need to be individually started in the right sequence and verified for proper operation.

You being unable to purchase a ticket is the last of anybody's worries.

Last outage I saw (other BIG player in Europe) cost more than €10 million, previous outage more then 14 years ago. Cost of parallel backup system: €40 million just to set up.

You do the math ....

BA no longer have a mainframe.
When they did, the cycle time to restart was under 5 minutes.
Look up the TPF system, it defined fault-tolerant, anti-fragile transaction processing.

27th May 2017, 22:17

#98 (permalink)

PAXfips

Join Date: Mar 2015

Location: XFW, Germany

Posts: 128

Likes: 0

Received 0 Likes on 0 Posts

Much kudos to the above posters doing real testing. Nowadays it seems to be "enough" to have the redundancy on glossy paper and when SHTF it "didnt work out - let's sue/blame!".

If one comes up with "do not push this button NOW" I'll exactly do that. Resilient systems are hard - period. Would aviation IT learn from aviation flight ops (and "crash reports"/post mortems)?

Will we ever see public reports about what exactly did happen (compare to AWS who put out a very detailled report showing they "missed" out on the human in the loop)

27th May 2017, 22:21

#99 (permalink)

yoganmahew

Join Date: Aug 2007

Location: Tullamore

Posts: 27

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by anson harris

Perhaps the management should read this thread for some expert advice on how to run their IT systems - it seems that the world's supply of experts' opinions are here for the taking.

It's not that difficult... it's just not cheap.

27th May 2017, 22:35

#100 (permalink)

Nialler

Join Date: May 2008

Location: Paris

Age: 60

Posts: 101

Likes: 0

Received 0 Likes on 0 Posts

To IPL a mainframe shouldn't take more than five to ten minutes. If you want to cold start all systems and subsystems maybe a bit longer, but I'd expect to IPL my mainframe quicker than my smartphone.

Reply Share

First
Prev
5 / 32
Next
Last