PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Passengers & SLF (Self Loading Freight) (https://www.pprune.org/passengers-slf-self-loading-freight-61/)
-   -   BA delays at LHR - Computer issue (https://www.pprune.org/passengers-slf-self-loading-freight/595169-ba-delays-lhr-computer-issue.html)

HamishMcBush 27th May 2017 20:57

Currently semi-stranded at PHL. Should have been on BA66 to LHR. No attempt made by BA to make contact by phone, e-mail or text as apparently all those systems are down too. If that really ' s the case then IMHO their operating licence should be taken away as there are no contingency plans that work. At least I appear to have been rebooked onto AA but downgraded to Economy and no compensation offered so will have to claim myself if/when I get home. You'd have thought they could at least have been proactive with compensation

nohold 27th May 2017 21:02

...and BA400 off Heathrow 21:55 BST to Brussels, some three hours late.

Reminds me of the British Rail advert, we're getting there.

wiggy 27th May 2017 21:07


Planefinder shows BA215 in the air, off Heathrow circa 21:40 BST off to Boston.
Possibly positioning empty ( to at least have an aircraft in position in BOS for the return sector).

gordonroxburgh 27th May 2017 21:10

Highly likely that the flight going off now are empty and just trying to get aircraft / crew in the right place.

Ian W 27th May 2017 21:11


Originally Posted by xyzzy (Post 9784691)
I did operational IT for a living, rising to CIO, before I went into academia. I got bored with being told that filesystems and databases wouldn't stand sudden stops (ACID properties, right?). I was expected to buy exotic database products from Larry Ellison, tended by smug contractors who had a million and one reasons Postgres just wouldn't do. So I made it a point of acceptance testing from development into production that the systems I was expected to run had to survive a sudden stop. Salesmen from Oracle and NetApp talk about journalling, so let's see it: we're going to flip the power at the time of our choice during your testing, and your product will survive it, and we'll do it again a few times for fun, or you can all go back to your offices and fix it. It's not the 1990s, and fsck isn't a thing any more. They'd whinge and whine that I should be doing an orderly shutdown, but I genuinely meant "I will go into the development lab and flip switches at random".

I flushed out any number of problems with this approach.

I can remember a 7 day 24 hrs a day test cycle where we were running the system under load then going in and deleting primary then backup processes and confirming that the system kept running, then crashing hardware and confirming that the system kept running and that the error messages led the support engineer to the right fault and the documented recovery worked.
Like you I had to insist on that level of testing including overload testing.

Anson Harris

Perhaps the management should read this thread for some expert advice on how to run their IT systems - it seems that the world's supply of experts' opinions are here for the taking.
I don't know that it is the world's supply of opinions. There was a time that this level of understanding was common knowledge. Unfortunately, in the same way that manual flying skills are not valued by MBA management the same applies to architecture and design of computer systems that (as has been shown) are essential for the company operations.

There is NO excuse for a company the size of BA / IAG not to have mirrored redundant systems, ideally three, in widely separated locations. The systems should all be sized to be able to support the the entire operation so can take over 'standalone' if necessary. Under normal operations they load share providing excellent transaction times.

It beggars belief that any modern company would put all its IT eggs in one basket dependent on one power supply or switchover gear. Yet now we have had both BA and Delta exhibit the same lack of foresight. With the EU rules on compensation this will be a huge cost to BA. They could have had a reliable computer system for a lot less than it will cost them.

BARKINGMAD 27th May 2017 21:17

Redundancy Capability.
 

Originally Posted by OldLurker (Post 9784664)
Murphy can bite you in the ankle any time. One company I worked with seemed to have done most things right, including two mains feeds coming into the building from two separate grids (three would have been better, but there wasn't a third available). What they didn't realise was that in one place in the street the contractors had run the two cables side by side, in a shallow conduit because of some obstruction underneath. Inevitably some fool with a backhoe began to dig a hole in just that place (he should have been on the other side of the street) and chopped both cables at once ... Thankfully the company did have not only a UPS that worked, but also a backup generator that started on demand!

. Smooth-tallking female NATS spokesperson seen recently on UK national media bragging about how the new LCY tower control had 3 separate cables linking the airport to Swanwick NATS. Anyone like to start the betting as to when this brilliant IT innovation will fall over at the critical moment??

t1grm 27th May 2017 21:24

Doesn't say much for BA's business continuity plan. What happens if their data centre burns down? Are BA ISO2001 certified? I believe a tested BCP is a requirement. If so why wasn't this picked up by an ISO audit?

nohold 27th May 2017 21:28

Dunno about empty ferry flights, BRU airport website shows...

21:00 BA400 London Heathrow Delayed 23:34

Heathrow Harry 27th May 2017 21:35

A year back we were travelling on Eurostar when some poor b****** threw themsleves in front of a train = total stop

Withing 15 minutes St Pancras was full of Eurostar employees- all with yellow jackets on, all talkking to the benighted travelers - did they have all the answers - no - were they helpfull - too damn true. They had tablets , they had numbers to call - and when you called them the staff were up to speed and were rebooking everyone

We eventually left 6 hours late - but we had a definite slot and were able to dump the bags and go into C London - good? - not perfect but in terms of response a zillion miles ahead of any airline I've ever traveled on

wiggy 27th May 2017 21:38

Nohold


Dunno about empty ferry flights, BRU airport website shows...
I honestly have no idea what the status of the BRU but BA did announce (at 1645 BST) that all passenger services ex LHR and LGW were cancelled, I don't think there's going to be any attempt to resurrect normal service ex London until the AM. I do know that at least some Longhaul services are most definitely departing empty for positioning purposes tonight -some have already left. I suspect similar might be done with some of the Shorthaul fleet.

Nialler 27th May 2017 21:39


Originally Posted by RevMan2 (Post 9784656)
Any decent data centre has an array of batteries that kick in as soon as one of the main power supplies( you'll normally have 3) fails and keep the machines running until the whacking great diesel generator (kept at operating temperature) takes over. It'll have fuel for the next 48 hours.
And you'll have your core systems mirrored.
This is industry standard.

Err, no. If your climate management has also failed you *want* your system to go down before it cooks itself and turns its drives into pizza.

All of clients have geographically dispersed sys s. Great in the event of physical disaster at one site; the first fallback will kick in seamlessly. If needed, the third is there also. The probl m is not that Jumbo flying into Datacenter one. That's easy to deal with. DC2 will take over, right?

Jumbo jets don't have a habit of flying into datacentres. The real problem is logical errors. Pointer errors in sophisticated relational databases. That type of thing. Once they happen in one site they are faithfully replicated on mirror sites. All becomes useless at a stroke. The concept of sysplexing very large systems only prevents against large physical attacks. Logical ones are not prevented against.

Nialler 27th May 2017 21:44


Originally Posted by t1grm (Post 9784751)
Doesn't say much for BA's business continuity plan. What happens if their data centre burns down? Are BA ISO2001 certified? I believe a tested BCP is a requirement. If so why wasn't this picked up by an ISO audit?

Nobody calls a disaster situation. They're afraid to. Twenty years in the business thought me that. When I moved into disaster recovery it was my first lesson.

PAXfips 27th May 2017 22:01

Since Delta incident was mentioned and all the talk about redundant power. James Hamilton put up a nice article the other day of what had happened (and for the very pro, how to avoid it):
At Scale, Rare Events aren?t Rare ? Perspectives

(and even ultra-redundant Amazon AWS was able to thrash S3 service for over 4 hours last month ;) )

Real desaster was the time to recover PLUS the horrific communications.

yoganmahew 27th May 2017 22:04


Originally Posted by Heathrow Harry (Post 9784444)
CARR30 - you mean they are supposed to think of BA's customers? Every IT outfit I've ever dealt with does major upgrades over the weekend - when it doesn't get in the way of the real customers who are the BA management. Long weeknds are kept for BIG jobs.... It's like the railways - always close the lines over CHristmas, Easter, weekends etc

It's not the way airline IT works, or rather, not the way it should work. All work, barring emergency changes, should have been frozen for the weekend, as it's both busy and a holiday weekend. Airline IT avoids busy times for change as best practice.

Mind you, "BA IT" and "best practice" don't sit easily in the same sentence...

ILS27LEFT 27th May 2017 22:11

Extreme cost cutting?
 
From BBC:
"The GMB union has suggested the failure could have been avoided, had the airline not outsourced its IT work."
BA denied the claim, saying: "We would never compromise the integrity and security of our IT systems".

I do not trust large corporations anymore, common sense seems to have completely disappeared. Today, well today is another example of total failure of modern management.
I do not know if BA is telling the truth or the Union however I know that BA officially stated cause as a power failure. If this is true then it is totally unacceptable: a power failure cannot cause a total catastrophic IT crash as this one, this is why there are numerous power back-up solutions and BCP plans.
The real problem is the modern top management theory of constant improvements at all costs, indefinitely--->constant reduction of costs, indefinitely--->this brings to total failures (with enormous impact on a Company credibility and financial health) simply because the basics are missed (like a power outage back up) all in the name of beautiful and colourful PowerPoint slides, artificial stats created to earn astonishing bonuses nearly always directly linked to cost reductions (as a % in the best cases!) and therefore we end up with corporate greed as main foundation behind critical decisions.

I would not be surprised if this incredible failure is the direct result of the extreme cost cutting measures of recent months/years, modern management theories can seriously turn against an entire organisation if not correctly managed from the top.
It is quite simple.:ugh: We cannot keep cutting costs indefinitely and then we are surprised when we suffer this type of failures, it is a Company choice.
A cost cutting exercise can become incredibly expensive indeed.

I still hope that this was not a power outage. it cannot be.

Simply scary.:mad::mad::mad:

yoganmahew 27th May 2017 22:13


Originally Posted by MG23 (Post 9784553)
Plenty of companies have huge IT infrastructure without these kind of problems. Netflix, for example, has a policy of constant testing by randomly making its servers crash and ensuring that nothing bad happens when they do. As I understand it, the only thing they're 100% reliant on is Amazon staying up in at least one region of the world.

Netflix do not care that two people watch the same movie at once. Try sitting two people in 13A or sending a fuzzy logic APIS list. Airline IT is about perishable real stuff. You can't just order some more or turn around and say "oops, you know we said we'd deliver that by tuesday, well, we meant tuesday next month"... Unfortunately, senior IT management in airlines, often not being airline born and bred, seem unaware of this.

yoganmahew 27th May 2017 22:15


Originally Posted by lamer (Post 9784566)
More a lack of professionalism in the budgeting department.

Looks to me like their mainframe went down.
All tracking of Planes, Passengers, Baggage, Freight, Meals, Maintenance, Flight Planning and on and on and on will be down.

CIA, NSA, MIA, KGB, bla bla bla take a dim view of Flights leaving without prior notification of browsing history, credit card details and so on of each person on board.

Mainframes generally take many hours to get up and running again once the problem is identified and resolved. Many hundreds of subsystems need to be individually started in the right sequence and verified for proper operation.

You being unable to purchase a ticket is the last of anybody's worries.

Last outage I saw (other BIG player in Europe) cost more than €10 million, previous outage more then 14 years ago. Cost of parallel backup system: €40 million just to set up.

You do the math ....

BA no longer have a mainframe.
When they did, the cycle time to restart was under 5 minutes.
Look up the TPF system, it defined fault-tolerant, anti-fragile transaction processing.

PAXfips 27th May 2017 22:17

Much kudos to the above posters doing real testing. Nowadays it seems to be "enough" to have the redundancy on glossy paper and when SHTF it "didnt work out - let's sue/blame!".

If one comes up with "do not push this button NOW" I'll exactly do that. Resilient systems are hard - period. Would aviation IT learn from aviation flight ops (and "crash reports"/post mortems)?

Will we ever see public reports about what exactly did happen (compare to AWS who put out a very detailled report showing they "missed" out on the human in the loop)

yoganmahew 27th May 2017 22:21


Originally Posted by anson harris (Post 9784707)
Perhaps the management should read this thread for some expert advice on how to run their IT systems - it seems that the world's supply of experts' opinions are here for the taking.

:)
It's not that difficult... it's just not cheap.

Nialler 27th May 2017 22:35

To IPL a mainframe shouldn't take more than five to ten minutes. If you want to cold start all systems and subsystems maybe a bit longer, but I'd expect to IPL my mainframe quicker than my smartphone.

KelvinD 27th May 2017 22:43

It seems BA have flights in the air again. The regular evening flight to Sao Paulo just flew overhead Hampshire.

Blondie2005 27th May 2017 22:47

Presumably they're running the Twitter account from someone's phone.

DaveReidUK 27th May 2017 22:52


Originally Posted by gordonroxburgh (Post 9784742)
Highly likely that the flight going off now are empty and just trying to get aircraft / crew in the right place.

And leave more stranded passengers than necessary at LHR? I don't think so.

BA have operated around 14 departures so far this evening, and counting, all with regular flight numbers. I'd be very surprised if they weren't all pretty full.

cooperplace 27th May 2017 22:52


Originally Posted by DingerX (Post 9784387)
The person who 'saved the airline a fortune in IT costs' has now no doubt been promoted, hired away to a different company and enjoyed a couple more pay raises. Besides, s/he can always claim to have been "following industry best practices".

and you could add "in full consultation with all stakeholders"

gordonroxburgh 27th May 2017 23:01


Originally Posted by DaveReidUK (Post 9784806)
And leave more stranded passengers than necessary at LHR? I don't think so.

BA have operated around 14 departures so far this evening, and counting, all with regular flight numbers. I'd be very surprised if they weren't all pretty full.

However nobody can check in onto the flights!

Airbubba 27th May 2017 23:04

Here's the current splash screen on the ba.com site:


Welcome to ba.com

Following the major IT system failure experienced earlier today, with regret we have had to cancel all flights leaving from Heathrow and Gatwick for the rest of today, Saturday, May 27.

We are working hard to get our customers who were due to fly today onto the next available flights over the course of the rest of the weekend. Those unable to fly will be offered a full refund.

The system outage has also affected our call centres and our website but we will update customers as soon as we are able to.

Most long-haul flights due to land in London tomorrow (Sunday, May 28) are expected to arrive as normal, and we are working to restore our services from tomorrow, although some delays and disruption may continue into Sunday.

We will continue to provide information on ba.com, Twitter and through airport communication channels.

We will be updating the situation via the media regularly throughout the day.

We are extremely sorry for the inconvenience this is causing our customers during this busy holiday period.

If you would like to continue please click on one of the following links, otherwise please visit us again later
The links all seem to take you back to a version of this screen with some mixed language prompts.

beamender99 27th May 2017 23:38

BBC now reporting
"BA aims for 'near normal schedule' at Gatwick and 'majority of services' from Heathrow on Sunday after IT failure"

Piltdown Man 28th May 2017 00:13

Either WW is criminally stupid or the people who designed his systems are, or possibly both. Worse, he believes the public are stupid enough to believe him. Organisations the size of BA have to have bullet proof IT systems; ones that remain running after multiple power failures, floods and DNS attacks etc. Because if what he says is true, BA are extremely vulnerable until additional power supplies are installed in this one location. While I can accept a single power failure may have initiated this catastrophe, the reason the entire shebang went tits up was because the system was designed by clowns and overseen by incompetent idiots, one of whom was the man himself. WW's problem is that he now knows he can't trust his own IT guys, he's made the only team who knew what they were doing redundant and his designers and support are in bloody India. This is one hell of a way to save a few quid. I wonder who will get a bonus this year?

PM

Tight Accountant 28th May 2017 00:38


Originally Posted by Ian W (Post 9784744)
...mirrored redundant systems.

Surely some contradiction? By definition you wouldn't mirror redundant systems!

Tight Accountant 28th May 2017 00:43


Originally Posted by Ian W (Post 9784744)
It beggars belief that any modern company would put all its IT eggs in one basket dependent on one power supply or switchover gear..

Indeed, this is what I find incomprehensible; you have a separate, back-up power source for your servers independent of the National Grid. I am assuming the majority of BA's servers are located in Waterside somewhere. This said, the National Grid is relatively stable and power surges are few and fair between. The truth is one or more IT systems have probably fallen over.

Ian W 28th May 2017 00:56


Originally Posted by Tight Accountant (Post 9784849)
Surely some contradiction? By definition you wouldn't mirror redundant systems!

Redundant -- more systems than is needed to run the operational service

Mirrored -- Systems that have full (mirrored) copies of one or more other systems

I see no contradiction. Perhaps some elucidation is needed for those who do not deal with these issues daily.

Sunfish 28th May 2017 01:02

As aa general rule, if IT is critical to the business, then you don't outsource it.

Dairyground 28th May 2017 01:25

From tigrm:

Doesn't say much for BA's business continuity plan. What happens if their data centre burns down? Are BA ISO2001 certified? I believe a tested BCP is a requirement. If so why wasn't this picked up by an ISO audit?
I assume t1grm means ISO 9001 "Quality management systems - Requirements", as there is no standard ISO 2001 in the ISO catalogue.

I cannot recall seeing any claim by BA to have an overall quality management system, let alone one conforming to the ISO 9000 series of standards. ISO 9001 does not say much about the details of a quality system, but rather concentrates on consistency of operation of well-defined processes. Judging by some of the comments on here and other boards about BA's interactions with its customers, it seems unlikely that it has an effective quality management system.

Airbubba 28th May 2017 01:43

The latest ba.com splash screen, possibly typed on a tablet or phone with a couple of typos:


Welcome to ba.com

Major IT system failure . [sic] latest information at 23.30 Saturday May 27

Following the major IT system failure experienced throughout Saturday, we are continuing to work hard to fully restore all of our global IT systems.

Flights on Saturday May 27

We are extremely sorry for the significant levels of disruption caused to customers and understand how frustrating their experiences have been.

Affected customers can claim a full refund or rebook to a future date for travel up until the end of November 2017. Customers are urged to keep any food, transport or accommodation receipts and can make a claim in due course through our Customer Relations teams. There are a significant number of bags at Heathrow which we will be reuniting with customers via couriers as soon as we can. This will be done free of charge.

Please don.t [sic] come to Heathrow to collect your delayed bags, as they are in the process of being sorted for onward distribution in secure airside locations.

Flights on Sunday May 28

Although some of the IT systems have returned, there will be some knock-on disruption to our schedules as aircraft and crews are out of position around the world.

We are repositioning some aircraft during the night to enable us to operate as much of our schedule as possible throughout Sunday.

At this stage we are aiming to operate a near normal schedule of flights from Gatwick and the majority of our Heathrow services.

Please do not come to the airports unless you have a confirmed booking for travel.

We recognise the uncertainty that some customers may be feeling and have therefore extended our flexible booking policy.

If you are due to fly to/from Heathrow or Gatwick on Sunday May 28 or Monday May 29 and no longer wish to travel, even if your flight is still operating, you can rebook to travel up to and including 10 June.

Sunfish 28th May 2017 01:46

Is it possible that BA will go under as a result of this mess??????? Why. Would anyone want to fly with them again? What other BA "systems" are rotten and waiting to fail as a result of under investment / cost cutting?

PAXboy 28th May 2017 02:14

They won't go under but neither will the people who orchestrated their systems over the last 20 years be fairly assessed. I first read the following over 25 years, this copy is from Wikipedia and it is still relevant today. Irrespective of whether the failrue was physical (power) or logical.

The six phases of a big project is a cynical take on the outcome of large projects, with an unspoken assumption about their seemingly inherent tendency towards chaos. The list is reprinted in slightly different variations in any number of project management books as a cautionary tale.

One such example gives the phases as:
  • Enthusiasm,
  • Disillusionment,
  • Panic and hysteria,
  • Hunt for the guilty,
  • Punishment of the innocent, and
  • Reward for the uninvolved.

Freehills 28th May 2017 02:48


Originally Posted by Sunfish (Post 9784856)
As aa general rule, if IT is critical to the business, then you don't outsource it.

Major airlines that don't outsource IT
Delta (and they messed up)
Air New Zealand

And, frankly, it is a silly rule. I don't see many companies building their own PC operating systems, despite how critical that is. For a parallel, engines are also critical to the airline business, but since the break up of United in the 1930's, no airline hasn't outsourced designing & building engines

MG23 28th May 2017 03:56


Originally Posted by yoganmahew (Post 9784780)
Netflix do not care that two people watch the same movie at once.

My point is that they don't just let people develop software and hope it works when something crashes. They continually and deliberately kill random services in their system, so every piece of code has to handle failures correctly; if it doesn't, they find out very quickly.

That means that, when something crashes unexpectedly, the rest of the system keeps going. Because it's been developed from the ground up to expect that to happen.

Whereas in software which isn't tested that way, I've seen something as simple as a debug message in code that's only called in a failure case kill the system because it accessed an invalid variable. That failure case was never seen in testing, so the code was never executed until it happened for real. And then it made the failure much, much worse than it would otherwise have been.

southern duel 28th May 2017 04:35

Interesting scenario and one that BA or Heathrow management have not learned from : after all the issues with snow and previously high winds: aircraft waiting after landing to offload maybe up to 4 hours ?? incredible. As a result of previous instances and before I left LHR i drafted a contingency OSI for passenger offloading onto taxiways therefore enabling pax to get in the terminal. Aircraft would then taxi to base to park which was always full of available space. This was available for all terminals not just T5 with specific locations specific processes to achieve. During the snow debacle we and offloaded 10 aircraft in 90 minutes that had been waiting for up to 4 hours because of no available stands. This resulted in the draft process being drawn up with ATC and BA agreeing. looks like its now in the bin somewhere !! This is when experience counts with ops guys who can think on their feet and do not have to rely on a black or white procedure!!!!!!

pants on fire... 28th May 2017 05:11


Originally Posted by southern duel (Post 9784919)
This is when experience counts with ops guys who can think on their feet and do not have to rely on a black or white procedure!!!!!!

It's probably all outsourced to a centralised call centre in Hyderabad - with a beautifully scripted answer to everything.


All times are GMT. The time now is 21:11.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.