PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Passengers & SLF (Self Loading Freight) (https://www.pprune.org/passengers-slf-self-loading-freight-61/)
-   -   BA delays at LHR - Computer issue (https://www.pprune.org/passengers-slf-self-loading-freight/595169-ba-delays-lhr-computer-issue.html)

sherburn2LA 28th May 2017 05:25


Originally Posted by PAXboy (Post 9784878)
The six phases of a big project is a cynical take on the outcome of large projects:

In my (38 years) experience most IT mega projects go in just two phases

1) Fire all the people who say it won't work

later

2) Fire all the people who said it would work

LTNman 28th May 2017 05:36

I used to install UPS's (uninterruptible power supplies) as part of my job that guaranteed power for a set time. Years later we would go back when they failed to operate as the batteries that were on a constant trickle charge had a limited life and that no one had thought about changing them.

Rwy in Sight 28th May 2017 05:36


Nialler

prevents against large physical attacks. Logical ones are not prevented against.
I thought it is possible to avoid logical attacks by having the codes/software written by different people. So mistakes and logical bombs happen only in one system. I hope it makes sense.

Nialler 28th May 2017 05:43


Originally Posted by Rwy in Sight (Post 9784942)
I thought it is possible to avoid logical attacks by having the codes/software written by different people. So mistakes and logical bombs happen only in one system. I hope it makes sense.

For something such as a a database subsystem and its the associated application layer there will be many hundreds and potentially thousands of coders involved.

Sunfish 28th May 2017 06:06

freehills:


Quote:
Originally Posted by Sunfish View Post
As aa general rule, if IT is critical to the business, then you don't outsource it.
Major airlines that don't outsource IT
Delta (and they messed up)
Air New Zealand

And, frankly, it is a silly rule. I don't see many companies building their own PC operating systems, despite how critical that is. For a parallel, engines are also critical to the airline business, but since the break up of United in the 1930's, no airline hasn't outsourced designing & building engines
you misunderstand. there are multiple reputable and competitive sources for PC's, operating systems, engines (new and overhauled), tyres, etc. of course you can outsource all that without losing control and generate efficiencies in the process.

however, there is only one BA, and one set of BA IT systems that are intricately and inextricably bound up with BA'S business strategy. The two are inseparable and complementary, you cannot change business strategy without modifying business rules that are implemented in the software, period.

if you lose detailed control of the strategic IT in your business, you risk exactly what BA are now experiencing, because your contractor cannot know in detail what your strategic priorities are as well as you do, and those priorities change monthly.

for example, consider what software changes might be required to implement a marketing idea to increase baggage allowances.

ImageGear 28th May 2017 06:12


To IPL a mainframe shouldn't take more than five to ten minutes.
and this is totally unacceptable as BA have found to their severe cost. Whoever took the decision to implement a system where recovery is to IPL and blow the reservations system out of the water with no failover needs to take the long walk.

To call the latest crop of "big servers", mainframes, is blatant misrepresentation. True fault tolerant mainframes no longer really exist.

30 Years ago, I ran a demonstration in Paris for airline CIO's and IT Directors, of multi-host file sharing across duplicate mainframes, where it was proved that on catastrophically failing one mainframe, the end result was not one transaction being lost anywhere between the reservations terminal and the failing computer memory module. The impact on the reservations clerk was a "wait" screen that lasted around 40 seconds. The solution sold, but not to BA.

There is absolutely no excuse, 30 years later, for any airline reservations department to accept anything less.

Some very big heads must roll. :=

Imagegear.

Nialler 28th May 2017 06:33


Originally Posted by ImageGear (Post 9784959)
and this is totally unacceptable as BA have found to their severe cost. Whoever took the decision to implement a system where recovery is to IPL and blow the reservations system out of the water with no failover needs to take the long walk.

To call the latest crop of "big servers", mainframes, is blatant misrepresentation. True fault tolerant mainframes no longer really exist.

30 Years ago, I ran a demonstration in Paris of multi-host file sharing across duplicate mainframes, where it was proved that on catastrophically failing one mainframe, the end result was not one transaction being lost anywhere between the reservations terminal and the failing computer memory module. The impact on the reservations clerk was a "wait" screen that lasted around 40 seconds. The solution sold, but not to BA.

There is absolutely no excuse, 30 years later, for any airline reservations department to accept anything less.

Some very big heads must roll. :=

Imagegear.

You misunderstood me. My entire career has been working on mainframes. The type of fail over capabilities you describe are not quite thirty years old, but you are correct that not simply sysplex, but geographically dispersed sysplex is possible. Not just possible but standard. My clients all have datacentres at sites which h are remote from each other and are cycled on a scheduled basis. OK, I'm an old it head, but I'm sure that pilots will Nod at the concept of double or triple redundancy. Their lives depend on it. In my case it is just my job.

My fear always is that a single system failure might not be restricted or contained when it is a logical or intrinsic programmer error which with the cold logic of object code propagates through the redundant systems also. The problem in your primary hydraulic system is not actually isolated because the same problem which led to its failure exists on the fallback.

BetterByBoat 28th May 2017 06:57

Why is everyone being so quick to accept the official line? A power supply failure taking down multiple systems across multiple data centers. Possible but it wouldn't be at the top of my list of likely explanations.

Heathrow Harry 28th May 2017 07:19

Exactly - and should also be (relatively) easy to fix........................

some sort of botched software upgrade is my guess -... but blaming the hardware is probably a first step to trying to save your job

Tay Cough 28th May 2017 07:27


Some very big heads must roll.
This is BA. Of course they won't.

FlightCosting 28th May 2017 07:38


Originally Posted by Heathrow Harry (Post 9784992)
Exactly - and should also be (relatively) easy to fix........................

some sort of botched software upgrade is my guess -... but blaming the hardware is probably a first step to trying to save your job

Time to back to pen and paper. Back in the day (1970) in the brand new high tech terminal 1 the only computer we had was the Solaris information board and that broke down often. Paper pax manifest and hand written load sheets.

xs-baggage 28th May 2017 07:45

I seem to remember that two or three years ago BA were putting Navitaire into LHR, LGW, and LCY as part of what I was told was a "backup strategy" (I presume they meant a fallback strategy, but there you go). My involvement with UK airline IT ended shortly afterwards - does anyone know if they ever did it?

RomeoTangoFoxtrotMike 28th May 2017 08:22


Originally Posted by sherburn2LA (Post 9784935)
In my (38 years) experience most IT mega projects go in just two phases

1) Fire all the people who say it won't work

later

2) Fire all the people who said it would work

Typo. You meant:-

2) *Promote* all the people who said it would work...

["*We* know it will work, it's the silly IT nerds who couldn't do it. Sack them, promote us for having the brilliant idea in the first place"].

fchan 28th May 2017 08:28

I worked 40 years in safety and reliability in transportation industries, laterally with ATC. I could tell many stories of power supply disasters in so called very resistant systems. In the Far East I was doing a project in a road control centre. I asked them about power supply resilience and was assured they had all the batteries and generators to make it bulletproof. THE VERY NEXT DAY on entering the centre there were worried looking managers and a long extension lead across the floor leading the back of a server rack. “We had a power failure and had to do that to get it working again”.

In the same country the main and backup power supply went through the same circuit breaker which failed taking down a complete train line.

In UK ATC the last serious power outage was 15-20 years ago. Since then NATS has vastly improved the power systems so, whilst nothing is impossible, I’d be extremely surprised if power was the cause. If it did happen the most likely cause would be maintainer error rather than a straight hardware or software failure in the power supplies. Many of NAT’s recent very occasional issues have been due to software upgrades not going to plan but they are nearly always recoverable to the backup system or old software in minutes. The up to hours interruption to some traffic flow is only due to the time to rebuild the traffic flow to one that works safely; in an airport like Heathrow, as it’s working at >95% capacity, a small blip can’t be recovered in minutes. NATS does its software upgrades only in quiet periods and would never chose a May Bank Holiday w/e to do it. Unlike an airline whose planes are on the ground in incidents like current one, so safe, ATC has many planes in the sky so has to be much more careful.

RomeoTangoFoxtrotMike 28th May 2017 08:33


Originally Posted by BetterByBoat (Post 9784978)
Why is everyone being so quick to accept the official line? A power supply failure taking down multiple systems across multiple data centers. Possible but it wouldn't be at the top of my list of likely explanations.

I can see how a "power failure" in a specific subsystem, the consequences of which were neither fully thought through (or were ignored), the response to which was mishandled, could rapidly spiral out of control. It's probably at this point that having outsourced IT half a world away really began to bite...

Gertrude the Wombat 28th May 2017 08:39


Originally Posted by LTNman (Post 9784941)
I used to install UPS's (uninterruptible power supplies) as part of my job that guaranteed power for a set time. Years later we would go back when they failed to operate as the batteries that were on a constant trickle charge had a limited life and that no one had thought about changing them.

????? I don't see what use a UPS is if it doesn't monitor itself, surely it will have reported that it needed a new battery even if the bureaucratic systems for maintaining it failed???

ExSp33db1rd 28th May 2017 08:43

A million years ago - circa 1970's - BOAC sent us home from New York as passengers, unscheduled - pick up your tickets at the airport. On arrival JFK the check in girl advised us that we couldn't travel as "the system was down" and she couldn't print our tickets. The Flt. Eng. handed her his pen. We flew home.

It's called progress.

coalencanth 28th May 2017 08:49

There's quite a gem allegedly from the great Alex himself been leaked on flyertalk, apparently from their internal network, what do they call it, yammer?

Could have joined BA a few years back, thank god I didn't. Great company going down the pan.

oldart 28th May 2017 08:55


Originally Posted by FlightCosting (Post 9785003)
Time to back to pen and paper. Back in the day (1970) in the brand new high tech terminal 1 the only computer we had was the Solaris information board and that broke down often. Paper pax manifest and hand written load sheets.

Pen and paper with some Blu Tack would have given passengers some kind of information in the terminals, however that would have needed someone with common sense and not a keyboard stuck to their hands.

Rwy in Sight 28th May 2017 09:05


Great company going down the pan.
It seems they get the best in Europe to recover from severe schedule disruption.

Bobbsy 28th May 2017 09:07

I know tht SLF are only tolerated here and when admit that, before retirement, I was in charge of engineering for a TV News Agency I suspect there will be muttering under a lot of breaths.

However, because we had to broadcast live a lot of the time, we put in an elaborate back up power system. All critical facilities (in our case, studios, control rooms, edit rooms, computers etc.) ran on a UPS that could keep the whole load going for 30+ minutes. In addition, there were two auto-start generators. Either of these generated enough power to keep all essential services running. If both came on, we would power the whole building--offices and so on--but if one failed for any reason or another, there was automatic load shedding to turn off power to non essentials and keep the important parts going.

On top of that--but very important--we tested the system once a month. Technical management took turns coming in at around midnight, throwing the main power supply off and letting the generators take the load for half an hour or so. The diesel engines running the generators were inspected the next morming.

On top of all that, even before we got to UPS units and generators, we had two separate power supplies into the building and multiple different routes to feed data (and in our case video) out of the building.

It may have been expensive but the accountants were eventually convinced that it was more economical than going off the air.

So it CAN be done.

portmanteau 28th May 2017 09:36

blame game
 
all due to outsourcing? maybe its closer to home... Waterside is due to be flattened for new runway. Some premature tinkering going on in the basement?

wondrousbitofrough 28th May 2017 09:38


Originally Posted by oldart (Post 9785054)
Pen and paper with some Blu Tack would have given passengers some kind of information in the terminals,

They'll all be in somebody's cupboard at home:E

DaveReidUK 28th May 2017 09:47


Originally Posted by oldart (Post 9785054)
Pen and paper with some Blu Tack would have given passengers some kind of information in the terminals, however that would have needed someone with common sense and not a keyboard stuck to their hands.

Not all staff were paralysed by indecision.

https://ichef.bbci.co.uk/news/624/cp...i039731226.jpg

Sonny_Jim 28th May 2017 10:15

Would not be surprised in the slightest if the 'power supply problem' was actually a botched upgrade. It's very easy for the travelling public to understand 'The power cord fell out it stopped working', rather than 'We were updating our database and Little Jonny Tables popped up'.

pax britanica 28th May 2017 10:26

I rather like todays media comments that BA have advised passengers affected they will get a refund or rebooking . Well of course they will since they sold people tickets that turned out to be unusable and are in breach of contract with a couple of hundred thousand people. How arrogant can they get.

This is an extremely serious incident, if it was JAL or ANA the CEo would be packing his office up this morning but here in the land of 'accountability' not responsibility the blame will fall to the lowest possible credible manager or technician.

It is unthinkable that a power problem at just one site could shut everything down they must surely have a back up or mirror site else where in Uk or even in the US somewhere that would provide some assurance of continuity, aside from everything else it just makes them look stupid as a company and by name association us look stupid as a country

Ian W 28th May 2017 10:28


Originally Posted by southern duel (Post 9784919)
Interesting scenario and one that BA or Heathrow management have not learned from : after all the issues with snow and previously high winds: aircraft waiting after landing to offload maybe up to 4 hours ?? incredible. As a result of previous instances and before I left LHR i drafted a contingency OSI for passenger offloading onto taxiways therefore enabling pax to get in the terminal. Aircraft would then taxi to base to park which was always full of available space. This was available for all terminals not just T5 with specific locations specific processes to achieve. During the snow debacle we and offloaded 10 aircraft in 90 minutes that had been waiting for up to 4 hours because of no available stands. This resulted in the draft process being drawn up with ATC and BA agreeing. looks like its now in the bin somewhere !! This is when experience counts with ops guys who can think on their feet and do not have to rely on a black or white procedure!!!!!!

Interesting, I proposed a similar system to Delta after the ATL system lost it and all gates were full with arrivals being put into penalty boxes. My proposal was to use one or more of the gates close to their shuttle rail system on each concourse as offload gates where the pax and bags are deplaned then tow the aircraft to a stand for cleaning. I would think any airport should have some equivalent system.

Ian W 28th May 2017 10:39


Originally Posted by Nialler (Post 9784967)
<<SNIP>>

My fear always is that a single system failure might not be restricted or contained when it is a logical or intrinsic programmer error which with the cold logic of object code propagates through the redundant systems also. The problem in your primary hydraulic system is not actually isolated because the same problem which led to its failure exists on the fallback.

The solution to this is to dump all the input messages at a failover and restart from a checkpoint a few seconds before the crash. Normally, most problems are some kind of timing issue and restart from a previous checkpoint will not show the same problem. If it is a raw logic problem on one message then the user will have to re-input that message and the second time it may not be malformed if it is then the system does another restart but the source of the error becomes apparent and the input message or rather the user can be blocked. This approach worked well in systems developed for the FAA in the late 1960's and that software was only replaced in 2015.

mrben09 28th May 2017 10:47

As both SLF (derogatory as that title is) and a highly experienced IT leader (biased towards Infrastructure) and someone who spent nearly 10 hours yesterday in T5 , I feel I have something to contribute.

First off, let's not confuse DR with BCP, although both failed yesterday.

For example while IT wherever were toiling over bringing systems back online , the CW/CE/First queues, that were right out of the terminal, were being "organised" by 2 women who were effectively herding cats. They were on a hiding to nothing as people were joining any one and then losing it when the staff come back round again 20 minutes later telling them to go and join the mega queue at WT. Not enough staff and definitely no sign of Managment at all. This got better during the afternoon, but still no sign of any Senior Staff at all. Even this morning they were trying to get us on a flight as my wife received a text to say it was cancelled but nothing showed on their system. Where were the managers, nowhere to be seen, as they "were in meetings". Maybe those meetings should have been through the night so everyone could be briefed for 0430.
On a more serious note we were told by staff they couldn't find any megaphones to replace the non working PA. I would suggest that these should be easy to find in case of a real emergency.

As for IT, outsourcing is not something I would advocate, but when it has crossed my path, I would never allow a system to go live without:
Rigorous functional testing of system
Rigorous DR Testing
Sign off of all infrastructure designs from someone qualified to do so and counter sign it myself.

The outsourcer should not have unrestricted responsibility for design of something thousands of miles away that isn't theirs. This also makes it easy to swap supplier should they prove to be sub par, which they will.
I guarantee someone within BA has signed that design off as suitable, and that's where heads should roll initially. Then look at your "partner"

Also all the previous posts regarding bean counters are a given as well. Scourge of IT !
On a personal note I'm not actually buying the power excuse but as we don't like to speculate within these halls I'll keep my opinion to myself. I will say however all the systems affected were internet facing.

Anyway, got all that off my chest, and resigned to go back to work on Tuesday instead of enjoying a few cold ones on the Greek coastline !

Super VC-10 28th May 2017 11:09

https://twitter.com/WillBlackWriter/...87512616054786

Sonny_Jim 28th May 2017 11:17

Actually, a few viruses exist for the C64:
Com64/BHP | VirusInfo | Fandom powered by Wikia

enola-gay 28th May 2017 11:30


Originally Posted by Sonny_Jim (Post 9785128)
Would not be surprised in the slightest if the 'power supply problem' was actually a botched upgrade. It's very easy for the travelling public to understand 'The power cord fell out it stopped working', rather than 'We were updating our database and Little Jonny Tables popped up'.

===
Was Mr O' Bama flying BA from Edinburgh ? Damned apostrophes Eh?

Self Loading Freight 28th May 2017 11:44

I'd hope nobody would be rolling out upgrades on one of the busiest weekends of the year.

For what it's worth, and having seen and written about a few major IT omnishambles, I'd be surprised if this omelette wasn't made with swiss cheese. And that the motivation to find and fix the systemic error which allowed the holes to line up won't be sufficient to counter organisational inertia and CYA - but if it is, I'll regain that very scarce state of mind, respect for senior management.

crewmeal 28th May 2017 11:48

What exactly has Alex Cruz brought to the table since taking over as Chief Executive back in 2015?

tubby linton 28th May 2017 12:25


Originally Posted by crewmeal (Post 9785200)
What exactly has Alex Cruz brought to the table since taking over as Chief Executive back in 2015?


Marks and Spencer sandwiches?
I empathise with the frontline staff, and I hope that the passengers remember that it is not the fault of these staff that their journey has been delayed.
I would imagine that the prospect of huge Eu261 claims is probably driving the lack of disclosure as to the exact reason for this debacle.

paperHanger 28th May 2017 12:31

Free the gates!
 
I'm somewhat surprised that they are leaving a/c on gates. blocking other flights ... would make more sense to close 5B/5C and start stacking them on the taxiways C/D by now, blocking gates is not the best plan.

pax britanica 28th May 2017 12:34

Isn't it more what has Alex Cruz taken off the table since he has been CEO, no more meals , cheapskate policy on water for pax etc.

This fiasco is his fault, he is the CEO -he gets the plaudits but should also carry the can He is a slash and burn manager and his philosophy and aggressive cost cutting has obviously been taken taken a step too far here and he has to go for the good of BA

Anyone else agree

Annewd 28th May 2017 12:38

DingerX is spot on, saying "The person who 'saved the airline a fortune in IT costs' has now no doubt been promoted, hired away to a different company and enjoyed a couple more pay raises". Happens right across public sector services too

ILS27LEFT 28th May 2017 12:44

I agree
 

Originally Posted by pax britanica (Post 9785232)
Isn't it more what has Alex Cruz taken off the table since he has been CEO, no more meals , cheapskate policy on water for pax etc.

This fiasco is his fault, he is the CEO -he gets the plaudits but should also carry the can He is a slash and burn manager and his philosophy and aggressive cost cutting has obviously been taken taken a step too far here and he has to go for the good of BA

Anyone else agree

I totally agree, aggressive cost cutting is very likely behind this non sense and now the FLAs have to deal with the mess. Absurd.

arem 28th May 2017 12:53

Cruz must go - and I ain't talking about Penelope


All times are GMT. The time now is 21:01.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.