Complete LH group meltdown [Archive]

atakacs

15th Feb 2023, 10:47

"Deutsche Lufthansa AG (https://www.bloomberg.com/quote/LHA:GR) grounded all of its flights on Wednesday after damage to a set of Deutsche Telekom (https://www.bloomberg.com/quote/DTE:GR) broadband cables caused widespread IT problems" (link (https://www.bloomberg.com/news/articles/2023-02-15/lufthansa-says-it-system-issues-causing-widespread-cancellations) - behind paywall),

If true (ie the root cause, the disruption is very real) it is inexcusable.

Less Hair

15th Feb 2023, 10:54

Pretty bad mess at FRA right now. Flights get diverted elsewhere as the airport is topped off with aircraft after nothing is departing. Mind you, this Friday there are big airport strikes in Germany anyway.

EDLB

15th Feb 2023, 12:01

Workers managed to drill into a big fiber trunk 5m under ground near a railroad. Repair ongoing since Tuesday night.

The Nr Fairy

15th Feb 2023, 13:27

https://www.theregister.com/2023/02/15/lufthansa/ - non paywalled.

atakacs

15th Feb 2023, 14:13

Workers managed to drill into a big fiber trunk 5m under ground near a railroad. Repair ongoing since Tuesday night.
Might well be but this type of infra should most definitely be totally redundant - yes it does cost money but this incident will not be cheap either. Even the Frankfurt airport website (https://www.frankfurt-airport.com/en.html) is MIA...

MAN777

15th Feb 2023, 15:02

Single point of failure, oh dear !

hunterboy

15th Feb 2023, 17:30

Sounds very British Airways !

pax britanica

15th Feb 2023, 18:57

As a telecoms guy It is almost beyond belief that all LHs IT could be taken out by one cable cut and on a rail track right of way where it is well known there are always going to be telecom cables ..

Its the airline industry for gods sake redundancy is built into every aspecyt of airliners but it seems not in IT systems.

The CEO of the LH Group should be fired for this

First_Principal

15th Feb 2023, 19:36

... The CEO of the LH Group should be fired for this

While I agree redundancy in such a system is surely a good thing I really don't see why the CEO should be automatically fired as a result of this event?

A person in that position would not normally be expected to know the full down 'n dirty technical minutiae of their system, rather they would rely upon advice they receive from company and external experts. If it transpired they knew about weaknesses in the system, were advised to address these but did nothing, then perhaps they should consider moving on. Otherwise, assuming they're an effective CEO, why not allow them the opportunity to sort the issue out?

To put it another way; should a pilot be automatically 'fired' because an engineer did something that broke their 'plane?

FP.

Imagegear

15th Feb 2023, 20:18

The Piper will have to be paid.

No doubt a serious cost saving measure that turned around and bit them. Single point of failure cannot be tolerated in a failover strategy. Chief Information Officers and Financial Officers would not have adopted this strategy without the full knowledge of the CEO. There will be no hiding place.

IG

His dudeness

15th Feb 2023, 20:39

The Piper will have to be paid.

No doubt a serious cost saving measure that turned around and bit them. Single point of failure cannot be tolerated in a failover strategy. Chief Information Officers and Financial Officers would not have adopted this strategy without the full knowledge of the CEO. There will be no hiding place.

IG
My BS detector is right off the scale - do you HONESTLY think, Lufthansa should run their own internet ? Parallel to the infrastructure already installed ? Its not that someone drilled into a cable at the LH HQ.

There is plenty of things I would fire the CEO for (e.g. closing LH Flighttraining down or the downward spiral of T&C for LH Group employees), but this is not one of them.

What this really highlights, is how the (EU driven) shift to privatization of crucial infrastructure allows for these weak points to develop.

Flyhighfirst

15th Feb 2023, 20:42

The Piper will have to be paid.

No doubt a serious cost saving measure that turned around and bit them. Single point of failure cannot be tolerated in a failover strategy. Chief Information Officers and Financial Officers would not have adopted this strategy without the full knowledge of the CEO. There will be no hiding place.

IG

Adopting a strategy, without approval from the CEO may be one thing. A system in place for years before the CEO took the role? I would cut some slack. I know there is a propensity to think ever CEO of a large corporation is the devil incarnate right now but they can only know what they have been briefed on. To enable them to make a decision. Even then they aren’t briefed on every aspect of the organisation.

20driver

16th Feb 2023, 03:04

Quite a while ago a pile driving crew was driving piles for a new garage at Newark.
They hit a concrete conduit that was not supposed to be there. It was a glancing blow. So they moved over and tried again,.
This time they penetrated the conduit and hit the main power line to various parts of the airport , including the control tower and Customs in Terminal B.
For added measure the back up power line was in the same conduit bank and it was hit as well. !
Nothing moved for about 6 hours and all the inbound international arrivals had to wait for customs to get power back.
Seems the control tower had a back up gen set so they were OK.

EDLB

16th Feb 2023, 04:52

Idiot proofing infrastructure is not easy and expensive. Mainly due to the inventiveness of the idiots.

Denti

16th Feb 2023, 05:29

Don’t big companies usually have a CIO (Chief IT Officer) who can be fired for those cases? That said, Lufty probably just paid a lot of money to Telekom to get a redundant fibre connection and Telekom simply put all four fibers in the same conduit below that railway line.

The bigger problem is probably that nobody knows whats down there until you dig into it, since records are only kept within one company. In Berlin they just built the foundation for a new high rise building and afterwards discovered that they nearly destroyed the tunnel of an active underground line which now has to be repaired, stopping traffic in there for months, which is still better than nearly burying a whole commuter train in concrete.

the_stranger

16th Feb 2023, 07:35

I am (constantly) amazed by the calls for firing people without any further information on cause or processes.

Just cause should be applied everywhere, not just in the cockpit.

Somebody made a mistake, let them figure it out, learn from it and than see if somebody is to blame and in what measures.

Boeingdriver999

16th Feb 2023, 08:42

Replace CEO in all contexts with Captain. Reads a bit differently right?

ATC Watcher

16th Feb 2023, 09:03

Exactly the same thing occurred 45 yaers ago in my old ( major) Control centre. A worker using a backhoe dig and cut a bunch of cables just outside the centre , severing all telephone and radar lines ( that were all underground at the time ) Lessons were larned to have multiple access, etc.. but one of the root cause still rermains today : sub contracting basic work to the lowest bidder. In our case the guy operating backhoe was not one the regular workers who had a briefing on the cables layout , but the digging crew that morning were hired on the day to do some basic earth removal and they were not made aware of the exact position and sensivity of the cables.
I would not be surprised if the same happenned in Frankfurt yesterday with Deuche Bahn ( the railway company that was contracting the work that cut the cable), How many sub contractors down the line ?
As to the resposnibily of Lufhansa, of course not their direct fault as the incident occured quite far away from the airport, but relying on a single source of communications to function, probably is.

B Fraser

16th Feb 2023, 09:31

What this really highlights, is how the (EU driven) shift to privatization of crucial infrastructure allows for these weak points to develop.

I beg to differ. The most robust network I have been involved with was a totally private company. At one point, it carried the BBC broadcasting distribution network so it was a high stakes venture. In 1994, a 737 crashed short of Coventry airport and went straight through the network at Willenhall. I was in the ops centre at the time and the screens lit up like a christmas tree. The BBC transmissions didn't even flicker as the network was fully redundant. All other services including 999 were similarly unaffected.

Separacy and diversity of networks is critical. In a development at locations such as an airport, the incremental costs are negligible compared to the cost of failure. You can plan as much as you like but Murphy is always waiting in the wings. One memorable event involved a Big Telecom company who excavated a duct under a rugby pitch using a horizontal borer. The machine was programmed to dig deeper than other infrastructure which was marked accurately on the diagrams. Nobody noticed that to improve drainage, the height of the pitch had been increased. Ouch ! Another event in Glasgow city centre involved a water main. The excavations did not touch the pipe however it had corroded to the point where the weight of the earth above it was holding it together. The dig reduced the weight and the whole pipe let go in a spectacular fashion.

pax britanica

16th Feb 2023, 12:07

Most companies of LH size work extremely hard to ensure their networks have adequate diversity.

This often means literlaly going out and walking the paths of critical cables to ensure they reallyare seeprated when they enter your properties and not two cables in one duct.. All kinds of checking for common points of failure are routinely undertaken for mission critical enterprises .

Telecoms plant is often owned or used by more than one company and so buying from company A doesnt mean they use different networks from company B . In this case it is unthinkable that they serve sucha critical facility with just one cable . The reason I said the CEO should get the chop is because he has created or tolerated an obviously slack cunture and attude when it comes to resilience as IT is critical to all major airline ops.. He gets a big bonus when they make a lot of money but he does little or none of the work to do that himself. . In this case of course LH shouldnt run their own telcoms network but theyare a big enought company in Germnay to get T Systems and other suppliers to do exactly what they want .
This is big screw up so someonebig has to be resopnsible

Less Hair

16th Feb 2023, 12:34

We don't know what their setup is and what exactly went wrong. It only broke down on the next morning after the cables had been chopped when more data were transferred during peak time hours. I know a major TV station that went black after power grid surges when its computer went crazy and started to constantly switch back and forth between the fluctuating grid and the local backup generator and power system. Even a working backup can leave you in trouble.

B Fraser

16th Feb 2023, 15:05

It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.

n5296s

16th Feb 2023, 15:57

The trouble is that no matter how hard you try, you don't really know how this physical infrastructure is configured. You buy from two separate suppliers etc. You can still have bad surprises.

Many years ago (1980s) the Internet as it then was had coast-to-coast redundancy, Boston-Seattle and Washington DC-LA (or something like that). Guess what... a train derailed somewhere in the midwest and cut both of them.

That said, this case does seem especially egregious.

Klauss

17th Feb 2023, 04:00

Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business

Imagegear

17th Feb 2023, 05:29

Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business

Under normal circumstances, the critical failover IT infrastructure should have been located some way away from the main centre and with continuous transaction replication and a short periodic heart beat to verify that a connection has not been lost.
If a link is lost for what ever reason, the heartbeat will stop and so with no further information, the backup site should takeover transaction processing and recover any in-process transactions from the failed pipeline. In a critical system you cannot wait for a system to tell you it is broke.

I am aware of an organisation who ran three separate recovery data centres, on three different continents. (Think Flood, Fire, War, cyber attack, pestilence, etc.)

If were a share holder, I would hold responsible, the CIO of LH who has been negligent in accepting that single point of failure, and by association, the CEO and the CFO. It matters not whether the problem was there before their appointment, they did not all suddenly appear from the HR organisation.

IG

Gne

17th Feb 2023, 08:22

Exactly the same thing occurred 45 yaers ago in my old ( major) Control centre. A worker using a backhoe dig and cut a bunch of cables just outside the centre , severing all telephone and radar lines ( that were all underground at the time ) .
Same during my career:

On my first posting off course all coord comms to the adjacent civil centre lost due to back hoe activity just inside the main gate of the base.
some years later as a SATCO had the main comms/coord fibre to the outside world cut during night flying by a D9 clearing scrub 15 Km off base dragging a fiberoptic cable to the point it snapped. He was devastated when we located him and told him what he'd done.
next night he was clearing another patch of scrub 50km away and dragged the backup comms/coord fiberoptic cable to the point it snapped. Turns out, in both cases, the contractor had not buried the cable at the required depth, thinking, no doubt that it wouldn't matter.
change of role, five years later as a civvie ATM systems manager thought it prudent to check redundancy for the coord link between two major centres. The glossy schematics clearly showed diverse paths between the two MDFs, one terrestrial and the other in space. found that the connection between the MDF and the antennas to bounce the signals to the satellite was in the same off site duct as the terrestrial link. On the drive back to the airport to head home and try and resolve the problem the senior tech manager and I saw a backhoe working within 2m of the duct containing the cables!!

Don't talk to me about redundancy and diverse paths!!

Gne

Asturias56

17th Feb 2023, 08:23

One issue can be that multiple redundancy sounds great but it brings greater complexity - especially when you are upgrading software for example. You have to ensure ALL the links are working properly ALL the time . Easier to say than to achieve.

Imagegear

17th Feb 2023, 08:34

Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG

MichaelKPIT

17th Feb 2023, 13:57

Aaaaand now the same thing has happened at LH's terminal in JFK:
https://www.cnbc.com/2023/02/17/jfk-airport-terminal-to-remain-shut-friday-after-power-outage.html

nomilk

17th Feb 2023, 17:29

Wasn't it a fire in terminal 1?

MichaelKPIT

17th Feb 2023, 18:59

Yes it was - it's just that another single point of failure (electrical panel) brought down the whole operation. Agree it's much harder to avoid - just a similar occurrence.

SMT Member

18th Feb 2023, 07:43

Interesting comments here, especially so on a board that profess to cater to professional aviators.

Some of the comments made here would make you believe, that there are people who fully endorse the Qatar/Emirates way of working when something goes wrong: Fire everybody involved and sort out what actually happened later.

But in reality, those same people are (rightfully) up in arms when QR/EK sacks a crew whenever there’s been a bit of an incident.

Not a pretty sight.

Asturias56

18th Feb 2023, 07:51

Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG

I know the problem is often it's not "will pick up" - it's "should have picked up". Unless you rigorously test by failing a feed on a regular basis the various systems tend to diverge - often someone makes a small and seemingly inconsequential change to fix/upgrade something else and what should be Duplicated now becomes Similar. It 's especially the case when the system management is outsourced for a while - you lose that ownership of the complete system and it's commercial importance

Blackfriar

19th Feb 2023, 06:28

It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.

I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.

Less Hair

19th Feb 2023, 06:58

Weight concerned aircraft have triple redundancy built in but airlines don't.

vikingivesterled

19th Feb 2023, 10:37

Frankfurt is home to a large internet exchange (once 1 of only 5 in Europe) where many different telco companies cables meet in a single building exchanging traffic with each other. Many companies systems is housed in the same building for short acces to that critical exchange. As I remember that building have 2 cable entrance points to ducts under 2 different roads.
An airlines systems might be redundant with standbys in different buildings but the question is always when do you switch to the alternative. In this case the main system was still up and running. Airline systems are also very interconnected so will you switch to the backup system if ops control in other parts of Lufthansa is running fine but checkin at a particular airport is not.

Checkin systems are also often airport supplied and standardised so the desks can be used by many different airlines. Then it wouldn't be LH's system that was down but the airports connection to LH's main system that is most likely in a different place altogheter. And if connection to and through the internet exchange would be the airports responsibility. It would be the airlines responsibility not to have a manual backup system. But these day airport flows are so system dependent that sample baggage handling couldn't be done manually without systems working and I do believe some planes took off from Frankfurt without baggage.

Imagegear

19th Feb 2023, 10:39

Should airlines be regulated in terms of safety in the same way that aircraft manufacturers are, after all if an airlines systems fail, it is possible for deaths to occur.

IG

Less Hair

19th Feb 2023, 11:19

FWIW russian hacker group killnet has claimed to be responsible. But they seem to claim "a lot" just to be sure. Still, several airport sites in Germany in fact WERE attacked by killnet however with much more simplistic DDoS attacks during the same time.

Deaf

19th Feb 2023, 13:04

I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.

The question of where do the lines actually go is fairly difficult to determine.

Mapping copper last mile stuff in several countries where we produced the equipment both hardware to/from MDF,pillars etc and IT equpment to keep track of the findings. These showed about 70% actually went when they were shown on the plans. Anectdotally thats common and the biggest cause was rushed repair of a backhoe job and the quickest way was rerouting through alternate routes which may or may not rejoin the origininal route.

No longterm problem if the plans are amended accordingly but that paperwork bit which is often at the end of a lengthy night in the rain tends not to happen.

tdracer

19th Feb 2023, 18:36

It's not just communication cables. A couple years ago, the local natural gas utility came in and replaced all the underground gas lines in the development as 'preventative maintenance'.
It became pretty apparent as soon as they started digging up the street that the existing gas lines were not where they thought they were - often off by 10 meters or more. Needless to say, the local cul-de-sac was quite the mess by the time they finished digging.
At least we didn't find out about it the hard way - e.g. when some utility dig severed a gas line that they thought was 10 meters away...

paulross

20th Feb 2023, 13:51

It's not just communication cables. A couple years ago, the local natural gas utility came in and replaced all the underground gas lines in the development as 'preventative maintenance'.
It became pretty apparent as soon as they started digging up the street that the existing gas lines were not where they thought they were - often off by 10 meters or more. Needless to say, the local cul-de-sac was quite the mess by the time they finished digging.
At least we didn't find out about it the hard way - e.g. when some utility dig severed a gas line that they thought was 10 meters away...

As an example, this is what can happen when you drive a pile through a gas line (north Siberia):

https://cimg6.ibsrv.net/gimg/pprune.org-vbulletin/948x620/gaspipelinehit_e99bc0997cdfa4de4d79cedd0748511c658b5392.jpeg