Complete LH group meltdown

Reply Subscribe

Thread Tools

Search this Thread

16th Feb 2023, 12:34

#21 (permalink)

Less Hair

Join Date: Feb 2010

Posts: 1,075

Likes: 119

Received 66 Likes on 40 Posts

We don't know what their setup is and what exactly went wrong. It only broke down on the next morning after the cables had been chopped when more data were transferred during peak time hours. I know a major TV station that went black after power grid surges when its computer went crazy and started to constantly switch back and forth between the fluctuating grid and the local backup generator and power system. Even a working backup can leave you in trouble.

16th Feb 2023, 15:05

#22 (permalink)

B Fraser

Tabs please !

Join Date: Jun 2004

Location: Biffins Bridge

Posts: 950

Likes: 170

Received 327 Likes on 195 Posts

It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.

16th Feb 2023, 15:57

#23 (permalink)

n5296s

Join Date: Jun 2003

Location: LFMD

Posts: 749

Likes: 0

Received 7 Likes on 4 Posts

The trouble is that no matter how hard you try, you don't really know how this physical infrastructure is configured. You buy from two separate suppliers etc. You can still have bad surprises.

Many years ago (1980s) the Internet as it then was had coast-to-coast redundancy, Boston-Seattle and Washington DC-LA (or something like that). Guess what... a train derailed somewhere in the midwest and cut both of them.

That said, this case does seem especially egregious.

17th Feb 2023, 04:00

#24 (permalink)

Klauss

Join Date: Nov 2003

Location: Germany

Posts: 137

Likes: 0

Received 0 Likes on 0 Posts

What about wireless ?

Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business

17th Feb 2023, 05:29

#25 (permalink)

Imagegear

Join Date: Nov 2018

Location: back out to Grasse

Posts: 557

Likes: 1

Received 28 Likes on 12 Posts

Quote:

Originally Posted by Klauss

Under normal circumstances, the critical failover IT infrastructure should have been located some way away from the main centre and with continuous transaction replication and a short periodic heart beat to verify that a connection has not been lost.
If a link is lost for what ever reason, the heartbeat will stop and so with no further information, the backup site should takeover transaction processing and recover any in-process transactions from the failed pipeline. In a critical system you cannot wait for a system to tell you it is broke.

I am aware of an organisation who ran three separate recovery data centres, on three different continents. (Think Flood, Fire, War, cyber attack, pestilence, etc.)

If were a share holder, I would hold responsible, the CIO of LH who has been negligent in accepting that single point of failure, and by association, the CEO and the CFO. It matters not whether the problem was there before their appointment, they did not all suddenly appear from the HR organisation.

IG

17th Feb 2023, 08:22

#26 (permalink)

Gne

Join Date: Jan 2006

Location: With the Wizard

Posts: 188

Likes: 103

Received 54 Likes on 27 Posts

Quote:

Originally Posted by ATC Watcher

Exactly the same thing occurred 45 yaers ago in my old ( major) Control centre. A worker using a backhoe dig and cut a bunch of cables just outside the centre , severing all telephone and radar lines ( that were all underground at the time ) .

Same during my career:

On my first posting off course all coord comms to the adjacent civil centre lost due to back hoe activity just inside the main gate of the base.
some years later as a SATCO had the main comms/coord fibre to the outside world cut during night flying by a D9 clearing scrub 15 Km off base dragging a fiberoptic cable to the point it snapped. He was devastated when we located him and told him what he'd done.
next night he was clearing another patch of scrub 50km away and dragged the backup comms/coord fiberoptic cable to the point it snapped. Turns out, in both cases, the contractor had not buried the cable at the required depth, thinking, no doubt that it wouldn't matter.
change of role, five years later as a civvie ATM systems manager thought it prudent to check redundancy for the coord link between two major centres. The glossy schematics clearly showed diverse paths between the two MDFs, one terrestrial and the other in space. found that the connection between the MDF and the antennas to bounce the signals to the satellite was in the same off site duct as the terrestrial link. On the drive back to the airport to head home and try and resolve the problem the senior tech manager and I saw a backhoe working within 2m of the duct containing the cables!!

Don't talk to me about redundancy and diverse paths!!

Gne

17th Feb 2023, 08:23

#27 (permalink)

Asturias56

Join Date: Oct 2018

Location: Ferrara

Posts: 8,409

Likes: 433

Received 361 Likes on 210 Posts

One issue can be that multiple redundancy sounds great but it brings greater complexity - especially when you are upgrading software for example. You have to ensure ALL the links are working properly ALL the time . Easier to say than to achieve.

17th Feb 2023, 08:34

#28 (permalink)

Imagegear

Join Date: Nov 2018

Location: back out to Grasse

Posts: 557

Likes: 1

Received 28 Likes on 12 Posts

Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG

17th Feb 2023, 13:57

#29 (permalink)

MichaelKPIT

Join Date: Jun 2013

Location: Pittsburgh, PA

Posts: 112

Likes: 0

Received 0 Likes on 0 Posts

Aaaaand now the same thing has happened at LH's terminal in JFK:
https://www.cnbc.com/2023/02/17/jfk-...er-outage.html

17th Feb 2023, 17:29

#30 (permalink)

nomilk

Join Date: May 2021

Location: An Island

Posts: 92

Likes: 72

Received 24 Likes on 11 Posts

Wasn't it a fire in terminal 1?

17th Feb 2023, 18:59

#31 (permalink)

MichaelKPIT

Join Date: Jun 2013

Location: Pittsburgh, PA

Posts: 112

Likes: 0

Received 0 Likes on 0 Posts

Yes it was - it's just that another single point of failure (electrical panel) brought down the whole operation. Agree it's much harder to avoid - just a similar occurrence.

18th Feb 2023, 07:43

#32 (permalink)

SMT Member

Join Date: May 2008

Location: Europe

Age: 45

Posts: 625

Likes: 1

Received 0 Likes on 0 Posts

Interesting comments here, especially so on a board that profess to cater to professional aviators.

Some of the comments made here would make you believe, that there are people who fully endorse the Qatar/Emirates way of working when something goes wrong: Fire everybody involved and sort out what actually happened later.

But in reality, those same people are (rightfully) up in arms when QR/EK sacks a crew whenever there’s been a bit of an incident.

Not a pretty sight.

18th Feb 2023, 07:51

#33 (permalink)

Asturias56

Join Date: Oct 2018

Location: Ferrara

Posts: 8,409

Likes: 433

Received 361 Likes on 210 Posts

Quote:

Originally Posted by Imagegear

I know the problem is often it's not "will pick up" - it's "should have picked up". Unless you rigorously test by failing a feed on a regular basis the various systems tend to diverge - often someone makes a small and seemingly inconsequential change to fix/upgrade something else and what should be Duplicated now becomes Similar. It 's especially the case when the system management is outsourced for a while - you lose that ownership of the complete system and it's commercial importance

19th Feb 2023, 06:28

#34 (permalink)

Blackfriar

Join Date: Nov 2013

Location: Somerset

Posts: 182

Likes: 15

Received 1 Like on 1 Post

Quote:

Originally Posted by B Fraser

I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.

19th Feb 2023, 06:58

#35 (permalink)

Less Hair

Join Date: Feb 2010

Posts: 1,075

Likes: 119

Received 66 Likes on 40 Posts

Weight concerned aircraft have triple redundancy built in but airlines don't.

19th Feb 2023, 10:37

#36 (permalink)

vikingivesterled

Join Date: Aug 2007

Location: Ireland

Posts: 216

Likes: 0

Received 0 Likes on 0 Posts

Frankfurt is home to a large internet exchange (once 1 of only 5 in Europe) where many different telco companies cables meet in a single building exchanging traffic with each other. Many companies systems is housed in the same building for short acces to that critical exchange. As I remember that building have 2 cable entrance points to ducts under 2 different roads.
An airlines systems might be redundant with standbys in different buildings but the question is always when do you switch to the alternative. In this case the main system was still up and running. Airline systems are also very interconnected so will you switch to the backup system if ops control in other parts of Lufthansa is running fine but checkin at a particular airport is not.

Checkin systems are also often airport supplied and standardised so the desks can be used by many different airlines. Then it wouldn't be LH's system that was down but the airports connection to LH's main system that is most likely in a different place altogheter. And if connection to and through the internet exchange would be the airports responsibility. It would be the airlines responsibility not to have a manual backup system. But these day airport flows are so system dependent that sample baggage handling couldn't be done manually without systems working and I do believe some planes took off from Frankfurt without baggage.

19th Feb 2023, 10:39

#37 (permalink)

Imagegear

Join Date: Nov 2018

Location: back out to Grasse

Posts: 557

Likes: 1

Received 28 Likes on 12 Posts

Should airlines be regulated in terms of safety in the same way that aircraft manufacturers are, after all if an airlines systems fail, it is possible for deaths to occur.

IG

19th Feb 2023, 11:19

#38 (permalink)

Less Hair

Join Date: Feb 2010

Posts: 1,075

Likes: 119

Received 66 Likes on 40 Posts

FWIW russian hacker group killnet has claimed to be responsible. But they seem to claim "a lot" just to be sure. Still, several airport sites in Germany in fact WERE attacked by killnet however with much more simplistic DDoS attacks during the same time.

19th Feb 2023, 13:04

#39 (permalink)

Deaf

Join Date: Nov 2000

Location: Melbourne,Vic,Australia

Posts: 423

Likes: 0

Received 2 Likes on 2 Posts

Quote:

Originally Posted by Blackfriar

The question of where do the lines actually go is fairly difficult to determine.

Mapping copper last mile stuff in several countries where we produced the equipment both hardware to/from MDF,pillars etc and IT equpment to keep track of the findings. These showed about 70% actually went when they were shown on the plans. Anectdotally thats common and the biggest cause was rushed repair of a backhoe job and the quickest way was rerouting through alternate routes which may or may not rejoin the origininal route.

No longterm problem if the plans are amended accordingly but that paperwork bit which is often at the end of a lengthy night in the rain tends not to happen.

19th Feb 2023, 18:36

#40 (permalink)

tdracer

Join Date: Jul 2013

Location: Everett, WA

Age: 68

Posts: 4,408

Likes: 97

Received 180 Likes on 88 Posts

It's not just communication cables. A couple years ago, the local natural gas utility came in and replaced all the underground gas lines in the development as 'preventative maintenance'.
It became pretty apparent as soon as they started digging up the street that the existing gas lines were not where they thought they were - often off by 10 meters or more. Needless to say, the local cul-de-sac was quite the mess by the time they finished digging.
At least we didn't find out about it the hard way - e.g. when some utility dig severed a gas line that they thought was 10 meters away...

Reply Share

First
Prev
2 / 3
Next
Last