Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

Complete LH group meltdown

Wikiposts
Search
Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

Complete LH group meltdown

Thread Tools
 
Search this Thread
 
Old 16th Feb 2023, 12:34
  #21 (permalink)  
 
Join Date: Feb 2010
Posts: 1,075
Received 66 Likes on 40 Posts
We don't know what their setup is and what exactly went wrong. It only broke down on the next morning after the cables had been chopped when more data were transferred during peak time hours. I know a major TV station that went black after power grid surges when its computer went crazy and started to constantly switch back and forth between the fluctuating grid and the local backup generator and power system. Even a working backup can leave you in trouble.
Less Hair is offline  
Old 16th Feb 2023, 15:05
  #22 (permalink)  
Tabs please !
 
Join Date: Jun 2004
Location: Biffins Bridge
Posts: 950
Received 327 Likes on 195 Posts
It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.
B Fraser is online now  
Old 16th Feb 2023, 15:57
  #23 (permalink)  
 
Join Date: Jun 2003
Location: LFMD
Posts: 749
Likes: 0
Received 7 Likes on 4 Posts
The trouble is that no matter how hard you try, you don't really know how this physical infrastructure is configured. You buy from two separate suppliers etc. You can still have bad surprises.

Many years ago (1980s) the Internet as it then was had coast-to-coast redundancy, Boston-Seattle and Washington DC-LA (or something like that). Guess what... a train derailed somewhere in the midwest and cut both of them.

That said, this case does seem especially egregious.
n5296s is offline  
Old 17th Feb 2023, 04:00
  #24 (permalink)  
 
Join Date: Nov 2003
Location: Germany
Posts: 137
Likes: 0
Received 0 Likes on 0 Posts
What about wireless ?

Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business
Klauss is offline  
Old 17th Feb 2023, 05:29
  #25 (permalink)  
 
Join Date: Nov 2018
Location: back out to Grasse
Posts: 557
Received 28 Likes on 12 Posts
Originally Posted by Klauss
Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business
Under normal circumstances, the critical failover IT infrastructure should have been located some way away from the main centre and with continuous transaction replication and a short periodic heart beat to verify that a connection has not been lost.
If a link is lost for what ever reason, the heartbeat will stop and so with no further information, the backup site should takeover transaction processing and recover any in-process transactions from the failed pipeline. In a critical system you cannot wait for a system to tell you it is broke.

I am aware of an organisation who ran three separate recovery data centres, on three different continents. (Think Flood, Fire, War, cyber attack, pestilence, etc.)

If were a share holder, I would hold responsible, the CIO of LH who has been negligent in accepting that single point of failure, and by association, the CEO and the CFO. It matters not whether the problem was there before their appointment, they did not all suddenly appear from the HR organisation.

IG
Imagegear is offline  
Old 17th Feb 2023, 08:22
  #26 (permalink)  
Gne
 
Join Date: Jan 2006
Location: With the Wizard
Posts: 188
Received 54 Likes on 27 Posts
Originally Posted by ATC Watcher
Exactly the same thing occurred 45 yaers ago in my old ( major) Control centre. A worker using a backhoe dig and cut a bunch of cables just outside the centre , severing all telephone and radar lines ( that were all underground at the time ) .
Same during my career:
  • On my first posting off course all coord comms to the adjacent civil centre lost due to back hoe activity just inside the main gate of the base.
  • some years later as a SATCO had the main comms/coord fibre to the outside world cut during night flying by a D9 clearing scrub 15 Km off base dragging a fiberoptic cable to the point it snapped. He was devastated when we located him and told him what he'd done.
  • next night he was clearing another patch of scrub 50km away and dragged the backup comms/coord fiberoptic cable to the point it snapped. Turns out, in both cases, the contractor had not buried the cable at the required depth, thinking, no doubt that it wouldn't matter.
  • change of role, five years later as a civvie ATM systems manager thought it prudent to check redundancy for the coord link between two major centres. The glossy schematics clearly showed diverse paths between the two MDFs, one terrestrial and the other in space. found that the connection between the MDF and the antennas to bounce the signals to the satellite was in the same off site duct as the terrestrial link. On the drive back to the airport to head home and try and resolve the problem the senior tech manager and I saw a backhoe working within 2m of the duct containing the cables!!
Don't talk to me about redundancy and diverse paths!!

Gne
Gne is offline  
Old 17th Feb 2023, 08:23
  #27 (permalink)  
 
Join Date: Oct 2018
Location: Ferrara
Posts: 8,409
Received 361 Likes on 210 Posts
One issue can be that multiple redundancy sounds great but it brings greater complexity - especially when you are upgrading software for example. You have to ensure ALL the links are working properly ALL the time . Easier to say than to achieve.
Asturias56 is offline  
Old 17th Feb 2023, 08:34
  #28 (permalink)  
 
Join Date: Nov 2018
Location: back out to Grasse
Posts: 557
Received 28 Likes on 12 Posts
Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG
Imagegear is offline  
Old 17th Feb 2023, 13:57
  #29 (permalink)  
 
Join Date: Jun 2013
Location: Pittsburgh, PA
Posts: 112
Likes: 0
Received 0 Likes on 0 Posts
Aaaaand now the same thing has happened at LH's terminal in JFK:
https://www.cnbc.com/2023/02/17/jfk-...er-outage.html
MichaelKPIT is offline  
Old 17th Feb 2023, 17:29
  #30 (permalink)  
 
Join Date: May 2021
Location: An Island
Posts: 92
Received 24 Likes on 11 Posts
Wasn't it a fire in terminal 1?
nomilk is offline  
Old 17th Feb 2023, 18:59
  #31 (permalink)  
 
Join Date: Jun 2013
Location: Pittsburgh, PA
Posts: 112
Likes: 0
Received 0 Likes on 0 Posts
Yes it was - it's just that another single point of failure (electrical panel) brought down the whole operation. Agree it's much harder to avoid - just a similar occurrence.
MichaelKPIT is offline  
Old 18th Feb 2023, 07:43
  #32 (permalink)  
 
Join Date: May 2008
Location: Europe
Age: 45
Posts: 625
Received 0 Likes on 0 Posts
Interesting comments here, especially so on a board that profess to cater to professional aviators.

Some of the comments made here would make you believe, that there are people who fully endorse the Qatar/Emirates way of working when something goes wrong: Fire everybody involved and sort out what actually happened later.

But in reality, those same people are (rightfully) up in arms when QR/EK sacks a crew whenever there’s been a bit of an incident.

Not a pretty sight.
SMT Member is offline  
Old 18th Feb 2023, 07:51
  #33 (permalink)  
 
Join Date: Oct 2018
Location: Ferrara
Posts: 8,409
Received 361 Likes on 210 Posts
Originally Posted by Imagegear
Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG

I know the problem is often it's not "will pick up" - it's "should have picked up". Unless you rigorously test by failing a feed on a regular basis the various systems tend to diverge - often someone makes a small and seemingly inconsequential change to fix/upgrade something else and what should be Duplicated now becomes Similar. It 's especially the case when the system management is outsourced for a while - you lose that ownership of the complete system and it's commercial importance
Asturias56 is offline  
Old 19th Feb 2023, 06:28
  #34 (permalink)  
 
Join Date: Nov 2013
Location: Somerset
Posts: 182
Received 1 Like on 1 Post
Originally Posted by B Fraser
It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.
I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.
Blackfriar is offline  
Old 19th Feb 2023, 06:58
  #35 (permalink)  
 
Join Date: Feb 2010
Posts: 1,075
Received 66 Likes on 40 Posts
Weight concerned aircraft have triple redundancy built in but airlines don't.
Less Hair is offline  
Old 19th Feb 2023, 10:37
  #36 (permalink)  
 
Join Date: Aug 2007
Location: Ireland
Posts: 216
Likes: 0
Received 0 Likes on 0 Posts
Frankfurt is home to a large internet exchange (once 1 of only 5 in Europe) where many different telco companies cables meet in a single building exchanging traffic with each other. Many companies systems is housed in the same building for short acces to that critical exchange. As I remember that building have 2 cable entrance points to ducts under 2 different roads.
An airlines systems might be redundant with standbys in different buildings but the question is always when do you switch to the alternative. In this case the main system was still up and running. Airline systems are also very interconnected so will you switch to the backup system if ops control in other parts of Lufthansa is running fine but checkin at a particular airport is not.

Checkin systems are also often airport supplied and standardised so the desks can be used by many different airlines. Then it wouldn't be LH's system that was down but the airports connection to LH's main system that is most likely in a different place altogheter. And if connection to and through the internet exchange would be the airports responsibility. It would be the airlines responsibility not to have a manual backup system. But these day airport flows are so system dependent that sample baggage handling couldn't be done manually without systems working and I do believe some planes took off from Frankfurt without baggage.
vikingivesterled is offline  
Old 19th Feb 2023, 10:39
  #37 (permalink)  
 
Join Date: Nov 2018
Location: back out to Grasse
Posts: 557
Received 28 Likes on 12 Posts
Should airlines be regulated in terms of safety in the same way that aircraft manufacturers are, after all if an airlines systems fail, it is possible for deaths to occur.

IG
Imagegear is offline  
Old 19th Feb 2023, 11:19
  #38 (permalink)  
 
Join Date: Feb 2010
Posts: 1,075
Received 66 Likes on 40 Posts
FWIW russian hacker group killnet has claimed to be responsible. But they seem to claim "a lot" just to be sure. Still, several airport sites in Germany in fact WERE attacked by killnet however with much more simplistic DDoS attacks during the same time.
Less Hair is offline  
Old 19th Feb 2023, 13:04
  #39 (permalink)  
 
Join Date: Nov 2000
Location: Melbourne,Vic,Australia
Posts: 423
Likes: 0
Received 2 Likes on 2 Posts
Originally Posted by Blackfriar
I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.
The question of where do the lines actually go is fairly difficult to determine.

Mapping copper last mile stuff in several countries where we produced the equipment both hardware to/from MDF,pillars etc and IT equpment to keep track of the findings. These showed about 70% actually went when they were shown on the plans. Anectdotally thats common and the biggest cause was rushed repair of a backhoe job and the quickest way was rerouting through alternate routes which may or may not rejoin the origininal route.

No longterm problem if the plans are amended accordingly but that paperwork bit which is often at the end of a lengthy night in the rain tends not to happen.
Deaf is offline  
Old 19th Feb 2023, 18:36
  #40 (permalink)  
 
Join Date: Jul 2013
Location: Everett, WA
Age: 68
Posts: 4,408
Received 180 Likes on 88 Posts
It's not just communication cables. A couple years ago, the local natural gas utility came in and replaced all the underground gas lines in the development as 'preventative maintenance'.
It became pretty apparent as soon as they started digging up the street that the existing gas lines were not where they thought they were - often off by 10 meters or more. Needless to say, the local cul-de-sac was quite the mess by the time they finished digging.
At least we didn't find out about it the hard way - e.g. when some utility dig severed a gas line that they thought was 10 meters away...
tdracer is offline  


Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.