PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Rumours & News (https://www.pprune.org/rumours-news-13/)
-   -   Complete LH group meltdown (https://www.pprune.org/rumours-news/651369-complete-lh-group-meltdown.html)

Less Hair 16th Feb 2023 12:34

We don't know what their setup is and what exactly went wrong. It only broke down on the next morning after the cables had been chopped when more data were transferred during peak time hours. I know a major TV station that went black after power grid surges when its computer went crazy and started to constantly switch back and forth between the fluctuating grid and the local backup generator and power system. Even a working backup can leave you in trouble.

B Fraser 16th Feb 2023 15:05

It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.

n5296s 16th Feb 2023 15:57

The trouble is that no matter how hard you try, you don't really know how this physical infrastructure is configured. You buy from two separate suppliers etc. You can still have bad surprises.

Many years ago (1980s) the Internet as it then was had coast-to-coast redundancy, Boston-Seattle and Washington DC-LA (or something like that). Guess what... a train derailed somewhere in the midwest and cut both of them.

That said, this case does seem especially egregious.

Klauss 17th Feb 2023 04:00

What about wireless ?
 
Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business

Imagegear 17th Feb 2023 05:29


Originally Posted by Klauss (Post 11387151)
Yes, LH certainly has huge data needs. I am wondering why they rely on one or two cables, and don´t go for a secondary connection via their own satellite dish or at least a terrestrial wireless hookup...
What happened is something that should not be possible - single point of failure for the whole, global, business

Under normal circumstances, the critical failover IT infrastructure should have been located some way away from the main centre and with continuous transaction replication and a short periodic heart beat to verify that a connection has not been lost.
If a link is lost for what ever reason, the heartbeat will stop and so with no further information, the backup site should takeover transaction processing and recover any in-process transactions from the failed pipeline. In a critical system you cannot wait for a system to tell you it is broke.

I am aware of an organisation who ran three separate recovery data centres, on three different continents. (Think Flood, Fire, War, cyber attack, pestilence, etc.)

If were a share holder, I would hold responsible, the CIO of LH who has been negligent in accepting that single point of failure, and by association, the CEO and the CFO. It matters not whether the problem was there before their appointment, they did not all suddenly appear from the HR organisation.

IG

Gne 17th Feb 2023 08:22


Originally Posted by ATC Watcher (Post 11386729)
Exactly the same thing occurred 45 yaers ago in my old ( major) Control centre. A worker using a backhoe dig and cut a bunch of cables just outside the centre , severing all telephone and radar lines ( that were all underground at the time ) .

Same during my career:
  • On my first posting off course all coord comms to the adjacent civil centre lost due to back hoe activity just inside the main gate of the base.
  • some years later as a SATCO had the main comms/coord fibre to the outside world cut during night flying by a D9 clearing scrub 15 Km off base dragging a fiberoptic cable to the point it snapped. He was devastated when we located him and told him what he'd done.
  • next night he was clearing another patch of scrub 50km away and dragged the backup comms/coord fiberoptic cable to the point it snapped. Turns out, in both cases, the contractor had not buried the cable at the required depth, thinking, no doubt that it wouldn't matter.
  • change of role, five years later as a civvie ATM systems manager thought it prudent to check redundancy for the coord link between two major centres. The glossy schematics clearly showed diverse paths between the two MDFs, one terrestrial and the other in space. found that the connection between the MDF and the antennas to bounce the signals to the satellite was in the same off site duct as the terrestrial link. On the drive back to the airport to head home and try and resolve the problem the senior tech manager and I saw a backhoe working within 2m of the duct containing the cables!!
Don't talk to me about redundancy and diverse paths!!

Gne

Asturias56 17th Feb 2023 08:23

One issue can be that multiple redundancy sounds great but it brings greater complexity - especially when you are upgrading software for example. You have to ensure ALL the links are working properly ALL the time . Easier to say than to achieve.

Imagegear 17th Feb 2023 08:34

Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG

MichaelKPIT 17th Feb 2023 13:57

Aaaaand now the same thing has happened at LH's terminal in JFK:
https://www.cnbc.com/2023/02/17/jfk-...er-outage.html

nomilk 17th Feb 2023 17:29

Wasn't it a fire in terminal 1?

MichaelKPIT 17th Feb 2023 18:59

Yes it was - it's just that another single point of failure (electrical panel) brought down the whole operation. Agree it's much harder to avoid - just a similar occurrence.

SMT Member 18th Feb 2023 07:43

Interesting comments here, especially so on a board that profess to cater to professional aviators.

Some of the comments made here would make you believe, that there are people who fully endorse the Qatar/Emirates way of working when something goes wrong: Fire everybody involved and sort out what actually happened later.

But in reality, those same people are (rightfully) up in arms when QR/EK sacks a crew whenever there’s been a bit of an incident.

Not a pretty sight.

Asturias56 18th Feb 2023 07:51


Originally Posted by Imagegear (Post 11387232)
Right up until the point when the last transaction to the main host is not processed, transactions will be received by the different sites. (Multi-Host File Sharing - MHFS)

When the heartbeat of the main host disappears, the other hosts will first recover the last transactions from the Comms processing units before resuming processing of normal traffic. I am no longer at the sharp end of this stuff (With a number of major airlines and demonstrating to clients), but the principles are the same. (Or should be).

IG


I know the problem is often it's not "will pick up" - it's "should have picked up". Unless you rigorously test by failing a feed on a regular basis the various systems tend to diverge - often someone makes a small and seemingly inconsequential change to fix/upgrade something else and what should be Duplicated now becomes Similar. It 's especially the case when the system management is outsourced for a while - you lose that ownership of the complete system and it's commercial importance

Blackfriar 19th Feb 2023 06:28


Originally Posted by B Fraser (Post 11386892)
It's entirely possible that LH have two providers, perhaps Deutsche Telekom AG (DTAG) and another. The requirement to have diversity and separacy may have been delivered on day 1 however if the second provider had used DTAG and pointed out the requirement, over time this can become compromised. Airports are prone to lots of construction so existing telco cables can be moved to avoid civils works. If system "A" is at risk due to a new roundabout and is migrated by DTAG to the ducts carrying system "B", all the eggs are in one basket. Along comes another civils work where the digger operator is gung ho and the holes in the swiss cheese all line up.

Teasing out who is to blame is far from easy.

I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.

Less Hair 19th Feb 2023 06:58

Weight concerned aircraft have triple redundancy built in but airlines don't.

vikingivesterled 19th Feb 2023 10:37

Frankfurt is home to a large internet exchange (once 1 of only 5 in Europe) where many different telco companies cables meet in a single building exchanging traffic with each other. Many companies systems is housed in the same building for short acces to that critical exchange. As I remember that building have 2 cable entrance points to ducts under 2 different roads.
An airlines systems might be redundant with standbys in different buildings but the question is always when do you switch to the alternative. In this case the main system was still up and running. Airline systems are also very interconnected so will you switch to the backup system if ops control in other parts of Lufthansa is running fine but checkin at a particular airport is not.

Checkin systems are also often airport supplied and standardised so the desks can be used by many different airlines. Then it wouldn't be LH's system that was down but the airports connection to LH's main system that is most likely in a different place altogheter. And if connection to and through the internet exchange would be the airports responsibility. It would be the airlines responsibility not to have a manual backup system. But these day airport flows are so system dependent that sample baggage handling couldn't be done manually without systems working and I do believe some planes took off from Frankfurt without baggage.

Imagegear 19th Feb 2023 10:39

Should airlines be regulated in terms of safety in the same way that aircraft manufacturers are, after all if an airlines systems fail, it is possible for deaths to occur.

IG

Less Hair 19th Feb 2023 11:19

FWIW russian hacker group killnet has claimed to be responsible. But they seem to claim "a lot" just to be sure. Still, several airport sites in Germany in fact WERE attacked by killnet however with much more simplistic DDoS attacks during the same time.

Deaf 19th Feb 2023 13:04


Originally Posted by Blackfriar (Post 11388090)
I’ve designed many resilient data networks and have always gone right down to the detailed routeing using Google Earth mapping to ensure the two routes, whether from one carrier or two different ones, are actually separate and have no single points of failure. Sometimes it’s difficult and you get a crossover where one line is on the railway and one beside the road on a bridge over the railway so you have vertical separation. The level of risk accepted by the client is based on the systems that it is supporting and the cost of making it perfect. It took many months to get one diversely routed system correctly installed by Openreach, but it was finally done. A couple of years later a contractor went through one of the cables in which our fibres were located, but the client saw nothing as it seamlessly switched to the backup route. It took weeks to fix the problem as it was rail side and hundreds of fibres and thousands of copper lines had been cut, so all the time the client was at risk on the other route. Fortunately this was just an insurance company backing up data between two data centres and not critical, so if it’s critical infrastructure, you really need three separate routes and triple redundant power, devices and everything else.

The question of where do the lines actually go is fairly difficult to determine.

Mapping copper last mile stuff in several countries where we produced the equipment both hardware to/from MDF,pillars etc and IT equpment to keep track of the findings. These showed about 70% actually went when they were shown on the plans. Anectdotally thats common and the biggest cause was rushed repair of a backhoe job and the quickest way was rerouting through alternate routes which may or may not rejoin the origininal route.

No longterm problem if the plans are amended accordingly but that paperwork bit which is often at the end of a lengthy night in the rain tends not to happen.

tdracer 19th Feb 2023 18:36

It's not just communication cables. A couple years ago, the local natural gas utility came in and replaced all the underground gas lines in the development as 'preventative maintenance'.
It became pretty apparent as soon as they started digging up the street that the existing gas lines were not where they thought they were - often off by 10 meters or more. Needless to say, the local cul-de-sac was quite the mess by the time they finished digging.
At least we didn't find out about it the hard way - e.g. when some utility dig severed a gas line that they thought was 10 meters away...


All times are GMT. The time now is 12:42.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.