PPRuNe Forums - 'System outage' grounds Delta flights

Page 3 of 4

Show 40 post(s) from this thread on one page

PPRuNe Forums (https://www.pprune.org/)

- Rumours & News (https://www.pprune.org/rumours-news-13/)

- - 'System outage' grounds Delta flights (https://www.pprune.org/rumours-news/582700-system-outage-grounds-delta-flights.html)

Ancient Observer

9th Aug 2016 11:22

It is seldom the back up generators that fail if they are rigorously tested. It is normally some switch somewhere, which no-one seems to own. IT folk think they do good Project management. Maybe they do for software implementations. For real Engineering, hire a real Engineer.

OldLurker

9th Aug 2016 12:18

Quote:

if you want your backups and protections to work when they are needed, you have to actually integrate usage of them into your standard operations. No amount of "really really careful" testing is an adequate substitute.

Yes!

Quote:

the factory site at which I worked which at times was probably on the worldwide top ten list for dollar value added across all industries, was so concerned about the single point failure of losing utility power ... that they paid to have several miles of high-voltage connection made to a second point within the utility network.

One of my company's sites did something similar. But no clued person supervised the actual connection. Result: at a certain point close to the site, the two cables ran side by side within a foot or two. Yes, you guessed it: that was exactly where some guy with a backhoe, digging an unrelated hole, got the spot wrong and cut through both lines ...

er340790

9th Aug 2016 14:18

Well, it's certainly a major wake-up call.

The proliferation of IT-based "solutions" in passenger air-transport recently has been remarkable: fully web-based reservations; on-line check-in; boarding cards via hand-held devices etc etc etc.

These days all one sees before Security at airports is a spotty 16-year old handling checked baggage - typically someone who wouldn't recognize a Manual System if it swam up and bit him.

When all said and done, Delta's back-up and disaster recovery procedures clearly fell far short. No real excuse for that. :=

[Presumably anyone dying in SE States that day got an extra day on earth. "It doesn't matter if you go to Heaven or Hell, you still have to go via Atlanta!"] :E

FakePilot

9th Aug 2016 14:26

Quote:

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer. Most engineering work I've observed is "will the whole thing work?" vs. software people "when can I use my favorite tool?"

RAT 5

9th Aug 2016 16:20

Various observations come to mind:

1. Someone somewhere, perhaps, signed off on NOT installing a correct, suitable worse case, thoroughly tested - often- backup system. Heads might roll, but don't hold your breath. They can not pin 'pilot error' on this one.

2. Someone somewhere did not do a thorough threat/risk assessment of 'what happens if...../'

3. Someone somewhere was being over complacent. "it has never been a problem before, therefore it's OK."

4. When volcanic ash shutdowns airspace, and puts a/c & crews where you did not plan them to be, you used your computer systems to sort out the consequential poo-pile. Oops, the poo-pile is caused by your own computer system. Now where is that pencil & rubber, slide-rule and abacus? What do you mean there's no paper back up? Oops.

This saga could go on long enough for Hollywood to make an epic drama out of it, at least a TV box set. Then you could throw in some foreign espionage conspiracy and ruin the whole truth. Ground Crash Investigation could have a field day with this one. Human error puts a company on the edge.

What will be interesting will be the investigation as to root cause. I wonder if that will ever see the light of day to the public. Check out the dole queue for a clue.

Flash2001

9th Aug 2016 16:26

Never worked in airline reservations but did work many years in the broadcast industry. I was surprised at the number of redundant systems I found that assured system failure if either of the duplicate systems failed! Also worked as a millennium auditor in the same industry. We found several epoch related risks that had nothing to do with Y2K.

After an excellent landing etc...

EEngr

9th Aug 2016 16:37

Quote:

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer.

And that's often a management decision. Back when I worked for Boeing, the big thing was to compartmentalize the hardware and software development tasks. Theoretically so each one could be assigned to a group with the appropriate expertise. But more often than not because each discipline had an entrenchedfiefdom. Later on, it was to facilitate outsourcing each task to different subcontractors (spread the blame).

This is how they treated their core competency: aircraft. The 'systems engineering' function (a top-down view of overall function) was mostly contract management and very little actual engineering. You can imagine what lack of attention was given to non-core functions (data centers, facilities, etc.)

bafanguy

9th Aug 2016 20:59

Not that it matters terribly, but here's latest info:

http://finance.yahoo.com/news/delta-...194608131.html

twochai

9th Aug 2016 21:12

A wake up call? Delta already had the wake up call, ten years ago:

Comair's Christmas Disaster: Bound To Fail | CIO

Of course, Comair was only a subsidiary of Delta, not part of the main team - they were too smart to let such a thing happen!

The CEO of Comair walked the plank! Wonder if that'll happen this time around!

xs-baggage

9th Aug 2016 21:50

Quote:

Originally Posted by Peter47 (Post 9467438)

Presumably VS still has its own computer system as its ops appear to be unaffected.

As far as I'm aware VS is still hosted on the former EDS SHARES, now owned and operated by HP. CO also sat on SHARES (don't know if that's the case since the UA merger).

BA, mentioned in another post, is also hosted by a third party with a very resilient system indeed. I'm surprised that DL didn't move to third party hosting when they finally dropped the legacy in-house DELTAMATIC system.

OldCessna

9th Aug 2016 23:22

Amazon computer services offered a much more redundant system and they (Delta) didn't want to pay the money.

You can assume Amazon are pretty much switched on with systems.

So Delta are running a 20-25 year old system that if one hub goes down so does the rest. All the senior IT execs are former IBM. That sums it up of course.

On another positive note all the competition are doing really well from this total Fook up!

Ian W

10th Aug 2016 10:49

From the Yahoo Link above:

Quote:

"Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power," Delta COO Gil West said in a statement on Tuesday. "The universal power was stabilized and power was restored quickly."

However, the trouble obviously didn't end there. A Delta spokesperson confirmed to Business Insider earlier today that the airline's backup systems failed to kick in."

And here we have the fundamental fault in the design. The 'backup' system should be operating all the time as a part of the live system. To all intents and purposes you have a widely distributed system that usually operates very efficiently. When part of the system fails all that happens is that the remaining part of the system carries on operating slightly less efficiently. There is no impact at all on operations and no failover to worry about.
I have no doubt that the IT people would want to have a fault tolerant system but the beancounters will have said how often do things fail? What is the cost of 2 computer centers? We are not paying that they can stay in the same building.... and there will be a Delta beancounter with his abacus out saying now that they still won on the deal.

Ian W

10th Aug 2016 11:08

Quote:

Originally Posted by twochai (Post 9468553)

There is a little more to this story.
I have been told that in order to reduce stock holding at the airport Cincinnati had only a small supply of deicer and when snow/ice/freezing weather was forecast would call for supplies sufficient for the expected weather. in this case the tankers of deicer were on their way to the airport but were pulled over by law enforcement and told it was too dangerous for them to carry on driving due to the snow. So the airport was unable to deice aircraft and operations were halted. Not only did the aircraft tires freeze to the ground, but also the jetways froze in position.

Lots of holes in the cheese lined up. A really good learning exercise for the MBAs who run airports these days.

Tech Guy

10th Aug 2016 11:18

The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".

Ian W

10th Aug 2016 12:54

Quote:

Originally Posted by Tech Guy (Post 9469102)

The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".

Exactly.
There is only one explanation really, Delta beancounters felt the cost of a fault tolerant system made it worth taking the risk of a total system failure. Yet the cost of the backup system running as a 'hot spare' in a separate building would be peanuts compared to their cash and status losses now. There are still flights being cancelled today and their computer systems are still not recovered with lots of broken links and applications not back in synch. All those people with their 'e-boarding passes' on their phones could be in trouble. This may run on for months with people with bookings out months suddenly finding that the roll-back/roll-forward broke their bookings.

They should take the $200 a pax good will payments out of their beancounters' head count budgets. Only then with skin in the game would they appreciate the risk analyses.

procede

10th Aug 2016 13:11

Fact of the matter is that every backup system will introduce new failure modes.
It happens that everything stops because of inconstancy between primary and secondary systems. Systems can become unavailable as they need to re-synchronize (a common one is where a drive in a RAID array fails and the server starts filling the hot-spare). The best one I ever experienced is a UPS that failed: Everything had power, except the systems behind the UPS...

MarkerInbound

10th Aug 2016 13:49

Quote:

Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?

You have a crew scheduled to fly to Podunk and spend the night then fly the morning departure back. They never got there so there is no crew (or aircraft) for the morning flight.

.Scott

10th Aug 2016 17:15

Quote:

Originally Posted by FakePilot (Post 9468181)

The "favorite tool" and/or "favorite language" syndrome is the sign of a junior programmer.

In this case, lots of schemes would have worked. As others have said, they just needed to actually exercise the one they picked. When it comes to software and information systems, if it hasn't been tested, it doesn't work.
So we have the first round of testing: Not bad, up in only a few hours. Too bad it wasn't a test.

Actually, a major cost in these systems is the testing. As each revision is made to the system, you need ways to routinely simulate and check daily activity without actually relying on that system.

neilki

10th Aug 2016 19:14

Coal Face

For the record, everyone i worked with or watched the last few days met the challenges at hand with grace and patience. It was impressive to watch people pull together to keep the show running. I would take my hat off to them; but i'm not supposed to...

West Coast

10th Aug 2016 19:22

Quote:

Quote:
Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?

As MI mentioned, after IROPS it takes a few hours to days to get the system back to normal ops. Bet the reserve complements are getting heavily used.

All times are GMT. The time now is 10:17.

Page 3 of 4

Show 40 post(s) from this thread on one page