Go Back  PPRuNe Forums > Flight Deck Forums > Rumours & News
Reload this Page >

'System outage' grounds Delta flights

Rumours & News Reporting Points that may affect our jobs or lives as professional pilots. Also, items that may be of interest to professional pilots.

'System outage' grounds Delta flights

Old 9th Aug 2016, 11:22
  #41 (permalink)  
 
Join Date: Aug 2006
Location: Lemonia. Best Greek in the world
Posts: 1,759
Received 6 Likes on 3 Posts
It is seldom the back up generators that fail if they are rigorously tested. It is normally some switch somewhere, which no-one seems to own. IT folk think they do good Project management. Maybe they do for software implementations. For real Engineering, hire a real Engineer.
Ancient Observer is offline  
Old 9th Aug 2016, 12:18
  #42 (permalink)  
 
Join Date: Jul 2014
Location: England
Posts: 411
Received 2 Likes on 2 Posts
if you want your backups and protections to work when they are needed, you have to actually integrate usage of them into your standard operations. No amount of "really really careful" testing is an adequate substitute.
Yes!

the factory site at which I worked which at times was probably on the worldwide top ten list for dollar value added across all industries, was so concerned about the single point failure of losing utility power ... that they paid to have several miles of high-voltage connection made to a second point within the utility network.
One of my company's sites did something similar. But no clued person supervised the actual connection. Result: at a certain point close to the site, the two cables ran side by side within a foot or two. Yes, you guessed it: that was exactly where some guy with a backhoe, digging an unrelated hole, got the spot wrong and cut through both lines ...
OldLurker is offline  
Old 9th Aug 2016, 14:18
  #43 (permalink)  
 
Join Date: Jul 2005
Location: Canadian Shield
Posts: 538
Likes: 0
Received 0 Likes on 0 Posts
Well, it's certainly a major wake-up call.

The proliferation of IT-based "solutions" in passenger air-transport recently has been remarkable: fully web-based reservations; on-line check-in; boarding cards via hand-held devices etc etc etc.

These days all one sees before Security at airports is a spotty 16-year old handling checked baggage - typically someone who wouldn't recognize a Manual System if it swam up and bit him.

When all said and done, Delta's back-up and disaster recovery procedures clearly fell far short. No real excuse for that.

[Presumably anyone dying in SE States that day got an extra day on earth. "It doesn't matter if you go to Heaven or Hell, you still have to go via Atlanta!"]
er340790 is offline  
Old 9th Aug 2016, 14:26
  #44 (permalink)  
 
Join Date: Mar 2004
Location: Baltimore, MD
Posts: 273
Received 5 Likes on 1 Post
It is seldom the back up generators that fail if they are rigorously tested. It is normally some switch somewhere, which no-one seems to own. IT folk think they do good Project management. Maybe they do for software implementations. For real Engineering, hire a real Engineer.
Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer. Most engineering work I've observed is "will the whole thing work?" vs. software people "when can I use my favorite tool?"
FakePilot is offline  
Old 9th Aug 2016, 16:20
  #45 (permalink)  
 
Join Date: Jun 2000
Location: last time I looked I was still here.
Posts: 4,507
Likes: 0
Received 0 Likes on 0 Posts
Various observations come to mind:

1. Someone somewhere, perhaps, signed off on NOT installing a correct, suitable worse case, thoroughly tested - often- backup system. Heads might roll, but don't hold your breath. They can not pin 'pilot error' on this one.

2. Someone somewhere did not do a thorough threat/risk assessment of 'what happens if...../'

3. Someone somewhere was being over complacent. "it has never been a problem before, therefore it's OK."

4. When volcanic ash shutdowns airspace, and puts a/c & crews where you did not plan them to be, you used your computer systems to sort out the consequential poo-pile. Oops, the poo-pile is caused by your own computer system. Now where is that pencil & rubber, slide-rule and abacus? What do you mean there's no paper back up? Oops.

This saga could go on long enough for Hollywood to make an epic drama out of it, at least a TV box set. Then you could throw in some foreign espionage conspiracy and ruin the whole truth. Ground Crash Investigation could have a field day with this one. Human error puts a company on the edge.

What will be interesting will be the investigation as to root cause. I wonder if that will ever see the light of day to the public. Check out the dole queue for a clue.
RAT 5 is offline  
Old 9th Aug 2016, 16:26
  #46 (permalink)  
 
Join Date: Dec 2001
Location: Richmond Texas
Posts: 305
Likes: 0
Received 0 Likes on 0 Posts
Never worked in airline reservations but did work many years in the broadcast industry. I was surprised at the number of redundant systems I found that assured system failure if either of the duplicate systems failed! Also worked as a millennium auditor in the same industry. We found several epoch related risks that had nothing to do with Y2K.

After an excellent landing etc...
Flash2001 is offline  
Old 9th Aug 2016, 16:37
  #47 (permalink)  
 
Join Date: Jan 2011
Location: Seattle
Posts: 725
Received 10 Likes on 3 Posts
Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer.
And that's often a management decision. Back when I worked for Boeing, the big thing was to compartmentalize the hardware and software development tasks. Theoretically so each one could be assigned to a group with the appropriate expertise. But more often than not because each discipline had an entrenchedfiefdom. Later on, it was to facilitate outsourcing each task to different subcontractors (spread the blame).


This is how they treated their core competency: aircraft. The 'systems engineering' function (a top-down view of overall function) was mostly contract management and very little actual engineering. You can imagine what lack of attention was given to non-core functions (data centers, facilities, etc.)
EEngr is offline  
Old 9th Aug 2016, 20:59
  #48 (permalink)  
 
Join Date: Feb 2004
Location: USA
Posts: 3,432
Likes: 0
Received 29 Likes on 18 Posts
Not that it matters terribly, but here's latest info:

http://finance.yahoo.com/news/delta-...194608131.html
bafanguy is offline  
Old 9th Aug 2016, 21:12
  #49 (permalink)  
 
Join Date: Jan 2008
Location: On the lake
Age: 82
Posts: 671
Received 0 Likes on 0 Posts
A wake up call? Delta already had the wake up call, ten years ago:

Comair's Christmas Disaster: Bound To Fail | CIO

Of course, Comair was only a subsidiary of Delta, not part of the main team - they were too smart to let such a thing happen!

The CEO of Comair walked the plank! Wonder if that'll happen this time around!

Last edited by twochai; 9th Aug 2016 at 21:50.
twochai is offline  
Old 9th Aug 2016, 21:50
  #50 (permalink)  
 
Join Date: Aug 2007
Location: West London, UK
Posts: 12
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Peter47
Presumably VS still has its own computer system as its ops appear to be unaffected.
As far as I'm aware VS is still hosted on the former EDS SHARES, now owned and operated by HP. CO also sat on SHARES (don't know if that's the case since the UA merger).

BA, mentioned in another post, is also hosted by a third party with a very resilient system indeed. I'm surprised that DL didn't move to third party hosting when they finally dropped the legacy in-house DELTAMATIC system.
xs-baggage is offline  
Old 9th Aug 2016, 23:22
  #51 (permalink)  
 
Join Date: Feb 2003
Location: PBI
Posts: 215
Likes: 0
Received 0 Likes on 0 Posts
Amazon computer services offered a much more redundant system and they (Delta) didn't want to pay the money.

You can assume Amazon are pretty much switched on with systems.

So Delta are running a 20-25 year old system that if one hub goes down so does the rest. All the senior IT execs are former IBM. That sums it up of course.

On another positive note all the competition are doing really well from this total Fook up!

Last edited by OldCessna; 9th Aug 2016 at 23:23. Reason: Typo
OldCessna is offline  
Old 10th Aug 2016, 10:49
  #52 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
From the Yahoo Link above:

"Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power," Delta COO Gil West said in a statement on Tuesday. "The universal power was stabilized and power was restored quickly."

However, the trouble obviously didn't end there. A Delta spokesperson confirmed to Business Insider earlier today that the airline's backup systems failed to kick in."
And here we have the fundamental fault in the design. The 'backup' system should be operating all the time as a part of the live system. To all intents and purposes you have a widely distributed system that usually operates very efficiently. When part of the system fails all that happens is that the remaining part of the system carries on operating slightly less efficiently. There is no impact at all on operations and no failover to worry about.
I have no doubt that the IT people would want to have a fault tolerant system but the beancounters will have said how often do things fail? What is the cost of 2 computer centers? We are not paying that they can stay in the same building.... and there will be a Delta beancounter with his abacus out saying now that they still won on the deal.
Ian W is offline  
Old 10th Aug 2016, 11:08
  #53 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by twochai
A wake up call? Delta already had the wake up call, ten years ago:

Comair's Christmas Disaster: Bound To Fail | CIO

Of course, Comair was only a subsidiary of Delta, not part of the main team - they were too smart to let such a thing happen!

The CEO of Comair walked the plank! Wonder if that'll happen this time around!
There is a little more to this story.
I have been told that in order to reduce stock holding at the airport Cincinnati had only a small supply of deicer and when snow/ice/freezing weather was forecast would call for supplies sufficient for the expected weather. in this case the tankers of deicer were on their way to the airport but were pulled over by law enforcement and told it was too dangerous for them to carry on driving due to the snow. So the airport was unable to deice aircraft and operations were halted. Not only did the aircraft tires freeze to the ground, but also the jetways froze in position.

Lots of holes in the cheese lined up. A really good learning exercise for the MBAs who run airports these days.
Ian W is offline  
Old 10th Aug 2016, 11:18
  #54 (permalink)  
 
Join Date: Dec 2015
Location: Southampton
Posts: 126
Received 0 Likes on 0 Posts
The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".
Tech Guy is online now  
Old 10th Aug 2016, 12:54
  #55 (permalink)  
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by Tech Guy
The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".
Exactly.
There is only one explanation really, Delta beancounters felt the cost of a fault tolerant system made it worth taking the risk of a total system failure. Yet the cost of the backup system running as a 'hot spare' in a separate building would be peanuts compared to their cash and status losses now. There are still flights being cancelled today and their computer systems are still not recovered with lots of broken links and applications not back in synch. All those people with their 'e-boarding passes' on their phones could be in trouble. This may run on for months with people with bookings out months suddenly finding that the roll-back/roll-forward broke their bookings.

They should take the $200 a pax good will payments out of their beancounters' head count budgets. Only then with skin in the game would they appreciate the risk analyses.
Ian W is offline  
Old 10th Aug 2016, 13:11
  #56 (permalink)  
 
Join Date: Jan 2008
Location: Netherlands
Age: 46
Posts: 375
Likes: 0
Received 4 Likes on 4 Posts
Fact of the matter is that every backup system will introduce new failure modes.
It happens that everything stops because of inconstancy between primary and secondary systems. Systems can become unavailable as they need to re-synchronize (a common one is where a drive in a RAID array fails and the server starts filling the hot-spare). The best one I ever experienced is a UPS that failed: Everything had power, except the systems behind the UPS...
procede is online now  
Old 10th Aug 2016, 13:49
  #57 (permalink)  
 
Join Date: Nov 2007
Location: Texas
Posts: 1,924
Likes: 0
Received 1 Like on 1 Post
Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?
You have a crew scheduled to fly to Podunk and spend the night then fly the morning departure back. They never got there so there is no crew (or aircraft) for the morning flight.
MarkerInbound is offline  
Old 10th Aug 2016, 17:15
  #58 (permalink)  
 
Join Date: Feb 2015
Location: New Hampshire
Posts: 152
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by FakePilot
Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer. Most engineering work I've observed is "will the whole thing work?" vs. software people "when can I use my favorite tool?"
The "favorite tool" and/or "favorite language" syndrome is the sign of a junior programmer.

In this case, lots of schemes would have worked. As others have said, they just needed to actually exercise the one they picked. When it comes to software and information systems, if it hasn't been tested, it doesn't work.
So we have the first round of testing: Not bad, up in only a few hours. Too bad it wasn't a test.

Actually, a major cost in these systems is the testing. As each revision is made to the system, you need ways to routinely simulate and check daily activity without actually relying on that system.
.Scott is offline  
Old 10th Aug 2016, 19:14
  #59 (permalink)  
 
Join Date: Sep 2007
Location: New York
Posts: 225
Likes: 0
Received 0 Likes on 0 Posts
Coal Face

For the record, everyone i worked with or watched the last few days met the challenges at hand with grace and patience. It was impressive to watch people pull together to keep the show running. I would take my hat off to them; but i'm not supposed to...
neilki is offline  
Old 10th Aug 2016, 19:22
  #60 (permalink)  
 
Join Date: Apr 2001
Location: surfing, watching for sharks
Posts: 4,097
Received 63 Likes on 41 Posts
Quote:
Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?
As MI mentioned, after IROPS it takes a few hours to days to get the system back to normal ops. Bet the reserve complements are getting heavily used.
West Coast is offline  

Thread Tools
Search this Thread

Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.