'System outage' grounds Delta flights

Reply Subscribe

Thread Tools

Search this Thread

9th Aug 2016, 11:22

#41 (permalink)

Ancient Observer

Join Date: Aug 2006

Location: Lemonia. Best Greek in the world

Posts: 1,759

Likes: 16

Received 6 Likes on 3 Posts

It is seldom the back up generators that fail if they are rigorously tested. It is normally some switch somewhere, which no-one seems to own. IT folk think they do good Project management. Maybe they do for software implementations. For real Engineering, hire a real Engineer.

9th Aug 2016, 12:18

#42 (permalink)

OldLurker

Join Date: Jul 2014

Location: England

Posts: 401

Likes: 1

Received 1 Like on 1 Post

Quote:

if you want your backups and protections to work when they are needed, you have to actually integrate usage of them into your standard operations. No amount of "really really careful" testing is an adequate substitute.

Yes!

Quote:

the factory site at which I worked which at times was probably on the worldwide top ten list for dollar value added across all industries, was so concerned about the single point failure of losing utility power ... that they paid to have several miles of high-voltage connection made to a second point within the utility network.

One of my company's sites did something similar. But no clued person supervised the actual connection. Result: at a certain point close to the site, the two cables ran side by side within a foot or two. Yes, you guessed it: that was exactly where some guy with a backhoe, digging an unrelated hole, got the spot wrong and cut through both lines ...

9th Aug 2016, 14:18

#43 (permalink)

er340790

Join Date: Jul 2005

Location: Canadian Shield

Posts: 538

Likes: 0

Received 0 Likes on 0 Posts

Well, it's certainly a major wake-up call.

The proliferation of IT-based "solutions" in passenger air-transport recently has been remarkable: fully web-based reservations; on-line check-in; boarding cards via hand-held devices etc etc etc.

These days all one sees before Security at airports is a spotty 16-year old handling checked baggage - typically someone who wouldn't recognize a Manual System if it swam up and bit him.

When all said and done, Delta's back-up and disaster recovery procedures clearly fell far short. No real excuse for that.

[Presumably anyone dying in SE States that day got an extra day on earth. "It doesn't matter if you go to Heaven or Hell, you still have to go via Atlanta!"]

9th Aug 2016, 14:26

#44 (permalink)

FakePilot

Join Date: Mar 2004

Location: Baltimore, MD

Posts: 273

Likes: 0

Received 5 Likes on 1 Post

Quote:

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer. Most engineering work I've observed is "will the whole thing work?" vs. software people "when can I use my favorite tool?"

9th Aug 2016, 16:20

#45 (permalink)

RAT 5

Join Date: Jun 2000

Location: last time I looked I was still here.

Posts: 4,507

Likes: 0

Received 0 Likes on 0 Posts

Various observations come to mind:

1. Someone somewhere, perhaps, signed off on NOT installing a correct, suitable worse case, thoroughly tested - often- backup system. Heads might roll, but don't hold your breath. They can not pin 'pilot error' on this one.

2. Someone somewhere did not do a thorough threat/risk assessment of 'what happens if...../'

3. Someone somewhere was being over complacent. "it has never been a problem before, therefore it's OK."

4. When volcanic ash shutdowns airspace, and puts a/c & crews where you did not plan them to be, you used your computer systems to sort out the consequential poo-pile. Oops, the poo-pile is caused by your own computer system. Now where is that pencil & rubber, slide-rule and abacus? What do you mean there's no paper back up? Oops.

This saga could go on long enough for Hollywood to make an epic drama out of it, at least a TV box set. Then you could throw in some foreign espionage conspiracy and ruin the whole truth. Ground Crash Investigation could have a field day with this one. Human error puts a company on the edge.

What will be interesting will be the investigation as to root cause. I wonder if that will ever see the light of day to the public. Check out the dole queue for a clue.

9th Aug 2016, 16:26

#46 (permalink)

Flash2001

Join Date: Dec 2001

Location: Richmond Texas

Posts: 305

Likes: 0

Received 0 Likes on 0 Posts

Never worked in airline reservations but did work many years in the broadcast industry. I was surprised at the number of redundant systems I found that assured system failure if either of the duplicate systems failed! Also worked as a millennium auditor in the same industry. We found several epoch related risks that had nothing to do with Y2K.

After an excellent landing etc...

9th Aug 2016, 16:37

#47 (permalink)

EEngr

Join Date: Jan 2011

Location: Seattle

Posts: 717

Likes: 0

Received 3 Likes on 2 Posts

Quote:

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer.

And that's often a management decision. Back when I worked for Boeing, the big thing was to compartmentalize the hardware and software development tasks. Theoretically so each one could be assigned to a group with the appropriate expertise. But more often than not because each discipline had an entrenchedfiefdom. Later on, it was to facilitate outsourcing each task to different subcontractors (spread the blame).

This is how they treated their core competency: aircraft. The 'systems engineering' function (a top-down view of overall function) was mostly contract management and very little actual engineering. You can imagine what lack of attention was given to non-core functions (data centers, facilities, etc.)

9th Aug 2016, 20:59

#48 (permalink)

bafanguy

Join Date: Feb 2004

Location: USA

Posts: 3,381

Likes: 0

Received 17 Likes on 11 Posts

Not that it matters terribly, but here's latest info:

http://finance.yahoo.com/news/delta-...194608131.html

9th Aug 2016, 21:12

#49 (permalink)

twochai

Join Date: Jan 2008

Location: On the lake

Age: 82

Posts: 670

Likes: 4

Received 0 Likes on 0 Posts

A wake up call? Delta already had the wake up call, ten years ago:

Comair's Christmas Disaster: Bound To Fail | CIO

Of course, Comair was only a subsidiary of Delta, not part of the main team - they were too smart to let such a thing happen!

The CEO of Comair walked the plank! Wonder if that'll happen this time around!

Last edited by twochai; 9th Aug 2016 at 21:50.

9th Aug 2016, 21:50

#50 (permalink)

xs-baggage

Join Date: Aug 2007

Location: West London, UK

Posts: 12

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Peter47

Presumably VS still has its own computer system as its ops appear to be unaffected.

As far as I'm aware VS is still hosted on the former EDS SHARES, now owned and operated by HP. CO also sat on SHARES (don't know if that's the case since the UA merger).

BA, mentioned in another post, is also hosted by a third party with a very resilient system indeed. I'm surprised that DL didn't move to third party hosting when they finally dropped the legacy in-house DELTAMATIC system.

9th Aug 2016, 23:22

#51 (permalink)

OldCessna

Join Date: Feb 2003

Location: PBI

Posts: 215

Likes: 0

Received 0 Likes on 0 Posts

Amazon computer services offered a much more redundant system and they (Delta) didn't want to pay the money.

You can assume Amazon are pretty much switched on with systems.

So Delta are running a 20-25 year old system that if one hub goes down so does the rest. All the senior IT execs are former IBM. That sums it up of course.

On another positive note all the competition are doing really well from this total Fook up!

Last edited by OldCessna; 9th Aug 2016 at 23:23. Reason: Typo

10th Aug 2016, 10:49

#52 (permalink)

Ian W

Join Date: Dec 2006

Location: Florida and wherever my laptop is

Posts: 1,350

Likes: 0

Received 0 Likes on 0 Posts

From the Yahoo Link above:

Quote:

"Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power," Delta COO Gil West said in a statement on Tuesday. "The universal power was stabilized and power was restored quickly."

However, the trouble obviously didn't end there. A Delta spokesperson confirmed to Business Insider earlier today that the airline's backup systems failed to kick in."

And here we have the fundamental fault in the design. The 'backup' system should be operating all the time as a part of the live system. To all intents and purposes you have a widely distributed system that usually operates very efficiently. When part of the system fails all that happens is that the remaining part of the system carries on operating slightly less efficiently. There is no impact at all on operations and no failover to worry about.
I have no doubt that the IT people would want to have a fault tolerant system but the beancounters will have said how often do things fail? What is the cost of 2 computer centers? We are not paying that they can stay in the same building.... and there will be a Delta beancounter with his abacus out saying now that they still won on the deal.

10th Aug 2016, 11:08

#53 (permalink)

Ian W

Join Date: Dec 2006

Location: Florida and wherever my laptop is

Posts: 1,350

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by twochai

There is a little more to this story.
I have been told that in order to reduce stock holding at the airport Cincinnati had only a small supply of deicer and when snow/ice/freezing weather was forecast would call for supplies sufficient for the expected weather. in this case the tankers of deicer were on their way to the airport but were pulled over by law enforcement and told it was too dangerous for them to carry on driving due to the snow. So the airport was unable to deice aircraft and operations were halted. Not only did the aircraft tires freeze to the ground, but also the jetways froze in position.

Lots of holes in the cheese lined up. A really good learning exercise for the MBAs who run airports these days.

10th Aug 2016, 11:18

#54 (permalink)

Tech Guy

Join Date: Dec 2015

Location: Southampton

Posts: 125

Likes: 1

Received 0 Likes on 0 Posts

The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".

10th Aug 2016, 12:54

#55 (permalink)

Ian W

Join Date: Dec 2006

Location: Florida and wherever my laptop is

Posts: 1,350

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by Tech Guy

The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".

Exactly.
There is only one explanation really, Delta beancounters felt the cost of a fault tolerant system made it worth taking the risk of a total system failure. Yet the cost of the backup system running as a 'hot spare' in a separate building would be peanuts compared to their cash and status losses now. There are still flights being cancelled today and their computer systems are still not recovered with lots of broken links and applications not back in synch. All those people with their 'e-boarding passes' on their phones could be in trouble. This may run on for months with people with bookings out months suddenly finding that the roll-back/roll-forward broke their bookings.

They should take the $200 a pax good will payments out of their beancounters' head count budgets. Only then with skin in the game would they appreciate the risk analyses.

10th Aug 2016, 13:11

#56 (permalink)

procede

Join Date: Jan 2008

Location: Netherlands

Age: 46

Posts: 343

Likes: 0

Received 0 Likes on 0 Posts

Fact of the matter is that every backup system will introduce new failure modes.
It happens that everything stops because of inconstancy between primary and secondary systems. Systems can become unavailable as they need to re-synchronize (a common one is where a drive in a RAID array fails and the server starts filling the hot-spare). The best one I ever experienced is a UPS that failed: Everything had power, except the systems behind the UPS...

10th Aug 2016, 13:49

#57 (permalink)

MarkerInbound

Join Date: Nov 2007

Location: Texas

Posts: 1,921

Likes: 0

Received 1 Like on 1 Post

Quote:

Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?

You have a crew scheduled to fly to Podunk and spend the night then fly the morning departure back. They never got there so there is no crew (or aircraft) for the morning flight.

10th Aug 2016, 17:15

#58 (permalink)

.Scott

Join Date: Feb 2015

Location: New Hampshire

Posts: 152

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by FakePilot

The "favorite tool" and/or "favorite language" syndrome is the sign of a junior programmer.

In this case, lots of schemes would have worked. As others have said, they just needed to actually exercise the one they picked. When it comes to software and information systems, if it hasn't been tested, it doesn't work.
So we have the first round of testing: Not bad, up in only a few hours. Too bad it wasn't a test.

Actually, a major cost in these systems is the testing. As each revision is made to the system, you need ways to routinely simulate and check daily activity without actually relying on that system.

10th Aug 2016, 19:14

#59 (permalink)

neilki

Join Date: Sep 2007

Location: New York

Posts: 225

Likes: 0

Received 0 Likes on 0 Posts

Coal Face

For the record, everyone i worked with or watched the last few days met the challenges at hand with grace and patience. It was impressive to watch people pull together to keep the show running. I would take my hat off to them; but i'm not supposed to...