'System outage' grounds Delta flights [Archive]

Dimitri Cherchenko

8th Aug 2016, 09:17

Delta airlines says all flights suspended "due to system outage nationwide"

'System outage' grounds Delta flights - BBC News (http://www.bbc.com/news/world-us-canada-37007908)

Twitter (https://twitter.com/DeltaNewsHub/status/762575331647688704)

Porky Speedpig

8th Aug 2016, 12:12

Must be a nightmare for all concerned and affected - any word as to the root cause?

bafanguy

8th Aug 2016, 12:15

This:

http://finance.yahoo.com/news/delta-says-atlanta-power-outage-111236208.html

log0008

8th Aug 2016, 12:20

8am and the world's busiest airport is very quiet! Not a good week for those traveling though major airports

Porky Speedpig

8th Aug 2016, 12:24

Wow, that's one heck of a power outage. No doubt 2-3 back up power systems too so it will be interesting to see why one couldn't kick in.

crablab

8th Aug 2016, 12:58

Surely in the age of the cloud they should have multiple data centres?!

Alanwsg

8th Aug 2016, 13:12

Power cut crashes Delta's worldwide flight update systems ? The Register (http://www.theregister.co.uk/2016/08/08/computer_fault_takes_down_delta/)

OldLurker

8th Aug 2016, 16:12

From experience elsewhere (not being able to see behind the curtain at Delta) I'd hazard a guess that their IT guys have been beating on the management for years to get proper hot fall-back for what is nowadays a mission-critical system, but management (supported by bean-counters) have been stalling on the necessary investment.

OTOH, they may have all the fall-back in the world but they've never actually exercised it properly, so when it's called on for real, it doesn't work ...

Lonewolf_50

8th Aug 2016, 16:21

OTOH, they may have all the fall-back in the world but they've never actually exercised it properly, so when it's called on for real, it doesn't work ... I wonder. A whole lot of folks ran "business continuity plans" and tests before the dreaded "Y2K" event (which wasn't all that it threatened to be) so perhaps management consider that event the IT guys "crying wolf."

So now they lost a bit of this week's flock/wool ...

Porky Speedpig

8th Aug 2016, 17:01

A veritable squadron of off schedule Delta birds on the Atlantic now most heading for JFK and ATL and arriving in quick succession. Likely to be a nightmare at CBP.

Derfred

8th Aug 2016, 17:17

the dreaded "Y2K" event (which wasn't all that it threatened to be)

Y2J wasn't all that it threatened to be because they identified it beforehand and fixed it (at huge cost in some areas). You obviously missed that bit.

PAXboy

8th Aug 2016, 17:24

Correct Derfred. Y2K was fixed in time but it suits everyone to say we cryed wolf. Then it is easier to not give credit where it's due.

Correct OldLurker. I was in IT for 27 years and because it works 99.9% of the time, they think it will be acceptable when it fails. As they say in the airline industry, "If you think preventative maintenance is expensive - try a crash for size" For years I saw IT starved of investment and then, when it did go wrong, they gave us the money we'd been asking for.

esa-aardvark

8th Aug 2016, 18:00

Y2K, I made a lot of money out of this. I invested a smallish amount into a company
working on Y2K solutions. Then I forgot about it for a couple of years, investment
had improved in value by about 8 times. On the subject of power supplies the last Data
Centre which I managed had 2*Ships diesels, each capable of carrying the mainframe &
ancilliary equipment load. I wonder in the case of Delta if it was some equipment other
than the computers which failed. I remember in NZ a few years ago the highly robust
point of sale network went down when someone found and cut the only "single point failure" cable.

c52

8th Aug 2016, 18:06

My near-invariable experience in IT was that there would be a total failure when the annual test of the uninterruptible power supply took place.

Ian W

8th Aug 2016, 18:28

I wonder how long it will be for Delta to have a redundant system set up say at MSP? If they had done that the failure could have been transparent to the airline apart from the people directly involved at ATL. It seems that all the airline beancounters would prefer to upset their customers and give dispatch a really hard problem to solve at vast expense to the airline (just think of the EU mandated payments!) rather than have an efficient system that is fault tolerant. Perhaps, if IT had asked for the back up and it been refused the costs of the global ground stop and recovery could be put on the accountancy head count budget? That might concentrate their minds.

Smott999

8th Aug 2016, 18:35

What's interesting is Georgia Power is denying that there was an outage at all. None of their customers had a loss and all their equipment was running.
Delta called them to have them look at a master switch of some sort which had failed. Hmm.

What is that called.....single point of, something.....what is it again?

If it is something that silly and bad, heads will roll.

vector4fun

8th Aug 2016, 18:40

We had a lightning induced power failure years ago. Also had a bank of batteries and diesel generators to take over that failed to work. Seems the folks that maintained the UPS system monthly failed to disconnect the dummy load used for testing.....

Smott999

8th Aug 2016, 18:52

I used to work at an intl bank, and every 6-9 months we had to do a full DIsaster recovery drill. Took it very seriously. If these guys had their whole global network sitting on one switch....yikes.

Lonewolf_50

8th Aug 2016, 18:55

Y2J wasn't all that it threatened to be because they identified it beforehand and fixed it (at huge cost in some areas). You obviously missed that bit. No, I didn't miss anything as I was involved in three different BCP's for Y2K. Thanks for playing. I just didn't add all of the bloody detail to that post, so perhaps I overdid the brevity.

My point is that some in current management, who probably weren't in management then, may perceive through MBA eyes that Y2K was "crying wolf" since they didn't know what it took to mitigate it. As you are probably aware, management "up and out" and medium to high turnover is common.

Joe_K

8th Aug 2016, 19:51

Ars Technica has this:

"According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. "

If true: oops.

Data center disaster disrupts Delta Air Lines | Ars Technica (http://arstechnica.com/business/2016/08/data-center-disaster-disrupts-delta-airlines/)

Smott999

8th Aug 2016, 20:19

So no separate data center? Or backup system on a different power circuit?

ex-EGLL

8th Aug 2016, 20:29

We had a lightning induced power failure years ago. Also had a bank of batteries and diesel generators to take over that failed to work. Seems the folks that maintained the UPS system monthly failed to disconnect the dummy load used for testing.....

Or my personal favorite from ATC. We religiously switched to the standby generator on the first Sunday of the month to make sure all was good, it always was. One day we lost power, generator fired up and all was good...... for a couple of minutes, then it went very dark and very quiet.

Seems the monthly startup / shutdown checklists made no mention of fuel quantity !!!

FakePilot

8th Aug 2016, 20:33

When everything is fully redundant, Murphy refocuses his effort on the part that controls both.

G-CPTN

8th Aug 2016, 20:37

Or my personal favorite from ATC. We religiously switched to the standby generator on the first Sunday of the month to make sure all was good, it always was. One day we lost power, generator fired up and all was good...... for a couple of minutes, then it went very dark and very quiet.

Seems the monthly startup / shutdown checklists made no mention of fuel quantity !!!
I think that was what 'sunk' New Orleans during Katrina - the auxiliary generators that were backup for the pumps 'ran out of fuel'.

Ian W

8th Aug 2016, 20:59

So no separate data center? Or backup system on a different power circuit?

This is the fact that makes no sense. If you are running a global system and it must be up and running 24/7 then you must not have all your systems in one place. Not only should there be a separate backup system it must be a geographically separate backup system ideally in another State - Delta could have theirs at MSP. Both the ATL and MSP systems should be running in parallel, with their own local backup for fault tolerance. Both should be able to support all operations 24/7, but when both are up they could share the load and keep each other in synch. Banks do this all the time. If ATL had gone down under this scheme only those people in the ATL center would have noticed. All users would have the same service from the one site.

As said before this sounds like a beancounter enforced lack of fault tolerance.

Peter47

8th Aug 2016, 21:12

Here are some questions for IT experts - and I am sure there are plenty of issues I haven't even thought about.

If DL's computers are inop could other airlines help out, for example KLM produce flight plans for DL flights departing AMS or would there be problems owing to different regulatory regimes, etc? I know that there are specific rules relating to despatchers who have to be licenced in the US.

Where you have codeshare partners - lets say you are a DL ticketed passenger travelling on a KL flight - would your reservation appear in both systems which would effectively provide a backup? I have to say that I have visions of pax arriving at airports having to prove that they have a reservation - useful to have printed off a confirmation.

Presumably VS still has its own computer system as its ops appear to be unaffected.

crablab

8th Aug 2016, 21:15

#1 Nah, all the Delta databases will have been in that datacenter (from the sounds of it) - you'd need all that information to produce anything useful and it would be a right pain to modify the KLM systems to handle the data etc. You might as well do it by hand.

#2 I believe all passenger reservations are stored on a central system to aid access of the TSA etc. Like APIS. So I guess you might be able to get data off of that?

Smott999

8th Aug 2016, 21:32

Indeed Ian, makes no sense if they were without a physically remote hot backup data center with fully redundant data, available to switch on promptly should the main site be lost.
I wonder if their are or will be regs about that kind of thing.

Speaking of regs, what about EC 261, that automatic-compensation for delayed/cancelled flights in EU? I've used it myself a few hrs back.

Lot of folks stranded in LHR or Amsterdam might want to make use of it! I wonder how many Yanks know about it though...

Logohu

8th Aug 2016, 21:41

I wonder how many Yanks know about it though...

Probably not many, but you can be sure their attorneys will ;)

archae86

8th Aug 2016, 21:51

One lesson I think I learned in another industry: if you want your backups and protections to work when they are needed, you have to actually integrate usage of them into your standard operations. No amount of "really really careful" testing is an adequate substitute.

True story: the factory site at which I worked which at times was probably on the worldwide top ten list for dollar value added across all industries, was so concerned about the single point failure of losing utility power (yes, we had lots of stuff on UPS, but some big stuff of interest was not) that they paid to have several miles of high-voltage connection made to a second point within the utility network. There was a nifty switch on our premises which at need would transfer our load from the one string of power towers to the other.

Came the day we needed the backup connection to work--not because of a failure of the utility, but because a forklift operator on our own premises accidentally damaged a very late connection line by swinging a load up into it.

The post-mortem established that the nifty transfer switch had a battery which needed to be alive for the transfer to happen. And there was no maintenance plan for looking after the battery, which had probably been dead for some time by the day of our need. That one cost a very, very, large amount of money.

Yes, a suitable test would have caught that one, but I'll still hold out for the higher standard of usage. That way people take it seriously, and people notice and fix the troubles.

lincman

8th Aug 2016, 23:39

I understand BA have a fully duplicated back-up in a secret location where it can't be sabotaged. Maybe DL ought to visit BA and learn something?

etudiant

8th Aug 2016, 23:45

I understand BA have a fully duplicated back-up in a secret location where it can't be sabotaged. Maybe DL ought to visit BA and learn something?

Hope the transfer switch battery has been checked regularly. ;)

Water pilot

9th Aug 2016, 01:03

If it makes anyone feel better, years ago I worked for Microsoft in Redmond, Wa. Cost for our data centers was no object and we had the best backup power systems that money could buy at the time. Comes along a big storm in Redmond that took out the power and you guesed it, the lights go out, my screen goes dark, and the fire doors slammed shut..

I think lightning hit the backup generators. The good news is that my house was near "campus" and I got power restored days before anybody else.

underfire

9th Aug 2016, 01:51

From my experience, including working in Bldg #2 at Redmond, these 'experts' in backup and redundancy are akin to the 'experts' in aviation, always 'formerly' employed in the business.
Everything has relied on boilerplate checklists, with single point failures at virtually every point, so on paper it looks fine, but in operation, it falls apart.
In Seattle, the Police Department had backup generators for the systems. The weekly tests of the systems went fine. When an earthquake happened, the backup generator systems went down 2 hours after startup. The weekly tests had been using up the fuel storage, and there was never a contract in place to keep the tanks topped off.

RobertS975

9th Aug 2016, 02:59

Most "yanks" don't consult attorneys regarding canceled flights. DL has offered a $200 travel voucher to anyone who was canceled or delayed greater than 3 hours.

underfire

9th Aug 2016, 03:06

go for the 15K FF miles!

ExXB

9th Aug 2016, 06:10

Most "yanks" don't consult attorneys regarding canceled flights. DL has offered a $200 travel voucher to anyone who was canceled or delayed greater than 3 hours.
Yet 261 provides for €400/600 in CASH not in vouchers. It applies to all DL flights departing from an EU airport.

As passengers can't waive their rights they would be entitled to this in addition to their voucher.

ajamieson

9th Aug 2016, 08:57

Although passengers can't waive their rights, airlines do not have to offer compensation where a passenger has agreed an alternative. Accepting the voucher could be argued as agreement to an alternative offer.

But yes, definitely worth lodging an EC261 as 600 Euro is way more use than an MCO/voucher that probably comes with a string of restrictions.

Smott999

9th Aug 2016, 11:15

I once had to go EC261 on United as they stranded me in London.
It literally took 18 months and they denied everything and said their cancellation was "force majeur " and I was entitled to nothing. Until the courts ruled and it was time to pay up, then they tried to bribe me to drop the case. I told them they were in violation of law for contacting me directly instead of my attny and hung up.
There are actually firms in EU that do nothing but prosecute EC261. I think they took about 15% and I had to do nothing except email them my boarding pass and other info. Not bad, but United was just appalling about it.

sabbasolo

9th Aug 2016, 11:18

Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?

Ancient Observer

9th Aug 2016, 11:22

It is seldom the back up generators that fail if they are rigorously tested. It is normally some switch somewhere, which no-one seems to own. IT folk think they do good Project management. Maybe they do for software implementations. For real Engineering, hire a real Engineer.

OldLurker

9th Aug 2016, 12:18

if you want your backups and protections to work when they are needed, you have to actually integrate usage of them into your standard operations. No amount of "really really careful" testing is an adequate substitute. Yes!

the factory site at which I worked which at times was probably on the worldwide top ten list for dollar value added across all industries, was so concerned about the single point failure of losing utility power ... that they paid to have several miles of high-voltage connection made to a second point within the utility network. One of my company's sites did something similar. But no clued person supervised the actual connection. Result: at a certain point close to the site, the two cables ran side by side within a foot or two. Yes, you guessed it: that was exactly where some guy with a backhoe, digging an unrelated hole, got the spot wrong and cut through both lines ...

er340790

9th Aug 2016, 14:18

Well, it's certainly a major wake-up call.

The proliferation of IT-based "solutions" in passenger air-transport recently has been remarkable: fully web-based reservations; on-line check-in; boarding cards via hand-held devices etc etc etc.

These days all one sees before Security at airports is a spotty 16-year old handling checked baggage - typically someone who wouldn't recognize a Manual System if it swam up and bit him.

When all said and done, Delta's back-up and disaster recovery procedures clearly fell far short. No real excuse for that. :=

[Presumably anyone dying in SE States that day got an extra day on earth. "It doesn't matter if you go to Heaven or Hell, you still have to go via Atlanta!"] :E

FakePilot

9th Aug 2016, 14:26

It is seldom the back up generators that fail if they are rigorously tested. It is normally some switch somewhere, which no-one seems to own. IT folk think they do good Project management. Maybe they do for software implementations. For real Engineering, hire a real Engineer.

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer. Most engineering work I've observed is "will the whole thing work?" vs. software people "when can I use my favorite tool?"

RAT 5

9th Aug 2016, 16:20

Various observations come to mind:

1. Someone somewhere, perhaps, signed off on NOT installing a correct, suitable worse case, thoroughly tested - often- backup system. Heads might roll, but don't hold your breath. They can not pin 'pilot error' on this one.

2. Someone somewhere did not do a thorough threat/risk assessment of 'what happens if...../'

3. Someone somewhere was being over complacent. "it has never been a problem before, therefore it's OK."

4. When volcanic ash shutdowns airspace, and puts a/c & crews where you did not plan them to be, you used your computer systems to sort out the consequential poo-pile. Oops, the poo-pile is caused by your own computer system. Now where is that pencil & rubber, slide-rule and abacus? What do you mean there's no paper back up? Oops.

This saga could go on long enough for Hollywood to make an epic drama out of it, at least a TV box set. Then you could throw in some foreign espionage conspiracy and ruin the whole truth. Ground Crash Investigation could have a field day with this one. Human error puts a company on the edge.

What will be interesting will be the investigation as to root cause. I wonder if that will ever see the light of day to the public. Check out the dole queue for a clue.

Flash2001

9th Aug 2016, 16:26

Never worked in airline reservations but did work many years in the broadcast industry. I was surprised at the number of redundant systems I found that assured system failure if either of the duplicate systems failed! Also worked as a millennium auditor in the same industry. We found several epoch related risks that had nothing to do with Y2K.

After an excellent landing etc...

EEngr

9th Aug 2016, 16:37

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer.

And that's often a management decision. Back when I worked for Boeing, the big thing was to compartmentalize the hardware and software development tasks. Theoretically so each one could be assigned to a group with the appropriate expertise. But more often than not because each discipline had an entrenchedfiefdom. Later on, it was to facilitate outsourcing each task to different subcontractors (spread the blame).

This is how they treated their core competency: aircraft. The 'systems engineering' function (a top-down view of overall function) was mostly contract management and very little actual engineering. You can imagine what lack of attention was given to non-core functions (data centers, facilities, etc.)

bafanguy

9th Aug 2016, 20:59

Not that it matters terribly, but here's latest info:

http://finance.yahoo.com/news/delta-just-revealed-caused-computers-194608131.html

twochai

9th Aug 2016, 21:12

A wake up call? Delta already had the wake up call, ten years ago:

Comair's Christmas Disaster: Bound To Fail | CIO (http://www.cio.com/article/2438920/risk-management/comair-s-christmas-disaster--bound-to-fail.html)

Of course, Comair was only a subsidiary of Delta, not part of the main team - they were too smart to let such a thing happen!

The CEO of Comair walked the plank! Wonder if that'll happen this time around!

xs-baggage

9th Aug 2016, 21:50

Presumably VS still has its own computer system as its ops appear to be unaffected.
As far as I'm aware VS is still hosted on the former EDS SHARES, now owned and operated by HP. CO also sat on SHARES (don't know if that's the case since the UA merger).

BA, mentioned in another post, is also hosted by a third party with a very resilient system indeed. I'm surprised that DL didn't move to third party hosting when they finally dropped the legacy in-house DELTAMATIC system.

OldCessna

9th Aug 2016, 23:22

Amazon computer services offered a much more redundant system and they (Delta) didn't want to pay the money.

You can assume Amazon are pretty much switched on with systems.

So Delta are running a 20-25 year old system that if one hub goes down so does the rest. All the senior IT execs are former IBM. That sums it up of course.

On another positive note all the competition are doing really well from this total Fook up!

Ian W

10th Aug 2016, 10:49

From the Yahoo Link above:

"Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power," Delta COO Gil West said in a statement on Tuesday. "The universal power was stabilized and power was restored quickly."

However, the trouble obviously didn't end there. A Delta spokesperson confirmed to Business Insider earlier today that the airline's backup systems failed to kick in."

And here we have the fundamental fault in the design. The 'backup' system should be operating all the time as a part of the live system. To all intents and purposes you have a widely distributed system that usually operates very efficiently. When part of the system fails all that happens is that the remaining part of the system carries on operating slightly less efficiently. There is no impact at all on operations and no failover to worry about.
I have no doubt that the IT people would want to have a fault tolerant system but the beancounters will have said how often do things fail? What is the cost of 2 computer centers? We are not paying that they can stay in the same building.... and there will be a Delta beancounter with his abacus out saying now that they still won on the deal.

Ian W

10th Aug 2016, 11:08

A wake up call? Delta already had the wake up call, ten years ago:

Comair's Christmas Disaster: Bound To Fail | CIO (http://www.cio.com/article/2438920/risk-management/comair-s-christmas-disaster--bound-to-fail.html)

Of course, Comair was only a subsidiary of Delta, not part of the main team - they were too smart to let such a thing happen!

The CEO of Comair walked the plank! Wonder if that'll happen this time around!

There is a little more to this story.
I have been told that in order to reduce stock holding at the airport Cincinnati had only a small supply of deicer and when snow/ice/freezing weather was forecast would call for supplies sufficient for the expected weather. in this case the tankers of deicer were on their way to the airport but were pulled over by law enforcement and told it was too dangerous for them to carry on driving due to the snow. So the airport was unable to deice aircraft and operations were halted. Not only did the aircraft tires freeze to the ground, but also the jetways froze in position.

Lots of holes in the cheese lined up. A really good learning exercise for the MBAs who run airports these days.

Tech Guy

10th Aug 2016, 11:18

The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".

Ian W

10th Aug 2016, 12:54

The last company I worked for had mirrored data centres in Europe, Asia and America.
You could loose any 2 and everything would still work correctly at a "user level".
Exactly.
There is only one explanation really, Delta beancounters felt the cost of a fault tolerant system made it worth taking the risk of a total system failure. Yet the cost of the backup system running as a 'hot spare' in a separate building would be peanuts compared to their cash and status losses now. There are still flights being cancelled today and their computer systems are still not recovered with lots of broken links and applications not back in synch. All those people with their 'e-boarding passes' on their phones could be in trouble. This may run on for months with people with bookings out months suddenly finding that the roll-back/roll-forward broke their bookings.

They should take the $200 a pax good will payments out of their beancounters' head count budgets. Only then with skin in the game would they appreciate the risk analyses.

procede

10th Aug 2016, 13:11

Fact of the matter is that every backup system will introduce new failure modes.
It happens that everything stops because of inconstancy between primary and secondary systems. Systems can become unavailable as they need to re-synchronize (a common one is where a drive in a RAID array fails and the server starts filling the hot-spare). The best one I ever experienced is a UPS that failed: Everything had power, except the systems behind the UPS...

MarkerInbound

10th Aug 2016, 13:49

Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?

You have a crew scheduled to fly to Podunk and spend the night then fly the morning departure back. They never got there so there is no crew (or aircraft) for the morning flight.

.Scott

10th Aug 2016, 17:15

Take away all tools from the Engineer except one (i.e. hammer). Now you have a programmer. Most engineering work I've observed is "will the whole thing work?" vs. software people "when can I use my favorite tool?"The "favorite tool" and/or "favorite language" syndrome is the sign of a junior programmer.

In this case, lots of schemes would have worked. As others have said, they just needed to actually exercise the one they picked. When it comes to software and information systems, if it hasn't been tested, it doesn't work.
So we have the first round of testing: Not bad, up in only a few hours. Too bad it wasn't a test.

Actually, a major cost in these systems is the testing. As each revision is made to the system, you need ways to routinely simulate and check daily activity without actually relying on that system.

neilki

10th Aug 2016, 19:14

For the record, everyone i worked with or watched the last few days met the challenges at hand with grace and patience. It was impressive to watch people pull together to keep the show running. I would take my hat off to them; but i'm not supposed to...

West Coast

10th Aug 2016, 19:22

Quote:
Any idea why are DL flights still being cancelled today (Tuesday)? Positioning? Some loss of data?

As MI mentioned, after IROPS it takes a few hours to days to get the system back to normal ops. Bet the reserve complements are getting heavily used.

Ian W

10th Aug 2016, 19:27

Fact of the matter is that every backup system will introduce new failure modes.
It happens that everything stops because of inconstancy between primary and secondary systems. Systems can become unavailable as they need to re-synchronize (a common one is where a drive in a RAID array fails and the server starts filling the hot-spare). The best one I ever experienced is a UPS that failed: Everything had power, except the systems behind the UPS...

That is why you do not have 'backup systems' you have a widely distributed fault tolerant system. Yes they are a real pain to test as Scott says above, especially the regression testing after every change and fix. But Delta is probably wishing it had spent the money on a distributed system.

And Neilki, I was talking with your compatriots at around 4am this morning - they are doing a good job. I can only imagine the workload in ops and dispatch over the last few days. :eek:

G-CPTN

10th Aug 2016, 19:48

I can only imagine the workload in ops and dispatch over the last few days. :eek:
Pah! I was about to board a B747 from HKG to TPE when the check-in system went down.

It took best part of two hours to process the passengers and issue boarding cards (I don't remember whether they were hand-written - it was about 30 years ago).

ph-sbe

11th Aug 2016, 00:05

From an I.T. management point of view, Delta are stupid.

Let's assume for a second that the whole outage was indeed caused by power circuits going down and backup power not kicking in. Fine, **** happens. Stuff fails.

What SHOULD have happened is that their entire system failed over to a backup SITE within 30 seconds. That is not impossible to do, and I just delivered a system (three months ago) that does just that. Ask yourself why Google, Facebook etc never go down. Redundancy, redundancy, redundancy and redundancy. And trust me, as an I.T. professional (MSc + JNCIE) I can guarantee you that this is not rocket science.

This time it's a power failure. Next time it's a criminal or terrorist act that takes out the entire DC.

413X3

11th Aug 2016, 00:40

Clearly these airlines have done little investment in IT systems and their own staff. Probably preferring to outsource everything and pay expensive consultants to come in once in a while. Splitting up your servers at various sites scattered around the country, or world, is expensive, but necessary when you rely on systems to do function as a company. Will airlines invest in their infrastructure rather than worrying about quarterly results and executive parachutes?

etudiant

11th Aug 2016, 02:44

Clearly these airlines have done little investment in IT systems and their own staff. Probably preferring to outsource everything and pay expensive consultants to come in once in a while. Splitting up your servers at various sites scattered around the country, or world, is expensive, but necessary when you rely on systems to do function as a company. Will airlines invest in their infrastructure rather than worrying about quarterly results and executive parachutes?

In fairness remember that all the major US carriers have only recently experienced real financial difficulty, with bankruptcy rife. That is a difficult environment in which to fund a major DP upgrade that does not provide immediate economic advantage and so these carriers have let their legacy systems soldier on too long.
This incident will remind their managements to reassess that decision.

edmundronald

11th Aug 2016, 04:29

**** happens. Even Google and Apple occasionally have their sites go dark. Let's hope Boeing and Airbus have fewer critical failures.

This is probably a nightmare for the insurers.

PAXboy

11th Aug 2016, 23:59

I spent 27 years in IT and much was putting the case to mgmt as to why they had to spend the money, if they wanted to achieve the aims that they said they did.

Dear Main Board Directors of DL:

You know that, when you buy a twin engine aircraft, the donkeys are BIG? Each has to have enough reserve power to be able to continue safely when a donkey conks at V1.

It's exactly like that. We need two mighty big IT Donkeys to get our passengers safely to their destination so that you can stay on the golf course.

:E

sherburn2LA

12th Aug 2016, 02:48

In my 37 years of IT (starting when it was DP) both sides of the pond I have never worked anywhere there was a shortage of donkeys.

twochai

12th Aug 2016, 15:22

I have never worked anywhere there was a shortage of donkeys.

One can only say "Amen" to that!

steamchicken

12th Aug 2016, 15:59

Sounds like a case for the Chaos Monkey, and its friends in the Simian Army (https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey).

(Netflix developed an application, called Chaos Monkey, that causes randomly selected servers and/or network links to fail deliberately, in order to test their failover arrangements, collect data on the recovery process' performance, and most importantly, to demonstrate that they work. And they're just selling TV.)

vector4fun

12th Aug 2016, 20:49

Someone somewhere, perhaps, signed off on NOT installing a correct, suitable worse case, thoroughly tested - often- backup system. Heads might roll, but don't hold your breath. They can not pin 'pilot error' on this one.

Someone somewhere did not do a thorough threat/risk assessment of 'what happens if...../'

Pilots, controllers, firemen, etc are taught to think "What happens if..." They are safety personnel, and need to always have a backup plan or even two.

But if you work in an office or cubicle, you get hammered for always asking "Yeah, but what if...." Then you're crapping on someone's plan and budget, and delaying the project. Those people do not think like we do, never will.

G-CPTN

12th Aug 2016, 22:00

We used to call it 'contingency'.

Ian W

12th Aug 2016, 22:10

From an I.T. management point of view, Delta are stupid.

Let's assume for a second that the whole outage was indeed caused by power circuits going down and backup power not kicking in. Fine, **** happens. Stuff fails.

What SHOULD have happened is that their entire system failed over to a backup SITE within 30 seconds. That is not impossible to do, and I just delivered a system (three months ago) that does just that. Ask yourself why Google, Facebook etc never go down. Redundancy, redundancy, redundancy and redundancy. And trust me, as an I.T. professional (MSc + JNCIE) I can guarantee you that this is not rocket science.

This time it's a power failure. Next time it's a criminal or terrorist act that takes out the entire DC.

Almost right - except I would make it a widely distributed redundant system so there is no backup just a system with two (or more) identical parts sharing data and transactions with redundant copies of all the data. So it is very over powered for what it is doing but any site can fail and the users don't notice not even a 30 second switch over. A distributed system is as fault tolerant (or more so) as a main system with a standby, but there is no expensive system sat doing nothing and failover is instant and transparent to the users.

STN Ramp Rat

12th Aug 2016, 22:26

Clearly these airlines have done little investment in IT systems and their own staff. Probably preferring to outsource everything and pay expensive consultants to come in once in a while. Splitting up your servers at various sites scattered around the country, or world, is expensive, but necessary when you rely on systems to do function as a company. Will airlines invest in their infrastructure rather than worrying about quarterly results and executive parachutes?

Quite the opposite I believe, I understand Delta have not outsourced their IT preferring to keep it all in house. airlines are airlines and their core competency the airline game and not the IT game. Sometimes it’s best to outsource a key function to a company whose core competency it is. Don’t assume outsourcing is bad; after all as SLF we outsource our travel arrangements every time we get on board an aircraft and I can assure you that's a lot safer than me trying to fly myself everwhere

twochai

13th Aug 2016, 21:28

Don't forget, Delta even 'in-sources' their jet fuel - they own the refinery!! I trust they do a better job refining crude oil than refining crude IT!

PAXboy

14th Aug 2016, 21:13

The key difference is that if the item (IT in this case) is 'in sourced' when it goes wrong, the CEO can reach out and grab the person (IT Director) and grasp them warmly by the throat. They can even be out of their job by the end of the day.

With out sourcing you cannot do that. You can terminate the contract but that will take months of negotiation and cost more money to change it to another bunch.

When an employee knows that they cannot be made to carry the can - their attitude is different. It is what makes pilots different. As we know, CEOs often pass the can to others too. I have never liked outsourcing because it costs a lot of things that do not show up on the balance sheet.