PPRuNe Forums

PPRuNe Forums (https://www.pprune.org/)
-   Passengers & SLF (Self Loading Freight) (https://www.pprune.org/passengers-slf-self-loading-freight-61/)
-   -   BA delays at LHR - Computer issue (https://www.pprune.org/passengers-slf-self-loading-freight/595169-ba-delays-lhr-computer-issue.html)

The_Steed 6th Jun 2017 11:26

Even if you take the excuse at face value (that someone hit the big red button) it still doesn't explain why their systems didn't failover to the backup Data Centre.

I find it unbelievable that BA would not have automated failover to a backup Data Centre since they are running 24x7x365 safety critical systems. The pertinent data should be replicated in real-time, so apart from a brief interruption, there shouldn't have even been any indication to the Users that anything happened.

Putting that to one side, they should still have been able to get things back up and running again pretty quickly. I would have expected BA to have pretty modern kit, so I would have thought the biggest risk would be from data corruption rather than hardware issues. Then it's just a case of restoring from backup - which is easy because you test that process on a regular basis :)

Joe_K 6th Jun 2017 11:28


Originally Posted by Ian W (Post 9794296)
From that article:
The BA system was obviously not designed to be fault tolerant. Or the system had been put into a state where it was not fault tolerant by people not knowing what they were doing.

Here's a quote from a Register article, which leads to some obvious conclusions:

However, within the comments of the BA chief executive there is one telling statement:
Tens of millions of messages every day that are shared across 200 systems across the BA network and it actually affected all of those systems across the network.
Sorry for the text speak, but WTF? How does it require 200 systems to issue a boarding pass, check someone in and pass their security details on to the US – even if they aren't going there? Buried deep in The Register comments on the article is an allegedly former BA employee claiming that this is in fact the case, that all of these systems are required for BA to function. How did BA get to the point that there are 200 systems in the critical path?
Source: https://www.theregister.co.uk/2017/0...path_analysis/

PAXboy 6th Jun 2017 12:21

Restarting a data centre is just like starting an aircraft: There is a sequence that has been tested and proved correct. Any component/generator/system that is dependent on another item being running - will be set to start after it. There is testing of links to other systems - just like checking 'full and free'.

It used to be that you started your car by setting manual controls and then going to the front of the car to swing a handle. Once it had 'caught', you jumped into the seat to adjust choke and mixture etc. Now the car does it all for you when you turn the key/push the button and it sequences everything in the right order.

Wrong order for anything and the flight crew have to go back to the top of the checklist before calling for push or moving under own power, or turning onto the active. So the question is: What state was BA's startup list in and when was it last read?

bbrown1664 6th Jun 2017 12:32


Originally Posted by The_Steed (Post 9794333)
Then it's just a case of restoring from backup - which is easy because you test that process on a regular basis :)

Assuming they have the latest greatest tape drives (which is unlikely) you still need to get the tapes back from the off-site location (assuming the local copy is corrupt or unavailable) which can take a couple of hours then you start the restore.
DR restores are normally done in a specific sequence depending on the priority of the system to the business.
the latest tape drives can transfer up to 1TB per hour. They probably have several in a library or two meaning you could theoretically transfer 4-8TB per hour off the tapes assuming the rest of the infrastructure can take it.

Most DBA's and system admins are not good at archiving data though so the main databases will be massive and will need fully restoring before you can bring the database back on line. Until that happens, the other systems may as well be in another universe as they cannot do anything useful.

The customer I currently work with has its systems categorised into several categories. The highest priority should be up and running within 15 minutes. This can only happen in an ideal world where you don't have corruption.
2nd, 4 hours, 3rd 24 hours 4th 7 days. Many of the systems in the 2nd and 3rd category wont work fully without the 4th category systems running so whilst they can be used for reference, they cannot operate their business fully until everything is back online. Blame the developers for that one as well as the business managers for wanting to cut costs.

DType 6th Jun 2017 13:20

Dumb suggestion:-
Maybe data centre A had a failure, and operation had been passed seamlessly to DC B, when someone did a NO NO at DC B.
A bit like being AOK to survive with one engine out, when the second engine shuts down!
Although I would have expected BA to blame "exceptional circumstances" had that actually happened.

Ian W 6th Jun 2017 15:38


Originally Posted by PAXboy (Post 9794378)
Restarting a data centre is just like starting an aircraft: There is a sequence that has been tested and proved correct. Any component/generator/system that is dependent on another item being running - will be set to start after it. There is testing of links to other systems - just like checking 'full and free'.

It used to be that you started your car by setting manual controls and then going to the front of the car to swing a handle. Once it had 'caught', you jumped into the seat to adjust choke and mixture etc. Now the car does it all for you when you turn the key/push the button and it sequences everything in the right order.

Wrong order for anything and the flight crew have to go back to the top of the checklist before calling for push or moving under own power, or turning onto the active. So the question is: What state was BA's startup list in and when was it last read?

But there are geographically separate data centers backing each other up - or so we are told. But this is obviously not the case. It would appear that what they have really been operating with is a closely coupled distributed system which provides (provided) no redundancy or fault tolerance. It would appear someone has implemented something that turned an otherwise redundant system into a single monolith and failing any part of the monolith results in a total system crash. This was indeed what was reported with phones and display board failures.

This was not a power supply fault although that exposed it - it was a gross system architecture design failure. I can't imagine that it was originally set up like that, it is more likely that someone has removed the redundancy from the system in some way possibly through ignorance of how the fault tolerance operated.

bbrown1664 6th Jun 2017 16:04


Originally Posted by Ian W (Post 9794590)
I can't imagine that it was originally set up like that, it is more likely that someone has removed the redundancy from the system in some way possibly through ignorance of how the fault tolerance operated.

Unfortunately I can. In my experience, the original developers didn't design any resilience in. Then an infrastructure architect got hold of it and designed in all of the resilience and fail over architecture in. Then the bill payer saw the price and order it be redesigned to meet a budget 10-50% of the infrastructure architects design. Finally, a solution goes in that doesn't meet the original requirements but meets a price on the basis that the "what-if" scenarios are so rare they can be discounted until they actually happen.......

Been there, seen it, done it and got the t-shirt.

In addition, the one thing you cannot protect against fully in this situation is where the master system writes corruption to the disk. The storage devices then replicate it to the DR location as they believe the master knew what it was doing and you end up with corruption on both sites.

pax britanica 6th Jun 2017 17:29

Kill switches are not a very good idea unless they are designed to shed electrical loads gradually which is quite possible in the same way airliners have different busses. A straight forward off switch is liable to do more damage than the reasons for using it - if there is a small fire why kill the smoke extraction system, if there's a big fire its too late to turn off the power anyway.

If someone feels the necessity for such a devise it has to be behind a guard to prevent accidental use or behind glass in a break glass emergency case and in any event it is questionable as to why one person is allowed to work alone on something of this scale in todays world especially if they are a contractor -oh there is one reason, its cheaper, wonder which policy BA adopted.

bbrown1664 6th Jun 2017 18:37

Sensible idea and the reason you have the kill switches in the data rooms. You don't want to be holding a wet hose that starts spraying the 3-phase supply.

As for the smoke extraction systems, they are on a different circuit normally and the fans for those are well away from the data halls so smoke can still be extracted from a de-energised room.

vikingivesterled 6th Jun 2017 20:50


Originally Posted by Heidhurtin (Post 9794297)
Whereas I have spent the last 2 years disconnecting these "big red switches" whenever I find one in any of my DC's (usually in the smaller installations). There are gas suppression systems to take care of any fire situation and automatic disconnection of electrical supplies to cater for any electrical fault (down to individual circuit level).
There is simply no need for a master cut off switch to kill the whole hall. It's not as if there are big mechanical nasties whirling around and looking to cause injury to the unwary (think workshop or factory) which certainly need some form of emergency manual intervention capability.

Gas suppresion may take care of a fire but what if somebody is electrocuted in the dc. The police will come and demand all power turned off, then stick police-tape around the place and investigate it for a couple of days. Not helped by that many older raised floors are of metal. Where you planning to drag the poor soul outside before you called the emergency services?
And what about fires above or below, where the building is not only used to house computers. Certain fires demand water and the fire brigade won't touch the place if power is still on. A whole warehouse recently burned for days in Norway partially because the fire brigade wouldn't set a foot on the roof since it was covered in solar panels and they couldn't be sure there where no live currents.

Banana4321 6th Jun 2017 21:40


Originally Posted by The_Steed (Post 9794333)
I find it unbelievable that BA would not have automated failover to a backup Data Centre since they are running 24x7x365 safety critical systems.

No they are not. Well not in the datacentres anyway. On the planes...maybe

MurphyWasRight 6th Jun 2017 23:39

I smell a bit of a rat around the "physical damage" from uncontrolled power on.

While it is painfully true that individual pieces of equipment may fail on power up if they were marginal the idea that switching off counting to 20 and switching on would cause widespread physical damage just does not pass the sniff test.

If it did very poorly designed, marginal and fragile internal power distribution would be indicated.
Worst case should just be some tripped breakers if the startup surge was greater than capacity.

One sees that occasional on utility service outages where a line transformer will pop it's fuse when the service is restored and the whole blocks refrigerators try to start at once.

This does not cause a damaging surge and is typically resolved with a new fuse.

Heidhurtin 7th Jun 2017 05:43


Originally Posted by vikingivesterled (Post 9794842)
Gas suppresion may take care of a fire but what if somebody is electrocuted in the dc. The police will come and demand all power turned off, then stick police-tape around the place and investigate it for a couple of days. Not helped by that many older raised floors are of metal. Where you planning to drag the poor soul outside before you called the emergency services?
And what about fires above or below, where the building is not only used to house computers. Certain fires demand water and the fire brigade won't touch the place if power is still on. A whole warehouse recently burned for days in Norway partially because the fire brigade wouldn't set a foot on the roof since it was covered in solar panels and they couldn't be sure there where no live currents.

Posting from a mobile so forgive the poor formatting. Regarding the issues of electrocution - in a properly designed area the supply to the component causing the shock would be disconnected before anything fatal happened, assuming everything is correctly bonded etc. In any case the "big red button" cannot be used to protect against electrical fault or shock for obvious reasons - by the time you get to it you're already too late.

If the fire brigade, or police or any other emergency service, required the installation to be powered down in an emergency this could be done directly from the UPS and other switchgear (accepting the need to invoke disaster recovery and role swap to the backup) within 2-3 minutes - less than the likely response time of any fire service. I'll grant there could be a problem with solar panels which can't be switched off, but we don"t have these in a data hall. How would the fire service deal with UPS battery strings which can reach several hundred volts?
I restate the point - there's no need for a "big red button".

artee 7th Jun 2017 05:49

Airlines spend half the norm on IT? - The Economist
 
"The first lesson from such painful experiences is to refrain from pruning investment in IT too far, as some airlines may have in their desperate efforts to fend off budget competitors. “Legacy carriers like BA saw spending on this as an overhead,” says Henry Harteveldt of Atmosphere Research, a consultancy. “But it should be seen as a cost of doing business.” In 2015 airlines spent 2.7% of their revenues on IT, half the norm across all industries and a lower share even than hotels."

Scary if true.
http://www.economist.com/news/busine...ys-botches-its

OldLurker 7th Jun 2017 08:23


Originally Posted by artee (Post 9795080)
“Legacy carriers like BA saw spending on [IT] as an overhead,” says Henry Harteveldt of Atmosphere Research, a consultancy.

To be fair, that was the attitude in many companies, not only airlines, that were run by an older generation of top managers who didn't really understand the pervasive necessity of IT in the modern world. Most of those companies have learned their lesson now.

“But it should be seen as a cost of doing business.”
That's what an overhead is: a cost of doing business. The difficulty often was to get the dinosaurs in management to see IT as producing benefit, not cost only. That wasn't helped by turf wars within certain companies.

Example from a large company (not air industry), anonymised to protect the guilty: two relatively small units, because of what they did, each legitimately spent as much on IT as the whole of the rest of the company. Both fought tooth and nail against integrating their IT with the rest of the company, against counting their IT spend in the company's total IT spend and, crucially, against counting even small parts of their profits or successes as 'benefits of IT'.

"In 2015 airlines spent 2.7% of their revenues on IT, half the norm across all industries and a lower share even than hotels."
Crude percentages like that are meaningless. There's no real 'norm' across all industries – I think he must mean the average across all industries, which is meaningless too. How much does Google spend on IT? I don't know, but obviously the percentage must be high because that's most of what Google does. OTOH, as some people here may have noticed, airlines have to spend a lot of money on certain highly complex and very, very expensive equipment (clue: it's not IT*). The fact that their spend on IT is less, as a percentage of their total spend, than in other industries is hardly a surprise.

* Although, of course, modern aircraft contain a lot of embedded IT. Is the purchase and maintenance cost of that kit counted as 'spending on IT'? Turf wars over again?

pax britanica 7th Jun 2017 11:11

Artee
While the Economist is not a reliable source about many things lets assume for once it is right about IT spending . If that is a low percentage it is probably because airlines have such massive big budget items on this like airframe depreciation and fuel that perhaps it pushes the IT percentage down compared to say a bank which spends relatively more in percentage terms on IT because it doesn't have to buy things like A 380s or spend a fortune on fuel.

The big problem though is where it is just regarded as a cost centre and not a contributor-a vital one to revenues and market development. Most established big companies have got into the rather bizarre and dangerous mind set that revenue will keep coming in because it has done within living memory and therefore what needs to be done is to control-ie cut costs. BA seem to have taken this disease it to heart since so many contributors on hear make it clear that IT is critical not just for airline Ops (important enough in itself if efficiency there leads to say fuel savings) but in managing pricing and marketing campaigns to directly influence revenues and margins.

PAXboy 7th Jun 2017 11:16

Turf Wars? Oh yes! When I worked with a multi-national (stock market quoted, you would know their name) then there was an old manager in the Property department who had always managed 'the telephones'. But, when they became too big and complex and multi-sited for him to control that the IT department took them over? He had to be in the meetings and, because he controlled access to all the (14 or something) UK locations - he had to be appeased. Waste of time and effort.

Oh and the old 'DP' manager who was given the new comms responsibility didn't understand the new systems either. :uhoh: And he had an addiction problem, as well as being mean spirited and narrow of vision. But then, that's not so remarkable in British [so called] management. :rolleyes:

artee 7th Jun 2017 12:29


Originally Posted by pax britanica (Post 9795347)
Artee
While the Economist is not a reliable source about many things lets assume for once it is right about IT spending...

You're certainly right about the relative spend in capex (and opex) intensive industries. I think the critical point is the one you make about IT being a cost centre - at one time enlightened companies saw their IT as being a competitive differentiator.
That certainly doesn't seem to be the case at BA (sadly).

Henty1 7th Jun 2017 12:45

Coming clean
 
Surely there must be some definite inside information by now from the many staff that have had the heave-Ho.

Caribbean Boy 7th Jun 2017 13:11


Originally Posted by OldLurker (Post 9795204)
Crude percentages like that are meaningless. There's no real 'norm' across all industries – I think he must mean the average across all industries, which is meaningless too. How much does Google spend on IT? I don't know, but obviously the percentage must be high because that's most of what Google does. OTOH, as some people here may have noticed, airlines have to spend a lot of money on certain highly complex and very, very expensive equipment (clue: it's not IT*). The fact that their spend on IT is less, as a percentage of their total spend, than in other industries is hardly a surprise.

* Although, of course, modern aircraft contain a lot of embedded IT. Is the purchase and maintenance cost of that kit counted as 'spending on IT'? Turf wars over again?

Read the quote: "In 2015 airlines spent 2.7% of their revenues on IT, half the norm across all industries and a lower share even than hotels."

So, taking an average across all industries smooths out all manner of variations and gives a useful indication of the amount of IT spend. And what is striking is that the airline industry, which relies on IT as much as most other industries, treats IT as a burden on the company (a cost centre) instead of being a key business enabler. You really do get what you pay for.


All times are GMT. The time now is 10:39.


Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.