PDA

View Full Version : All London airspace closed


Pittsextra
12th Dec 2014, 14:24
*all london airspace closed after computer failure: Eurocontrol

Southside Hangers
12th Dec 2014, 14:26
Doesn't look like its ALL closed?


Still aircraft coming in from the East?

Pittsextra
12th Dec 2014, 14:33
https://www.public.nm.eurocontrol.int/PUBPORTAL/gateway/spec/index.html

for updates I guess

Southside Hangers
12th Dec 2014, 14:37
Aircraft still making approaches, albeit not many now.


Think parking will be fun...

Straighten Up
12th Dec 2014, 14:38
from where i sit I can see flight(s) still heading in to heathrow - what's the contingency for this sort of situation for aircraft already approaching the London holds? - where would they divert to?

some people are in for a busy afternoon/evening.

HEATHROW DIRECTOR
12th Dec 2014, 14:40
Yes, they're still packing them in at Heathrow so the airspace is not closed.

readywhenreaching
12th Dec 2014, 14:41
not very comprehensive:
Swanwick technical failure | NATS (http://www.nats.aero/news/swanwick-technical-failure/)

Ampage
12th Dec 2014, 14:43
BBC have also posted a direct link to eurocontrol's website.. on the front page of BBC news.

I'd advise you all to refrain from visiting it unless necessary. We don't want their site to fallover due to excess hits. - Which it probably will now.

Pittsextra
12th Dec 2014, 14:55
*london airspace is open; traffic volumes restricted: Nats

AnglianAV8R
12th Dec 2014, 14:56
Lovely lady on Sky has just described " A lot of planes coming in, circling over Heathrow, just waiting for more information"

flightradar24 on the screen

whitelighter
12th Dec 2014, 15:01
By 'London' do they mean the TMA or the FIR?

ZOOKER
12th Dec 2014, 15:03
A good case against EASA's plan to reduce the number of ATCCs then.

Straighten Up
12th Dec 2014, 15:04
listening to shannon on liveatc, it appears system reset was successful and london are starting to accept traffic again with restrictions.

Kelly Hopper
12th Dec 2014, 15:05
Sitting in Luton.Odds for a nightstop? :{

AnglianAV8R
12th Dec 2014, 15:07
Looks like London City may be clogged. Seen a couple of flights divert to Stansted. Logistical nightmare.

bnt
12th Dec 2014, 15:08
They're crediting Plane Finder on the screen. Looking at FlightRadar24, things are starting to move again, some flights departing Heathrow now.

phiggsbroadband
12th Dec 2014, 15:12
16:10 Local... After doing some orbits in the North Sea area, there seems to be a steady stream of descending aircraft above the Thames Estuary.
Also several departures from the London airfields.
c/o Flightradar24.

ZOOKER
12th Dec 2014, 15:12
Just over a year since the last major problem, which was caused by the telephone system on Saturday 07/12/2013,
But all those lovely 'journos' and media types will focus on these 2 days, rather than all the others, on each of which about 7000 aeros have moved about flawlessly.

readywhenreaching
12th Dec 2014, 15:18
a bit of good news:

Following a technical failure at Swanwick, the system has been restored and we are in the process of returning to normal operations.

We apologise for any delays and the inconvenience this may have caused.

Further information will be released as it becomes available.
NATS (http://www.nats.aero/news/swanwick-technical-failure-update-1615/)

AnglianAV8R
12th Dec 2014, 15:19
We're on the move. A BA 747 has just departed LHR for JFK

EGLD
12th Dec 2014, 15:21
Clearance at Heathrow reporting that they are only allowed a "trickle" of departures

Someone just told they are position 55 in the queue :uhoh:

anotherthing
12th Dec 2014, 15:22
Good work, once again, by the shop floor engineers to get things up and running so quickly. Often overlooked, they have probably had the worst of it when it comes to manpower reductions.

Glad I'm not in today... next few hours will be interesting with, I'd imagine, quite a few out-of-position flights to get back into normal :ok:rotation.

KelvinD
12th Dec 2014, 15:22
Yep. And there's squadrons of 'em gathering around the Essex area, not to mention the BA B777 following the B747.
Now Radio 5 will have to find something else to prattle on about and Simon Calder can go back to his armchair!

wasthatit
12th Dec 2014, 15:35
From BBC:

Oxford: Experiencing "some delays", mainly to services arriving from overseas.

:p

Sir George Cayley
12th Dec 2014, 16:12
CTRL+ALT+Delete should do it.;)

Thank heaven it was a simple power blip and not a cyber attack. Phew!

terrain safe
12th Dec 2014, 16:25
Loved this on the BBC website:

'Disgruntled passengers'

Posted at 17:18
Josh Rasbash, a software engineer in the aviation business, has been on a delayed flight from Edinburgh to Brussels. He said: "I've been stuck on the plane for an hour and a half. Most passengers are disgruntled.
"You have to be extremely careful with managing a flight. You can't just let them land wherever. It needs to be carefully organised and managed so we don't hit each other in the air.
"It's Ryanair so I'm not expecting much.":E

LeftBlank
12th Dec 2014, 17:54
Just got back in time for weekend off despite horrendous CTOT issued earlier by Brussels.
Thanks to all the UK controllers working hard to restore normality.
:D

hits80
12th Dec 2014, 17:59
what atc computer system does swanwick control center use?

2Planks
12th Dec 2014, 18:03
Having listened to the BBC I was expecting carnage - sure there's lots of delays of less than 2 hours, but no cancellations at VS, very few at BA (and generally on high frequency European routes) and as for FR it looks like a drop in the ocean when compared with the Italian ATC strike. Serves me right for listening:ugh:

jumbobelle
12th Dec 2014, 18:11
windows 8 has a lot to answer for

Hotel Tango
12th Dec 2014, 18:28
Southside Hangers (btw that should be hangars), just for your info, when airspace is declared "closed" that doesn't mean that all aircraft in said airspace will mysteriously disappear. You will of course see the traffic already co-ordinated/committed continue to operate. When the airspace is closed it basically means that the coordination of further traffic into that airspace will not be accepted until the restriction is lifted.

glendalegoon
12th Dec 2014, 18:43
have there been any rumors of a hack attack? I wouldn't believe them, but you all are on the other side of the pond from me.

AS you all know, we had an outage at chicago center a few months ago due to someone starting a fire in the com/computer room. took weeks to fix fully, work around did help things move slowly and no collisions.

Piltdown Man
12th Dec 2014, 18:52
what atc computer system does swanwick control center use?


I believe it's a few of those three letter jobbies; all running in parallel to prevent complete system crashes. It's certainly not a fruit based device.

glendalegoon
12th Dec 2014, 18:57
i just read in the following article that the center has been plagued with problems, cost over runs, delays in commisioning and complaints by controllers

interesting:London Hit by Air Traffic Control Computer Failure - ABC News (http://abcnews.go.com/International/wireStory/london-airspace-closed-due-computer-failure-27555726?singlePage=true)

4Greens
12th Dec 2014, 19:37
How will the system cope with an extra runway at Heathrow or Gatwick ?

EEngr
12th Dec 2014, 19:52
the system has been restored and we are in the process of returning to normal operations.Please! No! Anything but that!
;)

Ian W
12th Dec 2014, 19:58
How will the system cope with an extra runway at Heathrow or Gatwick ?

Wrong question.

How will Heathrow or Gatwick with their nice shiny new runway(s) explain that nobody really thought about increasing the system capacity so the new runway, the reason all those hotels/listed buildings were demolished, can only operate at a few aircraft an hour and those come from movements that could easily have operated from the existing runways.

Yes - it could happen.

4Greens
12th Dec 2014, 21:16
IAN, it was the right question. The ATC system cannot cope with a new runway.

eastern wiseguy
12th Dec 2014, 21:17
Zooker

But all those lovely 'journos' and media types will focus on these 2 days, rather than all the others, on each of which about 7000 aeros have moved about flawlessly.

Which is what is SUPPOSED to happen. A bit like having 40 years flawless service...one midair and they never shut up about it.

The system seems to have a bit of a weak link , and that link seems to be in the computing. What platitudes or excuses will be offered this time?

Will NATS offer compensation for the ATC induced delays?

Good luck to everyone sorting that mess out.

TWT
12th Dec 2014, 21:25
windows 8 has a lot to answer for

Not W8,but 'Jovial' apparently

UK flights chaos: Air traffic control computers using software from the 1960s - Telegraph (http://www.telegraph.co.uk/news/aviation/11291495/UK-flights-chaos-Air-traffic-control-computers-using-software-from-the-1960s.html)

zonoma
12th Dec 2014, 21:34
Will NATS offer compensation for the ATC induced delays?
If I'm not mistaken, they will be automatically fined for the delays.

As for how will the system cope with a new runway in London/no forward planning, I suggest to take a visit to the ATC Issues forum & read the extensive thread about future London airspace projects & the design to cope with almost DOUBLE the average traffic of today :mad:

Solar
13th Dec 2014, 02:02
Zonoma
If as you suspect that Nats do get fined, do the airlines that may have compensated passengers see any of the fine money or does it dissappear into the goverment coffers? Im sure that the airlines will be out considerable monies for other reasons like increased fuel usage and such as well.

FlightCosting
13th Dec 2014, 02:35
I am sure that NATS could fix the problem with the injection of a couple of billion dollars and a new centre. If those pesky airlines would pay more for terminal nav fees that would help.

fatmanmedia
13th Dec 2014, 04:31
I hate to scare you all, but the operating system used by nats is the same used by the banks for dealing with accounts, the banks are using hardware that is even older than nats and a lot less reliable, one bank has a number of main frames that are coming up to their 50th birthday.

Kelly Hopper
13th Dec 2014, 06:03
FC.
We all paid a couple of billion more to finance the NEW centre at Swanwick. The centre that failed yesterday!

Super VC-10
13th Dec 2014, 07:05
Any truth in the rumour that someone downloaded the Candy Crush Saga app and it crashed the computer? :uhoh:

Right Engine
13th Dec 2014, 08:10
My understanding NATS was a state owned operation until just over a decade ago. Swanwick was built after a lot of bickering between Bechtel and Lockheed Martin under PFI by Lockheed Martin. The software surely would have been designed when it was built in 2000?

There's government select committee minutes out there.

The software was NOT designed in the 60's.

ATC Watcher
13th Dec 2014, 10:32
Read the Telegraph article , not sure that the "old" software is responsible for this particular problem yesterday , ( old is generally compatible with reliable ) but even it it was , the important question should be : why did the back up system(s) failed ?
Whether the problem was electric as initially reported by some , then UPS failed, or if it was a main system, then the back up failed ,

Question for those in the know here : do you have one or 2 main system backups in London ? Or does your procedures stipulates to clear and close airspace when you are on a back up system ?

Southside Hangers
13th Dec 2014, 10:40
Hotel Tango thank you for your pedanticism. It maybe, of course, that I chose that spelling on purpose.


Strange that you chose not to include Heathrow Director in your little homily re airspace being closed :)

ironbutt57
13th Dec 2014, 11:26
Most often "old" software is less complex and more stable than the latest and greatest..

Jim59
13th Dec 2014, 11:43
fatmanmedia
I hate to scare you all, but the operating system used by nats is the same used by the banks for dealing with accounts, the banks are using hardware that is even older than nats and a lot less reliable, one bank has a number of main frames that are coming up to their 50th birthday.

Fats

What rubbish.

Which operating system is it?

How about naming the bank?

What make of mainframe? Do you really believe that there would be any maintenance available for machines that old?

offa
13th Dec 2014, 12:50
Banks aren't moving at 8 miles a minute on minimum fuel :=

offa
13th Dec 2014, 12:56
"Most software implemented in JOVIAL is mission critical, and maintenance is getting more difficult. In December 2014 it was reported that software deriving from JOVIAL code produced in the 1960s was involved in a major failure of the United Kingdom's air traffic control infrastructure, and that the agency which uses it was having to train its IT staff in JOVIAL in order to maintain this software, which is not scheduled for replacement until 2016"

150commuter
13th Dec 2014, 14:18
Offa
That's just what someone's recently added to the Wiki entry for Jovial. It has no particular authority and I doubt whether the contributor has any more of an inside track on the problem than many people here.

Downwind Lander
13th Dec 2014, 15:19
What is needed is a serious level IT expert to be ready to comment on technical explanations as they come through. Any offers?

If there is a criminal behind this, will it be covered up? Where are the hardware backups and software backups?

EEngr
13th Dec 2014, 15:33
Banks aren't moving at 8 miles a minute on minimum fuel :=This is a good point. Ans it's something I've run into trying to get some IT people from a Finance Department background to understand avionics and engineering s/w.

In the banking and business industry, the computer record is the object being managed. It defines reality from the point of view of the business people. In engineering, and more so in real time (avionics) systems, the software attempts to measure or model some process occurring in the real world. And it has the potential for being wrong. So systems have to be built around the idea that they can and will fail. And even if your subsystem is still up and running, the one upstream can go brain dead and start feeding yours garbage.
:8

dazdaz1
13th Dec 2014, 15:42
"In the banking and business industry, the computer record is the object being managed"

They were the good old days COBOL 74 A sadly missed language. Top down design ruled.

Downwind Lander
13th Dec 2014, 16:39
BBC News Channel. Saturday. 1700 GMT.

Richard Deakin, CEO, NATS, says that a responsible delinquent line of code has been discovered in amongst 4 million lines of code. When asked why this line has not caused havoc before, he gave no clear answer.

When asked if he would resign, he muttered stuff about upgrading resilience.

He said that systems were back and running in 45 minutes, which is what they train for. So, that's OK, is it?

Complacency rules, OK.

Chronus
13th Dec 2014, 18:03
Pilots are often blamed when finger trouble with auatomation goes wrong.

When this sort of thing happens which not only causes abject misery to all pax and crews, but also ends up with someone picking up the tab for a few million quid, and compromises flight safety, all that is said is " Computer says NO "

Gonzo
13th Dec 2014, 18:13
So theoretically, what kind of contingency should be provided by ATCCs ? A fully redundant and 'hot' contingency facility which can take over in a moment's notice with fully trained staff just waiting for a failure, with each of the hundreds of systems built on completely different software platforms (to mitigate against those code errors) to the main ATCC?

Second question, who pays for it?

Maybe someone should tell the EC?

aerobelly
13th Dec 2014, 18:15
What is needed is a serious level IT expert to be ready to comment on technical explanations as they come through. Any offers?

This article in The Register seems to have expert comments from within NATS. Anonymously of course.

REVEALED: Titsup flight plan mainframe borks UK air traffic control ? The Register (http://www.theregister.co.uk/2014/12/12/iregi_confirms_it_was_dodgy_flight_server_that_took_down_uk_ air_traffic_control/)

DaveReidUK
13th Dec 2014, 18:20
a responsible delinquent line of code has been discovered in amongst 4 million lines of codeSo we can rest assured that the next time the Swanwick system goes t*ts-up, it will have been down to one of the remaining 3,999,999 lines ...

EEngr
13th Dec 2014, 19:44
From The Register:

"Invariably someone puts a flight plan wrong and it borks* the system," one source told El Reg on condition of anonymity.That should never happen. And (hopefully) the bits that fly the planes are built to a higher standard. One would expect a well written application to skip over the bad data, raise an alarm, log it, but keep going. Something like dividing by zero (Sunk by Windows NT (http://archive.wired.com/science/discoveries/news/1998/07/13987) one of my favorite SNAFU examples of poor system implementation) should never lock up an operating system. In a well written app, it shouldn't even slow down processing of the remaining good data.

Part of the problem is cultural. The people who do mainframes have historically guarded their domain from systems engineers and real time software experts. For people who did things like payroll systems, it was acceptable to print the crash report, go through the data and re-punch the defective card. And then run it again. That's just not going to cut it in real time.

*And quit picking on poor Judge Bork.

crewmeal
13th Dec 2014, 19:53
So why on earth should airlines have to fork out compensation claims when it was nothing to do with them? Perhaps they should sue NATS for all the extra night stops crews will 'enjoy'

Passengers will be able to sue airlines over three-hour delays | Money | The Guardian (http://www.theguardian.com/money/2011/mar/05/passengers-sue-airlines-delays)

slip and turn
13th Dec 2014, 20:00
Richard Deakin, CEO, NATS, ... muttered stuff about upgrading resilience.Upgrade ? Yes my first thought.
Resilience? Yes my second thought. Resilience to previously unanticipated airspace incursions perhaps?

Lots of those making the headlines recently.

Flight planning server fell over, did it? What sort of significant traffic provides no flight plans?

Flight_Plan=Null would need some error handling in the code, throughout the entire 4 million lines, not just the bit some bright spark upgraded in isolation yesterday :hmm:

Or maybe they hadn't been upgrading resilience, but it's done now :ok:

Just a couple of thoughts ...

eglnyt
13th Dec 2014, 20:05
That should never happen. And (hopefully) the bits that fly the planes are built to a higher standard.

They are actually built to the same standard but it's a standard that allows different degrees of rigour depending upon the criticality of the component being produced.

One would expect a well written application to skip over the bad data, raise an alarm, log it, but keep going.

Exactly how does that work in a near real time system when the data actually represents a real life event? That bit of data is bad so I'll just skip over it , send an e-mail to someone and keep going. Do you tell the controller? If so how long do you think it will be before he decides I'm not sure I can trust this I'd better put some traffic restrictions in? If not who becomes accountable when something bad happens because the information the controller is seeing doesn't actually reflect real life?

slip and turn
13th Dec 2014, 20:13
Exactly how does that work in a near real time system when the data actually represents a real life event? That bit of data is bad so I'll just skip over it , send an e-mail to someone and keep going. Do you tell the controller? If so how long do you think it will be before he decides I'm not sure I can trust this I'd better put some traffic restrictions in? If not who becomes accountable when something bad happens because the information the controller is seeing doesn't actually reflect real life?It's called error handling and it is an absolutely critical part of any computer program. If a line of code receives unanticipated data (which may not be 'bad' per se), that unforeseen use case needs to have been foreseen by whomever put together the program spec, whomever agreed the program spec, whomever designed the logic that was intended to handle it flawlessly in the code, whomever checked it, whomever tested it, and whomever signed off on the project or module or upgrade, but one or all six or sixty of whom we all now know was mistaken. And there's the rub :ooh:

So now we are told that a single line of code stopped the machine, what actually was it in the real time real life world that was unforeseen? That would be the real story.

If I was anotherthing or Gonzo or Zooker or eglnyt et al, I'd have asked that one at the office by now :}

FlightlessParrot
13th Dec 2014, 20:37
In the report on the BBC website on this event, it was said that the ATC systems were operating at 98%, or 99%, of capacity. That is surely most of the problem. If stuff happens--and stuff happens, it's an axiom--it's much easier to cope with if you have spare capacity.

And why don't we have spare capacity, in all sorts of systems? Cost reduction, of course. So if, and only if, the head of NATS is responsible for running the system at maximum capacity, he should resign. But I expect it's the paymasters who really are responsible, and the fact that it's a public-private partnership, one of the advantages of which is private sector financial discipline. That is, running everything at maximum capacity all the time. Why have such big engines? Run them at take off power all the time. Why plan for engine failure? Plan for success.

SamYeager
13th Dec 2014, 20:45
BBC News Channel. Saturday. 1700 GMT.

Richard Deakin, CEO, NATS, says that a responsible delinquent line of code has been discovered in amongst 4 million lines of code.


IIRC Richard Deakin also said that the code had been corrected. Given the short space of time since the original fault I have my doubts about how well the "correction" has been tested. I just hope that there won't be a further problem down the line in a few months as a result of the "correction".

Alsacienne
13th Dec 2014, 21:12
Is there a simulation exercise to explore 'What if there was a complete situation failure' and what actions could be taken to keep the situation stable rather than immediately go into lockdown?

And if there isn't one, or a regular rethinking of this situation, surely there should be ... or are we just so confident (or that blinkered) that such a situation cannot be envisaged as a potential reality?

2Planks
13th Dec 2014, 22:36
The NATS system was down for 36 minutes or so - for the rest of the minutes in the year everything was fine. The reason the recovery, which led to the misery, expense and frustration, took so long is largely due to the whole aviation sector in the London area at running at 98% (I'm not sure the BBC post about NATS running at 98% is strictly accurate) and it cannot cope with any interference. That interference may be a runway closure due to a BA returning with an engine cowl, a security scare in a terminal or a drop in temperature of a degree that wasn't forecast leading to snow instead of sleet or thick fog resulting in LVPs. It is unacceptable that NATS had a failure with such severe consequences, but as previous posters have pointed out nothing is infallible and the UK's continued dallying over airport expansion is one of the real culprits in this incident. The Government and the opposition have all been bumping their gums over this issue whilst they collectively dance round their handbags over aviation infrastructure improvements in the SE. And so it will continue........

The Privateer
14th Dec 2014, 00:16
http://www.telegraph.co.uk/news/worldnews/europe/sweden/11292095/Foreign-military-aircraft-nearly-collides-with-passenger-plane-over-Sweden.html

Hmm same day... What a coincidence.

wasthatit
14th Dec 2014, 07:59
It looks to me like the training in unusual circumstances and emergencies (TRUCE) paid off and this major outage was handled calmly, smoothly and professionally :D.

Quote:
That should never happen. And (hopefully) the bits that fly the planes are built to a higher standard.
They are actually built to the same standard but it's a standard that allows different degrees of rigour depending upon the criticality of the component being produced.

This is an interesting point. Although the software standards for avionics and air traffic control have the same origin there are some subtle but important differences.

The most significant in my mind is the prolific use of so called Commercial off the Shelf - COTS - software that includes things like general purpose operating systems (Windows, Linux etc.) and device drivers. We can also refer to much of this as Software of an Unknown Pedigree (SOUP).

Historically not much COTS/SOUP is used in aircraft avionic systems but is widely used in air traffic control systems. This is because a COTS/SOUP license is usually less than 10% the cost of a special OS licence and ATC systems are generally thought to be less critical.

I don't think we have got to a point where we really understand the risks posed by SOUP. The analysis is done blindfolded so its a bit like playing the lottery (or if I want to be dramatic Russian roulette). The worry is that more SOUP is finding its way into aircraft systems and AFAIK the analysis techniques are the same as for the ground ones.

blueskythinking
14th Dec 2014, 08:38
2 planks has hit the nail on the head. How many flights will airlines cancel on an
Unexpectedly foggy morning or if a runway is blocked or if a foreign atc unit strikes . Let's get this into some perspective shall we . Whilst unfortunate and deeply embarrassing for NATS it is just a minor blip in a system which operates well for 99.999 % of the time. The problem with this site now is it is populated by the Facebook / Twitter generation who seem to have little ability to think critically or look beyond a two line headline! And our politicians pander to this by issuing ridiculous statements to the press. And I do wonder why NATS is not a little more robust in its defence of the situation.

OFSO
14th Dec 2014, 08:42
Is there a simulation exercise to explore 'What if there was a complete situation failure'

AS one who has sat through more spacecraft launch/early orbit phase simulations than I can remember, may I point out that the anomalous scenario which actually happens is never the one you spent hours and hours rehearsing for.

Furthermore, having a back-up or simulation facility means synchronising it with the in-use facility so that both are running on identical hardware and software. Given human nature, this is extremely difficult, more so with the software than the hardware. Providing systems so that the backup facility is continuously updated from the in-use facility can trigger yet more problems.

Personally, given the Government dithering mentioned here and the lack of resources, I think NATS do an excellent job most of the time.

Norman.D.Landing
14th Dec 2014, 09:30
I heard a radio presenter, Petrie Hoskins on LBC this morning, stating her incredulity that NATS software is from the 1960's therefore her mobile phone has more computer power than the system that run our air traffic service. :mad:

The radio is now in the garden and I have a large hole in the window. :}

4Greens
14th Dec 2014, 09:38
I say again and again the extra runway will overload ATC even more than it is already.

slip and turn
14th Dec 2014, 09:51
Personally, given the Government dithering mentioned here and the lack of resources, I think NATS do an excellent job most of the time.Lack of resources? At NATS ? That's an interesting one.

What was it someone said earlier that the "new" centre at Swanwick cost?

A couple of billion ?

And how much is salted away in the taxpayer funded NATS pension fund ?

And who bought a 21% stake of NATS in September 2013 from a bundle of airlines who bailed out ? Aviation people ? Nope, pensions people - specialists in pensions funded by another other great UK gravy train/feed trough, the great British higher education system (sic)! And now they (USS Sherwood - with the starship sounding name that you can use to deflect your own curiosity at this point and look no further if you wish) seem to own half of the Airlines Group 42% share of NATS!

I think I read somewhere before Swanwick was operational that the NATS pension scheme held funds of over £3BN. Its probably nearer double that now.

Anyone who knows here who might tell us?

Control of a heap of money like that sitting sweetly doing nothing except endlessly attracting more and more taxpayer contributions makes the everyday business of finding resources to push tin look a bit dulled now in comparison.

SLF-Flyer
14th Dec 2014, 10:06
Back in the early 70's, the telegram service was automated. To tell the computer that the end of message had been reached four N's (NNNN) was sent.

This was OK until someone sent lower case N four times (,,,,) in part of the telegram which was the same code as N. The difference being the lower shift code was sent prior to the ,,,,.
To overcome the problem, all operators connected to the system were instructed to put the lower case code in between the ,,,, which would not show up on the telegram and stopped the message ending prematurely.

It is often the simple things that catch you out.

Del Prado
14th Dec 2014, 10:10
Airlines recognise the need for asset replacement in RP2 but expect NERL to sweat the assets harder given the present financial constraints

- Taken from the RP2 consultation working group.

Costs are falling year on year. Are the delays still below the agreed targets?


Ps. 2planks, spot on.

DaveReidUK
14th Dec 2014, 10:12
the UK's continued dallying over airport expansion is one of the real culprits in this incidentI thought that by now I had heard all the arguments on both sides of the airport expansion debate.

But "let's build a new runway in the southeast in case Swanwick has another system failure" is a new one.

eglnyt
14th Dec 2014, 10:35
Airport capacity is certainly part of the equation. The disruption ran far beyond the initial failure period because there isn't any slack at Heathrow to accommodate the delayed flights and there weren't any stands for arriving aircraft because of the delay to departures.

ExXB
14th Dec 2014, 13:56
In (yet another) BBC report (http://www.bbc.co.uk/news/uk-30467692) they have this quote:

A spokesman for Nats said it typically invested £140m per annum, and would be spending an additional £575m over the next five years on its systems.

er, then that would be £115m per year, a 25m annual reduction. (And no, I don't understand in context what they mean by 'additional').

But what can you say for an organisation owned by both governments and airlines, a recipe for disaster.

slip and turn
14th Dec 2014, 14:37
But what can you say for an organisation owned by both governments and airlines, a recipe for disaster.And what pray might be said for an air traffic control organisation 49% owned by UK government, 21% by a pension specialist that looks after university staff pensions but with a name out of Star Trek, then Monarch airline's pension fund and three or is it five* other airlines with about 5% each ?
* I thought Virgin, Thomas Cook, Deutsche Lufthansa AG and TUI sold their bits to USS but their names are still apparently on the list (http://www.nats.aero/about-us/what-we-do/our-ownership-2014/)? Maybe they kept a per cent or two for old times sake.

Gonzo
14th Dec 2014, 19:50
What was it someone said earlier that the "new" centre at Swanwick cost?

A couple of billion ?

Are we just taking 'what someone said earlier' as fact?

And how much is salted away in the taxpayer funded NATS pension fund ?

Salted? In what way? And where do you get the impression that it's taxpayer funded? It isn't, by the way.

But hey, who lets facts get in the way of a good argument?

mercurydancer
14th Dec 2014, 20:30
The fact of life is that computer systems fall over. They do that. It is unrealistic to assume that computers will work faultlessly all of the time, every time.

So Swanwick went into emergency measures. It means that the emergency procedures have been applied and that they worked in that every aircraft was diverted where necessary and all landed safely. I trust that if there were any fuel shortage calls from any aircraft then PPrune would have had many posters making it clear how many aircraft were involved and how close they were to danger. No reports so far. So Swanwick did what they needed to do to ensure safety and they did that. This could not have been saved by a simple Ctl Alt Del.

As a frequent SLF I have two modes - if I am in the air I want my ass to get on the ground. Safely. Where that landing is, is optional. Diversions are part of life. They happen. I dont get myself wound up over them. If I am in an airport, then no sweat, as long as the bar is open I am happy. Delays happen.

I have often found that to approach the airline staff and state that I wont be hassling them for the next flight out if I really dont need to be on that flight works wonders. Its one less passenger they know wont throw a hissy fit. They seldom forget and usually help out with essentials like food and a comfy bench to sleep on.

So, in all this rather normal activity, what has gone wrong? Richard Deakin. He should be sacked forthwith for telling lies. He stated that the problem would not occur again. That is not possible. It will occur again, possibly not in the same way but as sure as :mad:something else will. I dont trust any CEO who will blatantly lie.

Secondly, one line of code? If I recall my most basic programming code lines refer to each other. So one poor line of code is being blamed? That is a simplistic and patronising comment from him. In fact such a silly comment worries me more than anything. Although I do not hold to this line of thought, some might - if there are millions of lines of code, and one goes wrong then there are some millions more to worry about! Panic! An absolutely stupid comment. If his computer systems are not reliable then he should really not be making excuses. He should either admit it and resign, or use the millions to get a more reliable system.

Deakin, IMHO should have said something like "A computer system failed. My staff worked very efficiently and without delay to ensure flight safety. They did this admirably and although there were significant problems, no passenger was put in danger. NATS takes flight safety as the ultimate priority and despite the difficulties faced that was achieved."

FlyGooseFly!
14th Dec 2014, 20:42
Unless I've missed it - has the actual problem been identified? Other than "a techinical failure"; "a system crash" etc., caused by additional workstations being turned on during a period of high taffic..... and the mysterious faulty data code line.

I for one, am utterly dismayed that one line of code can shut down the entire system. I feel that information readily available to some is being deliberately kept close for "security reasons" - read: butt protection.

I had an "in" to a fairly high level of NATS a few years ago, the chap I knew was very sincerely and totally committed but nice guy as he was - he wouldn't have been my first choice of a manager for a large wiring installation - yet this was part of his remit ....... after the previous guy had apparantly disappeared in almost Nick Leeson fashion and surprise, surprise, leaving behind a simiar disaster and a rumoured complete recable to resolve.

Standing by for the flak.

DaveReidUK
14th Dec 2014, 21:26
So one poor line of code is being blamed? That is a simplistic and patronising comment from him. In fact such a silly comment worries me more than anything.

It's entirely feasible that it's correct.

gordonroxburgh
14th Dec 2014, 21:37
Quote:
So one poor line of code is being blamed? That is a simplistic and patronising comment from him. In fact such a silly comment worries me more than anything.
It's entirely feasible that it's correct.

a simple logic operator (< > <= =>) round the wrong way in an obscure less used sub-routine could easily have gone un detected.

mercurydancer
14th Dec 2014, 21:43
It would be interesting to see what the code linked to. Computer systems dont rely on one line of code but to what each line relates to. In my business its called a root cause, and I think that term is very widely understood. How that root cause could be understood so accurately and so soon is a mystery to me. In complex systems a simple root cause is very hard to find.

In the main I repeat my previous post. The emergency measures worked well. its good that they did.

118.70
14th Dec 2014, 21:45
He conceded that some of their systems were "fairly elderly", adding: "The system we had a problem with last night has code written in the early '90s."
Nats is investing a "huge amount" in new technology, Deakin said, with £575 million set to be spent over the next five years to move towards more resilient, internet-based systems.


from Strip Air Traffic Boss Richard Deakin Of Xmas Bonus, Says MP (http://www.huffingtonpost.co.uk/2014/12/14/xmas-bonus-air-traffic_n_6322108.html)

If I was running mission-critical code, would I really want to base it on the internet ?

slip and turn
14th Dec 2014, 21:49
What was it someone said earlier that the "new" centre at Swanwick cost?

A couple of billion ?
Are we just taking 'what someone said earlier' as fact?Nope we don't have to if you know better.And how much is salted away in the taxpayer funded NATS pension fund ?Salted? In what way?As in stashed, existing and eventual retired members, for the index-linked use of ... wouldn't that be a fair summise?And where do you get the impression that it's taxpayer funded? It isn't, by the way.Oh? In what way is it not taxpayer funded? Oh you mean the changes in the last few years? But they didn't wind up the taxpayer funded gilt-edged bit which is regularly salted and examined for missing gilt and patched up nicely for the older stagers, now did they?
But hey, who lets facts get in the way of a good argument?Well you are welcome to give us some - afterall, NATS is a public private partnership so there's no reason for us to be kept in ignorance by you or anyone else, now is there?

So dear fellow, is the NATS pension fund worth £6BN now, more or less? The expected annual growth in such a number would be rather more than the feeble numbers we've been knocking around for suggested investment levels in NATS operations. And the annual growth wasn't all that, then who would be patching up the hole? Do tell - there's a good chap! We're all in it together you know ;)

reynoldsno1
14th Dec 2014, 21:50
There is a small niche collection of software companies in the US that exist wholly to maintain the programs that control many of the current missile systems that were originally installed in the 60's & 70's. I don't think an 'upgrade' is possible - it would require a whole new system to be installed.

carlrsymington
14th Dec 2014, 21:52
Single line of code... Very possible.
I have 12 years experience as a performance test engineer and the combination of factors times volume is what I try to replicate when testing.
It looks like one combination of many mi!llions didnt, work.
I'm sorry someone didn't find it in testing but did anyone die?
It looks like a great result except I wonder what a comparison group looks like...
How often do they fail... There is your answer....

mercurydancer
14th Dec 2014, 22:01
Essentially that is the issue. Computer systems as well as any other systems fail. That is the way things are. A safe system was instituted as soon as the problem was identified. That is good management. Flight safety was not compromised.

NATS CEO? Not worth his salary.

2Planks
14th Dec 2014, 22:29
The continued media frenzy over this issue is getting tedious as is every call from hack journalists/politicians demanding the CEO loses his bonus/ resigns/ is sacked. Things go wrong - that's life - what is important is the fall back procedures and safety. I stand to be corrected but as far as I can see no passenger or crew member was put at risk.


And remember the strike in Italy on Friday - more flights were cancelled or delayed because of that issue (looking at the various airlines websites) but not a murmur in the Press on that snag.


As I have said before, this incident was unfortunate/unacceptable, but if every time an aircraft had an avoidable incident necessitated a CEO resigning the revolving doors would be rotating that fast the crosswind at LHR would be out of limits!!!


He who lives in a greenhouse should not throw stones - just saying

DC10RealMan
14th Dec 2014, 22:36
Slip and Turn.

I am glad to know that you take such an interest in my pension fund and it is even more satisfying that the pension that I paid 15% of my salary into for thirty years is so healthy.
I don't think that NATS pensioners, myself included need to defend the terms and conditions of their pension scheme and to which they have paid many hundreds of thousands of pounds over many decades to guarantee a decent retirement.
My suspicion is that you are a journalist looking for a story along the lines of "Index linked Civil Servants hit hard working families" accompanied by a diatribe which will probably include a line involving a hero pilot, a school, and screaming children and a photograph showing a marshaller.

Del Prado
15th Dec 2014, 00:30
Slip and Turn, you have proven time and again on ATC threads that you don't know what you're talking about. I recall in the past you trying to argue separation standards with a Heathrow tower controller.
A simple use of Google (other search engines are available) would find the cost of the Swanwick centre was around £640 million and not a "couple of billion".

I am sorry to say you come across as an agitator with a general interest in aviation but no professional background in the industry so why are you so vociferous in your attacks when you appear to speak from a position of ignorance?

Del Prado
15th Dec 2014, 00:41
And to the best of my recollection National Air Traffic Services, the pre privatisation government agency as was, was consistently profit making and a net contributor to the public purse so your allegations of a "taxpayer funded pension pot" is very wide of the mark.
That's in stark contrast BTW to all the other ANSPs in Europe. The French, Italian, even the German ANSP have pension funds that are paid for by their governments which hardly makes for a level playing field when comparing NATS with their peers.

Gonzo
15th Dec 2014, 04:13
Correct, Del Prado.

s&t,

Any shortfall in the pension, and there has been such over the past few years, has been met by increased employer contributions and changes in staff T&Cs.

Knock yourself out!
Annual Report & Accounts | NATS (http://www.nats.aero/about-us/financial-performance/annual-reports/)

118.70
15th Dec 2014, 07:11
CAA announce independent enquiry :

The UK Civil Aviation Authority (CAA) and NATS have agreed to the establishment of an independent inquiry following the disruption caused by the failure in air traffic management systems on the afternoon of Friday 12th December 2014.

The CAA will, in consultation with NATS, appoint an independent chair of the panel which will consist of NATS technical experts, a board member from the CAA and independent experts on information technology, air traffic management and operational resilience. The full terms of reference will be published following consultation with interested parties including airlines and consumer groups but it is expected that the review will cover, as a minimum:

1. The root causes of the incident on Friday
2. NATS’ handling of that incident to minimise disruption without compromising safety
3. Whether the lessons identified in the review of the disruption in December 2013 have been fully embedded and were effective in this most recent incident
4. A review of the levels of resilience and service that should be expected across the air traffic network taking into account relevant international benchmarks
5. Further measures to avoid technology or process failures in this critical national infrastructure and reduce the impact of any unavoidable disruption

For more information, please contact the CAA Press Office, on [email protected], or 020 7453 6030 out of hours 07789745636Independent inquiry into air traffic control failure announced | CAA Newsroom | About the CAA (http://www.caa.co.uk/application.aspx?catid=14&pagetype=65&appid=7&newstype=n&mode=detail&nid=2411)

WHBM
15th Dec 2014, 08:39
What is needed is a serious level IT expert to be ready to comment on technical explanations as they come through. Any offers?
Not quite up to such a grandiose introduction, but nevertheless ......

Everyone seems to blame it on "the computer" but there is no really understandable technical description being offered so difficult to comment. And "it's old" is certainly not a technical description - what is needed for such is an explanation of why it has worked satisfactorily for so long before encountering an issue, and what caused the issue to manifest itself now.

But that's tech. As I understand it there was an outage for an hour or so. Aviation is of course well used to hour-long holdups for a wide variety of reasons. What REALLY needs to be investigated is why it then took so long for normality to be restored - there were still significant BA cancellations the following day.

This is something which increasingly afflicts not only aviation but also other transport modes like rail or road, the length of time taken to recover the service from an incident going ever upwards. It seems that NATS have been on a substantial staff reduction exercise; it is moments like this when you find out that those staff were actually doing something. Likewise for the airlines, the inability by some (not all) to have the resilience to come back from the various situations is one for them, not something to be just stuck on the ATC provider. The ability to blame it on "knock-on effect" is a glorious excuse for slowness and inertia rather than trying really hard to get things back straight again quickly. And that's nothing to do with computers.

Most notable of all is all the calls in the press for "investment" in replacement computers. Goodness me, the IT salesmen (:)) must be smacking their lips at this early Christmas present, and some little placings by their PR teams with the media contacts whilst this is Hot News doubtless works wonders as well. Time and again the high-level know-nothings get themselves talked into spending money on new kit rather than dealing with the operational procedures and management which are the real issue. It's just like airport security. 10 security stations provided of which only 3 are staffed even at peak times, and a 30-minute queue. After many complaints, what's the solution ? More security stations, of course.

118.70
15th Dec 2014, 10:25
Doesn't "The Register" article imply that there was a combination of circumstances that prevented the usual responses to the failure of the flight data processing system (holding the database of filed flight plans) getting linked back to the central flight server (holding the radar data) within a critcally short period ? It sounds as though the failures of the flight data system are by no means uncommon but NATS is well rehearsed in getting it back on the road. Unfortunately a separate problem with the link resulted in busting the deadline to prevent the radar system complaining it was only holding stale data and forcing procedures with lower capacity to start.

I wonder which system has the "delinquent" line of code ?

118.70
15th Dec 2014, 17:54
A previous failure of the link between the National Airspace System and the National Flight Data Processing System in 2008 seems to have been reported in Computer Weekly :

Failure of Swanwick comms link leads to flight delays (http://www.computerweekly.com/news/2240086978/Failure-of-Swanwick-comms-link-leads-to-flight-delays)

zonoma
15th Dec 2014, 19:16
In any exceptional event in the UK which affects flow control, the knock on effect normally does take days to recover. When Heathrow loses both runways for 15 minutes and then is single runway for a further hour then the delays can be felt up to 3 days after. Exactly this occurred following the emergency return of the BA flight to Oslo after it lost the engine cowlings on departure. It wouldn't have made any difference if Heathrow had any more runways in operation at the time, purely due to SAFETY landings and departures were stopped. That is exactly what happened at Swanwick, measures were taken to provide ultimate safety, and then as the system recovered, the traffic was gradually increased again.

Who is at fault? Anyone who has an open mind will see exactly what the main UK ATC union has to say on the subject here on the Prospect website (http://www.prospect.org.uk/news/story/2014/December/15/Government-has-lost-NATS-brief-01585)

zonoma
15th Dec 2014, 19:39
Also meant to mention that with the Belgians being on strike today, 600 flights have been cancelled.

Not one thread seen yet.

Just sayin........

slip and turn
15th Dec 2014, 22:22
Slip and Turn, you have proven time and again on ATC threads that you don't know what you're talking about.Well you might forgive me if I say I have reason to summarise slightly differently ;)I recall in the past you trying to argue separation standards with a Heathrow tower controller.Could be true - does that mean that a mere mortal managed to engage with the next best thing to a God? Wow. I remember arguing that the vertical separation between departing London City traffic and descending Heathrow inbound crossing traffic in the area of Canary Wharf should be increased to allow much greater margin for a proven pilot error (level bust) hotspot. Simple as that, but of course it was already perfectly well reasoned, like everything else at NATS. I believe the levels were adjusted eventually (3000' minimum increased to 4000' for LHR inbounds joining westerly final from the north descending and crossing abeam western end of City?), but I have to say I haven't checked closely, so you might have a point.
s&t,

Any shortfall in the pension, and there has been such over the past few years, has been met by increased employer contributions and changes in staff T&Cs.So NATS being the employer, shouldered the shortfalls? And NATS is 49% publically owned? A bit like Lloyds Banking Group and RBS? Or nothing like Lloyds Bank and RBS ? But suffice to say that money used for plugging multi-million pound holes in gilt-edged pension commitments first made when NATS was part of the Civil Service, means a shortage in available funds for ongoing capital investment in operations, does it not? Can we agree on that? And the size of those pension fund shortfalls each year was? The accounts and reports for NATS and its various guises defies clarity of any sort at first glance, which one can only assume is deliberate. Can you decode for us and summarise, or do I have to Google more obscure documents like this one? (https://www.caa.co.uk/docs/5/ergdocs/20100526PensionsGAD.pdf) And maybe DelPrado's 2012 re-sounding of a substantial 8 figure warning bell first rung 13 years ago in this thread: http://www.pprune.org/atc-issues/501526-nats-2001-pension-holiday-reminder.html which never got heeded?

Not exactly the best transparency for a publicly owned entity, is it? Bits here and there and no-one really volunteering the big picture ... ?

A paragraph written 4½ years ago in that Government Actuary's Dept. document said this:The NATS scheme’s benefits are more generous than those provided by typical UK private sector DB schemes. Approximate calculations suggest that, if the NATS scheme’s benefits were to be more typical, the employer’s standard contribution rate could be around 25% of pay, compared to the actual rate of 37% of pay. The purpose of this calculation is solely to illustrate the broad effect of the level of the NATS scheme’s benefits on NERL’s projected contributions. We have not been asked to comment on the reasonableness of the level of the scheme’s benefits. We recognise that the NATS scheme’s benefits reflect the scheme’s public sector origins and protections put in place at privatisation.The executive summary in that report also has a chart indicating that NATS En Route Plc's (NERL’s) projected pension contributions – £ million in constant 2008-09 prices terms (whatever that last caveat really means) were as high as £90M. We are 4½ years further on. How did they turn out, please? Or is the expenditure on pensions insignificant compared to capital expenditure on operations, and therefore a red herring in this thread?
And is the NERL pension expenditure the lion's share of NATS overall pension commitments or is there more buried elsewhere in the various books?

ZOOKER
15th Dec 2014, 23:22
Slip and WHBM,
Last Friday, the folks wearing headsets and their immediate operational managers gave it their best shot.
Something happened that wasn't supposed to happen, and as a result, no-one died.
That's what it's all about.

Del Prado
16th Dec 2014, 04:09
Folks, this whole stupid thread has been a trolling expediton, with a troublemaker knowing how to tweak sensibilities here.

Don't feed the troll! I don't think I have read such stupidity on this board before. It is designed to cause a hysterical reaction. Ignore.

Anyone considering engaging with Slip and Turn can I suggest you read this (http://www.pprune.org/questions/379603-runway-behind-you-way-save-time.html) thread first?

anotherthing
16th Dec 2014, 11:27
S&T
Far be it for me to feed a troll but:

You claim to be able to read accounts
you claim to understand pensions

You fail to understand that HMG 49% holding does not mean that we receive money from the taxpayer to bolster pension or for anything else, even for investment in new equipment.

NATS pays for the pension contributions through employee payments and from company gross earnings.

Any investment in future equipment etc is either paid for directly from earnings, or from business loans.

Instead of 'propping up NATS' HMG shouldered NATS with a large loan which meant that HM treasury pocketed over £600M during PPP, but NATS have to service the debt.

As for knowledge of equipment etc... it was the modern, new equipment that caused this failure. The oldests technology, found in TC, was completely unaffected.

I'm sure you won't understand how this could be if we had to reduce flow, but I've fed you enough, someone else might like to explain to you why we needed the restrictions even if TC was able to operate normally... I'm certain you won't know despite your protestations about you wealth of knowledge.

Downwind Lander
16th Dec 2014, 12:17
Select committee to hold Deakin's feet to the fire:

Committee to question NATS and CAA over failure in air traffic management system - News from Parliament - UK Parliament (http://www.parliament.uk/business/committees/committees-a-z/commons-select/transport-committee/news/nats-evidence-session/)

It might be on the UK "Parliament Channel", Freeview 131.
http://www.bbc.co.uk/parliament/programmes/schedules/2014/12/19

2Planks
16th Dec 2014, 13:29
zonoma - thanks for the link - that's possibly the most pragmatic statement I have ever seen from a Trade Union.

Ancient Observer
16th Dec 2014, 13:41
NATS does not currently benefit from Govt support for its pension plan.
Most of the actual pensioners in the pre-privatisation phase were put in the CAA's part of the plan.

The real cost was in the privatisation process, when HMG stuffed the CAA and the NATS pension funds full of taxpayers money. That was because prior to the privatisation, the liabilities had been HMGs.

However, if either the NATS or the CAA's pension plans hit problems, they will rush to HMG for more money. It's one of those "Is the Pope a Catholic" sort of questions.

Ian W
16th Dec 2014, 14:46
It's called error handling and it is an absolutely critical part of any computer program. If a line of code receives unanticipated data (which may not be 'bad' per se), that unforeseen use case needs to have been foreseen by whomever put together the program spec, whomever agreed the program spec, whomever designed the logic that was intended to handle it flawlessly in the code, whomever checked it, whomever tested it, and whomever signed off on the project or module or upgrade, but one or all six or sixty of whom we all now know was mistaken. And there's the rub :ooh:

So now we are told that a single line of code stopped the machine, what actually was it in the real time real life world that was unforeseen? That would be the real story.

If I was anotherthing or Gonzo or Zooker or eglnyt et al, I'd have asked that one at the office by now :}

The NAS Host software that was written in 1969 - 1971 yes in Jovial and BAL (basic assembler language) is actually extremely reliable. However, it was made to run on a set of 6 IBM360's known (in the trade) as the IBM 9020D. UK CAA did not purchase the 9020E which was another team of 6 IBM360's that did radar data processing.

So the 9020D had 3 input output processing IBM360's and 3 compute element IBM360's - all running at an impressive 300,000 integer instructions a second.

Now the architecture is what made the system reliable. The system was a multiprocessor mufti-programming system and any program that was pre-empted could be picked up by another processor. The system repeatedly recorded checkpoint recovery data from once a second out to a few minutes. So if an error was found by the computer (what would give a BSOD in a PC) the IBM360 involved would stop all the other processors and give them the checkpoint data and all the processors would rerun precisely the same program and data. If only one of the processors got the error then the error must be hardware in that processor and it put itself offline. If all the processors got the error then the error must be software and the 9020 did a core dump (a large hexadecimal printout) threw away all its input messages then restarted (startover) from a clean checkpoint say 3 minutes before. As software faults in a real time system are normally timing/preemption related or caused by a broken input message, the system would normally startover successfully. Controllers would receive a message 'STARTOVER at time - please re-input any messages" (or words to that effect.) If Gork put in the broken message again then it could cause the startover again. However, the Data systems specialist would be looking at the last messages in and identify Gork's message and somewhat testily suggest that he did not re-enter the message next time.

OK so now the system is rehosted as a virtual machine inside a nice shiny new machine. A lot of the automated recovery that was built in may not work quite that way (I don't know how that is now implemented) So I rather think that it may take more manual intervention if the Host software has a glitch.

Heathrow Harry
17th Dec 2014, 15:50
http://imageshack.com/a/img537/853/rClDAh.gif

Downwind Lander
17th Dec 2014, 16:13
Typescript of Mondays Transport Select Comittee meeting with McLoughlin.

http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/transport-committee/nats-failure-in-air-traffic-management-systems/oral/16712.pdf

Video of interviews with Deakin, Rolfe and Haines.
London air traffic control failure examined - News from Parliament - UK Parliament (http://www.parliament.uk/business/committees/committees-a-z/commons-select/transport-committee/news/nats-evidence-session/)

Report to be out by the beginning of April.

118.70
18th Dec 2014, 10:58
Draft terms of reference for the inquiry have appeared at

Terms of reference | Regulatory Policy | About the CAA (http://www.caa.co.uk/default.aspx?catid=2942&pagetype=90&pageid=16657)

It mentions

including the measures that had been put in place to prepare for routine changes to systems that had occurred on the 11 December 2014 date for the regulated changes to aeronautical information (the AIRAC date) and for the move of additional workstations to support the military task that was re-locating from Prestwick.and


The preparation and testing of premeditated operational/engineering changes to systems and procedures planned to take place on or about regular AIRAC dates or in association with particular infrastructure changes.

which is the first time I have seen links to system changes as being a potential contributory cause.

Ian W
18th Dec 2014, 11:53
The Flight Data Processing software uses an 'Adaptation Controlled Environment System' - that describes the airspace and a huge number of other parameters used by the FDPS as it is running. It is feasible that an error in the Adaptation update could lead to a program crash, but these changes are usually very carefully controlled. NATS has had a lot of practice doing them since the seventies.

stanprice
18th Dec 2014, 14:44
As the Software Engineer responsible for acceptance of the 9020 software by NATS in 1974 , I am concerned if it is still in use. Constant modification in dead computer languages and re-hosting over 30 years are not conducive to reliability .Part of the problem was and maybe still is NATS managements inability to successfully plan the next generation of systems whilst implementing the previous generations.

pax britanica
18th Dec 2014, 16:16
I think this whole incident shows up something of the over reaction from the media of anything to do with the aviation industry.
Of course it was a serious problem and off course there will be management failings because that's life -management effectively means making do not being perfect because perfect is always unrealistically expensive.

Since the incident occurred I have heard every single day train services in the London area disrupted by signal failures between x and y. Signalling is the ATC of the rail industry , it is an essential safety feature and it is complicated-a lot of it is also quite old.

If you add all the signal failures on the National and urban rail networks tis year I would bet that they caused more inconvenience and more delay to many more people than the other weeks problem.
Is there a call for an inquiry, are the heads of the relevant service providers summoned to Westminster ? no they were not but they probably should have been because they caused at least as much chaos just over a longer time span

Downwind Lander
18th Dec 2014, 16:16
stanprice says: "As the Software Engineer responsible for acceptance of the 9020 software by NATS in 1974..."

Who says there isn't a God? This is what I asked for in #56, page 3.

It might be a good idea, Stanprice, if you were to approach the Transport Select Committee (#117. page 6) with a view to sending in evidence, or possibly attending a session.

What other little surprises could this software have in store?

WHBM
18th Dec 2014, 16:40
We're still going down the wrong route.

This failure, a unique one, lasted less than an hour before it was overcome. Someone must have understood it and overcome it in that time. A 60 minute hangup is standard stuff in commercial aviation (on BA at a Heathrow gate you might still be waiting for engineering to make their way across after this time). What was the real issue was the lack of resilience in infrastructure and operations, and poor information dissemination, which let the whole thing just drag on and on. As I have pointed out previously, BA was still doing gross cancellations, 14 short haul departures from Heathrow alone, the following day, after this fleet had stood all night. This just shows an operation planned with insufficient flexibility to any disruption from any source.

DaveReidUK
18th Dec 2014, 17:02
It might be a good idea, Stanprice, if you were to approach the Transport Select Committee (#117. page 6) with a view to sending in evidence, or possibly attending a session.

In the Transport Committee hearings on Monday it was stated that the independent [sic] joint CAA/NATS inquiry would take evidence from any interested parties.

What was the real issue was the lack of resilience in infrastructure and operations, and poor information dissemination, which let the whole thing just drag on and on. As I have pointed out previously, BA was still doing gross cancellations, 14 short haul departures from Heathrow alone, the following day, after this fleet had stood all night. This just shows an operation planned with insufficient flexibility to any disruption from any source.

That's a little hard on BA, though maybe you didn't intend it to sound like that.

Any network carrier that suffers a prolonged period of greatly reduced capacity at its principal shorthaul hub is going to struggle to get its operation back on schedule by the end of the day.

Ian W
18th Dec 2014, 17:21
We're still going down the wrong route.

This failure, a unique one, lasted less than an hour before it was overcome. Someone must have understood it and overcome it in that time. A 60 minute hangup is standard stuff in commercial aviation (on BA at a Heathrow gate you might still be waiting for engineering to make their way across after this time). What was the real issue was the lack of resilience in infrastructure and operations, and poor information dissemination, which let the whole thing just drag on and on. As I have pointed out previously, BA was still doing gross cancellations, 14 short haul departures from Heathrow alone, the following day, after this fleet had stood all night. This just shows an operation planned with insufficient flexibility to any disruption from any source.

Lack of flexibility isn't the cause, it's running a system at 105% which gives no leeway to recover from the smallest delays. System engineers would normally design overcapacity into a system so the loading is not normally more than 30% which allows for peak loads in excess of that to be coped with and the system recover. For various reasons ATM systems worldwide tend to be loaded to the level where the smallest event causes a cascade of disruption across the network. It is a tribute to the work of dispatchers, controllers and flow controllers that the system recovers as well as it does; theoretically it should take a lot longer.

Gonzo
18th Dec 2014, 18:45
I stand to be corrected but there are many on this thread who appear to know what the cause of the problem was, and/or the piece of software involved.

Not sure that's been released yet, has it?

zonoma
18th Dec 2014, 19:28
I think it is very obvious Gonzo that those wishing to stir up trouble regarding the failure have very little knowledge of what they are talking about. If any of them spent the time reading or watching factual evidence then they would come across better informed.

For those that don't feel it necessary to watch the hour long parliament meeting linked above, then it was stated in that meeting that the software that caused the issue was developed in the 1990's. I'm sorry if that fact gets in the way of your scaremongering.

DaveReidUK
18th Dec 2014, 19:43
Not sure that's been released yet, has it?

The investigation report will be out "by March", according to Deakin.

Ian W
18th Dec 2014, 21:06
There is a term that used to be in vogue called 'software corrosion'. Loosely this is the effect of multiple patches on software leading to unexpected errors - which of course lead to more patches. There was concern that the NAS Host software - the Flight Data Processing software - could suffer from this in the late 1980's. A consultancy company was called in which did a thorough audit of the FDPS and NAS Host software. It was found that the approach taken to software maintenance by the Support and Development Organization (as it was then known) had actually improved the software design and there was no evidence of overlapping patches causing issues. Indeed, from the statistics it is normally interfacing systems that cause issues not the Host software itself.
Jovial may be a rather antediluvian language but it is extremely powerful in many ways better than C or C++, and for the processing done in NAS Host it is probably unmatched in suitability. The Host software has now bedded in over forty years and is an extremely reliable set of programs. Any software house that thinks it will be simple to replace should not apply for the job as they do not understand it.

1985
18th Dec 2014, 21:16
NAS didn't fail. If it had then Scottish and TC would have also have broken. It was part of the NERC kit in LAC that failed. Which is part of the 1990s software.

Ian W
18th Dec 2014, 23:43
NAS didn't fail. If it had then Scottish and TC would have also have broken. It was part of the NERC kit in LAC that failed. Which is part of the 1990s software.

I would guess it was the interface to NAS Host from that NERC (New En-Route Centre) kit that was the problem sending some unexpected broken message, one of a set that cannot be rejected or referred rejected.

But let's wait and see. Although it would be interesting to be there working it out, I would expect that all the system engineers already know precisely what happened and who wrote the code responsible. :D

vancouv
19th Dec 2014, 08:05
I always enjoy reading PPRuNe and though I am a mere PPL I always get a chuckle out of some of the posts by people who obviously have no understanding of how things work but feel the need to post anyway. And now we have a thread about my great love aviation and my lifetime career subject of software development - boy have I had some laughs.

I don't know anything about NATS software or the specific incident but I have worked with big computer systems for 40 years. Of course they'll go wrong - it is absolutely impossible to test for all scenarios and have every single combination of events covered so there is never an outage. A well written system will handle these occurrences with the minimum impact to anyone or anything around them.

It appears NATS had a 45 minute outage, during which time no plane plummeted onto a school/hospital (delete as appropriate). I would say this is a success. The fact that so many people were late for their holidays is not related to this in any real sense - it is, as many people have posted, related to the fact that there is no capacity to absorb the inevitable delays as a result of this outage. Could be a software error, could be a MAYDAY, could be an ash cloud.

I don't know anything about the head of NATS either, but hauling him in front of MPs to explain a 45 minute software outage is one of the most ridiculous things I've heard for a long time. If London had been littered with the carcasses of 747s then maybe he would have something to explain.

Too many people these days install Windows on a laptop and think they know as much about software development as professional programmers - they don't. :ugh:

GAPSTER
19th Dec 2014, 08:28
Well said.The system fell down,the system kept things safe.

anotherthing
19th Dec 2014, 08:43
Ian W, your guess is wrong...

WHBM - lack of information? The airlines have applauded NATS for the flow of information... maybe that didn't filter down to the shop floor.

Friday was a slow news day, which is why the media had a field day. There was more disruption yesterday due to bad weather, but that doesn't get a mention....

Jonty
19th Dec 2014, 09:52
There could be a job for you, Stan, want to move to Swanwick?
Sounds like they could do with the help!

Heathrow Harry
19th Dec 2014, 13:55
National Rail have had 5 major signalling issues out of Paddington in 3 months.................. and it doesn't get a thousandth of the coverage :{

118.70
19th Dec 2014, 14:02
Viewing the Transport Select C'tee meeting, I was amazed at Graham Stringer's longevity in active membership and his ability to recall the "Computer Weekly" criticism of the Swanwick development and their call for an independent inquiry.

Links to that 1997 evidence session at

House of Commons - Environment, Transport and Regional Affairs - Minutes of Evidence (http://www.publications.parliament.uk/pa/cm199798/cmselect/cmenvtra/360-e/36001.htm)

The Computer Weekly submission on the early-warning tests for project failure were good :

Also, having given the matter further thought I believe the following early warning signs of a disaster are particularly pertinent in this case:

— A failure to meet revised deadlines shortly after assurances that all is going well.

— Late changes to the system to accommodate end-users who were not consulted adequately at the beginning.

— A strong resistance to an independent audit.

— A desire to push ahead or even rush to complete the acceptance tests before all the bugs are removed.

— Lack of goodwill among end-users (air traffic controllers are alleged to have walked out recently during a trial of the training and development unit system).

— The original requirements and the technology being superseded because of the repeated delays. The Swanwick centre's requirements are already nearly seven years old.
House of Commons - Environment, Transport and Regional Affairs - Minutes of Evidence (http://www.publications.parliament.uk/pa/cm199798/cmselect/cmenvtra/360-e/36077.htm)

phiggsbroadband
19th Dec 2014, 14:49
Would the knock on effects have lasted so long if LHR were allowed to have 'Night-Time Ops' in emergency situations?

Perhaps the other infrastructure of London would not be available... Taxis, Rail Transport, Hotels etc.

DaveReidUK
19th Dec 2014, 15:24
Would the knock on effects have lasted so long if LHR were allowed to have 'Night-Time Ops' in emergency situations?LHR is allowed to handle emergency situations at night. This wasn't one.

STN Ramp Rat
19th Dec 2014, 15:24
Would the knock on effects have lasted so long if LHR were allowed to have 'Night-Time Ops' in emergency situations?


I believe the night time curfew was removed on this occasion, I am not sure how many flights operated during the curfew.


I totally agree it was a slow news day and the politicians felt compelled to react for fear of being seen to be doing nothing. A total overreaction. the only thing I think that NATS did wrong was to announce the delay would be a huge outage before it was really clear how long it was actually going to be.

stanprice
19th Dec 2014, 16:25
Three major points.
1) ATC systems failure rates are specified in terms of one failure every hundreds if not thousands of years. Obviously unrealistic but this does not mean the recent outage can be trivialised as some posters seem to suggest.

2) Ian W states "Jovial may be a rather antediluvian language but it is extremely powerful in many ways better than C or C++, and for the processing done in NAS Host it is probably unmatched in suitability. The Host software has now bedded in over forty years and is an extremely reliable set of programs."
Whether C or C++ is better than JOVIAL is debatable. Incidentally the JOVIAL NAS is written in was not a mainstream version. What makes a software language antediluvian ( strange word) depends on a number of factors, not just its technical features, including the availability of a pool of programmers competent in it. As for the reliability of NAS I repeat 30 (actually 40) years of modification and rehosting is not conducive to reliability. Having been instrumental in identifying the need for a UK NAS Support and Development Organisation and then playing a major role in setting up and its work in early years I believe I have the knowledge to express concern about NAS`s life expectancy.
3) I do not know if the recent Swanwick outage was caused by NAS software, some of which predates by over a decade its introduction to the UK. What I do know is that if Swanwick is still dependant on it, this represents a failure of several generations of NATS management to engineer its total replacement.
I will of course, if appropriate, be putting these views with supporting documentation to the relevant bodies.

118.70
22nd Dec 2014, 11:29
Re-reading the report on the 2013 outage

http://www.nats.aero/wp-content/uploads/2014/08/ATC%20Disruption%207%20Dec%2013%20-%20Report.pdf

http://www.nats.aero/wp-content/uploads/2014/08/ATC%20Disruption%207%20Dec%2013%20-%20Report%20Appendices.pdf

I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.

Should I have confidence in the 2014 inquiry ? Any news of the chairman yet ?

20. The cause of the TMCS failure was corrupted computer disks on three separate servers, which
could not be recovered quickly using standard practices that have been effective in the past.

1. The failure occurred in the Voice Communication System (VCS) which provides controllers
with integrated voice communications for radio, telephone and intercom in one system. VCS has
three main elements:
􀀡 A digital telephone exchange system (known as a ‘voice switch’) which provides all the
channels for controller-to-controller and controller-to-aircraft communication;
􀀡 Operator touch-screen panels at every workstation which enable controllers to access all
the communication channels associated with their task and to amend individual workstation
configuration, for example when combining airspace sectors (‘band-boxing’) for night time
operations;
􀀡 A Technical Monitoring and Control System (TMCS) which is a computer system for
monitoring VCS and managing system changes – essentially a ‘control computer’ connected to
all the other system components but with no connections to the ‘outside world’.

3. It was the TMCS system which failed on the 7th December 2013. TMCS is fully duplicated
using a Master and Hot Standby (i.e. ready to take over immediately) arrangement. Both the
TMCS servers failed during an overnight installation of data changes (‘adaptations’) while
Swanwick AC was in night-time operational mode with just 5 band-boxed sectors controlling all
upper airspace above England and Wales.


53. The failure of the Technical Monitoring and Control System (TMCS) servers (main, standby
and back-up) occurred during an update of the Voice Communication System (VCS) as part of a
series of overnight activities on some 20 systems at the Centre.
54. Subsequent investigations revealed that the failure occurred during re-start procedures
following installation of planned changes. The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the
second standby server and subsequently a back-up spare.
55. The start-up files were corrupted at some point during November 2013, and were lying
dormant until the next requirement for a re-start. Investigation by the manufacturer (Frequentis)
discovered corruption in network system files, most likely due to an intermittent network
connection fault. The TMCS system hardware has since been entirely replaced and the precise
reason for the corruption may never be established.
56. The investigation into the subsequent sequence of events is summarised below. A summary
of the findings of TRC’s independent technical systems expert is at Appendix D which broadly
concur with NATS’ investigations.

5.4.1 Could the failure have been anticipated?

57. The TRC investigation looked at the history of related problems with TMCS. System logs
revealed that difficulties with previous re-starts in April and October 2013 had given engineers
cause for concern. For example, in April 2013 there was a similar incident involving TMCS which
on that occasion prevented controllers from combining sectors (band-boxing), a scenario which
has no impact on capacity provided there are adequate numbers of controllers to continue to
operate the unbandboxed sectors. Since then there had been a series of problems which were
successfully resolved each time.
58. NATS had already ordered (in November 2013) an enhancement to TMCS from Frequentis to
be available during 2014. In the interim, the engineering judgement was that – as these problems
had not impacted the ATC service to customers – the residual risk was tolerable in the short term.
59. Given the previous experience with TMCS, the TRC’s experts considered that NATS’
engineering team could have been more prepared for resolving re-start problems. In particular, restart
problems had been experienced in October 2013 and other faults found before and after 7
December 2013, all of which with hindsight could have merited deeper investigation and response
by NATS. However, the experts concluded that “this particular failure was not realistically
predictable”. But they considered that it would be appropriate for NATS to review the level to
which the residual risk of such problem conditions could be considered tolerable / acceptable. The
key judgement, however, is that none of the residual risks result in an unsafe system or operation.
60. Engineering procedures for TMCS were immediately changed post event. A planned
enhancement to the VCS and TMCS systems has also been deployed which allows bandboxing/
splitting without the TMCS. These two changes provide far greater resilience to failure in
the future.


D1. Summary of Technical Findings in the TRC Report to the NATS Board
– March 2014

Is TMCS fit for purpose? It is old, fragile and slow because of limited memory and slow machines.
It is due for replacement. The current upgrade should increase resilience
markedly

Other systems with similar
vulnerability?
Flight Plan Suite Automation System has similar architecture. There may
be other systems with dissimilar architecture but comparable vulnerability.
NATS Engineering is working on resilience generally.


Are system failures properly
reported?
Yes. There is a good culture of following up on failures. Analyses have
been detailed and frank.

Resilience measures appropriate? Principally resilience relates to protection against hardware failure:
replication of CPUs, disks, networks, etc. Less attention appears to be
given to the risk of software failure or file corruption, which are harder to
protect against and recover from. However, many systems are old and
have been running satisfactorily for many years. The risk is lower but
evidently there.

Downwind Lander
22nd Dec 2014, 14:21
118.70 opines: "I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated/ explained".

It would be interesting to find a stats/probability expert to calculate the odds. This reminds me of Richard Feynman concluding that the Challenger disaster was a statistical certainty.

Heathrow Harry
22nd Dec 2014, 14:35
there is no such thing

stats depend on how connected the issues were

as for example "The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the second standby server and subsequently a back-up spare."

essentially the same problem was copied to several of the servers in which case the back-up duplication was useless

eglnyt
22nd Dec 2014, 18:12
I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.

How could they be independent? If one is main, one is standby and the other is a backup ready for immediate use they must have the same configuration on them. The problem with modelling the real world is that when it changes your configuration also has to change on all instances.

Evey_Hammond
23rd Dec 2014, 07:26
It's somewhat unsurprising to find out that one single line of code, previously identified as a potential future problem back in the 90's, could cause this outage so many years later...

vancouv
23rd Dec 2014, 08:31
Computer systems are way more complex than the simplistic 'one line of code caused the problem' theories we've heard on here. Maybe the final point of failure was one line of code, but that same line of code could have worked perfectly for millions of combinations of events that led to it being executed. It only takes one flawed route to cause a problem.

When analysing failures in computer systems it is usually not too difficult to find where it's failed - the trick is finding why it failed, how did we get to this point with this data on this particular occasion when everything has worked fine for the last umpteen executions.

Think of the holes lining up that people often talk about when analysing aircraft accidents.

Heathrow Harry
24th Dec 2014, 10:39
Vancouv is correct

there are many things which can cause accidents - and God knows a lot of them finish up on here

To say that an ATC software problem affects FBW aircraft is like saying its raining outside and that might affect approaches to LHR - true but pointless

FlyGooseFly!
26th Dec 2014, 16:48
Some very interesting and informative posts here since the last time I looked in - particularly from Vancouv - who seems to have much relevant experience.


In addition to my previous offering, I'd like to say that I'm not overly worried about something being "old" per se - especially as I'm not that young any more myself, yet still work reasonably well. The problems with computer systems usually occur when some guy has "fiddled" with it! Updates, mods and improvements often turn out to be nothing of the kind - whereas the original version may very well have carried on doing what it was designed to do.


Obviously, advances in technology have enforced some changes with older languages and versions not being able to run on modern platforms - though introducing these and debugging must be the stuff of nightmares!
Not to mention having ones start-up files fragged after the last use and sitting there ready to corrupt.

NigelOnDraft
26th Dec 2014, 18:26
Which leads me to, how did the config files get shared duplicated whilst corrupt ? The appendix said it was via RAID i.e. the discs were kept sync'd across the servers (?) Seems fair deal for live / hot spare setup, but as was seen poor protection against a corruption issue which RAID duly passed along (as it should).

Not an expert in this area, so might not have quite understood...

eglnyt
26th Dec 2014, 19:16
The appendix said it was via RAID

No it doesn't, It says that RAID is not used in TMCS. The disks are mirrored which is quite different.

NigelOnDraft
26th Dec 2014, 21:03
No it doesn't, It says that RAID is not used in TMCS. The disks are mirrored which is quite differentReally?
Lessons Learned from Dec 7
• Failure of Technical Monitoring & Control System (TMCS) – part of
Communications System (VCS) for Area Control at Swanwick
• VCS allows direct access comms between sectors, airports & adjacent
centres & is automatically configured for the sector configuration.
• File corruption occurred on the primary server which then transferred to the
hot standby as they were linked via RAID*.
• Server was replaced but the software fault then transferred to the spareand also:Wiki RAID (http://en.wikipedia.org/wiki/RAID)RAID 1 consists of mirroring

eglnyt
26th Dec 2014, 21:43
Well Appendix D Page 29 would suggest otherwise.

ZOOKER
27th Dec 2014, 17:41
Presumably after today's debacle on the ECML, those responsible will face the same grilling from Ellman, McCartney and Stringer et al, that Deakin and Rolfe had to endure?

EEngr
27th Dec 2014, 20:00
There seems to be some misunderstanding of the term RAID* as it applies to the TMCS servers. RAID technology is a set of firmware and configurations that makes a group of hardware disks appear to a host as a single logical disk drive. This can provide redundancy in the event of a single (or multiple) hardware failures and allow the host system to continue running, although in some cases at a reduced level of performance. The key here is that a RAID array typically serves a single host system. So even in the event of a series of failures sufficient to incapacitate the entire RAID array (highly improbable), only that one host fails.

From the description of the events, it appears that the logical disk failure affected several redundant host systems. This leads me to believe that, in addition to a RAID array, these systems were using Network Attached Storage** (NAS). Several implementations of this may be referred to as a Storage Area Network, where one server 'shares out' its disk system to other systems. Each system would look at the files on this shared (mounted) drive as if they were local to that system. However, data (bad data in this case) written by one system would become available to all.

It is also possible (not clear from the report) that the "disk mirroring" function may have been implemented at an application level. That is: The TMCS server applications would receive a copy of a data stream and each would write a local copy to its disk. This would be the most robust system, as the applications would be able to spot "bad data" and refuse to save a local copy. Typical NAS systems don't have this capability, as the operating systems have no concept of what is good or bad, Bytes are bytes. And from the description of the failure, it sounds like the latter is what was implemented.

NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.

*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.

:8

Standard Toaster
27th Dec 2014, 21:02
NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.

*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.


NAS (probably SAN in this case) systems are perfect for redundancy, but redundancy != backup, and the fact people (read ADMINS) keep confusing redundancy, syncing and backup is beyond me.

If one HDD becomes corrupted, the other obviously will, because that's what most of the RAID modes are for, mirroring one disk.

Disaster recovery plans are mandatory, and with that comes a good backup plan, and I'm sure in this case they had one.

Of course I know in real life some lazy ass admins somehow think that RAID is a good "backup", and when the shtf they're sol.

118.70
28th Dec 2014, 11:29
Disaster recovery plans are mandatory, and with that comes a good backup plan, and I'm sure in this case they had one.The 2013 incident seems to have started with backup Plan A which was then ditched for Plan B which eventually included being sent an electronic backup copy from April by Frequentis in Vienna. It isn't clear whether this was the "good backup plan" that they had in place.

For the 2014 incident, I see the Observer today repeats this version of the story :

Finally, at the end of the year, there was chaos at Nats (http://www.theguardian.com/world/2014/dec/15/uk-air-traffic-chaos-nats), the public-private partnership that runs Britain’s air traffic control systems. The whole of London’s airspace was closed for more than an hour on 12 December, with disruption continuing for several subsequent days. Aircraft were stuck in holding patterns over Heathrow, or diverted to other airports, with hundreds of flights cancelled. Vince Cable said Nats had been “penny wise and pound foolish” and was running “ancient computer systems, which then crash”. It eventually emerged that a single line of computer code more than 25 years old was responsible for the shutdown.Fear of flying: the spectre that haunts modern life | World news | The Guardian (http://www.theguardian.com/world/2014/dec/28/fear-of-flying-phobia-we-cant-overcome)

Any word from the CAA on the 2014 inquiry chairman yet ?

Mike-Bracknell
28th Dec 2014, 11:43
When aircraft crash, all the pilots on here get quite uppity over amateurs speculating over the cause.

Having heeded their warning and avoided speculation in those cases, and having a deep background in IT, I can now see first-hand the effects of the ill-informed speculating about which they have scant knowledge.

:ok:

Heathrow Harry
28th Dec 2014, 11:43
between Christmas & New Year??

Not a hope ................ the top brass won't be back until around 5th January..............

EEngr
28th Dec 2014, 19:37
NAS (probably SAN in this case) systems are perfect for redundancy,Nope. NAS is intended to provide a single logical drive as the storage for multiple hosts. If redundancy is a requirement, what you need is each host with their own storage separate from the others, in its own cabinet, with its own power supply. Think of air data computers. Having multiple ADCs all fed from a single static/pitot system is the logical equivalent of using NAS (one logical disk drive) to feed multiple servers.

the fact people (read ADMINS) keep confusing redundancy,They don't confuse it. Its just laziness. One physical copy of data is easier to maintain than having to update multiple systems. Or update one master and push copies to the slaves.

Ian W
28th Dec 2014, 19:42
Just to confuse things further, the Flight Data Processing system is based on NAS Host. That is the National Airspace System - Host Computer. So it is easy to misread what has been reported.

Mike-Bracknell
28th Dec 2014, 19:47
Thankyou IanW. I suspected that was the case, now if the plonkers above you could stop wittering about Network Attached Storage (and getting most of it wrong, this means you Eengr), then you can get back on topic.

EEngr
29th Dec 2014, 16:46
NAS Host. That is the National Airspace System - Host Computer.Sorry about the acronym name space collision. Henceforth, I shall stick with the term "Network Storage". However, I stand by my claim that there appears to be some confusion in the NATS report appendix (almost another collision there).

There are two mentions of RAID. One states that RAID was not used (but mirrored disks) with no further explanation. The other states that
File corruption occurred on the primary server which then transferred to the
hot standby as they were linked via RAID*.
But file sharing between servers is NOT a feature of RAID. It is one of Network Storage. So this leads me to believe that the information given for the NATS incident report may ot be consistent.Sorry to be pedantic, but it is this level of detail that will indicate whether the NATS systems are properly designed or not.

Mike-Bracknell
29th Dec 2014, 21:05
It is therefore a badly-written report, and you therefore will not empirically glean any information as to the inner workings of NATS as a result.

FYI, anyone referring to 'NAS' in any form related to storage would not be let near the quoting process for such a mission-critical service. 'NAS' stops being useful at the home-user/small-business arena, hence my points above. SAN, on the other hand, is more likely, but still given the vintage a distributed system based on shared fabric is not very likely.

118.70
12th Jan 2015, 20:01
the top brass won't be back until around 5th January.............Have they been back long enough yet to find an independent Inquiry Chairman we can have confidence in ?

ImageGear
13th Jan 2015, 14:54
I am a late arrival to this discussion, but a little clarity should to be brought to the table.

NAS is virtually irrelevant at the core of mission critical systems. What we probably should be discussing is real time, Multi-Host File Sharing, mass storage replication, SAN Disks, and high speed cache, to achieve a resilient transaction based platform.

In a true multi-host, live environment, with directly connected hot-standby system "in the same room", sharing files, and with full on/off-site backup, and software release managed, test and development environments, would go some way to delivering the level of resiliency needed for this Application.

Current thinking across many systems integrators and suppliers suggest that clients are more concerned with the cost of everything, and the value of nothing, and this means that comprehensive hardware and software fault recovery, error check, correction and reporting, and resiliency slip quietly out the window.

It is therefore completely unacceptable that start-up files are exercised for the first time on a live platform. These are usually scripts which are run to pre-configure systems prior to release of a new version from dev to test, or test to live. In my opinion this is a "find an alternative career" error.

Perhaps a little of our Oriental Friends technology ethos could do with being imported, for example - "No single hardware or software system failure is acceptable WITHIN THE SERVICE LIFE OF THE PRODUCT" (What manufacturer or integrator can or would offer that to NATS today, and would they pay the price) How many full system down events is contractually acceptable to NATS?

I don't know and very few do.

118.70
16th Jan 2015, 18:23
Independent Enquiry Terms of Reference have been published at

Enquiry Terms of Reference | Regulatory Policy | About the CAA (http://www.caa.co.uk/default.aspx?catid=2942&pagetype=90&pageid=16657)

....Overall the Enquiry will address:


The root causes of the incident on 12 December 2014 affecting the Area Control Operations Room, including the measures that had been put in place to prepare for routine changes to systems that occurred on the 11 December 2014 and for support to the military task that was re-locating onto the AC system.
NATS’ handling of the incident to minimise disruption without compromising safety, including the measures to suppress and re-generate traffic and associated communications with airlines, airports and other stakeholders.
Whether the lessons identified in the review of the disruption in December 2013 have been fully embedded and were effective during this incident.
Levels of future resilience and service delivery that should be expected across the en route air traffic network taking into account relevant aviation benchmarks and costs.
Further measures to avoid or reduce the impact of technology or process failures in the future (either by NATS or within the wider industry).
Recommendations on how NATS can improve its response to any future service disruption caused by a system failure.

Scope

In order to fulfil its objectives the scope of the Enquiry will focus on:


NATS’ ability to maintain a safe operation during periods of operational contingency caused by failures of its systems and how this is balanced against the disruption to normal operations.
The functioning of the NERL operation and the interdependencies of the systems that support it including communication, surveillance and flight data and their failure modes, contingencies and operational workarounds.
The preparation and testing of planned changes to systems and procedures linked to regular Aeronautical Information Publication updates or in association with other infrastructure changes.
The effectiveness of NATS’ incident communications process triggered during the event both in terms of NATS’ customers (principally airlines and airports), other ATM agencies including the ATM Network Manager, the regulator, and the government.
The linkage to previous operational failures, their handling and the lessons that have been learned from them.
How NATS’ investment and efficiency plans have previously, and will in future, contribute to operational resilience and the speed of restoring normal working. In particular would an earlier than currently planned introduction of new technology improve resilience and be operationally feasible.
The effectiveness of the CAA oversight arrangements that are in place and under consideration for normal operations, changes to operations and incident/contingency arrangements.

Accountability

The Enquiry is jointly sponsored by and will report to the two chairs of CAA and NATS.
Enquiry Panel Members

The Enquiry panel will consist of the following members:


Sir Robert Walmsley KCB (Chair)
Sir Timothy Anderson KCB DSO
Clayton Brendish CBE
Prof. John McDermid OBE
Mike Toms
Joe Sultana (Director Network Management, Eurocontrol)
Mark Swan (Group Director Safety and Airspace Regulation, CAA)
Martin Rolfe (Managing Director Operations, NATS).

John McDermid seems useful , coming from the High Integrity Systems Engineering Group at York University

Professor John A. McDermid, HISE Research Group, Department of Computer Science, The University of York (http://www-users.cs.york.ac.uk/~jam/)

Is Mike Toms the ex Planning Manager from BAA ?

118.70
1st Feb 2015, 08:51
The interim report should have been completed by now :

Enquiry Process

The Enquiry will be conducted on the following basis:

The Enquiry will produce a written report that will be made public.
The Enquiry will start on 13th January 2015 and is expected to deliver its report no later than 14th May 2015.
The Enquiry will provide an interim report by 31st January 2015 focused on the NATS internal investigation of the 12th December 2014 incident

GAPSTER
6th Feb 2015, 06:17
It's out.Published on the NATS Intranet.Not aware if it's more widely available yet.

off watch
6th Feb 2015, 11:50
It's here on the CAA website :

http://www.caa.co.uk/docs/2942/v3%200%20Interim%20Report%20-%20NATS%20System%20Failure%2012%20December%202014.pdf

118.70
6th Feb 2015, 22:00
I wonder where the erroneous number 151 came from !

Or was that an original valid maximum system capacity that was not changed when other system modifications to expand were carried out ?

And living with the high frequency of pressing the wrong button seems peculiar.

118.70
9th Feb 2015, 16:15
The Register has

UK air traffic mega cockup: BOTH server channels failed - report ? The Register (http://www.theregister.co.uk/2015/02/09/nats_air_traffic_issues_due_to_server_channels_failing_repor t/)

UK air traffic mega cockup: BOTH server channels failed - report
'First time ever in server's history' says independent panel

The IT cockup at the National Air Traffic Services (NATS) that grounded hundreds of flights in December occurred because both of its System Flight Server (SFS) channels went down, an independent report has revealed.

"The disruption on 12 December 2014 arose because – for the first time in the history of the SFS – both channels failed at the same time," said the NATS System Failure 12 December 2014 – Interim Report.

The cockup resulted in 120 flights being cancelled and 500 flights being delayed for 45 minutes, and affected 10,000 passengers in total.......

118.70
13th Feb 2015, 08:20
And in "Computing" :

Twenty-year-old 'latent defect' in software caused December air-traffic control shutdown - 12 Feb 2015 - Computing News (http://www.computing.co.uk/ctg/news/2394862/twenty-year-old-latent-defect-in-software-caused-december-air-traffic-control-shutdown)

Lon More
19th Feb 2015, 17:15
although not relevant to this failure I would have thought a triplicated, not duplicated system would have been in place

Wham Bam
20th Feb 2015, 23:05
I think a quadrupled system would be better. Can't be too safe....
I think every system should have 10 back ups.
They all would have failed but at least they were there.

118.70
11th May 2015, 07:19
The Enquiry will start on 13th January 2015 and is expected to deliver its report no later than 14th May 2015.

The report should be with us soon !

118.70
19th May 2015, 20:30
I see that the Telegraph reports that the departure of Richard Deakin from NATS is completely unrelated to the Walmsley Inquiry report :

Air traffic control boss stands down - Telegraph (http://www.telegraph.co.uk/finance/newsbysector/transport/11613901/Air-traffic-control-boss-stands-down.html)

Mr Deakin came under pressure to resign before Christmas when a computer malfunction on a single day led to the cancellation and delay of hundreds of flights.

The findings of an independent inquiry into the debacle were given to the chairmen of both NATS and the Civil Aviation Authority last Wednesday.

A spokesman for the air traffic service insisted that the resignation of Mr Deakin, whose £1m pay package last year also drew ire, was not linked to the inquiry.
The final report has not yet been made public.

Talkdownman
19th May 2015, 21:54
More time for his aircraft spotting...

GAPSTER
20th May 2015, 15:17
But not at Birmingham or Gatwick presumably

Talkdownman
20th May 2015, 15:30
Maybe that's the underlining problem…he's copped it in more ways than one...

118.70
23rd May 2015, 07:36
The Times apologises :

Corrections and clarifications : May 23, 2015

We wrongly reported(Business, May 19) that Richard Deakin, who is standing down as head of National Air Traffic Services, had been dismissed. We apologise for the error.Corrections and clarifications: May 23, 2015 | The Times (http://www.thetimes.co.uk/tto/life/courtsocial/article4448493.ece)

118.70
23rd May 2015, 07:47
Final report from Inquiry on CAA website at

http://www.caa.co.uk/docs/2942/Independent%20Enquiry%20Final%20Report%202.0.pdf

zonoma
27th May 2015, 22:34
It's been 4 days now and still no comments.

Have the facts got in the way of some corporate bashing?

118.70
27th May 2015, 22:34
NATS ignored previous recommendations ? IT cock-up report ? The Register (http://www.theregister.co.uk/2015/05/26/nats_ignored_previous_recommendations_it_cockup_report/)

From The Register "biting the hand that feeds IT" :

The National Air Traffic Services failed to implement recommendations to mitigate IT risks, according to an independent report into the mega systems failure in December which left thousands of passengers stranded in Blighty......

zonoma
28th May 2015, 12:48
That isn't quite what was stated now was it. What the report said was:
ES31. A previous NATS’ investigation into a serious communications system failure that occurred on 7 December 2013 identified a number of lessons and prompted associated recommendations by NATS and the CAA most of which were reported as closed off and in place ahead of this most recent incident. However, amongst these recommendations were three of particular note in the context of the 12 December 2014 failure. The first was to review with stakeholders the industry’s ability to respond to service failures and identify required changes to NATS’ crisis management capabilities, resilience of systems, procedures and service continuity plans. The second, made by the CAA, encouraged NATS to make best use of all means by which a crisis can be handled from an operational standpoint, including exploring the more effective use of and interactions with the Eurocontrol Network Manager (NM). Despite being assessed by NATS as complete before 12 December, it is evident that neither of these recommendations had been addressed fully. Finally, a review of the wider industry crisis response and resilience arrangements was recommended. Invitations to participate in the crisis response exercise were extended by NATS to major stakeholders in May 2013 and the event was anticipated to take place in February / March 2015, although that date has now been postponed until after this Enquiry reports.
It states that NATS had addressed all the recommendations but that 2 points hadn't been addressed fully, so hardly "ignoring", just not doing what they should have, which they have accepted.

I also got the feeling that the report compliments NATS handling of the situation in several areas including maintaining safety throughout and the speediness of identifying the fault. A review of the flow measures used will be undertaken to see if they can be improved in future system failures but there wasn't any condemning criticism of how NATS handled this particular failure, and they even agreed that ruling out ANY future failures is impossible in such a complex system.

EastofKoksy
29th May 2015, 05:45
Zonoma

You need to remember that good news is no news!

118.70
29th May 2015, 06:56
The Telegraph mentions the "debacle" in its report on the NATS interim dividend :

....In December, a computer malfunction caused chaos in UK airspace and led to the delay and cancellation of hundreds of flights.

Richard Deakin, the chief executive of the air traffic controller, came under pressure to resign over the debacle and his £1m pay package also came in for criticism. Nats announced earlier this month that Mr Deakin had resigned after five years in the role, and a spokesman insisted his exit was not related to the incident before Christmas.

An independent report into the debacle was published last week and found that two recommendations made following a communications systems failure in 2013 had not been “addressed fully” prior to the December incident “despite being assessed by Nats as complete”. Air traffic controller to pay £54.4m dividend - Telegraph (http://www.telegraph.co.uk/finance/newsbysector/transport/11635653/Air-traffic-controller-to-pay-54.4m-dividend.html)

118.70
31st May 2015, 07:14
Report into air traffic control failure shows we need a better approach to programming



Report into air traffic control failure shows we need a better approach to programming (http://theconversation.com/report-into-air-traffic-control-failure-shows-we-need-a-better-approach-to-programming-42496)