All London airspace closed

Reply Subscribe

Thread Tools

Search this Thread

22nd Dec 2014, 11:29

#141 (permalink)

118.70

Join Date: Mar 2008

Location: London

Age: 69

Posts: 148

Likes: 0

Received 0 Likes on 0 Posts

Re-reading the report on the 2013 outage

http://www.nats.aero/wp-content/uplo...-%20Report.pdf

http://www.nats.aero/wp-content/uplo...Appendices.pdf

I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.

Should I have confidence in the 2014 inquiry ? Any news of the chairman yet ?

Quote:

20. The cause of the TMCS failure was corrupted computer disks on three separate servers, which
could not be recovered quickly using standard practices that have been effective in the past.

1. The failure occurred in the Voice Communication System (VCS) which provides controllers
with integrated voice communications for radio, telephone and intercom in one system. VCS has
three main elements:
􀀡 A digital telephone exchange system (known as a ‘voice switch’) which provides all the
channels for controller-to-controller and controller-to-aircraft communication;
􀀡 Operator touch-screen panels at every workstation which enable controllers to access all
the communication channels associated with their task and to amend individual workstation
configuration, for example when combining airspace sectors (‘band-boxing’) for night time
operations;
􀀡 A Technical Monitoring and Control System (TMCS) which is a computer system for
monitoring VCS and managing system changes – essentially a ‘control computer’ connected to
all the other system components but with no connections to the ‘outside world’.

3. It was the TMCS system which failed on the 7th December 2013. TMCS is fully duplicated
using a Master and Hot Standby (i.e. ready to take over immediately) arrangement. Both the
TMCS servers failed during an overnight installation of data changes (‘adaptations’) while
Swanwick AC was in night-time operational mode with just 5 band-boxed sectors controlling all
upper airspace above England and Wales.

53. The failure of the Technical Monitoring and Control System (TMCS) servers (main, standby
and back-up) occurred during an update of the Voice Communication System (VCS) as part of a
series of overnight activities on some 20 systems at the Centre.
54. Subsequent investigations revealed that the failure occurred during re-start procedures
following installation of planned changes. The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the
second standby server and subsequently a back-up spare.
55. The start-up files were corrupted at some point during November 2013, and were lying
dormant until the next requirement for a re-start. Investigation by the manufacturer (Frequentis)
discovered corruption in network system files, most likely due to an intermittent network
connection fault. The TMCS system hardware has since been entirely replaced and the precise
reason for the corruption may never be established.
56. The investigation into the subsequent sequence of events is summarised below. A summary
of the findings of TRC’s independent technical systems expert is at Appendix D which broadly
concur with NATS’ investigations.

5.4.1 Could the failure have been anticipated?

57. The TRC investigation looked at the history of related problems with TMCS. System logs
revealed that difficulties with previous re-starts in April and October 2013 had given engineers
cause for concern. For example, in April 2013 there was a similar incident involving TMCS which
on that occasion prevented controllers from combining sectors (band-boxing), a scenario which
has no impact on capacity provided there are adequate numbers of controllers to continue to
operate the unbandboxed sectors. Since then there had been a series of problems which were
successfully resolved each time.
58. NATS had already ordered (in November 2013) an enhancement to TMCS from Frequentis to
be available during 2014. In the interim, the engineering judgement was that – as these problems
had not impacted the ATC service to customers – the residual risk was tolerable in the short term.
59. Given the previous experience with TMCS, the TRC’s experts considered that NATS’
engineering team could have been more prepared for resolving re-start problems. In particular, restart
problems had been experienced in October 2013 and other faults found before and after 7
December 2013, all of which with hindsight could have merited deeper investigation and response
by NATS. However, the experts concluded that “this particular failure was not realistically
predictable”. But they considered that it would be appropriate for NATS to review the level to
which the residual risk of such problem conditions could be considered tolerable / acceptable. The
key judgement, however, is that none of the residual risks result in an unsafe system or operation.
60. Engineering procedures for TMCS were immediately changed post event. A planned
enhancement to the VCS and TMCS systems has also been deployed which allows bandboxing/
splitting without the TMCS. These two changes provide far greater resilience to failure in
the future.

D1. Summary of Technical Findings in the TRC Report to the NATS Board
– March 2014

Is TMCS fit for purpose? It is old, fragile and slow because of limited memory and slow machines.
It is due for replacement. The current upgrade should increase resilience
markedly

Other systems with similar
vulnerability?
Flight Plan Suite Automation System has similar architecture. There may
be other systems with dissimilar architecture but comparable vulnerability.
NATS Engineering is working on resilience generally.

Are system failures properly
reported?
Yes. There is a good culture of following up on failures. Analyses have
been detailed and frank.

Resilience measures appropriate? Principally resilience relates to protection against hardware failure:
replication of CPUs, disks, networks, etc. Less attention appears to be
given to the risk of software failure or file corruption, which are harder to
protect against and recover from. However, many systems are old and
have been running satisfactorily for many years. The risk is lower but
evidently there.

22nd Dec 2014, 14:21

#142 (permalink)

Downwind Lander

Join Date: Apr 2014

Location: London

Posts: 148

Likes: 0

Received 0 Likes on 0 Posts

118.70 opines: "I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated/ explained".

It would be interesting to find a stats/probability expert to calculate the odds. This reminds me of Richard Feynman concluding that the Challenger disaster was a statistical certainty.

22nd Dec 2014, 14:35

#143 (permalink)

Heathrow Harry

Join Date: Apr 2010

Location: London

Posts: 7,072

Likes: 0

Received 0 Likes on 0 Posts

there is no such thing

stats depend on how connected the issues were

as for example "The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the second standby server and subsequently a back-up spare."

essentially the same problem was copied to several of the servers in which case the back-up duplication was useless

22nd Dec 2014, 18:12

#144 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Quote:

I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.

How could they be independent? If one is main, one is standby and the other is a backup ready for immediate use they must have the same configuration on them. The problem with modelling the real world is that when it changes your configuration also has to change on all instances.

23rd Dec 2014, 07:26

#145 (permalink)

Evey_Hammond

Join Date: Feb 2012

Location: England

Posts: 67

Likes: 0

Received 0 Likes on 0 Posts

It's somewhat unsurprising to find out that one single line of code, previously identified as a potential future problem back in the 90's, could cause this outage so many years later...

23rd Dec 2014, 08:31

#146 (permalink)

vancouv

Join Date: May 2002

Location: uk

Posts: 314

Likes: 0

Received 0 Likes on 0 Posts

Computer systems are way more complex than the simplistic 'one line of code caused the problem' theories we've heard on here. Maybe the final point of failure was one line of code, but that same line of code could have worked perfectly for millions of combinations of events that led to it being executed. It only takes one flawed route to cause a problem.

When analysing failures in computer systems it is usually not too difficult to find where it's failed - the trick is finding why it failed, how did we get to this point with this data on this particular occasion when everything has worked fine for the last umpteen executions.

Think of the holes lining up that people often talk about when analysing aircraft accidents.

24th Dec 2014, 10:39

#147 (permalink)

Heathrow Harry

Join Date: Apr 2010

Location: London

Posts: 7,072

Likes: 0

Received 0 Likes on 0 Posts

Vancouv is correct

there are many things which can cause accidents - and God knows a lot of them finish up on here

To say that an ATC software problem affects FBW aircraft is like saying its raining outside and that might affect approaches to LHR - true but pointless

26th Dec 2014, 16:48

#148 (permalink)

FlyGooseFly!

Join Date: Jul 2006

Location: Rickmansworth

Age: 74

Posts: 41

Likes: 0

Received 0 Likes on 0 Posts

Some very interesting and informative posts here since the last time I looked in - particularly from Vancouv - who seems to have much relevant experience.

In addition to my previous offering, I'd like to say that I'm not overly worried about something being "old" per se - especially as I'm not that young any more myself, yet still work reasonably well. The problems with computer systems usually occur when some guy has "fiddled" with it! Updates, mods and improvements often turn out to be nothing of the kind - whereas the original version may very well have carried on doing what it was designed to do.

Obviously, advances in technology have enforced some changes with older languages and versions not being able to run on modern platforms - though introducing these and debugging must be the stuff of nightmares!
Not to mention having ones start-up files fragged after the last use and sitting there ready to corrupt.

26th Dec 2014, 18:26

#149 (permalink)

NigelOnDraft

Join Date: Jan 2001

Location: UK

Posts: 2,044

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Which leads me to, how did the config files get shared duplicated whilst corrupt ?

The appendix said it was via RAID i.e. the discs were kept sync'd across the servers (?) Seems fair deal for live / hot spare setup, but as was seen poor protection against a corruption issue which RAID duly passed along (as it should).

Not an expert in this area, so might not have quite understood...

26th Dec 2014, 19:16

#150 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Quote:

The appendix said it was via RAID

No it doesn't, It says that RAID is not used in TMCS. The disks are mirrored which is quite different.

26th Dec 2014, 21:03

#151 (permalink)

NigelOnDraft

Join Date: Jan 2001

Location: UK

Posts: 2,044

Likes: 0

Received 0 Likes on 0 Posts

Quote:

No it doesn't, It says that RAID is not used in TMCS. The disks are mirrored which is quite different

Really?

Quote:

Lessons Learned from Dec 7
• Failure of Technical Monitoring & Control System (TMCS) – part of
Communications System (VCS) for Area Control at Swanwick
• VCS allows direct access comms between sectors, airports & adjacent
centres & is automatically configured for the sector configuration.
• File corruption occurred on the primary server which then transferred to the
hot standby as they were linked via RAID*.
• Server was replaced but the software fault then transferred to the spare

and also:Wiki RAID

Quote:

RAID 1 consists of mirroring

26th Dec 2014, 21:43

#152 (permalink)

eglnyt

Join Date: Oct 2004

Location: Southern England

Posts: 483

Likes: 0

Received 0 Likes on 0 Posts

Well Appendix D Page 29 would suggest otherwise.

27th Dec 2014, 17:41

#153 (permalink)

ZOOKER

Join Date: Jan 2008

Location: The foot of Mt. Belzoni.

Posts: 2,001

Likes: 0

Received 0 Likes on 0 Posts

Presumably after today's debacle on the ECML, those responsible will face the same grilling from Ellman, McCartney and Stringer et al, that Deakin and Rolfe had to endure?

27th Dec 2014, 20:00

#154 (permalink)

EEngr

Join Date: Jan 2011

Location: Seattle

Posts: 716

Likes: 0

Received 3 Likes on 2 Posts

Terminology

There seems to be some misunderstanding of the term RAID* as it applies to the TMCS servers. RAID technology is a set of firmware and configurations that makes a group of hardware disks appear to a host as a single logical disk drive. This can provide redundancy in the event of a single (or multiple) hardware failures and allow the host system to continue running, although in some cases at a reduced level of performance. The key here is that a RAID array typically serves a single host system. So even in the event of a series of failures sufficient to incapacitate the entire RAID array (highly improbable), only that one host fails.

From the description of the events, it appears that the logical disk failure affected several redundant host systems. This leads me to believe that, in addition to a RAID array, these systems were using Network Attached Storage** (NAS). Several implementations of this may be referred to as a Storage Area Network, where one server 'shares out' its disk system to other systems. Each system would look at the files on this shared (mounted) drive as if they were local to that system. However, data (bad data in this case) written by one system would become available to all.

It is also possible (not clear from the report) that the "disk mirroring" function may have been implemented at an application level. That is: The TMCS server applications would receive a copy of a data stream and each would write a local copy to its disk. This would be the most robust system, as the applications would be able to spot "bad data" and refuse to save a local copy. Typical NAS systems don't have this capability, as the operating systems have no concept of what is good or bad, Bytes are bytes. And from the description of the failure, it sounds like the latter is what was implemented.

NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.

*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.

27th Dec 2014, 21:02

#155 (permalink)

Standard Toaster

Join Date: Jul 2013

Location: Alternative Universe

Posts: 19

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Originally Posted by EEngr

NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.

*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.

NAS (probably SAN in this case) systems are perfect for redundancy, but redundancy != backup, and the fact people (read ADMINS) keep confusing redundancy, syncing and backup is beyond me.

If one HDD becomes corrupted, the other obviously will, because that's what most of the RAID modes are for, mirroring one disk.

Disaster recovery plans are mandatory, and with that comes a good backup plan, and I'm sure in this case they had one.

Of course I know in real life some lazy ass admins somehow think that RAID is a good "backup", and when the shtf they're sol.

28th Dec 2014, 11:29

#156 (permalink)

118.70

Join Date: Mar 2008

Location: London

Age: 69

Posts: 148

Likes: 0

Received 0 Likes on 0 Posts

Quote:

Disaster recovery plans are mandatory, and with that comes a good backup plan, and I'm sure in this case they had one.

The 2013 incident seems to have started with backup Plan A which was then ditched for Plan B which eventually included being sent an electronic backup copy from April by Frequentis in Vienna. It isn't clear whether this was the "good backup plan" that they had in place.

For the 2014 incident, I see the Observer today repeats this version of the story :

Quote:

Finally, at the end of the year, there was chaos at Nats, the public-private partnership that runs Britain’s air traffic control systems. The whole of London’s airspace was closed for more than an hour on 12 December, with disruption continuing for several subsequent days. Aircraft were stuck in holding patterns over Heathrow, or diverted to other airports, with hundreds of flights cancelled. Vince Cable said Nats had been “penny wise and pound foolish” and was running “ancient computer systems, which then crash”. It eventually emerged that a single line of computer code more than 25 years old was responsible for the shutdown.

Fear of flying: the spectre that haunts modern life | World news | The Guardian

Any word from the CAA on the 2014 inquiry chairman yet ?

28th Dec 2014, 11:43

#157 (permalink)

Mike-Bracknell

Join Date: Jan 2008

Location: Bracknell, Berks, UK

Age: 52

Posts: 1,133

Likes: 0

Received 0 Likes on 0 Posts

When aircraft crash, all the pilots on here get quite uppity over amateurs speculating over the cause.

Having heeded their warning and avoided speculation in those cases, and having a deep background in IT, I can now see first-hand the effects of the ill-informed speculating about which they have scant knowledge.

28th Dec 2014, 11:43

#158 (permalink)

Heathrow Harry

Join Date: Apr 2010

Location: London

Posts: 7,072

Likes: 0

Received 0 Likes on 0 Posts

between Christmas & New Year??

Not a hope ................ the top brass won't be back until around 5th January..............

28th Dec 2014, 19:37

#159 (permalink)

EEngr

Join Date: Jan 2011

Location: Seattle

Posts: 716

Likes: 0

Received 3 Likes on 2 Posts

Quote:

NAS (probably SAN in this case) systems are perfect for redundancy,

Nope. NAS is intended to provide a single logical drive as the storage for multiple hosts. If redundancy is a requirement, what you need is each host with their own storage separate from the others, in its own cabinet, with its own power supply. Think of air data computers. Having multiple ADCs all fed from a single static/pitot system is the logical equivalent of using NAS (one logical disk drive) to feed multiple servers.

Quote:

the fact people (read ADMINS) keep confusing redundancy,

They don't confuse it. Its just laziness. One physical copy of data is easier to maintain than having to update multiple systems. Or update one master and push copies to the slaves.

28th Dec 2014, 19:42

#160 (permalink)

Ian W

Join Date: Dec 2006

Location: Florida and wherever my laptop is

Posts: 1,350

Likes: 0

Received 0 Likes on 0 Posts

Just to confuse things further, the Flight Data Processing system is based on NAS Host. That is the National Airspace System - Host Computer. So it is easy to misread what has been reported.

Reply Share

First
Prev
8 / 10
Next
Last