PPRuNe Forums - View Single Post - All London airspace closed
View Single Post
Old 22nd Dec 2014, 11:29
  #141 (permalink)  
118.70
 
Join Date: Mar 2008
Location: London
Age: 69
Posts: 148
Likes: 0
Received 0 Likes on 0 Posts
Re-reading the report on the 2013 outage

http://www.nats.aero/wp-content/uplo...-%20Report.pdf

http://www.nats.aero/wp-content/uplo...Appendices.pdf

I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.

Should I have confidence in the 2014 inquiry ? Any news of the chairman yet ?

20. The cause of the TMCS failure was corrupted computer disks on three separate servers, which
could not be recovered quickly using standard practices that have been effective in the past.

1. The failure occurred in the Voice Communication System (VCS) which provides controllers
with integrated voice communications for radio, telephone and intercom in one system. VCS has
three main elements:
􀀡 A digital telephone exchange system (known as a ‘voice switch’) which provides all the
channels for controller-to-controller and controller-to-aircraft communication;
􀀡 Operator touch-screen panels at every workstation which enable controllers to access all
the communication channels associated with their task and to amend individual workstation
configuration, for example when combining airspace sectors (‘band-boxing’) for night time
operations;
􀀡 A Technical Monitoring and Control System (TMCS) which is a computer system for
monitoring VCS and managing system changes – essentially a ‘control computer’ connected to
all the other system components but with no connections to the ‘outside world’.

3. It was the TMCS system which failed on the 7th December 2013. TMCS is fully duplicated
using a Master and Hot Standby (i.e. ready to take over immediately) arrangement. Both the
TMCS servers failed during an overnight installation of data changes (‘adaptations’) while
Swanwick AC was in night-time operational mode with just 5 band-boxed sectors controlling all
upper airspace above England and Wales.


53. The failure of the Technical Monitoring and Control System (TMCS) servers (main, standby
and back-up) occurred during an update of the Voice Communication System (VCS) as part of a
series of overnight activities on some 20 systems at the Centre.
54. Subsequent investigations revealed that the failure occurred during re-start procedures
following installation of planned changes. The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the
second standby server and subsequently a back-up spare.
55. The start-up files were corrupted at some point during November 2013, and were lying
dormant until the next requirement for a re-start. Investigation by the manufacturer (Frequentis)
discovered corruption in network system files, most likely due to an intermittent network
connection fault. The TMCS system hardware has since been entirely replaced and the precise
reason for the corruption may never be established.
56. The investigation into the subsequent sequence of events is summarised below. A summary
of the findings of TRC’s independent technical systems expert is at Appendix D which broadly
concur with NATS’ investigations.

5.4.1 Could the failure have been anticipated?

57. The TRC investigation looked at the history of related problems with TMCS. System logs
revealed that difficulties with previous re-starts in April and October 2013 had given engineers
cause for concern. For example, in April 2013 there was a similar incident involving TMCS which
on that occasion prevented controllers from combining sectors (band-boxing), a scenario which
has no impact on capacity provided there are adequate numbers of controllers to continue to
operate the unbandboxed sectors. Since then there had been a series of problems which were
successfully resolved each time.
58. NATS had already ordered (in November 2013) an enhancement to TMCS from Frequentis to
be available during 2014. In the interim, the engineering judgement was that – as these problems
had not impacted the ATC service to customers – the residual risk was tolerable in the short term.
59. Given the previous experience with TMCS, the TRC’s experts considered that NATS’
engineering team could have been more prepared for resolving re-start problems. In particular, restart
problems had been experienced in October 2013 and other faults found before and after 7
December 2013, all of which with hindsight could have merited deeper investigation and response
by NATS. However, the experts concluded that “this particular failure was not realistically
predictable”. But they considered that it would be appropriate for NATS to review the level to
which the residual risk of such problem conditions could be considered tolerable / acceptable. The
key judgement, however, is that none of the residual risks result in an unsafe system or operation.
60. Engineering procedures for TMCS were immediately changed post event. A planned
enhancement to the VCS and TMCS systems has also been deployed which allows bandboxing/
splitting without the TMCS. These two changes provide far greater resilience to failure in
the future.


D1. Summary of Technical Findings in the TRC Report to the NATS Board
– March 2014

Is TMCS fit for purpose? It is old, fragile and slow because of limited memory and slow machines.
It is due for replacement. The current upgrade should increase resilience
markedly

Other systems with similar
vulnerability?
Flight Plan Suite Automation System has similar architecture. There may
be other systems with dissimilar architecture but comparable vulnerability.
NATS Engineering is working on resilience generally.


Are system failures properly
reported?
Yes. There is a good culture of following up on failures. Analyses have
been detailed and frank.

Resilience measures appropriate? Principally resilience relates to protection against hardware failure:
replication of CPUs, disks, networks, etc. Less attention appears to be
given to the risk of software failure or file corruption, which are harder to
protect against and recover from. However, many systems are old and
have been running satisfactorily for many years. The risk is lower but
evidently there.
118.70 is offline