All London airspace closed
Join Date: Mar 2008
Location: London
Age: 69
Posts: 148
Likes: 0
Received 0 Likes
on
0 Posts
Re-reading the report on the 2013 outage
http://www.nats.aero/wp-content/uplo...-%20Report.pdf
http://www.nats.aero/wp-content/uplo...Appendices.pdf
I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.
Should I have confidence in the 2014 inquiry ? Any news of the chairman yet ?
http://www.nats.aero/wp-content/uplo...-%20Report.pdf
http://www.nats.aero/wp-content/uplo...Appendices.pdf
I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.
Should I have confidence in the 2014 inquiry ? Any news of the chairman yet ?
20. The cause of the TMCS failure was corrupted computer disks on three separate servers, which
could not be recovered quickly using standard practices that have been effective in the past.
1. The failure occurred in the Voice Communication System (VCS) which provides controllers
with integrated voice communications for radio, telephone and intercom in one system. VCS has
three main elements:
A digital telephone exchange system (known as a ‘voice switch’) which provides all the
channels for controller-to-controller and controller-to-aircraft communication;
Operator touch-screen panels at every workstation which enable controllers to access all
the communication channels associated with their task and to amend individual workstation
configuration, for example when combining airspace sectors (‘band-boxing’) for night time
operations;
A Technical Monitoring and Control System (TMCS) which is a computer system for
monitoring VCS and managing system changes – essentially a ‘control computer’ connected to
all the other system components but with no connections to the ‘outside world’.
3. It was the TMCS system which failed on the 7th December 2013. TMCS is fully duplicated
using a Master and Hot Standby (i.e. ready to take over immediately) arrangement. Both the
TMCS servers failed during an overnight installation of data changes (‘adaptations’) while
Swanwick AC was in night-time operational mode with just 5 band-boxed sectors controlling all
upper airspace above England and Wales.
53. The failure of the Technical Monitoring and Control System (TMCS) servers (main, standby
and back-up) occurred during an update of the Voice Communication System (VCS) as part of a
series of overnight activities on some 20 systems at the Centre.
54. Subsequent investigations revealed that the failure occurred during re-start procedures
following installation of planned changes. The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the
second standby server and subsequently a back-up spare.
55. The start-up files were corrupted at some point during November 2013, and were lying
dormant until the next requirement for a re-start. Investigation by the manufacturer (Frequentis)
discovered corruption in network system files, most likely due to an intermittent network
connection fault. The TMCS system hardware has since been entirely replaced and the precise
reason for the corruption may never be established.
56. The investigation into the subsequent sequence of events is summarised below. A summary
of the findings of TRC’s independent technical systems expert is at Appendix D which broadly
concur with NATS’ investigations.
5.4.1 Could the failure have been anticipated?
57. The TRC investigation looked at the history of related problems with TMCS. System logs
revealed that difficulties with previous re-starts in April and October 2013 had given engineers
cause for concern. For example, in April 2013 there was a similar incident involving TMCS which
on that occasion prevented controllers from combining sectors (band-boxing), a scenario which
has no impact on capacity provided there are adequate numbers of controllers to continue to
operate the unbandboxed sectors. Since then there had been a series of problems which were
successfully resolved each time.
58. NATS had already ordered (in November 2013) an enhancement to TMCS from Frequentis to
be available during 2014. In the interim, the engineering judgement was that – as these problems
had not impacted the ATC service to customers – the residual risk was tolerable in the short term.
59. Given the previous experience with TMCS, the TRC’s experts considered that NATS’
engineering team could have been more prepared for resolving re-start problems. In particular, restart
problems had been experienced in October 2013 and other faults found before and after 7
December 2013, all of which with hindsight could have merited deeper investigation and response
by NATS. However, the experts concluded that “this particular failure was not realistically
predictable”. But they considered that it would be appropriate for NATS to review the level to
which the residual risk of such problem conditions could be considered tolerable / acceptable. The
key judgement, however, is that none of the residual risks result in an unsafe system or operation.
60. Engineering procedures for TMCS were immediately changed post event. A planned
enhancement to the VCS and TMCS systems has also been deployed which allows bandboxing/
splitting without the TMCS. These two changes provide far greater resilience to failure in
the future.
D1. Summary of Technical Findings in the TRC Report to the NATS Board
– March 2014
Is TMCS fit for purpose? It is old, fragile and slow because of limited memory and slow machines.
It is due for replacement. The current upgrade should increase resilience
markedly
Other systems with similar
vulnerability?
Flight Plan Suite Automation System has similar architecture. There may
be other systems with dissimilar architecture but comparable vulnerability.
NATS Engineering is working on resilience generally.
Are system failures properly
reported?
Yes. There is a good culture of following up on failures. Analyses have
been detailed and frank.
Resilience measures appropriate? Principally resilience relates to protection against hardware failure:
replication of CPUs, disks, networks, etc. Less attention appears to be
given to the risk of software failure or file corruption, which are harder to
protect against and recover from. However, many systems are old and
have been running satisfactorily for many years. The risk is lower but
evidently there.
could not be recovered quickly using standard practices that have been effective in the past.
1. The failure occurred in the Voice Communication System (VCS) which provides controllers
with integrated voice communications for radio, telephone and intercom in one system. VCS has
three main elements:
A digital telephone exchange system (known as a ‘voice switch’) which provides all the
channels for controller-to-controller and controller-to-aircraft communication;
Operator touch-screen panels at every workstation which enable controllers to access all
the communication channels associated with their task and to amend individual workstation
configuration, for example when combining airspace sectors (‘band-boxing’) for night time
operations;
A Technical Monitoring and Control System (TMCS) which is a computer system for
monitoring VCS and managing system changes – essentially a ‘control computer’ connected to
all the other system components but with no connections to the ‘outside world’.
3. It was the TMCS system which failed on the 7th December 2013. TMCS is fully duplicated
using a Master and Hot Standby (i.e. ready to take over immediately) arrangement. Both the
TMCS servers failed during an overnight installation of data changes (‘adaptations’) while
Swanwick AC was in night-time operational mode with just 5 band-boxed sectors controlling all
upper airspace above England and Wales.
53. The failure of the Technical Monitoring and Control System (TMCS) servers (main, standby
and back-up) occurred during an update of the Voice Communication System (VCS) as part of a
series of overnight activities on some 20 systems at the Centre.
54. Subsequent investigations revealed that the failure occurred during re-start procedures
following installation of planned changes. The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the
second standby server and subsequently a back-up spare.
55. The start-up files were corrupted at some point during November 2013, and were lying
dormant until the next requirement for a re-start. Investigation by the manufacturer (Frequentis)
discovered corruption in network system files, most likely due to an intermittent network
connection fault. The TMCS system hardware has since been entirely replaced and the precise
reason for the corruption may never be established.
56. The investigation into the subsequent sequence of events is summarised below. A summary
of the findings of TRC’s independent technical systems expert is at Appendix D which broadly
concur with NATS’ investigations.
5.4.1 Could the failure have been anticipated?
57. The TRC investigation looked at the history of related problems with TMCS. System logs
revealed that difficulties with previous re-starts in April and October 2013 had given engineers
cause for concern. For example, in April 2013 there was a similar incident involving TMCS which
on that occasion prevented controllers from combining sectors (band-boxing), a scenario which
has no impact on capacity provided there are adequate numbers of controllers to continue to
operate the unbandboxed sectors. Since then there had been a series of problems which were
successfully resolved each time.
58. NATS had already ordered (in November 2013) an enhancement to TMCS from Frequentis to
be available during 2014. In the interim, the engineering judgement was that – as these problems
had not impacted the ATC service to customers – the residual risk was tolerable in the short term.
59. Given the previous experience with TMCS, the TRC’s experts considered that NATS’
engineering team could have been more prepared for resolving re-start problems. In particular, restart
problems had been experienced in October 2013 and other faults found before and after 7
December 2013, all of which with hindsight could have merited deeper investigation and response
by NATS. However, the experts concluded that “this particular failure was not realistically
predictable”. But they considered that it would be appropriate for NATS to review the level to
which the residual risk of such problem conditions could be considered tolerable / acceptable. The
key judgement, however, is that none of the residual risks result in an unsafe system or operation.
60. Engineering procedures for TMCS were immediately changed post event. A planned
enhancement to the VCS and TMCS systems has also been deployed which allows bandboxing/
splitting without the TMCS. These two changes provide far greater resilience to failure in
the future.
D1. Summary of Technical Findings in the TRC Report to the NATS Board
– March 2014
Is TMCS fit for purpose? It is old, fragile and slow because of limited memory and slow machines.
It is due for replacement. The current upgrade should increase resilience
markedly
Other systems with similar
vulnerability?
Flight Plan Suite Automation System has similar architecture. There may
be other systems with dissimilar architecture but comparable vulnerability.
NATS Engineering is working on resilience generally.
Are system failures properly
reported?
Yes. There is a good culture of following up on failures. Analyses have
been detailed and frank.
Resilience measures appropriate? Principally resilience relates to protection against hardware failure:
replication of CPUs, disks, networks, etc. Less attention appears to be
given to the risk of software failure or file corruption, which are harder to
protect against and recover from. However, many systems are old and
have been running satisfactorily for many years. The risk is lower but
evidently there.
Join Date: Apr 2014
Location: London
Posts: 148
Likes: 0
Received 0 Likes
on
0 Posts
118.70 opines: "I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated/ explained".
It would be interesting to find a stats/probability expert to calculate the odds. This reminds me of Richard Feynman concluding that the Challenger disaster was a statistical certainty.
It would be interesting to find a stats/probability expert to calculate the odds. This reminds me of Richard Feynman concluding that the Challenger disaster was a statistical certainty.
Join Date: Apr 2010
Location: London
Posts: 7,072
Likes: 0
Received 0 Likes
on
0 Posts
there is no such thing
stats depend on how connected the issues were
as for example "The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the second standby server and subsequently a back-up spare."
essentially the same problem was copied to several of the servers in which case the back-up duplication was useless
stats depend on how connected the issues were
as for example "The re-start failed due to corruption of 19 start-up files
in the TMCS servers which control the VCS system. The fault corruption was replicated to the second standby server and subsequently a back-up spare."
essentially the same problem was copied to several of the servers in which case the back-up duplication was useless
Join Date: Oct 2004
Location: Southern England
Posts: 483
Likes: 0
Received 0 Likes
on
0 Posts
I wasn't convinced that the root cause of the simultaneous corruption of files in three independent computers was adequately investigated / explained.
Join Date: Feb 2012
Location: England
Posts: 67
Likes: 0
Received 0 Likes
on
0 Posts
It's somewhat unsurprising to find out that one single line of code, previously identified as a potential future problem back in the 90's, could cause this outage so many years later...
Join Date: May 2002
Location: uk
Posts: 314
Likes: 0
Received 0 Likes
on
0 Posts
Computer systems are way more complex than the simplistic 'one line of code caused the problem' theories we've heard on here. Maybe the final point of failure was one line of code, but that same line of code could have worked perfectly for millions of combinations of events that led to it being executed. It only takes one flawed route to cause a problem.
When analysing failures in computer systems it is usually not too difficult to find where it's failed - the trick is finding why it failed, how did we get to this point with this data on this particular occasion when everything has worked fine for the last umpteen executions.
Think of the holes lining up that people often talk about when analysing aircraft accidents.
When analysing failures in computer systems it is usually not too difficult to find where it's failed - the trick is finding why it failed, how did we get to this point with this data on this particular occasion when everything has worked fine for the last umpteen executions.
Think of the holes lining up that people often talk about when analysing aircraft accidents.
Join Date: Apr 2010
Location: London
Posts: 7,072
Likes: 0
Received 0 Likes
on
0 Posts
Vancouv is correct
there are many things which can cause accidents - and God knows a lot of them finish up on here
To say that an ATC software problem affects FBW aircraft is like saying its raining outside and that might affect approaches to LHR - true but pointless
there are many things which can cause accidents - and God knows a lot of them finish up on here
To say that an ATC software problem affects FBW aircraft is like saying its raining outside and that might affect approaches to LHR - true but pointless
Join Date: Jul 2006
Location: Rickmansworth
Age: 74
Posts: 41
Likes: 0
Received 0 Likes
on
0 Posts
Some very interesting and informative posts here since the last time I looked in - particularly from Vancouv - who seems to have much relevant experience.
In addition to my previous offering, I'd like to say that I'm not overly worried about something being "old" per se - especially as I'm not that young any more myself, yet still work reasonably well. The problems with computer systems usually occur when some guy has "fiddled" with it! Updates, mods and improvements often turn out to be nothing of the kind - whereas the original version may very well have carried on doing what it was designed to do.
Obviously, advances in technology have enforced some changes with older languages and versions not being able to run on modern platforms - though introducing these and debugging must be the stuff of nightmares!
Not to mention having ones start-up files fragged after the last use and sitting there ready to corrupt.
In addition to my previous offering, I'd like to say that I'm not overly worried about something being "old" per se - especially as I'm not that young any more myself, yet still work reasonably well. The problems with computer systems usually occur when some guy has "fiddled" with it! Updates, mods and improvements often turn out to be nothing of the kind - whereas the original version may very well have carried on doing what it was designed to do.
Obviously, advances in technology have enforced some changes with older languages and versions not being able to run on modern platforms - though introducing these and debugging must be the stuff of nightmares!
Not to mention having ones start-up files fragged after the last use and sitting there ready to corrupt.
Join Date: Jan 2001
Location: UK
Posts: 2,044
Likes: 0
Received 0 Likes
on
0 Posts
Which leads me to, how did the config files get shared duplicated whilst corrupt ?
Not an expert in this area, so might not have quite understood...
Join Date: Jan 2001
Location: UK
Posts: 2,044
Likes: 0
Received 0 Likes
on
0 Posts
No it doesn't, It says that RAID is not used in TMCS. The disks are mirrored which is quite different
Lessons Learned from Dec 7
• Failure of Technical Monitoring & Control System (TMCS) – part of
Communications System (VCS) for Area Control at Swanwick
• VCS allows direct access comms between sectors, airports & adjacent
centres & is automatically configured for the sector configuration.
• File corruption occurred on the primary server which then transferred to the
hot standby as they were linked via RAID*.
• Server was replaced but the software fault then transferred to the spare
• Failure of Technical Monitoring & Control System (TMCS) – part of
Communications System (VCS) for Area Control at Swanwick
• VCS allows direct access comms between sectors, airports & adjacent
centres & is automatically configured for the sector configuration.
• File corruption occurred on the primary server which then transferred to the
hot standby as they were linked via RAID*.
• Server was replaced but the software fault then transferred to the spare
RAID 1 consists of mirroring
Join Date: Jan 2008
Location: The foot of Mt. Belzoni.
Posts: 2,001
Likes: 0
Received 0 Likes
on
0 Posts
Presumably after today's debacle on the ECML, those responsible will face the same grilling from Ellman, McCartney and Stringer et al, that Deakin and Rolfe had to endure?
Terminology
There seems to be some misunderstanding of the term RAID* as it applies to the TMCS servers. RAID technology is a set of firmware and configurations that makes a group of hardware disks appear to a host as a single logical disk drive. This can provide redundancy in the event of a single (or multiple) hardware failures and allow the host system to continue running, although in some cases at a reduced level of performance. The key here is that a RAID array typically serves a single host system. So even in the event of a series of failures sufficient to incapacitate the entire RAID array (highly improbable), only that one host fails.
From the description of the events, it appears that the logical disk failure affected several redundant host systems. This leads me to believe that, in addition to a RAID array, these systems were using Network Attached Storage** (NAS). Several implementations of this may be referred to as a Storage Area Network, where one server 'shares out' its disk system to other systems. Each system would look at the files on this shared (mounted) drive as if they were local to that system. However, data (bad data in this case) written by one system would become available to all.
It is also possible (not clear from the report) that the "disk mirroring" function may have been implemented at an application level. That is: The TMCS server applications would receive a copy of a data stream and each would write a local copy to its disk. This would be the most robust system, as the applications would be able to spot "bad data" and refuse to save a local copy. Typical NAS systems don't have this capability, as the operating systems have no concept of what is good or bad, Bytes are bytes. And from the description of the failure, it sounds like the latter is what was implemented.
NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.
*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.
From the description of the events, it appears that the logical disk failure affected several redundant host systems. This leads me to believe that, in addition to a RAID array, these systems were using Network Attached Storage** (NAS). Several implementations of this may be referred to as a Storage Area Network, where one server 'shares out' its disk system to other systems. Each system would look at the files on this shared (mounted) drive as if they were local to that system. However, data (bad data in this case) written by one system would become available to all.
It is also possible (not clear from the report) that the "disk mirroring" function may have been implemented at an application level. That is: The TMCS server applications would receive a copy of a data stream and each would write a local copy to its disk. This would be the most robust system, as the applications would be able to spot "bad data" and refuse to save a local copy. Typical NAS systems don't have this capability, as the operating systems have no concept of what is good or bad, Bytes are bytes. And from the description of the failure, it sounds like the latter is what was implemented.
NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.
*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.
Join Date: Jul 2013
Location: Alternative Universe
Posts: 19
Likes: 0
Received 0 Likes
on
0 Posts
Originally Posted by EEngr
NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.
*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.
*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.
If one HDD becomes corrupted, the other obviously will, because that's what most of the RAID modes are for, mirroring one disk.
Disaster recovery plans are mandatory, and with that comes a good backup plan, and I'm sure in this case they had one.
Of course I know in real life some lazy ass admins somehow think that RAID is a good "backup", and when the shtf they're sol.
Join Date: Mar 2008
Location: London
Age: 69
Posts: 148
Likes: 0
Received 0 Likes
on
0 Posts
Disaster recovery plans are mandatory, and with that comes a good backup plan, and I'm sure in this case they had one.
For the 2014 incident, I see the Observer today repeats this version of the story :
Finally, at the end of the year, there was chaos at Nats, the public-private partnership that runs Britain’s air traffic control systems. The whole of London’s airspace was closed for more than an hour on 12 December, with disruption continuing for several subsequent days. Aircraft were stuck in holding patterns over Heathrow, or diverted to other airports, with hundreds of flights cancelled. Vince Cable said Nats had been “penny wise and pound foolish” and was running “ancient computer systems, which then crash”. It eventually emerged that a single line of computer code more than 25 years old was responsible for the shutdown.
Any word from the CAA on the 2014 inquiry chairman yet ?
Join Date: Jan 2008
Location: Bracknell, Berks, UK
Age: 52
Posts: 1,133
Likes: 0
Received 0 Likes
on
0 Posts
When aircraft crash, all the pilots on here get quite uppity over amateurs speculating over the cause.
Having heeded their warning and avoided speculation in those cases, and having a deep background in IT, I can now see first-hand the effects of the ill-informed speculating about which they have scant knowledge.
Having heeded their warning and avoided speculation in those cases, and having a deep background in IT, I can now see first-hand the effects of the ill-informed speculating about which they have scant knowledge.
NAS (probably SAN in this case) systems are perfect for redundancy,
the fact people (read ADMINS) keep confusing redundancy,
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes
on
0 Posts
Just to confuse things further, the Flight Data Processing system is based on NAS Host. That is the National Airspace System - Host Computer. So it is easy to misread what has been reported.