Go Back  PPRuNe Forums > Ground & Other Ops Forums > ATC Issues
Reload this Page >

Problems at Swanwick

Wikiposts
Search
ATC Issues A place where pilots may enter the 'lions den' that is Air Traffic Control in complete safety and find out the answers to all those obscure topics which you always wanted to know the answer to but were afraid to ask.

Problems at Swanwick

Thread Tools
 
Search this Thread
 
Old 20th May 2002, 04:58
  #61 (permalink)  
 
Join Date: Jul 2001
Location: Fort Worth ARTCC ZFW
Posts: 1,155
Likes: 0
Received 0 Likes on 0 Posts
Cool

Iron City;

Actually VSCS takes about 35 minutes to reboot now... If that were to happen though we would switch over to VTABS and would continue working until the reboot was complete.

For those who don't know, VTABS ( VSCS Training and Backup Switch) was installed since we didn't feel that we wanted the live system hooked to the training communications switch, as well as we wanted to have something that would work in the event that we took a power bump or a software problem with VSCS. VTABS has a battery backup that is independent from the critical power UPS and system just in case. The system has been VERY reliable however...

regards
Scott Voigt is offline  
Old 20th May 2002, 16:47
  #62 (permalink)  
 
Join Date: May 2002
Location: UK
Posts: 4
Likes: 0
Received 0 Likes on 0 Posts
To all fine people out there,

From info I have received today, the Swanwick sector split problem was caused by a network glitch (love that CXSS) – hence they are still running on the new software drop and it is working.

Would love to know how much this has cost – can anyone give me an approximation – but I repeat my earlier comment:
This would not have been a problem ---NO COST/NO DELAY--- if ‘the systems specialist engineer’s’ had been around at the time of the DDC.

Unfortunately this costs a few extra peanuts and the current tight-arse management penny-pinching policy resulted in Fridays situation – nothing else.

The really good news is that management have re-organised these engineers for no other sensible reason than to impress their new owners TAG and in the hope that some will quit to reduce the redundancy count – but heh – it’ll never happen again right!!!!
Tanglefoot is offline  
Old 20th May 2002, 17:17
  #63 (permalink)  
Beady Eye
 
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes on 0 Posts
Tanglefoot et al:

Yes the 'trouble' at Swanwick was indeed a network problem, due to the interface between CXSS and a COTS product. This problem was first identified in June last year and the engineers had been trying to 'fix' it but it has proved to be a vastly more complex issue than was first hoped for.

It incorrect to assert that had specialist engineers been around it would have been sorted immediately. It took some searching by the Lockheed Martin Analyst in the low level codes to find out what had gone wrong.

As there is an engineering investigation ongoing which will produce a post-mortem result soon I will not speculate on here about solutions.
BDiONU is offline  
Old 20th May 2002, 17:20
  #64 (permalink)  
Beady Eye
 
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes on 0 Posts
Forgot a p.s.

The delays were HUGELY exacerbated by a failure of the CFMU in Brussels.
BDiONU is offline  
Old 20th May 2002, 19:01
  #65 (permalink)  
 
Join Date: Jan 2002
Location: NERC
Posts: 46
Likes: 0
Received 0 Likes on 0 Posts
Firstly apologies to ATC, Friday must have been any extremely stressful day.

That aside I agree with Take3Call5. The problem affected only a single workstation and once identified was VERY simple to correct.

I would like to say more but won't at this stage as I believe this to be counter productive and unfair to some people that give their absolute best.
NERC Dweller is offline  
Old 20th May 2002, 20:43
  #66 (permalink)  
 
Join Date: May 2002
Location: UK
Posts: 4
Likes: 0
Received 0 Likes on 0 Posts
Take3Call5

My sources have indeed been busy.
Point 1 is that CXSS is a COTS product – delivery as such by LM.
Point 2 the problem is indeed complex but the symptoms are apparently easy to deduce and as NERC Dweller states, is VERY easy to correct. I stand by my earlier comment. With the correct skill sets available at the time, this would have been identified and recovered immediately (I believe the term is workedaround – not fixed). Call me old fashioned but we should not be costing our airline customers millions of pounds whilst using the operational system as a LM test harness.
Point 3 if this is a complex ‘fix’ it will undoubtedly cost hugh amounts of money to implement a fix. If the problem is VERY easy to deduce and recover from, WHY fix it at all! (after all, how many times a year do you have to reboot your PC).

The results of the investigation will make interesting reading but I have my doubts if NATS can admit it’s mistakes to itself let alone the world on this one.
Tanglefoot is offline  
Old 20th May 2002, 21:16
  #67 (permalink)  
 
Join Date: May 1999
Location: Vancouver, BC.
Posts: 748
Likes: 0
Received 0 Likes on 0 Posts
Take3Call5

True they were, but by that time the die had been cast for a bad day with the first wave of traffic well behind schedule, when we lost CFMU it became 'very' bad. In relative terms, to my airline, Swanwick failure probaly caused 20 cancellation, with TACT falling over we ended up with 44!
no sig is offline  
Old 21st May 2002, 09:34
  #68 (permalink)  
 
Join Date: May 1999
Location: Vancouver, BC.
Posts: 748
Likes: 0
Received 0 Likes on 0 Posts
Nerc Dweller, you wrote

' The problem affected only a single workstation and once identified was VERY simple to correct. '

I hate to think what would happen if we had a 'serious' problem with the system!
no sig is offline  
Old 21st May 2002, 10:37
  #69 (permalink)  
 
Join Date: Jan 2001
Location: UK
Posts: 14
Likes: 0
Received 0 Likes on 0 Posts
I'm probably being very naive here, but aren't the Swanwick workstations powered by what are effectively PCs, which store their software on a hard disc drive?

Why not when doing these software changes put in a new HDD, and keep the old one safe, so if the new software doesn't work simply put in the original HDD and reboot?

The cost of this would be less then the cost of sending a first class letter to everyone in the company!

Cue the engineers to tell me why my perfect world won't work
Stan By is offline  
Old 21st May 2002, 13:14
  #70 (permalink)  
 
Join Date: Aug 2001
Location: Swanwick
Posts: 13
Likes: 0
Received 0 Likes on 0 Posts
Stan By -- I think it's been mentioned here already, but perhaps not loudly enough: The problem last Friday was not with the new software release. It was an existing problem, always thought to be inoffensive until it chose to manifest itself in a new and exciting way.

When we install a new release, we in fact do exactly as you say: New and old co-exist on disk, and we can switch back to the old one "easily". (Actually, switching back involves the same impact to operations as the original switch).

But on Friday, it wouldn't have helped.

--Dinosaur
Dinosaur is offline  
Old 21st May 2002, 15:55
  #71 (permalink)  
Beady Eye
 
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes on 0 Posts
Tanglefoot:

1. The CXSS used by NERC system is a hybrid. The COTS product I'm refering to (cannot name it obviously) is central to the use of sessions.

2. It was repairable by simply restarting the affected single workstation. However that doesn't mean that a host of highly skilled technicians had to be on hand to solve any problems that may have arisen following a DD&C (or any start up, and it was the start up which was where the 'error' came in). There are several solutions which have been aired, all to identify that this particular fault has been introduced. Knowing its there means a simple re-start of the affected workstation, which is hardly rocket science.
I don't understand your reference to the NERC system being an LM test harness. It is a fully working operational system shifting traffic every hour of every day. Or more correctly the Operational Staff are using it to shift traffic.

3. I concur.

Why would NATS not publish (in the parts of the company that need to know) the results of its own investigation? There are many lessons to be learnt here and I frankly doubt, whatever managements other failings, that they wouldn't want those lessons spread widely.
BDiONU is offline  
Old 21st May 2002, 18:46
  #72 (permalink)  
 
Join Date: May 2002
Location: UK
Posts: 4
Likes: 0
Received 0 Likes on 0 Posts
Thumbs up

Take3Call5

1.Agree with your definition of the ‘offending COTS’ product.

2.Agree that this problem was ‘fixed’ by restarting a single workstaion. My point is that with your highly skilled technicians (a term I think your engineers will liken to an ATCO being called an operator), this problem would have been deduced before it became a problem at splitout time. This time it was a network problem. Next time it could be something completely different. Doesn’t it make sense to have your system experts present at the only time you stop/start the system – the time it is most likely to fail/run into problems.

As for my last jibe. NATS has a pretty bad record at admitting bad news to itself, after all, how many of us were misled into relocating to Hampshire early on promises of it will be operational 96/98/2000/2001 etc.
Tanglefoot is offline  
Old 21st May 2002, 19:06
  #73 (permalink)  
Beady Eye
 
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes on 0 Posts
Yes Tanglefoot, NATS Senior staff were pretty poor (Well, OK, Gloves Off they were CR*P) at giving honest and truthful information about the move to Swanwick. Goodness only knows why, it was VERY obvious that the system wasn't ready!!!!!
I still take issue with the thought that having our technicians/engineers/analysts/system experts (amongst which I'm counted) would actually have picked up the problem and resolved it PDQ. Also we have started and re-started the system quite literally hundreds of times without any serious problems. This was, not to trivialise things, just a 'Gotcha!'
BDiONU is offline  
Old 21st May 2002, 19:19
  #74 (permalink)  
 
Join Date: May 2002
Location: Down South
Posts: 70
Likes: 0
Received 0 Likes on 0 Posts
The day experts you refer to on a DD&C night are there for the duration the said activity alone. Correct me if I'm wrong, but this problem was only found once they tried to split out the sectors - not until way after the DD&C team had gone off to dream land after a job well done and all the boxes ticked. I'm willing to be proved otherwise as I wasn't there - I only heard the gory details the other night from a colleague. I think you under-estimate the skill and ability of the "technicians" as they have been called.
All Systems Go is offline  
Old 22nd May 2002, 06:07
  #75 (permalink)  
Beady Eye
 
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes on 0 Posts
All Systems Go:

You are correct, the bug wasn't found until the day shift started splitting. However the low level code data logs do show the session problems from when that workstation was booted up at about 0120Z.

One thing to note, this problem was not caused, nor affected by the DD&C. It is a problem which sometimes appears on a workstation re-start, it could happen at any time when there's a re-start.

Not sure if you thought I was infering that the NATS Engineering staff (the word technicians is obviously an emotive one!) skills and abilities were anything other than excellent. I apologise if I came across that way. This problem was not identified until a couple of CXSS experts sifted through the low level code data logs and spotted anomolous activity. Not something which would generally smack you between the eyes and not something monitored in system control.
BDiONU is offline  
Old 22nd May 2002, 08:52
  #76 (permalink)  
 
Join Date: May 2002
Location: Down South
Posts: 70
Likes: 0
Received 0 Likes on 0 Posts
Take3:

I myself am a technician who calls one self an engineer - it's great. The CXSS experts do an amazing job, one which I can't yet understand how they cope with. Infact it has to be said everyone does an incredible job when our management doesn't seem interested in motivating staff or letting us have some small comforts in an otherwise cold day.

As for our chums the ATCOs and ATSAs, well. Where would we be without them? Seriously, they always cope admirably with things that would upset quite considerably a normal user. Well done to us all.

Lets hope good ole' AIX and it's bigger brother CXSS get along from now on and stop having all these barnies!!!
All Systems Go is offline  
Old 23rd May 2002, 16:34
  #77 (permalink)  
Beady Eye
 
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes on 0 Posts
All Systems Go:

Yes, here's hoping CXSS and AIX prove merry bedfellows once again, although I would Touch them.
BDiONU is offline  

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off



Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.