Problems at Swanwick

Closed Thread Subscribe

Thread Tools

Search this Thread

20th May 2002, 04:58

#61 (permalink)

Scott Voigt

Join Date: Jul 2001

Location: Fort Worth ARTCC ZFW

Posts: 1,155

Likes: 0

Received 0 Likes on 0 Posts

Iron City;

Actually VSCS takes about 35 minutes to reboot now... If that were to happen though we would switch over to VTABS and would continue working until the reboot was complete.

For those who don't know, VTABS ( VSCS Training and Backup Switch) was installed since we didn't feel that we wanted the live system hooked to the training communications switch, as well as we wanted to have something that would work in the event that we took a power bump or a software problem with VSCS. VTABS has a battery backup that is independent from the critical power UPS and system just in case. The system has been VERY reliable however...

regards

20th May 2002, 16:47

#62 (permalink)

Tanglefoot

Join Date: May 2002

Location: UK

Posts: 4

Likes: 0

Received 0 Likes on 0 Posts

To all fine people out there,

From info I have received today, the Swanwick sector split problem was caused by a network glitch (love that CXSS) – hence they are still running on the new software drop and it is working.

Would love to know how much this has cost – can anyone give me an approximation – but I repeat my earlier comment:
This would not have been a problem ---NO COST/NO DELAY--- if ‘the systems specialist engineer’s’ had been around at the time of the DDC.

Unfortunately this costs a few extra peanuts and the current tight-arse management penny-pinching policy resulted in Fridays situation – nothing else.

The really good news is that management have re-organised these engineers for no other sensible reason than to impress their new owners TAG and in the hope that some will quit to reduce the redundancy count – but heh – it’ll never happen again right!!!!

20th May 2002, 17:17

#63 (permalink)

BDiONU

Beady Eye

Join Date: Feb 2001

Location: UK

Posts: 1,495

Likes: 0

Received 0 Likes on 0 Posts

Tanglefoot et al:

Yes the 'trouble' at Swanwick was indeed a network problem, due to the interface between CXSS and a COTS product. This problem was first identified in June last year and the engineers had been trying to 'fix' it but it has proved to be a vastly more complex issue than was first hoped for.

It incorrect to assert that had specialist engineers been around it would have been sorted immediately. It took some searching by the Lockheed Martin Analyst in the low level codes to find out what had gone wrong.

As there is an engineering investigation ongoing which will produce a post-mortem result soon I will not speculate on here about solutions.

20th May 2002, 17:20

#64 (permalink)

BDiONU

Beady Eye

Join Date: Feb 2001

Location: UK

Posts: 1,495

Likes: 0

Received 0 Likes on 0 Posts

Forgot a p.s.

The delays were HUGELY exacerbated by a failure of the CFMU in Brussels.

20th May 2002, 19:01

#65 (permalink)

NERC Dweller

Join Date: Jan 2002

Location: NERC

Posts: 46

Likes: 0

Received 0 Likes on 0 Posts

Firstly apologies to ATC, Friday must have been any extremely stressful day.

That aside I agree with Take3Call5. The problem affected only a single workstation and once identified was VERY simple to correct.

I would like to say more but won't at this stage as I believe this to be counter productive and unfair to some people that give their absolute best.

20th May 2002, 20:43

#66 (permalink)

Tanglefoot

Join Date: May 2002

Location: UK

Posts: 4

Likes: 0

Received 0 Likes on 0 Posts

Take3Call5

My sources have indeed been busy.
Point 1 is that CXSS is a COTS product – delivery as such by LM.
Point 2 the problem is indeed complex but the symptoms are apparently easy to deduce and as NERC Dweller states, is VERY easy to correct. I stand by my earlier comment. With the correct skill sets available at the time, this would have been identified and recovered immediately (I believe the term is workedaround – not fixed). Call me old fashioned but we should not be costing our airline customers millions of pounds whilst using the operational system as a LM test harness.
Point 3 if this is a complex ‘fix’ it will undoubtedly cost hugh amounts of money to implement a fix. If the problem is VERY easy to deduce and recover from, WHY fix it at all! (after all, how many times a year do you have to reboot your PC).

The results of the investigation will make interesting reading but I have my doubts if NATS can admit it’s mistakes to itself let alone the world on this one.

20th May 2002, 21:16

#67 (permalink)

no sig

Join Date: May 1999

Location: Vancouver, BC.

Posts: 748

Likes: 0

Received 0 Likes on 0 Posts

Take3Call5

True they were, but by that time the die had been cast for a bad day with the first wave of traffic well behind schedule, when we lost CFMU it became 'very' bad. In relative terms, to my airline, Swanwick failure probaly caused 20 cancellation, with TACT falling over we ended up with 44!

21st May 2002, 09:34

#68 (permalink)

no sig

Join Date: May 1999

Location: Vancouver, BC.

Posts: 748

Likes: 0

Received 0 Likes on 0 Posts

Nerc Dweller, you wrote

' The problem affected only a single workstation and once identified was VERY simple to correct. '

I hate to think what would happen if we had a 'serious' problem with the system!

21st May 2002, 10:37

#69 (permalink)

Stan By

Join Date: Jan 2001

Location: UK

Posts: 14

Likes: 0

Received 0 Likes on 0 Posts

I'm probably being very naive here, but aren't the Swanwick workstations powered by what are effectively PCs, which store their software on a hard disc drive?

Why not when doing these software changes put in a new HDD, and keep the old one safe, so if the new software doesn't work simply put in the original HDD and reboot?

The cost of this would be less then the cost of sending a first class letter to everyone in the company!

Cue the engineers to tell me why my perfect world won't work

21st May 2002, 13:14

#70 (permalink)

Dinosaur

Join Date: Aug 2001

Location: Swanwick

Posts: 13

Likes: 0

Received 0 Likes on 0 Posts

Stan By -- I think it's been mentioned here already, but perhaps not loudly enough: The problem last Friday was not with the new software release. It was an existing problem, always thought to be inoffensive until it chose to manifest itself in a new and exciting way.

When we install a new release, we in fact do exactly as you say: New and old co-exist on disk, and we can switch back to the old one "easily". (Actually, switching back involves the same impact to operations as the original switch).

But on Friday, it wouldn't have helped.

--Dinosaur

21st May 2002, 15:55

#71 (permalink)

BDiONU

Beady Eye

Join Date: Feb 2001

Location: UK

Posts: 1,495

Likes: 0

Received 0 Likes on 0 Posts

Tanglefoot:

1. The CXSS used by NERC system is a hybrid. The COTS product I'm refering to (cannot name it obviously) is central to the use of sessions.

2. It was repairable by simply restarting the affected single workstation. However that doesn't mean that a host of highly skilled technicians had to be on hand to solve any problems that may have arisen following a DD&C (or any start up, and it was the start up which was where the 'error' came in). There are several solutions which have been aired, all to identify that this particular fault has been introduced. Knowing its there means a simple re-start of the affected workstation, which is hardly rocket science.
I don't understand your reference to the NERC system being an LM test harness. It is a fully working operational system shifting traffic every hour of every day. Or more correctly the Operational Staff are using it to shift traffic.

3. I concur.

Why would NATS not publish (in the parts of the company that need to know) the results of its own investigation? There are many lessons to be learnt here and I frankly doubt, whatever managements other failings, that they wouldn't want those lessons spread widely.

21st May 2002, 18:46

#72 (permalink)

Tanglefoot

Join Date: May 2002

Location: UK

Posts: 4

Likes: 0

Received 0 Likes on 0 Posts

Take3Call5

1.Agree with your definition of the ‘offending COTS’ product.

2.Agree that this problem was ‘fixed’ by restarting a single workstaion. My point is that with your highly skilled technicians (a term I think your engineers will liken to an ATCO being called an operator), this problem would have been deduced before it became a problem at splitout time. This time it was a network problem. Next time it could be something completely different. Doesn’t it make sense to have your system experts present at the only time you stop/start the system – the time it is most likely to fail/run into problems.

As for my last jibe. NATS has a pretty bad record at admitting bad news to itself, after all, how many of us were misled into relocating to Hampshire early on promises of it will be operational 96/98/2000/2001 etc.

21st May 2002, 19:06

#73 (permalink)

BDiONU

Beady Eye

Join Date: Feb 2001

Location: UK

Posts: 1,495

Likes: 0

Received 0 Likes on 0 Posts

Yes Tanglefoot, NATS Senior staff were pretty poor (Well, OK, Gloves Off they were CR*P) at giving honest and truthful information about the move to Swanwick. Goodness only knows why, it was VERY obvious that the system wasn't ready!!!!!
I still take issue with the thought that having our technicians/engineers/analysts/system experts (amongst which I'm counted) would actually have picked up the problem and resolved it PDQ. Also we have started and re-started the system quite literally hundreds of times without any serious problems. This was, not to trivialise things, just a 'Gotcha!'

21st May 2002, 19:19

#74 (permalink)

All Systems Go

Join Date: May 2002

Location: Down South

Posts: 70

Likes: 0

Received 0 Likes on 0 Posts

The day experts you refer to on a DD&C night are there for the duration the said activity alone. Correct me if I'm wrong, but this problem was only found once they tried to split out the sectors - not until way after the DD&C team had gone off to dream land after a job well done and all the boxes ticked. I'm willing to be proved otherwise as I wasn't there - I only heard the gory details the other night from a colleague. I think you under-estimate the skill and ability of the "technicians" as they have been called.

22nd May 2002, 06:07

#75 (permalink)

BDiONU

Beady Eye

Join Date: Feb 2001

Location: UK

Posts: 1,495

Likes: 0

Received 0 Likes on 0 Posts

All Systems Go:

You are correct, the bug wasn't found until the day shift started splitting. However the low level code data logs do show the session problems from when that workstation was booted up at about 0120Z.

One thing to note, this problem was not caused, nor affected by the DD&C. It is a problem which sometimes appears on a workstation re-start, it could happen at any time when there's a re-start.

Not sure if you thought I was infering that the NATS Engineering staff (the word technicians is obviously an emotive one!) skills and abilities were anything other than excellent. I apologise if I came across that way. This problem was not identified until a couple of CXSS experts sifted through the low level code data logs and spotted anomolous activity. Not something which would generally smack you between the eyes and not something monitored in system control.

22nd May 2002, 08:52

#76 (permalink)

All Systems Go

Join Date: May 2002

Location: Down South

Posts: 70

Likes: 0

Received 0 Likes on 0 Posts

Take3:

I myself am a technician who calls one self an engineer - it's great. The CXSS experts do an amazing job, one which I can't yet understand how they cope with. Infact it has to be said everyone does an incredible job when our management doesn't seem interested in motivating staff or letting us have some small comforts in an otherwise cold day.

As for our chums the ATCOs and ATSAs, well. Where would we be without them? Seriously, they always cope admirably with things that would upset quite considerably a normal user. Well done to us all.

Lets hope good ole' AIX and it's bigger brother CXSS get along from now on and stop having all these barnies!!!