PPRuNe Forums - Problems at Swanwick

Page 2 of 2

Show 40 post(s) from this thread on one page

PPRuNe Forums (https://www.pprune.org/)

- ATC Issues (https://www.pprune.org/atc-issues-18/)

- - Problems at Swanwick (https://www.pprune.org/atc-issues/53571-problems-swanwick.html)

crowman

17th May 2002 23:24

Bexil 160, your suppositions are totally correct! Also it is almost certain that the CFMU TACT computer went down due to the rapidity and complexity of the flow restrictions placed by LACC due to their inability to split sectors

I SAY NO MORE AS MY COVER IS EXPOSED FOR MANY TO SEE!! :p :p :p

Scott Voigt

18th May 2002 01:49

BEX;

Sorry to hear of another BAD day at NERC... You are making me just want to go to work and hug DSR and our old HOST <G>. Next time you come to visit, I should be able to get you in the building...

Take care

Scott Voigt

18th May 2002 01:56

Take3call5;

I think that I would have to take exception to your comment about it not being the suits fault...

If the requirements statement was correctly written and then the contractor was kept to a correctly written contract, then you wouldn't have these issues.

We had some serious problems with our itteration of the "new" and "improved" NAS. or ISSS as it was known. We finally were able to convince the suits that this was NOT going to work in a manner that was better or even as good as the old system. It was scrapped and then we took what elements did work and to save money put it together into something that would at least work and replace an old and failing system. Now we are working at slowly replacing all the other items that need to be updated. We know that the NAS software is VERY COMPLEX and must work all the time in real time. We are not in a hurry for the sake of safety and the customers....

regards

Brookmans Park

18th May 2002 02:59

problems at swanwick

having just read the posts re the Europeaen cup in Scotland
could there be any connection with the the extra traffic which this generated and the subsequent problems at SwAnNwIkK ??:o :confused: :mad:

BDiONU

18th May 2002 07:17

Dan Ryan:
No. Traffic levels was not the issue. Won't find out 'till I go into work on Monday what really happened but it appears from comments in this thread that there was a problem with one workstation. This caused the watch management to delay 'splitting out' the Ops room to accomodate daytime traffic levels until they could be sure that the problem wasn't going to be replicated on other workstations. At least thats my surmise.

Scott:
I fully agree with your comment about requirements statements etc. However, you underestimate the complexity of the system at LACC. It is not very unusual for a 'fix' on one part of the system to introduce regression or an unwanted 'feature' on another part. It can be impossible to know this until a 'problem' is found and the engineers can examine the data (personally I would have it re-written using an object oriented language so that regression cannot be introduced in other areas).

Sounds like the upgrade that was put on just didn't 'take' properly on one workstation (although they all appeared to work (albeit without flight data) when checked by me and my team) so that, for safety, they fully checked out all the rest before commiting to daytime traffic levels. IMO the only decision which they could make under the circumstances.

In my opinion I am sure that EVERYONE is aware of the problems caused to other units, to the airlines and to all the other services connected with flying when there's a problem like this. Just wish there was an easy solution to 'fixing' things, other than relying on everyone else's professionalism to 'carry on regardless' :mad:

Check 6

18th May 2002 07:32

A couple of questions:

Do these "glitches" at Swanwick affect London Mil?

Did London Mil move also?

Cheers,

BDiONU

18th May 2002 07:42

Check 6:

London Mil are still at West Drayton, hopefully will move with LTCC (circa 2005??). These problems will not have affected them, except where they had traffic to join controlled airspace. Quite probably they handled some commercial flights who were willing to fly outside airways.

N.B. There still military controllers (LJAO) working with (and at) the LTCC at Swanwick, just as there were at LATCC.

POMPI

18th May 2002 09:17

Never did like CXSS - too complex.

chippy63

18th May 2002 10:25

SLF is self-loading freight, ie folks like me.

HEATHROW DIRECTOR

18th May 2002 10:26

Yes, all three Lon Mil controllers are still at West Drayton!!

Sonic Cruiser

18th May 2002 11:26

EGLL must have been interesting yesterday morning, where did all the inbounds (particularly BA T4 Long Haul) park if very few of the outbounds moved?

GMP and GMC must have been busy positions to be operating yesterday. Were inbound holds lengthy as well or were restrictions put on the number of Flights allowed in to Heathrow??

I read that there will be backlogs right across the weekend as airlines try to get the aircraft back in the right place.

Scott Voigt

18th May 2002 21:23

Hi Take3Call5;

Actually I probably know the system that you have fairly well since all that you have had and now have were off shoots of what we have or decided to not do...

I completely understand the issues with doing something to the software and then effecting something else in the system. That is why we do a LOT of testing on all of our patches and then test them at all 20 facilities when we install them here. Guess what, even with doing that it doesn't always work. We had a failure just last month on a new patch due to those issues, but we do the install on the midnight shift and bring it back up before the traffic starts getting busy, so if the system flops rigth away, there isn't a lot of impact when we reload the old system and bring it back online...

As to the complexities of any sort of NAS system replacement, we completely understand that too and that is why we are now going with the thought of replacing small parts of the NAS one at a time and then turning them off one at a time. Do this until we get to the radar and data processing and then replace those. Don't try to do a big bang. There is too much at risk to do it that way, as well as a training nightmare for the work force. We don't let pilots get into a new aircraft with just a few days of training over a couple of month period. They go through a LOT of training and are taken off the line as it were to immerse in training. Obviously with our staffing in most of the busy parts of the world, we just can't do this. So do it the smart way. Go in baby steps and get the whole thing done over a course of years so that you have minor training issues that are easy to deal with and there is very little if any dissruption to the users.

regards

Jay Foe

18th May 2002 22:06

Regarding the problems of the last few days, I think todays 'Matt' Cartoon on the front of the Telegraph was quite amusing, (fingers crossed this works):( :( :( :(

http://www.telegraph.co.uk/core/Matt...Matt.telegraph

Wahey it worked!!!!!!!! I can do modern technology. Now where's PAR 2000............

DCS99

19th May 2002 11:37

I work on mainframe airline Res and DCS systems, most recently for a certain carrier which had a large cross on its tail, so I can imagine with knowing dread, the kind of situation that happened last week.

I've written and tested stuff as well as it can be, loaded it and it's gone wrong. OK, we follow the fallback plan, clean up any mess, re-test and try again. It happens to everyone at some point.

The systems are damn complex but we work equally damn hard to make sure we've thought of everything before going live and we do take it personally when others say things like "outsource IT!", or "don't these programmers/engineers know what they're doing?".

We want to deliver quality all the time, because we know the business and the terrible effects of even the smallest cock-up, but sometimes it's like trying to add another storey on a building between 2 existing floors. It ain't easy, but that's the existing architecture we're working with!

Back to Friday's snagettes:
The worst kind of problem is when a software change has been loaded and it doesn't go wrong till some hours later. At that stage, the fallback option might not be on the cards. It's fall forward but the morning shift may not know exactly what happened the night before, the logs have crashed or whatever.

To try and prevent these situations, you need:

1 Decent test systems with real live system data

2 Investment in Automated Volume testing tools (programmers dislike repetitive testing and anything that automates it is a great benefit).

3 For big changes, get the right people in on the night.

4 Check out as much as you can during the quiet hours at night

5 Pay them decent compensation. They should stay behind till the morning shift comes in and handover is complete.

I don't know the set-up at ATC other than through second hand sources, so flame me if I'm jumping to conclusions, but it seems like not all of these points were actioned for the change which went wrong on Friday.

I also fear that Point 5 - Paying Overtime - was something the management wanted to avoid - or am I speaking out of turn there?

no sig

19th May 2002 14:34

Would anyone like to hazard a guess at how many more of these 'glitches' we are going to have to cope with?

If we are running at risk that this will occur again in some form then tell us (the airlines), we'll do what we can to assist as we did with the change over. But this failure cost my outfit in excess of £400K probably more, resulted in the cancellation of 44 flights and all the misery that goes with it.

Really folks, we need to do better.

BEXIL160

19th May 2002 15:27

Three things will affect UK ATC for the foreseeable (sp?) future..

1) Serious lack of validated Controllers and Assistants at Swanwick

2) NAS (at West Drayton) could easily FLOP again, or the link to it could be lost (not usually too serious at LATCC, but even startovers can ruin Swanwick's whole day)

3) Unknown (and known) faults wthin the highly complex Swanwick software

NONE of the above are likely to be fixed in the short term, and the staffing situation is a LONG TERM issue. It takes YEARS to train and validate ATCOs, meanwhile more are retiring / leaving/ on long/short term sick, than are being replaced.

Once again, I am very sorry to be the bearer bad news, but I'd rather be "open and honest" with you than the claptrap that comes out of One Kemble Street. This ain't spin, it's the truth.

Sadly, yours
BEX

no sig

19th May 2002 16:50

BEXIL
Thank you as ever. I suppose what's hard to swallow is the thought that after three events, albeit, apparently unrelated, we are likely to face the mayhem of Friday again. I know it's a complex system, however, the fact that we have had three failures really does make one question to integrity of the software/system/ and the management of same.

chiglet

19th May 2002 19:44

Three quotes spring to mind
1 "Our Skies Are NOT For Sale":p
2 "The Buck Stops Here":cool:
3 "Action This Day"
Mr Blur and our "esteemed" CE, Mr Eveready have [obviously] not studied Modern History, or read the 'papers, or listened to the troops, but then again, what else is Chuffin' new? :mad: :mad:
we aim to please,it keeps the cleaners happy

2 six 4

19th May 2002 21:56

Went to the pub on Friday night .:p A friend of mine who is a secretary was concerned. Was it my computer that failed and caused all those delays ? Well sort of ... I said. After consoling me with a pint she told me they had been discussing it in the office when it came on the news.

Why don't you do what we do when the damned machine stops ? What's that I naively ask ? Just type CTRL + ALT + DELETE it works every time !!!

DOH We pay ££ millions for sophisticated software companies to design this complex beast and my mate tells me the answer down the pub :D :D

Where's that CTRL button and I'll tell Cheese and Ham .......

Iron City

20th May 2002 00:45

Hope your Swanwick is not the same as the voice switching and control system in the states (primary ARTCC voice com switch) when they control-alt-delete that it takes a couple hours to do a complete cold reboot. cans and string in the meantime, and a couple BIG megaphones.

Scott Voigt

20th May 2002 04:58

Iron City;

Actually VSCS takes about 35 minutes to reboot now... If that were to happen though we would switch over to VTABS and would continue working until the reboot was complete.

For those who don't know, VTABS ( VSCS Training and Backup Switch) was installed since we didn't feel that we wanted the live system hooked to the training communications switch, as well as we wanted to have something that would work in the event that we took a power bump or a software problem with VSCS. VTABS has a battery backup that is independent from the critical power UPS and system just in case. The system has been VERY reliable however...

regards

Tanglefoot

20th May 2002 16:47

To all fine people out there,

From info I have received today, the Swanwick sector split problem was caused by a network glitch (love that CXSS) – hence they are still running on the new software drop and it is working.

Would love to know how much this has cost – can anyone give me an approximation – but I repeat my earlier comment:
This would not have been a problem ---NO COST/NO DELAY--- if ‘the systems specialist engineer’s’ had been around at the time of the DDC.

Unfortunately this costs a few extra peanuts and the current tight-arse management penny-pinching policy resulted in Fridays situation – nothing else.

The really good news is that management have re-organised these engineers for no other sensible reason than to impress their new owners TAG and in the hope that some will quit to reduce the redundancy count – but heh – it’ll never happen again right!!!!

BDiONU

20th May 2002 17:17

Tanglefoot et al:

Yes the 'trouble' at Swanwick was indeed a network problem, due to the interface between CXSS and a COTS product. This problem was first identified in June last year and the engineers had been trying to 'fix' it but it has proved to be a vastly more complex issue than was first hoped for.

It incorrect to assert that had specialist engineers been around it would have been sorted immediately. It took some searching by the Lockheed Martin Analyst in the low level codes to find out what had gone wrong.

As there is an engineering investigation ongoing which will produce a post-mortem result soon I will not speculate on here about solutions.

BDiONU

20th May 2002 17:20

Forgot a p.s.

The delays were HUGELY exacerbated by a failure of the CFMU in Brussels.

NERC Dweller

20th May 2002 19:01

Firstly apologies to ATC, Friday must have been any extremely stressful day.

That aside I agree with Take3Call5. The problem affected only a single workstation and once identified was VERY simple to correct.

I would like to say more but won't at this stage as I believe this to be counter productive and unfair to some people that give their absolute best.

Tanglefoot

20th May 2002 20:43

Take3Call5

My sources have indeed been busy.
Point 1 is that CXSS is a COTS product – delivery as such by LM.
Point 2 the problem is indeed complex but the symptoms are apparently easy to deduce and as NERC Dweller states, is VERY easy to correct. I stand by my earlier comment. With the correct skill sets available at the time, this would have been identified and recovered immediately (I believe the term is workedaround – not fixed). Call me old fashioned but we should not be costing our airline customers millions of pounds whilst using the operational system as a LM test harness.
Point 3 if this is a complex ‘fix’ it will undoubtedly cost hugh amounts of money to implement a fix. If the problem is VERY easy to deduce and recover from, WHY fix it at all! (after all, how many times a year do you have to reboot your PC).

The results of the investigation will make interesting reading but I have my doubts if NATS can admit it’s mistakes to itself let alone the world on this one.
:confused:

no sig

20th May 2002 21:16

Take3Call5

True they were, but by that time the die had been cast for a bad day with the first wave of traffic well behind schedule, when we lost CFMU it became 'very' bad. In relative terms, to my airline, Swanwick failure probaly caused 20 cancellation, with TACT falling over we ended up with 44!

no sig

21st May 2002 09:34

Nerc Dweller, you wrote

' The problem affected only a single workstation and once identified was VERY simple to correct. '

I hate to think what would happen if we had a 'serious' problem with the system!

Stan By

21st May 2002 10:37

I'm probably being very naive here, but aren't the Swanwick workstations powered by what are effectively PCs, which store their software on a hard disc drive?

Why not when doing these software changes put in a new HDD, and keep the old one safe, so if the new software doesn't work simply put in the original HDD and reboot?

The cost of this would be less then the cost of sending a first class letter to everyone in the company!

Cue the engineers to tell me why my perfect world won't work;)

Dinosaur

21st May 2002 13:14

Stan By -- I think it's been mentioned here already, but perhaps not loudly enough: The problem last Friday was not with the new software release. It was an existing problem, always thought to be inoffensive until it chose to manifest itself in a new and exciting way.

When we install a new release, we in fact do exactly as you say: New and old co-exist on disk, and we can switch back to the old one "easily". (Actually, switching back involves the same impact to operations as the original switch).

But on Friday, it wouldn't have helped.

--Dinosaur

BDiONU

21st May 2002 15:55

Tanglefoot:

1. The CXSS used by NERC system is a hybrid. The COTS product I'm refering to (cannot name it obviously) is central to the use of sessions.

2. It was repairable by simply restarting the affected single workstation. However that doesn't mean that a host of highly skilled technicians had to be on hand to solve any problems that may have arisen following a DD&C (or any start up, and it was the start up which was where the 'error' came in). There are several solutions which have been aired, all to identify that this particular fault has been introduced. Knowing its there means a simple re-start of the affected workstation, which is hardly rocket science.
I don't understand your reference to the NERC system being an LM test harness. It is a fully working operational system shifting traffic every hour of every day. Or more correctly the Operational Staff are using it to shift traffic.

3. I concur.

Why would NATS not publish (in the parts of the company that need to know) the results of its own investigation? There are many lessons to be learnt here and I frankly doubt, whatever managements other failings, that they wouldn't want those lessons spread widely.

Tanglefoot

21st May 2002 18:46

Take3Call5

1.Agree with your definition of the ‘offending COTS’ product.

2.Agree that this problem was ‘fixed’ by restarting a single workstaion. My point is that with your highly skilled technicians (a term I think your engineers will liken to an ATCO being called an operator), this problem would have been deduced before it became a problem at splitout time. This time it was a network problem. Next time it could be something completely different. Doesn’t it make sense to have your system experts present at the only time you stop/start the system – the time it is most likely to fail/run into problems.

As for my last jibe. NATS has a pretty bad record at admitting bad news to itself, after all, how many of us were misled into relocating to Hampshire early on promises of it will be operational 96/98/2000/2001 etc.
;)

BDiONU

21st May 2002 19:06

Yes Tanglefoot, NATS Senior staff were pretty poor (Well, OK, Gloves Off they were CR*P) at giving honest and truthful information about the move to Swanwick. Goodness only knows why, it was VERY obvious that the system wasn't ready!!!!!
I still take issue with the thought that having our technicians/engineers/analysts/system experts (amongst which I'm counted) would actually have picked up the problem and resolved it PDQ. Also we have started and re-started the system quite literally hundreds of times without any serious problems. This was, not to trivialise things, just a 'Gotcha!'

All Systems Go

21st May 2002 19:19

The day experts you refer to on a DD&C night are there for the duration the said activity alone. Correct me if I'm wrong, but this problem was only found once they tried to split out the sectors - not until way after the DD&C team had gone off to dream land after a job well done and all the boxes ticked. I'm willing to be proved otherwise as I wasn't there - I only heard the gory details the other night from a colleague. I think you under-estimate the skill and ability of the "technicians" as they have been called.

BDiONU

22nd May 2002 06:07

All Systems Go:

You are correct, the bug wasn't found until the day shift started splitting. However the low level code data logs do show the session problems from when that workstation was booted up at about 0120Z.

One thing to note, this problem was not caused, nor affected by the DD&C. It is a problem which sometimes appears on a workstation re-start, it could happen at any time when there's a re-start.

Not sure if you thought I was infering that the NATS Engineering staff (the word technicians is obviously an emotive one!) skills and abilities were anything other than excellent. I apologise if I came across that way. This problem was not identified until a couple of CXSS experts sifted through the low level code data logs and spotted anomolous activity. Not something which would generally smack you between the eyes and not something monitored in system control.

All Systems Go

22nd May 2002 08:52

Take3:

I myself am a technician who calls one self an engineer - it's great. The CXSS experts do an amazing job, one which I can't yet understand how they cope with. Infact it has to be said everyone does an incredible job when our management doesn't seem interested in motivating staff or letting us have some small comforts in an otherwise cold day.

As for our chums the ATCOs and ATSAs, well. Where would we be without them? Seriously, they always cope admirably with things that would upset quite considerably a normal user. Well done to us all.

Lets hope good ole' AIX and it's bigger brother CXSS get along from now on and stop having all these barnies!!!

BDiONU

23rd May 2002 16:34

All Systems Go:

Yes, here's hoping CXSS and AIX prove merry bedfellows once again, although I would Touch them.

All times are GMT. The time now is 01:31.

Page 2 of 2

Show 40 post(s) from this thread on one page