Problems at Swanwick [Archive]

BEXIL160

17th May 2002, 09:44

The BBC have picked up on the fact that the DD&C performed last night should have happened on Wednesday.

They are also saying that it was delayed until last night due to the number of extra flights connected with the European Controllers Cup..... (I wish!), no, sorry, I mean the European Cup Final.

No mention of the real reason for the delay until Thursday night, which was not enough staff at Swanwick. As it was several sectors were closed on Wednesday night.

Until VERY late in the day, Swanwick management were INSISTING that the DD&C go ahead on Wednesday.

As to todays problem. I wonder when they discovered it wasn't possible to "spilt" the sectors? Not a nice feeling when the traffic is relentlessly coming off the Ocean and you're getting buried.

Getting Swanwick back up will be a challenge. Will a complete shutdown (of Swanwick) be required before rebooting the whole thing, a bit like DD&C in reverse? If so when? A tough decision.

All this makes me very sad.

Rgds BEX

Self Loading Freight

17th May 2002, 09:53

Well, something's up and running again -- loads of delays have just melted away completely. I was watching an ETD on a flight from EDI -- nominally 20:15, it was at 00:16 for most of the morning. Then it went back to 22: something, and now it's back on schedule.

How good are those rollback procedures?

R

NICK HEFF

17th May 2002, 10:01

Calling all controllers,I got a rubbish roster this afternoon,SOU to DUB,DUB to SOU,SOU to EDI,EDI to SOU then SOU to IOM,any help in actually getting any of the above completed before midnight would br great,yea and the wx looks rubbish too!!

sony backhander

17th May 2002, 10:04

Well, at least it will all be sorted by the summer(?)
Not noticed anyone mention the mysterious workstation freezes that happened a while back- as far as i know systems never found out what happened they just cured themselves.....
You're right Bex it IS sad (especially when we all expected this)

Not Long Now

17th May 2002, 10:07

N H, suggest you pack plenty of reading material and extra sarnies. Speak to you this afternoon, evening and probably night...

ZIP250

17th May 2002, 10:40

Does any body out there know if the operation is SWIMMING or fully electronic but in the TDU?

If the former then I suspect that the only way back, because of NAS link problems, is for the airspace to become very empty before an attempted reboot.

Z

BEXIL160

17th May 2002, 11:03

No, I think they came out of the TDU after the DD&C, went electronic and THEN when it came time to split for the morning, they couldn't OPEN SECTORS.

BEX

P.S. The BBC is reporting the system now fixed, but LONG delays are inevitable

ZIP250

17th May 2002, 11:13

Thanks Bex, that should mean a rapid return to normal once they can split again. Unfortunately the back log will take a while to clear.

If any SLF is reading this thread please accept the Swanwick ATCO's and ATSA's apologies for any inconvenience which is not of their making. NATS management were warned that this sort of thing could happen and will continue to happen for the forseeable future.

Z

sony backhander

17th May 2002, 12:08

ok, what's an "SLF"?

BEXIL160

17th May 2002, 12:11

SLF=Self Loading Freight..... err.. Passengers

BEX

BDiONU

17th May 2002, 12:44

I was in for the DD&C last night and it all went OK, no problems at all! Bandboxed back in Ops at 0130Z and checked out all the Workstations, no problems. Left the control staff to it, we were off home by 2Z. Don't know why the day watch appeared to be unable to split out, possible bug in the software drop?

Iron City

17th May 2002, 13:03

"possible bug in the software drop" Poosible. Wouldn't be the first time and won't be the last, either.

Flight Plan Fixer

17th May 2002, 13:14

The NAS AIRAC update went perfectly, no probs with the NASSFS link either, NAS shutdown between 0009 and 0039 and the beast arose from its slumbers first time.

Possible instability with Swanwick software drop? Did a rogue flight plan do some damage within the workstation token ring?

Condolences to the morning watch (Air Traffic AND Tels) and passengers involved.

1261

17th May 2002, 13:18

Token ring, now there's an expression I haven't heard for a long time!

What's it used for at Swanwick?

EarlyGo

17th May 2002, 15:28

Absolute pain of a morning shift, loads of flight plans coming out but few getting airborne, London City fills up, lots of level capping and bizarre reroutes to try & avoid NERC sectors, then Pease radar has a wobble, saturation of SSRs due to the vast number of a/c still on the ground at EGLL, NERC preparing to go to SwIMM mode, was talk of going to SwIMM at 3pm local.

As usual, operational staff at the coal face kept the show running but we're all getting a bit fed up of this now. Pax, airlines & NATS own staff are being let down bigstyle every time these problems occur, I dread to think the reputation NATS is getting with the flying public.

BDiONU

17th May 2002, 15:28

1261

All the LAN's at LACC use the IBM token ring system.

Iron City:
You're quite right, neither the first nor last time. No matter how exhaustive your testing, with a system as complex and large as LACC there are BOUND to be bugs which only surface when you drop onto the operational system and let the 'customers' lose on it.
We do our best but we are human and not perfect.

OLNEY 1 BRAVO

17th May 2002, 15:46

Just looking at the easyJet Website, it talks about:

"Following a significant failure in the UK's air traffic control system this morning, there has been a subsequent failure in the Brussels Air Traffic Slot Co-ordination Management System. "

I presume they mean CFMU - any details please?:(

techofish

17th May 2002, 16:00

The problem was due to 1 ATC workstation computer.

When it started up after the DDC, it had problems communicating with other computers in the system.
When it came to split out the sectors from this workstation, it couldn't copy it's Flight Data to the new workstation. So when ATC tried to use the new workstation, it received it first flight data update and goes "I don't know about this flight, I'd better restart".

Only the sectors on that sectors on the 1 problem workstation were affected, but nobody could tell what was the problem from the Monitor System. So they didn't split the other sectors.

This is a problem which has been around for a while, but has been deemed my management to be not urgent, and isn't due to be fixed for another year.

Steep Approach

17th May 2002, 16:58

I blame September 11th! :D

1261

17th May 2002, 17:05

Management made these problems; let management fix them!

I hope they don't fix it. More days like today (i.e. having time to read The Herald cover-to-cover) would be much appreciated. Thanks, Richard!

BDiONU

17th May 2002, 17:26

1261:

Its not management who made the problem, its just a consequence of a very complex system. No chance of 'them' fixing it, thats down to the engineers and programmers at the coal face who take considerable pride in their work and take it personally when things 'fail' to work as they envisaged.

Tanglefoot

17th May 2002, 17:31

Looks like this system problem could have been avoided totally if the day support engineers had been in - but that costs money - at least £1000 total and heh, they don't even work for Swanwick anymore because NATS management don't need 'experts' for COTS systems - and yes, the bit that broke is COTS.

TAG be warned. :eek:

1261

17th May 2002, 18:19

Take3Call5,

I think that Tanglefoot has hit the nail on the head. I know that it's the guys at the coalface who will always end up digging us out of whatever hole we're in, be it an engineering or ATC one!

I was referring to management metaphorically "fixing it" by allocating the resources required (both in terms of cash and manpower) to do the job properly with a new (and unproven, at least operationally) system.

The first step on the road to recovery is, after all, admitting that you're sick :)

nodelay

17th May 2002, 18:40

www.timesonline.co.uk/article/0,,1-299075,00.html

BEXIL160

17th May 2002, 19:01

WHAO THERE....

Lets see if I've got this straight. Todays chaos was caused by a KNOWN problem that can occur on a a SINGLE workstation that our beloved management KNEW ABOUT but deemed "not urgent".

Is this TRUE?

BEX :mad:

M.Mouse

17th May 2002, 19:36

What frightens me is that this is the third time that I can recall in recent months where total chaos has been caused by systems failures in ATC.

The cost in money terms to all airlines but in particular to my own is astronomical.

That is before we look at the loss of goodwill from passengers.

Where will it end?

No criticism of the ATCOs caught up in it, you must be as frustrated as we pilots are.

ContactLondon

17th May 2002, 19:41

Once again the system fails and the ops staff (atsa, atco and atce) are left to clear up the mess!

Funny thing though, for the first time that I have seen, the TV news said that the SWANWICK computer had a problem, where usually they blame the one at WEST DRAYTON.

Change in attitude from the management? They are usually so proud of their very expensive system in Hampshire, that I am surprised they didn't blame West Drayton again.

Or is that just the same computer?

Now I'm confused.

eyeinthesky

17th May 2002, 19:55

It's all semantics to cover up for a system which seems to be teetering on the edge of stability.

Everett is correct in that this is the first failure which is attributable solely to the NERC system which has brought the place to a halt during the day. What he skates over, however, is the fact that the first NAS failure was corrected relatively quickly, but the NERC system had to remain off for the REST OF THE DAY because of the perceived difficulty in getting it back on line again. That in my book is a failure of the NERC system as it failed to do the job for which it was designed: move air traffic more efficiently than we did at LATCC. Had we still been at LATCC on that day we would have been up and running again in about an hour.

You might not also be aware that at about 1230 local today the NAS link failed for about 2-3 minutes and we all prepared to go manual. This was shortly after we had heard that the TACT computer at FMP Brussels had also failed. Delays peaked at about 178000 minutes total, I think. That's 123.6 days!!

All of this points to a system which is inherently unstable, and cannot cope with inputs which it doesn't like. The computer upon which I am typing this has protection in it which stops me doing things which will make it crash. It cost just under £1000. Why can't a multi-million pound system do the same???:confused:

Buster the Bear

17th May 2002, 20:58

RAILTRACK

Scottie Dog

17th May 2002, 21:23

Firstly thanks for all the work that ATC and airline personnel do during these difficult times, it does however lead me to ask one question.

As a corporate travel agent, I have spent today fielding calls (until I finished at 2030bst) from travellers stranded at Heathrow, and other european airports, asking what is happening to their flight. When I check Flifo in the BA and BD systems there are no updates, not even to say the flight was cancelled, and I then have to start searching through the Net for airport screens (which mostly only show arrivals) to try and find out what is going on.

In this day of supposed technical automation I always feel at a loss as to why, when the data is most required, it is not available. The public service voices always seem to be saying that travellers should contact their airlines, however it is then normally impossible to get through on the phone lines, and their agents - who SLF expect to know all - are stuck as well.

Oh well, back to the drawing board and another week that will probably see us sorely tried by more distruption (only joking) somewhere in the world that will cause havoc - whether it be air, rail or any other means of transport.

Have a good weekend all and keep on smiling!

Scottie Dog

160to4DME

17th May 2002, 21:30

In this day of supposed technical automation I always feel at a loss as to why, when the data is most required, it is not available.

I have a feeling the Swanwick ATCOs which turned up for their morning duty this morning were thinking exactly the same thing!!!! :(

Scottie Dog

17th May 2002, 21:34

Didn't quite appreciate how the wording might have looked, but a very true and poinent comment.

Trust it will soon be resolved - well maybe this century!

Scottie Dog

sennadog

17th May 2002, 21:37

And I thought that I was having a bad day!

Hope you all manage to sort this one out and just to let you know that one of the SLF appreciates what you are doing/going through.

Thanks guys - have a beer on me.:D

Bewli-Begto

17th May 2002, 21:45

How many times can this happen again until something really nasty happens? We`re just about to enter our busiest period of the year - even after September 11th this summer will be AS busy as last year, if not busier! We need to have more confidence in the system and know that it will not let us down - we just can`t do that at present. I expect all the boys and girls did an excellent job today (like they ALWAYS do in extreme circumstances) with picking up the pieces and, knowing what they`re like, no doubt injecting a bit of humour into the situation! Well done all!!! Whose Watch`s turn is it to cop it next???

AF1

17th May 2002, 22:19

Spare a thought for the unlucky Paddys that have to deal with the resultant flak every time there is a problem with the London computer. This is the third time in as many months that there has been an explosion of traffic through the Dublin and Shannon FIRs.

Mind you, its not just when the computers refuses to work at Swanwick that these problems occur. Its a daily occurance, every morning with the eastbound NAT London impose gallons of reroutes on the Irish so that they get what they want coming into English airspace.

In effect, Ireland has taken over the complexity of the airspace so that London don't have to deal with it. How much longer can NATS struggle on before the whole thing sits down?

crowman

17th May 2002, 23:24

Bexil 160, your suppositions are totally correct! Also it is almost certain that the CFMU TACT computer went down due to the rapidity and complexity of the flow restrictions placed by LACC due to their inability to split sectors

I SAY NO MORE AS MY COVER IS EXPOSED FOR MANY TO SEE!! :p :p :p

Scott Voigt

18th May 2002, 01:49

BEX;

Sorry to hear of another BAD day at NERC... You are making me just want to go to work and hug DSR and our old HOST <G>. Next time you come to visit, I should be able to get you in the building...

Take care

Scott Voigt

18th May 2002, 01:56

Take3call5;

I think that I would have to take exception to your comment about it not being the suits fault...

If the requirements statement was correctly written and then the contractor was kept to a correctly written contract, then you wouldn't have these issues.

We had some serious problems with our itteration of the "new" and "improved" NAS. or ISSS as it was known. We finally were able to convince the suits that this was NOT going to work in a manner that was better or even as good as the old system. It was scrapped and then we took what elements did work and to save money put it together into something that would at least work and replace an old and failing system. Now we are working at slowly replacing all the other items that need to be updated. We know that the NAS software is VERY COMPLEX and must work all the time in real time. We are not in a hurry for the sake of safety and the customers....

regards

Brookmans Park

18th May 2002, 02:59

having just read the posts re the Europeaen cup in Scotland
could there be any connection with the the extra traffic which this generated and the subsequent problems at SwAnNwIkK ??:o :confused: :mad:

BDiONU

18th May 2002, 07:17

Dan Ryan:
No. Traffic levels was not the issue. Won't find out 'till I go into work on Monday what really happened but it appears from comments in this thread that there was a problem with one workstation. This caused the watch management to delay 'splitting out' the Ops room to accomodate daytime traffic levels until they could be sure that the problem wasn't going to be replicated on other workstations. At least thats my surmise.

Scott:
I fully agree with your comment about requirements statements etc. However, you underestimate the complexity of the system at LACC. It is not very unusual for a 'fix' on one part of the system to introduce regression or an unwanted 'feature' on another part. It can be impossible to know this until a 'problem' is found and the engineers can examine the data (personally I would have it re-written using an object oriented language so that regression cannot be introduced in other areas).

Sounds like the upgrade that was put on just didn't 'take' properly on one workstation (although they all appeared to work (albeit without flight data) when checked by me and my team) so that, for safety, they fully checked out all the rest before commiting to daytime traffic levels. IMO the only decision which they could make under the circumstances.

In my opinion I am sure that EVERYONE is aware of the problems caused to other units, to the airlines and to all the other services connected with flying when there's a problem like this. Just wish there was an easy solution to 'fixing' things, other than relying on everyone else's professionalism to 'carry on regardless' :mad:

Check 6

18th May 2002, 07:32

A couple of questions:

Do these "glitches" at Swanwick affect London Mil?

Did London Mil move also?

Cheers,

BDiONU

18th May 2002, 07:42

Check 6:

London Mil are still at West Drayton, hopefully will move with LTCC (circa 2005??). These problems will not have affected them, except where they had traffic to join controlled airspace. Quite probably they handled some commercial flights who were willing to fly outside airways.

N.B. There still military controllers (LJAO) working with (and at) the LTCC at Swanwick, just as there were at LATCC.

POMPI

18th May 2002, 09:17

Never did like CXSS - too complex.

chippy63

18th May 2002, 10:25

SLF is self-loading freight, ie folks like me.

HEATHROW DIRECTOR

18th May 2002, 10:26

Yes, all three Lon Mil controllers are still at West Drayton!!

Sonic Cruiser

18th May 2002, 11:26

EGLL must have been interesting yesterday morning, where did all the inbounds (particularly BA T4 Long Haul) park if very few of the outbounds moved?

GMP and GMC must have been busy positions to be operating yesterday. Were inbound holds lengthy as well or were restrictions put on the number of Flights allowed in to Heathrow??

I read that there will be backlogs right across the weekend as airlines try to get the aircraft back in the right place.

Scott Voigt

18th May 2002, 21:23

Hi Take3Call5;

Actually I probably know the system that you have fairly well since all that you have had and now have were off shoots of what we have or decided to not do...

I completely understand the issues with doing something to the software and then effecting something else in the system. That is why we do a LOT of testing on all of our patches and then test them at all 20 facilities when we install them here. Guess what, even with doing that it doesn't always work. We had a failure just last month on a new patch due to those issues, but we do the install on the midnight shift and bring it back up before the traffic starts getting busy, so if the system flops rigth away, there isn't a lot of impact when we reload the old system and bring it back online...

As to the complexities of any sort of NAS system replacement, we completely understand that too and that is why we are now going with the thought of replacing small parts of the NAS one at a time and then turning them off one at a time. Do this until we get to the radar and data processing and then replace those. Don't try to do a big bang. There is too much at risk to do it that way, as well as a training nightmare for the work force. We don't let pilots get into a new aircraft with just a few days of training over a couple of month period. They go through a LOT of training and are taken off the line as it were to immerse in training. Obviously with our staffing in most of the busy parts of the world, we just can't do this. So do it the smart way. Go in baby steps and get the whole thing done over a course of years so that you have minor training issues that are easy to deal with and there is very little if any dissruption to the users.

regards

Jay Foe

18th May 2002, 22:06

Regarding the problems of the last few days, I think todays 'Matt' Cartoon on the front of the Telegraph was quite amusing, (fingers crossed this works):( :( :( :(

http://www.telegraph.co.uk/core/Matt/pMattTemplate.jhtml?pTitle=Matt.telegraph

Wahey it worked!!!!!!!! I can do modern technology. Now where's PAR 2000............

DCS99

19th May 2002, 11:37

I work on mainframe airline Res and DCS systems, most recently for a certain carrier which had a large cross on its tail, so I can imagine with knowing dread, the kind of situation that happened last week.

I've written and tested stuff as well as it can be, loaded it and it's gone wrong. OK, we follow the fallback plan, clean up any mess, re-test and try again. It happens to everyone at some point.

The systems are damn complex but we work equally damn hard to make sure we've thought of everything before going live and we do take it personally when others say things like "outsource IT!", or "don't these programmers/engineers know what they're doing?".

We want to deliver quality all the time, because we know the business and the terrible effects of even the smallest cock-up, but sometimes it's like trying to add another storey on a building between 2 existing floors. It ain't easy, but that's the existing architecture we're working with!

Back to Friday's snagettes:
The worst kind of problem is when a software change has been loaded and it doesn't go wrong till some hours later. At that stage, the fallback option might not be on the cards. It's fall forward but the morning shift may not know exactly what happened the night before, the logs have crashed or whatever.

To try and prevent these situations, you need:

1 Decent test systems with real live system data

2 Investment in Automated Volume testing tools (programmers dislike repetitive testing and anything that automates it is a great benefit).

3 For big changes, get the right people in on the night.

4 Check out as much as you can during the quiet hours at night

5 Pay them decent compensation. They should stay behind till the morning shift comes in and handover is complete.

I don't know the set-up at ATC other than through second hand sources, so flame me if I'm jumping to conclusions, but it seems like not all of these points were actioned for the change which went wrong on Friday.

I also fear that Point 5 - Paying Overtime - was something the management wanted to avoid - or am I speaking out of turn there?

no sig

19th May 2002, 14:34

Would anyone like to hazard a guess at how many more of these 'glitches' we are going to have to cope with?

If we are running at risk that this will occur again in some form then tell us (the airlines), we'll do what we can to assist as we did with the change over. But this failure cost my outfit in excess of £400K probably more, resulted in the cancellation of 44 flights and all the misery that goes with it.

Really folks, we need to do better.

BEXIL160

19th May 2002, 15:27

Three things will affect UK ATC for the foreseeable (sp?) future..

1) Serious lack of validated Controllers and Assistants at Swanwick

2) NAS (at West Drayton) could easily FLOP again, or the link to it could be lost (not usually too serious at LATCC, but even startovers can ruin Swanwick's whole day)

3) Unknown (and known) faults wthin the highly complex Swanwick software

NONE of the above are likely to be fixed in the short term, and the staffing situation is a LONG TERM issue. It takes YEARS to train and validate ATCOs, meanwhile more are retiring / leaving/ on long/short term sick, than are being replaced.

Once again, I am very sorry to be the bearer bad news, but I'd rather be "open and honest" with you than the claptrap that comes out of One Kemble Street. This ain't spin, it's the truth.

Sadly, yours
BEX

no sig

19th May 2002, 16:50

BEXIL
Thank you as ever. I suppose what's hard to swallow is the thought that after three events, albeit, apparently unrelated, we are likely to face the mayhem of Friday again. I know it's a complex system, however, the fact that we have had three failures really does make one question to integrity of the software/system/ and the management of same.

chiglet

19th May 2002, 19:44

Three quotes spring to mind
1 "Our Skies Are NOT For Sale":p
2 "The Buck Stops Here":cool:
3 "Action This Day"
Mr Blur and our "esteemed" CE, Mr Eveready have [obviously] not studied Modern History, or read the 'papers, or listened to the troops, but then again, what else is Chuffin' new? :mad: :mad:
we aim to please,it keeps the cleaners happy

2 six 4

19th May 2002, 21:56

Went to the pub on Friday night .:p A friend of mine who is a secretary was concerned. Was it my computer that failed and caused all those delays ? Well sort of ... I said. After consoling me with a pint she told me they had been discussing it in the office when it came on the news.

Why don't you do what we do when the damned machine stops ? What's that I naively ask ? Just type CTRL + ALT + DELETE it works every time !!!

DOH We pay ££ millions for sophisticated software companies to design this complex beast and my mate tells me the answer down the pub :D :D

Where's that CTRL button and I'll tell Cheese and Ham .......

Iron City

20th May 2002, 00:45

Hope your Swanwick is not the same as the voice switching and control system in the states (primary ARTCC voice com switch) when they control-alt-delete that it takes a couple hours to do a complete cold reboot. cans and string in the meantime, and a couple BIG megaphones.

Scott Voigt

20th May 2002, 04:58

Iron City;

Actually VSCS takes about 35 minutes to reboot now... If that were to happen though we would switch over to VTABS and would continue working until the reboot was complete.

For those who don't know, VTABS ( VSCS Training and Backup Switch) was installed since we didn't feel that we wanted the live system hooked to the training communications switch, as well as we wanted to have something that would work in the event that we took a power bump or a software problem with VSCS. VTABS has a battery backup that is independent from the critical power UPS and system just in case. The system has been VERY reliable however...

regards

Tanglefoot

20th May 2002, 16:47

To all fine people out there,

From info I have received today, the Swanwick sector split problem was caused by a network glitch (love that CXSS) – hence they are still running on the new software drop and it is working.

Would love to know how much this has cost – can anyone give me an approximation – but I repeat my earlier comment:
This would not have been a problem ---NO COST/NO DELAY--- if ‘the systems specialist engineer’s’ had been around at the time of the DDC.

Unfortunately this costs a few extra peanuts and the current tight-arse management penny-pinching policy resulted in Fridays situation – nothing else.

The really good news is that management have re-organised these engineers for no other sensible reason than to impress their new owners TAG and in the hope that some will quit to reduce the redundancy count – but heh – it’ll never happen again right!!!!

BDiONU

20th May 2002, 17:17

Tanglefoot et al:

Yes the 'trouble' at Swanwick was indeed a network problem, due to the interface between CXSS and a COTS product. This problem was first identified in June last year and the engineers had been trying to 'fix' it but it has proved to be a vastly more complex issue than was first hoped for.

It incorrect to assert that had specialist engineers been around it would have been sorted immediately. It took some searching by the Lockheed Martin Analyst in the low level codes to find out what had gone wrong.

As there is an engineering investigation ongoing which will produce a post-mortem result soon I will not speculate on here about solutions.

BDiONU

20th May 2002, 17:20

Forgot a p.s.

The delays were HUGELY exacerbated by a failure of the CFMU in Brussels.

NERC Dweller

20th May 2002, 19:01

Firstly apologies to ATC, Friday must have been any extremely stressful day.

That aside I agree with Take3Call5. The problem affected only a single workstation and once identified was VERY simple to correct.

I would like to say more but won't at this stage as I believe this to be counter productive and unfair to some people that give their absolute best.

Tanglefoot

20th May 2002, 20:43

Take3Call5

My sources have indeed been busy.
Point 1 is that CXSS is a COTS product – delivery as such by LM.
Point 2 the problem is indeed complex but the symptoms are apparently easy to deduce and as NERC Dweller states, is VERY easy to correct. I stand by my earlier comment. With the correct skill sets available at the time, this would have been identified and recovered immediately (I believe the term is workedaround – not fixed). Call me old fashioned but we should not be costing our airline customers millions of pounds whilst using the operational system as a LM test harness.
Point 3 if this is a complex ‘fix’ it will undoubtedly cost hugh amounts of money to implement a fix. If the problem is VERY easy to deduce and recover from, WHY fix it at all! (after all, how many times a year do you have to reboot your PC).

The results of the investigation will make interesting reading but I have my doubts if NATS can admit it’s mistakes to itself let alone the world on this one.
:confused:

no sig

20th May 2002, 21:16

Take3Call5

True they were, but by that time the die had been cast for a bad day with the first wave of traffic well behind schedule, when we lost CFMU it became 'very' bad. In relative terms, to my airline, Swanwick failure probaly caused 20 cancellation, with TACT falling over we ended up with 44!

no sig

21st May 2002, 09:34

Nerc Dweller, you wrote

' The problem affected only a single workstation and once identified was VERY simple to correct. '

I hate to think what would happen if we had a 'serious' problem with the system!

Stan By

21st May 2002, 10:37

I'm probably being very naive here, but aren't the Swanwick workstations powered by what are effectively PCs, which store their software on a hard disc drive?

Why not when doing these software changes put in a new HDD, and keep the old one safe, so if the new software doesn't work simply put in the original HDD and reboot?

The cost of this would be less then the cost of sending a first class letter to everyone in the company!

Cue the engineers to tell me why my perfect world won't work;)

Dinosaur

21st May 2002, 13:14

Stan By -- I think it's been mentioned here already, but perhaps not loudly enough: The problem last Friday was not with the new software release. It was an existing problem, always thought to be inoffensive until it chose to manifest itself in a new and exciting way.

When we install a new release, we in fact do exactly as you say: New and old co-exist on disk, and we can switch back to the old one "easily". (Actually, switching back involves the same impact to operations as the original switch).

But on Friday, it wouldn't have helped.

--Dinosaur

BDiONU

21st May 2002, 15:55

Tanglefoot:

1. The CXSS used by NERC system is a hybrid. The COTS product I'm refering to (cannot name it obviously) is central to the use of sessions.

2. It was repairable by simply restarting the affected single workstation. However that doesn't mean that a host of highly skilled technicians had to be on hand to solve any problems that may have arisen following a DD&C (or any start up, and it was the start up which was where the 'error' came in). There are several solutions which have been aired, all to identify that this particular fault has been introduced. Knowing its there means a simple re-start of the affected workstation, which is hardly rocket science.
I don't understand your reference to the NERC system being an LM test harness. It is a fully working operational system shifting traffic every hour of every day. Or more correctly the Operational Staff are using it to shift traffic.

3. I concur.

Why would NATS not publish (in the parts of the company that need to know) the results of its own investigation? There are many lessons to be learnt here and I frankly doubt, whatever managements other failings, that they wouldn't want those lessons spread widely.

Tanglefoot

21st May 2002, 18:46

Take3Call5

1.Agree with your definition of the ‘offending COTS’ product.

2.Agree that this problem was ‘fixed’ by restarting a single workstaion. My point is that with your highly skilled technicians (a term I think your engineers will liken to an ATCO being called an operator), this problem would have been deduced before it became a problem at splitout time. This time it was a network problem. Next time it could be something completely different. Doesn’t it make sense to have your system experts present at the only time you stop/start the system – the time it is most likely to fail/run into problems.

As for my last jibe. NATS has a pretty bad record at admitting bad news to itself, after all, how many of us were misled into relocating to Hampshire early on promises of it will be operational 96/98/2000/2001 etc.
;)

BDiONU

21st May 2002, 19:06

Yes Tanglefoot, NATS Senior staff were pretty poor (Well, OK, Gloves Off they were CR*P) at giving honest and truthful information about the move to Swanwick. Goodness only knows why, it was VERY obvious that the system wasn't ready!!!!!
I still take issue with the thought that having our technicians/engineers/analysts/system experts (amongst which I'm counted) would actually have picked up the problem and resolved it PDQ. Also we have started and re-started the system quite literally hundreds of times without any serious problems. This was, not to trivialise things, just a 'Gotcha!'

All Systems Go

21st May 2002, 19:19

The day experts you refer to on a DD&C night are there for the duration the said activity alone. Correct me if I'm wrong, but this problem was only found once they tried to split out the sectors - not until way after the DD&C team had gone off to dream land after a job well done and all the boxes ticked. I'm willing to be proved otherwise as I wasn't there - I only heard the gory details the other night from a colleague. I think you under-estimate the skill and ability of the "technicians" as they have been called.

BDiONU

22nd May 2002, 06:07

All Systems Go:

You are correct, the bug wasn't found until the day shift started splitting. However the low level code data logs do show the session problems from when that workstation was booted up at about 0120Z.

One thing to note, this problem was not caused, nor affected by the DD&C. It is a problem which sometimes appears on a workstation re-start, it could happen at any time when there's a re-start.

Not sure if you thought I was infering that the NATS Engineering staff (the word technicians is obviously an emotive one!) skills and abilities were anything other than excellent. I apologise if I came across that way. This problem was not identified until a couple of CXSS experts sifted through the low level code data logs and spotted anomolous activity. Not something which would generally smack you between the eyes and not something monitored in system control.

All Systems Go

22nd May 2002, 08:52

Take3:

I myself am a technician who calls one self an engineer - it's great. The CXSS experts do an amazing job, one which I can't yet understand how they cope with. Infact it has to be said everyone does an incredible job when our management doesn't seem interested in motivating staff or letting us have some small comforts in an otherwise cold day.

As for our chums the ATCOs and ATSAs, well. Where would we be without them? Seriously, they always cope admirably with things that would upset quite considerably a normal user. Well done to us all.

Lets hope good ole' AIX and it's bigger brother CXSS get along from now on and stop having all these barnies!!!