Problems at Swanwick
Join Date: Feb 2001
Location: Fareham U.K.
Posts: 30
Likes: 0
Received 0 Likes
on
0 Posts
Bexil 160, your suppositions are totally correct! Also it is almost certain that the CFMU TACT computer went down due to the rapidity and complexity of the flow restrictions placed by LACC due to their inability to split sectors
I SAY NO MORE AS MY COVER IS EXPOSED FOR MANY TO SEE!!
I SAY NO MORE AS MY COVER IS EXPOSED FOR MANY TO SEE!!
Join Date: Jul 2001
Location: Fort Worth ARTCC ZFW
Posts: 1,155
Likes: 0
Received 0 Likes
on
0 Posts
BEX;
Sorry to hear of another BAD day at NERC... You are making me just want to go to work and hug DSR and our old HOST <G>. Next time you come to visit, I should be able to get you in the building...
Take care
Sorry to hear of another BAD day at NERC... You are making me just want to go to work and hug DSR and our old HOST <G>. Next time you come to visit, I should be able to get you in the building...
Take care
Join Date: Jul 2001
Location: Fort Worth ARTCC ZFW
Posts: 1,155
Likes: 0
Received 0 Likes
on
0 Posts
Take3call5;
I think that I would have to take exception to your comment about it not being the suits fault...
If the requirements statement was correctly written and then the contractor was kept to a correctly written contract, then you wouldn't have these issues.
We had some serious problems with our itteration of the "new" and "improved" NAS. or ISSS as it was known. We finally were able to convince the suits that this was NOT going to work in a manner that was better or even as good as the old system. It was scrapped and then we took what elements did work and to save money put it together into something that would at least work and replace an old and failing system. Now we are working at slowly replacing all the other items that need to be updated. We know that the NAS software is VERY COMPLEX and must work all the time in real time. We are not in a hurry for the sake of safety and the customers....
regards
I think that I would have to take exception to your comment about it not being the suits fault...
If the requirements statement was correctly written and then the contractor was kept to a correctly written contract, then you wouldn't have these issues.
We had some serious problems with our itteration of the "new" and "improved" NAS. or ISSS as it was known. We finally were able to convince the suits that this was NOT going to work in a manner that was better or even as good as the old system. It was scrapped and then we took what elements did work and to save money put it together into something that would at least work and replace an old and failing system. Now we are working at slowly replacing all the other items that need to be updated. We know that the NAS software is VERY COMPLEX and must work all the time in real time. We are not in a hurry for the sake of safety and the customers....
regards
problems at swanwick
having just read the posts re the Europeaen cup in Scotland
could there be any connection with the the extra traffic which this generated and the subsequent problems at SwAnNwIkK ??:o
could there be any connection with the the extra traffic which this generated and the subsequent problems at SwAnNwIkK ??:o
Beady Eye
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes
on
0 Posts
Dan Ryan:
No. Traffic levels was not the issue. Won't find out 'till I go into work on Monday what really happened but it appears from comments in this thread that there was a problem with one workstation. This caused the watch management to delay 'splitting out' the Ops room to accomodate daytime traffic levels until they could be sure that the problem wasn't going to be replicated on other workstations. At least thats my surmise.
Scott:
I fully agree with your comment about requirements statements etc. However, you underestimate the complexity of the system at LACC. It is not very unusual for a 'fix' on one part of the system to introduce regression or an unwanted 'feature' on another part. It can be impossible to know this until a 'problem' is found and the engineers can examine the data (personally I would have it re-written using an object oriented language so that regression cannot be introduced in other areas).
Sounds like the upgrade that was put on just didn't 'take' properly on one workstation (although they all appeared to work (albeit without flight data) when checked by me and my team) so that, for safety, they fully checked out all the rest before commiting to daytime traffic levels. IMO the only decision which they could make under the circumstances.
In my opinion I am sure that EVERYONE is aware of the problems caused to other units, to the airlines and to all the other services connected with flying when there's a problem like this. Just wish there was an easy solution to 'fixing' things, other than relying on everyone else's professionalism to 'carry on regardless'
No. Traffic levels was not the issue. Won't find out 'till I go into work on Monday what really happened but it appears from comments in this thread that there was a problem with one workstation. This caused the watch management to delay 'splitting out' the Ops room to accomodate daytime traffic levels until they could be sure that the problem wasn't going to be replicated on other workstations. At least thats my surmise.
Scott:
I fully agree with your comment about requirements statements etc. However, you underestimate the complexity of the system at LACC. It is not very unusual for a 'fix' on one part of the system to introduce regression or an unwanted 'feature' on another part. It can be impossible to know this until a 'problem' is found and the engineers can examine the data (personally I would have it re-written using an object oriented language so that regression cannot be introduced in other areas).
Sounds like the upgrade that was put on just didn't 'take' properly on one workstation (although they all appeared to work (albeit without flight data) when checked by me and my team) so that, for safety, they fully checked out all the rest before commiting to daytime traffic levels. IMO the only decision which they could make under the circumstances.
In my opinion I am sure that EVERYONE is aware of the problems caused to other units, to the airlines and to all the other services connected with flying when there's a problem like this. Just wish there was an easy solution to 'fixing' things, other than relying on everyone else's professionalism to 'carry on regardless'
Beady Eye
Join Date: Feb 2001
Location: UK
Posts: 1,495
Likes: 0
Received 0 Likes
on
0 Posts
Check 6:
London Mil are still at West Drayton, hopefully will move with LTCC (circa 2005??). These problems will not have affected them, except where they had traffic to join controlled airspace. Quite probably they handled some commercial flights who were willing to fly outside airways.
N.B. There still military controllers (LJAO) working with (and at) the LTCC at Swanwick, just as there were at LATCC.
London Mil are still at West Drayton, hopefully will move with LTCC (circa 2005??). These problems will not have affected them, except where they had traffic to join controlled airspace. Quite probably they handled some commercial flights who were willing to fly outside airways.
N.B. There still military controllers (LJAO) working with (and at) the LTCC at Swanwick, just as there were at LATCC.
Join Date: Jun 2001
Location: London
Posts: 52
Likes: 0
Received 0 Likes
on
0 Posts
EGLL must have been interesting yesterday morning, where did all the inbounds (particularly BA T4 Long Haul) park if very few of the outbounds moved?
GMP and GMC must have been busy positions to be operating yesterday. Were inbound holds lengthy as well or were restrictions put on the number of Flights allowed in to Heathrow??
I read that there will be backlogs right across the weekend as airlines try to get the aircraft back in the right place.
GMP and GMC must have been busy positions to be operating yesterday. Were inbound holds lengthy as well or were restrictions put on the number of Flights allowed in to Heathrow??
I read that there will be backlogs right across the weekend as airlines try to get the aircraft back in the right place.
Join Date: Jul 2001
Location: Fort Worth ARTCC ZFW
Posts: 1,155
Likes: 0
Received 0 Likes
on
0 Posts
Hi Take3Call5;
Actually I probably know the system that you have fairly well since all that you have had and now have were off shoots of what we have or decided to not do...
I completely understand the issues with doing something to the software and then effecting something else in the system. That is why we do a LOT of testing on all of our patches and then test them at all 20 facilities when we install them here. Guess what, even with doing that it doesn't always work. We had a failure just last month on a new patch due to those issues, but we do the install on the midnight shift and bring it back up before the traffic starts getting busy, so if the system flops rigth away, there isn't a lot of impact when we reload the old system and bring it back online...
As to the complexities of any sort of NAS system replacement, we completely understand that too and that is why we are now going with the thought of replacing small parts of the NAS one at a time and then turning them off one at a time. Do this until we get to the radar and data processing and then replace those. Don't try to do a big bang. There is too much at risk to do it that way, as well as a training nightmare for the work force. We don't let pilots get into a new aircraft with just a few days of training over a couple of month period. They go through a LOT of training and are taken off the line as it were to immerse in training. Obviously with our staffing in most of the busy parts of the world, we just can't do this. So do it the smart way. Go in baby steps and get the whole thing done over a course of years so that you have minor training issues that are easy to deal with and there is very little if any dissruption to the users.
regards
Actually I probably know the system that you have fairly well since all that you have had and now have were off shoots of what we have or decided to not do...
I completely understand the issues with doing something to the software and then effecting something else in the system. That is why we do a LOT of testing on all of our patches and then test them at all 20 facilities when we install them here. Guess what, even with doing that it doesn't always work. We had a failure just last month on a new patch due to those issues, but we do the install on the midnight shift and bring it back up before the traffic starts getting busy, so if the system flops rigth away, there isn't a lot of impact when we reload the old system and bring it back online...
As to the complexities of any sort of NAS system replacement, we completely understand that too and that is why we are now going with the thought of replacing small parts of the NAS one at a time and then turning them off one at a time. Do this until we get to the radar and data processing and then replace those. Don't try to do a big bang. There is too much at risk to do it that way, as well as a training nightmare for the work force. We don't let pilots get into a new aircraft with just a few days of training over a couple of month period. They go through a LOT of training and are taken off the line as it were to immerse in training. Obviously with our staffing in most of the busy parts of the world, we just can't do this. So do it the smart way. Go in baby steps and get the whole thing done over a course of years so that you have minor training issues that are easy to deal with and there is very little if any dissruption to the users.
regards
Join Date: Jul 2001
Location: Middle of Nowhere
Posts: 33
Likes: 0
Received 0 Likes
on
0 Posts
Regarding the problems of the last few days, I think todays 'Matt' Cartoon on the front of the Telegraph was quite amusing, (fingers crossed this works)
http://www.telegraph.co.uk/core/Matt...Matt.telegraph
Wahey it worked!!!!!!!! I can do modern technology. Now where's PAR 2000............
http://www.telegraph.co.uk/core/Matt...Matt.telegraph
Wahey it worked!!!!!!!! I can do modern technology. Now where's PAR 2000............
Join Date: Sep 2000
Location: Danger - Deep Excavation
Posts: 341
Received 0 Likes
on
0 Posts
I work on mainframe airline Res and DCS systems, most recently for a certain carrier which had a large cross on its tail, so I can imagine with knowing dread, the kind of situation that happened last week.
I've written and tested stuff as well as it can be, loaded it and it's gone wrong. OK, we follow the fallback plan, clean up any mess, re-test and try again. It happens to everyone at some point.
The systems are damn complex but we work equally damn hard to make sure we've thought of everything before going live and we do take it personally when others say things like "outsource IT!", or "don't these programmers/engineers know what they're doing?".
We want to deliver quality all the time, because we know the business and the terrible effects of even the smallest cock-up, but sometimes it's like trying to add another storey on a building between 2 existing floors. It ain't easy, but that's the existing architecture we're working with!
Back to Friday's snagettes:
The worst kind of problem is when a software change has been loaded and it doesn't go wrong till some hours later. At that stage, the fallback option might not be on the cards. It's fall forward but the morning shift may not know exactly what happened the night before, the logs have crashed or whatever.
To try and prevent these situations, you need:
1 Decent test systems with real live system data
2 Investment in Automated Volume testing tools (programmers dislike repetitive testing and anything that automates it is a great benefit).
3 For big changes, get the right people in on the night.
4 Check out as much as you can during the quiet hours at night
5 Pay them decent compensation. They should stay behind till the morning shift comes in and handover is complete.
I don't know the set-up at ATC other than through second hand sources, so flame me if I'm jumping to conclusions, but it seems like not all of these points were actioned for the change which went wrong on Friday.
I also fear that Point 5 - Paying Overtime - was something the management wanted to avoid - or am I speaking out of turn there?
I've written and tested stuff as well as it can be, loaded it and it's gone wrong. OK, we follow the fallback plan, clean up any mess, re-test and try again. It happens to everyone at some point.
The systems are damn complex but we work equally damn hard to make sure we've thought of everything before going live and we do take it personally when others say things like "outsource IT!", or "don't these programmers/engineers know what they're doing?".
We want to deliver quality all the time, because we know the business and the terrible effects of even the smallest cock-up, but sometimes it's like trying to add another storey on a building between 2 existing floors. It ain't easy, but that's the existing architecture we're working with!
Back to Friday's snagettes:
The worst kind of problem is when a software change has been loaded and it doesn't go wrong till some hours later. At that stage, the fallback option might not be on the cards. It's fall forward but the morning shift may not know exactly what happened the night before, the logs have crashed or whatever.
To try and prevent these situations, you need:
1 Decent test systems with real live system data
2 Investment in Automated Volume testing tools (programmers dislike repetitive testing and anything that automates it is a great benefit).
3 For big changes, get the right people in on the night.
4 Check out as much as you can during the quiet hours at night
5 Pay them decent compensation. They should stay behind till the morning shift comes in and handover is complete.
I don't know the set-up at ATC other than through second hand sources, so flame me if I'm jumping to conclusions, but it seems like not all of these points were actioned for the change which went wrong on Friday.
I also fear that Point 5 - Paying Overtime - was something the management wanted to avoid - or am I speaking out of turn there?
Join Date: May 1999
Location: Vancouver, BC.
Posts: 748
Likes: 0
Received 0 Likes
on
0 Posts
Would anyone like to hazard a guess at how many more of these 'glitches' we are going to have to cope with?
If we are running at risk that this will occur again in some form then tell us (the airlines), we'll do what we can to assist as we did with the change over. But this failure cost my outfit in excess of £400K probably more, resulted in the cancellation of 44 flights and all the misery that goes with it.
Really folks, we need to do better.
If we are running at risk that this will occur again in some form then tell us (the airlines), we'll do what we can to assist as we did with the change over. But this failure cost my outfit in excess of £400K probably more, resulted in the cancellation of 44 flights and all the misery that goes with it.
Really folks, we need to do better.
Join Date: Jan 2001
Location: I sell sea shells by the sea shore
Posts: 856
Likes: 0
Received 0 Likes
on
0 Posts
Three things will affect UK ATC for the foreseeable (sp?) future..
1) Serious lack of validated Controllers and Assistants at Swanwick
2) NAS (at West Drayton) could easily FLOP again, or the link to it could be lost (not usually too serious at LATCC, but even startovers can ruin Swanwick's whole day)
3) Unknown (and known) faults wthin the highly complex Swanwick software
NONE of the above are likely to be fixed in the short term, and the staffing situation is a LONG TERM issue. It takes YEARS to train and validate ATCOs, meanwhile more are retiring / leaving/ on long/short term sick, than are being replaced.
Once again, I am very sorry to be the bearer bad news, but I'd rather be "open and honest" with you than the claptrap that comes out of One Kemble Street. This ain't spin, it's the truth.
Sadly, yours
BEX
1) Serious lack of validated Controllers and Assistants at Swanwick
2) NAS (at West Drayton) could easily FLOP again, or the link to it could be lost (not usually too serious at LATCC, but even startovers can ruin Swanwick's whole day)
3) Unknown (and known) faults wthin the highly complex Swanwick software
NONE of the above are likely to be fixed in the short term, and the staffing situation is a LONG TERM issue. It takes YEARS to train and validate ATCOs, meanwhile more are retiring / leaving/ on long/short term sick, than are being replaced.
Once again, I am very sorry to be the bearer bad news, but I'd rather be "open and honest" with you than the claptrap that comes out of One Kemble Street. This ain't spin, it's the truth.
Sadly, yours
BEX
Join Date: May 1999
Location: Vancouver, BC.
Posts: 748
Likes: 0
Received 0 Likes
on
0 Posts
BEXIL
Thank you as ever. I suppose what's hard to swallow is the thought that after three events, albeit, apparently unrelated, we are likely to face the mayhem of Friday again. I know it's a complex system, however, the fact that we have had three failures really does make one question to integrity of the software/system/ and the management of same.
Thank you as ever. I suppose what's hard to swallow is the thought that after three events, albeit, apparently unrelated, we are likely to face the mayhem of Friday again. I know it's a complex system, however, the fact that we have had three failures really does make one question to integrity of the software/system/ and the management of same.
Join Date: Apr 2001
Location: Near Stalyvegas
Age: 78
Posts: 2,022
Likes: 0
Received 0 Likes
on
0 Posts
Three quotes spring to mind
1 "Our Skies Are NOT For Sale"
2 "The Buck Stops Here"
3 "Action This Day"
Mr Blur and our "esteemed" CE, Mr Eveready have [obviously] not studied Modern History, or read the 'papers, or listened to the troops, but then again, what else is Chuffin' new?
we aim to please,it keeps the cleaners happy
1 "Our Skies Are NOT For Sale"
2 "The Buck Stops Here"
3 "Action This Day"
Mr Blur and our "esteemed" CE, Mr Eveready have [obviously] not studied Modern History, or read the 'papers, or listened to the troops, but then again, what else is Chuffin' new?
we aim to please,it keeps the cleaners happy
Last edited by chiglet; 19th May 2002 at 20:49.
Join Date: Sep 1998
Location: UK
Posts: 272
Likes: 0
Received 0 Likes
on
0 Posts
Went to the pub on Friday night . A friend of mine who is a secretary was concerned. Was it my computer that failed and caused all those delays ? Well sort of ... I said. After consoling me with a pint she told me they had been discussing it in the office when it came on the news.
Why don't you do what we do when the damned machine stops ? What's that I naively ask ? Just type CTRL + ALT + DELETE it works every time !!!
DOH We pay ££ millions for sophisticated software companies to design this complex beast and my mate tells me the answer down the pub
Where's that CTRL button and I'll tell Cheese and Ham .......
Why don't you do what we do when the damned machine stops ? What's that I naively ask ? Just type CTRL + ALT + DELETE it works every time !!!
DOH We pay ££ millions for sophisticated software companies to design this complex beast and my mate tells me the answer down the pub
Where's that CTRL button and I'll tell Cheese and Ham .......
Join Date: Jan 2002
Location: USA
Posts: 394
Likes: 0
Received 0 Likes
on
0 Posts
Hope your Swanwick is not the same as the voice switching and control system in the states (primary ARTCC voice com switch) when they control-alt-delete that it takes a couple hours to do a complete cold reboot. cans and string in the meantime, and a couple BIG megaphones.