PPRuNe Forums - View Single Post - All London airspace closed
View Single Post
Old 16th Dec 2014, 14:46
  #114 (permalink)  
Ian W
 
Join Date: Dec 2006
Location: Florida and wherever my laptop is
Posts: 1,350
Likes: 0
Received 0 Likes on 0 Posts
Originally Posted by slip and turn
It's called error handling and it is an absolutely critical part of any computer program. If a line of code receives unanticipated data (which may not be 'bad' per se), that unforeseen use case needs to have been foreseen by whomever put together the program spec, whomever agreed the program spec, whomever designed the logic that was intended to handle it flawlessly in the code, whomever checked it, whomever tested it, and whomever signed off on the project or module or upgrade, but one or all six or sixty of whom we all now know was mistaken. And there's the rub

So now we are told that a single line of code stopped the machine, what actually was it in the real time real life world that was unforeseen? That would be the real story.

If I was anotherthing or Gonzo or Zooker or eglnyt et al, I'd have asked that one at the office by now
The NAS Host software that was written in 1969 - 1971 yes in Jovial and BAL (basic assembler language) is actually extremely reliable. However, it was made to run on a set of 6 IBM360's known (in the trade) as the IBM 9020D. UK CAA did not purchase the 9020E which was another team of 6 IBM360's that did radar data processing.

So the 9020D had 3 input output processing IBM360's and 3 compute element IBM360's - all running at an impressive 300,000 integer instructions a second.

Now the architecture is what made the system reliable. The system was a multiprocessor mufti-programming system and any program that was pre-empted could be picked up by another processor. The system repeatedly recorded checkpoint recovery data from once a second out to a few minutes. So if an error was found by the computer (what would give a BSOD in a PC) the IBM360 involved would stop all the other processors and give them the checkpoint data and all the processors would rerun precisely the same program and data. If only one of the processors got the error then the error must be hardware in that processor and it put itself offline. If all the processors got the error then the error must be software and the 9020 did a core dump (a large hexadecimal printout) threw away all its input messages then restarted (startover) from a clean checkpoint say 3 minutes before. As software faults in a real time system are normally timing/preemption related or caused by a broken input message, the system would normally startover successfully. Controllers would receive a message 'STARTOVER at time - please re-input any messages" (or words to that effect.) If Gork put in the broken message again then it could cause the startover again. However, the Data systems specialist would be looking at the last messages in and identify Gork's message and somewhat testily suggest that he did not re-enter the message next time.

OK so now the system is rehosted as a virtual machine inside a nice shiny new machine. A lot of the automated recovery that was built in may not work quite that way (I don't know how that is now implemented) So I rather think that it may take more manual intervention if the Host software has a glitch.
Ian W is offline