PPRuNe Forums - View Single Post

31st Aug 2002, 18:48

#96 (permalink)

arcniz

Join Date: Sep 2001

Location: 38N

Posts: 356

Likes: 0

Received 0 Likes on 0 Posts

Pehr Hallberg says:

Quote:

A computer doing some useful job is obviously some hardware with some specialized software together performing the job. I have seen in the Telecom industry for instance several examples where the redundancy is on the hardware side but where you run the same software in both sets of hardware. This protects against hardware faliure but if the software fails both computers fail in the same way.

I disagree. A couple of points here:

a) To achieve any tough degree of reliability and fail-soft capability with computer controls, one needs to work them in groups of three. With only two, any discrepancy leads to an argument and possible deadlock; with three, they can take a vote, cut out the oddfellow, and ring some bells to escalate alarm about the new condition of degraded redundancy. If three is good, 'many' sometimes can be better.

b) Much can be done - in design - to prevent the problem you point to - where 'everybody is wrong'. Sometimes the best answer is to 'fail' the system momentarily and recycle it to a new self-aware configuration. Also helpful is judicious allocation of 'gold bars' allocating authority.

c) As we all know, a lot of things can go wrong in hardware and software. Reliability in both is most commonly accomplished by detection of abnormal performance and subsequent reconfiguration to a (usually) more conservative operating strategy. This works just as well with 'software' as 'hardware', yet it tends more often to be omitted or done lightly in software because the threshold cost for s/w changes is putatively lower.

You seem to feel that hardware fault detection is much more reliable than software fault detection, but I disagree. If similar design methods are used for software fault detection and reconfiguration, the results can be comparable to those of the best hardware, i.e., nearly perfect.

We have all seen computers fail to perform correctly, but anecdotal evidence is not the same as truth.

If the intent is to have ultrareliable systems and then they are observed to fail, one must say that they have not been designed / tested with sufficient care. This is the binary equivalent of 'pilot error'.

Your telcom anecdote highlights a common failure in 'duty of care'.
Just as it would be irresponsible for an airline to put an unqualified and untested pilot in charge of a passenger transport, it is a management mistake to deploy inadequately designed and / or inappropriate technology in any critical application.

The only way to achieve true reliability is .....Very Carefully.