PPRuNe Forums - View Single Post - AF 447 Search to resume (part2)

23rd May 2011 | 10:34

#2159 (permalink)

john_tullamarine

Fleet Manager

Joined: Apr 2001

: ATPL

Posts: 7,441

Likes: 306

From: various places .....

Ref this post my limited knowledge of cutting code suggested that the sentiments were a bit optimistic. A specialist colleague provided the following commentary as an observation and invited me to post it for information.

"The contributor who goes by the name of syseng68k has a rosy, and, I fear, misleading idea of the quality of software and software development. It is not error-free, even in critical systems. There is probably no substantial system (of the order of tens of thousands of lines of code or larger, the size of all flight control systems in commercial aircraft nowadays) which has ever been built that was or is error-free. For the state of the art, see the last paragraph.

I have worked in the reliability and safety of software-based systems for a quarter century, am a member of my country's standardisation committe on functional safety of computer-based systems, have an international reputation in the field, and indeed shall be giving a keynote talk at a major international conference on the subject later in the year.

A famous study by Robin Lutz of Iowa State University twenty years ago on the source of about 200 mission-critical software failures for NASA showed that over 98% were due to the requirements not covering adequately the actual situation the system encountered. (syseng68k prefers not to call these "bugs" but most people working in ultrareliable software do.) A more recent study by Lutz and Mikulski from 2004, published in one of the premier journals for software engineering, may be found at http://www.cs.iastate.edu/%7Erlutz/publications/tse04.pdf

There is an enormous variety of problems with computer-based systems in aerospace contexts, the number of incidents is clearly increasing, and I fear will continue to do so. A check of Peter Neumann's "illustrative risks" Section 1.6, Commercial Aviation, gives some sense:
http://www.csl.sri.com/users/neumann/illustrative.html#9 The Risks Digest archive is also available on-line at http://catless.ncl.ac.uk/risks

Kevin Driscoll of Honeywell, along with Brendan Hall, Hakan Sivencrona (of Chalmers Institute of Technology) and Phil Zumsteg, published a paper recounting a so-called "Byzantine anomaly" with the FBW control system of a major airliner, that almost caused the airworthiness certificate to be withdrawn:
http://www.cs.indiana.edu/classes/b649-sjoh/reading/Driscoll-Hall-Sivencrona-Xumsteg-03.pdf
Driscoll gave a recent keynote talk at SAFECOMP 2010, "Murphy was an Optimist" on such kinds of failures, which are ongoing.

Estimates by colleagues who know put the general coding error rate in safety-critical software at about one error per 1,000 LOC (lines of executable source code). Obviously, this is a very general statement. How often behavioral anomalies are triggered by coding errors depends of course on the operational profile of the software. For example, the bug in the configuration control software on the Boeing 777 aircraft was present throughout the lifetime of the aircraft but only manifested after a decade of line service http://www.atsb.gov.au/publications/investigation_reports/2005/aair/aair200503722.aspx and recent anomalies, also triggered by issues with ADIRUs affecting the primary flight control, in Airbus A330 took a similar length of time to manifest : http://www.atsb.gov.au/media/1363394/ao2008070_ifr_2.pdf

There are some companies, such as Altran Praxis (formerly Praxis High Integrity Systems), who have instrumented their error rates and demonstrably achieve much lower rates, of the order of one error per 25,000 LOC. Altran Praxis have recently engaged on the Tokeneer project funded by the NSA, whose object was to develop error-free software http://www.adacore.com/home/products/sparkpro/tokeneer/
Despite apparent initial success, five errors were eventually found:
http://www-users.cs.york.ac.uk/~jim/woodcock-hoare-75th-final.pdf (note, this is a paper for a celebration of the 75th birthday of a UK pioneer in reliable software technology, Turing Award winner Sir Tony Hoare, co-inventor 40 years ago of the original method for mathematically proving that sequential programs do what they are supposed to do).

Summary: Five errors is as good as it gets in software with many tens of thousands of LOC. This is the order of the first civil aircraft flight control systems a quarter-century ago (Airbus A320, 1988, about 60,000 LOC). However, since then the size of the systems has increased by two orders of magnitude, and of course they are also very highly distributed, which brings many subtle additional sources of potential error into play (e.g., Byzantine failures)."