PPRuNe Forums - View Single Post - All London airspace closed
View Single Post
Old 27th Dec 2014, 20:00
  #154 (permalink)  
EEngr
 
Join Date: Jan 2011
Location: Seattle
Posts: 717
Likes: 0
Received 3 Likes on 2 Posts
Terminology

There seems to be some misunderstanding of the term RAID* as it applies to the TMCS servers. RAID technology is a set of firmware and configurations that makes a group of hardware disks appear to a host as a single logical disk drive. This can provide redundancy in the event of a single (or multiple) hardware failures and allow the host system to continue running, although in some cases at a reduced level of performance. The key here is that a RAID array typically serves a single host system. So even in the event of a series of failures sufficient to incapacitate the entire RAID array (highly improbable), only that one host fails.

From the description of the events, it appears that the logical disk failure affected several redundant host systems. This leads me to believe that, in addition to a RAID array, these systems were using Network Attached Storage** (NAS). Several implementations of this may be referred to as a Storage Area Network, where one server 'shares out' its disk system to other systems. Each system would look at the files on this shared (mounted) drive as if they were local to that system. However, data (bad data in this case) written by one system would become available to all.

It is also possible (not clear from the report) that the "disk mirroring" function may have been implemented at an application level. That is: The TMCS server applications would receive a copy of a data stream and each would write a local copy to its disk. This would be the most robust system, as the applications would be able to spot "bad data" and refuse to save a local copy. Typical NAS systems don't have this capability, as the operating systems have no concept of what is good or bad, Bytes are bytes. And from the description of the failure, it sounds like the latter is what was implemented.

NAS systems are a bad deal for redundancy. They make a system administrator's job easier. Write one file an everybody automatically gets a copy. But this is a bad deal if that one copy becomes corrupted.

*Redundant Array of Independent (or Inexpensive) Disks
**Some examples are NFS (Network File System), SMBFS (Server Message Block File System), and Novell Netware.

EEngr is offline