Tuesday, 4 November 2008

Why does data storage fail?

Sponsored Links
Find high paying job. It's quick! It's Free!!Earn some quick money by spending just 5 minutes!!

Although storage administrators tend to think that "hardware failures" and "disk failures" are synonymous, a recent study by researchers at the University of Illinois at Urbana Champaign and NetApp Inc. suggests that the majority of failures in backup data storage systems, as well as primary storage arrays, are not caused by disk failures.

In a paper titled Are disks the dominant contributor for storage failures?, Weihang Jiang and his co-authors Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky, and NetApp found that up to 80% of the storage system failures aren't in the disks at all. The authors surveyed approximately 39,000 storage systems with 155,000 enclosures containing roughly 1.8 million disks over a period of 44 months. The paper was presented at February's USENIX Conference on File and Storage Technologies (FAST 08). According to Jiang and his co-authors, only between 20% and 50% of the disk system failures were due to the disk drives. The rest came from other causes, most notably interconnection problems.

Of course, everyone who works on hardware knows that connections are a notorious source of mysterious troubles. Interconnect problems become more likely as we cram more and more physically smaller drives into a single enclosure. The probability of an interconnect failure probably increases faster than the probability of a drive failure in these ultra-crowded enclosures. It definitely makes it harder to get at the connections to check them. As the paper notes: "There are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements." This helps explain a couple of well-known, but puzzling phenomena associated with storage system failures. The most obvious is that users report disk failure rates in service that are between two and four times the rates calculated by manufacturers.

It also helps explain the phenomenon of Trouble Not Found (TNF) - returned drives that check out fine back at the factory. Some manufacturers put the rate of TNF drives as high as 50% of all those returned as failed. While there are many other causes for pulling a good drive as failed (such as problems in the protocol stack), replacing a drive is likely to fix a flaky connector, at least temporarily.

Source: Based on TechTarget report

Do not miss even a single tech update... Subscribe to RSS feeds now!

No comments: