Collecting, Analysing, and Exploiting Failure Data from Real, Large Systems

Posted in Conferences, Companies, Project Management, Testing on November 29, 2006


Collecting, Analysing, and Exploiting Failure Data from Real, Large Systems
Google Tech Talks October 27, 2006 ABSTRACT Component failure in large-scale IT installations is becoming an ever larger problem as the number of processors, memory chips, and disks in a single cluster approaches a million. Yet, virtually no data on failures in real systems is publicly available, forcing researchers to base their work on anecdotes and back of the envelope calculations. In this talk, we will present results from our analysis of failure data from 26 large-scale production systems at three different organizations, including two high-performance computing sites and one large internet service provider. Our results indicate that several commonly made assumptions about failures might not accurately reflect field experience. For example in the case of disk failures, we find that failure rates in the field can be by an order of magnitude higher than one might predict on disk's datasheet mean-time-to-failure, and that significant wear-out effects set in much earlier than commonly assumed. We will also talk briefly about our efforts on creating a Usenix-supported failure data repository to provide researchers with a wide variety of real failure data.

Watch Video

Tags: Techtalks, Google, Practices, Q&A, Conferences, Companies