Learning Defect Predictors from Static Code Attributes: Lessons from the Trenches

Posted in Conferences, Companies, Science on December 10, 2008

For six years, I have worked on learning quality predictors from NASA data. Based on that experimence, this talk offers the following lessons from the trenches:

  1. Real world data collection is more like ambulance chasing that bus driving. The old DoD model of rigorous process control just breaks down in the modern era of distributed software development. Rather than lament lack of formal process, we should adapt our learning methods to handle the idiosyncrasies of our data.
  2. Accuracy, correlation and precision and not accurate or precise and may not correlate with any decision making process. This is especially true for data sets where only a small percentage of the data contains the target concept.
  3. Static code attributes are a wide and shallow well- easy to get to the bottom, very hard to get much further. Our learners may have learned all they can learn from these attributes.
  4. The only way up is sideways. My data miners have struck a performance ceiling and the only way up is to change the performance target.
  5. The performance ceiling is very close- we can exploit that. Rather than large-scale automatic methods, it may be more productive to explore human-in-the-loop interactive learning strategies.
  6. We can talk, but will they listen? Many times, I have found a clear signal in a software engineering data set. Clearly, our learners are good enough to assist managers in the difficult task of managing software projects. Sadly, all too often, some management edict is applied that effectively ends that project (e.g. collection of that data source is terminated). I offer some speculations on this peculiar effect.


  • "Implications of Ceiling Effects in Defect Predictors" by T. Menzies and B. Turhan and A. Bener and G. Gay and B. Cukic and Y. Jiang. Proceedings of PROMISE 2008 Workshop (ICSE) 2008 . Available from http://menzies.us/pdf/08ceiling.pdf .
  • "Learning Better IVV Practices" by T. Menzies and M. Benson and K. Costello and C. Moats and M. Northey and J. Richarson. Innovations in Systems and Software Engineering March 2008 . Available from http://menzies.us/pdf/07ivv.pdf .
  • Data Mining Static Code Attributes to Learn Defect Predictors" by Tim Menzies and Jeremy Greenwald and Art Frank. IEEE Transactions on Software Engineering January 2007 . Available from http://menzies.us/pdf/06learnPredict.pdf .
  • "Problems with Precision" by Tim Menzies and Alex Dekhtyar and Justin Distefano and Jeremy Greenwald. IEEE Transactions on Software Engineering September 2007 . http://menzies.us/pdf/07precision.pdf .
  • "Finding the Right Data for Software Cost Modeling" by Zhihao Chen and Tim Menzies and Dan Port and Barry Boehm. IEEE Software Nov 2005 . http://menzies.us/pdf/05chen.pd

Speaker: Tim Menzies
Dr. Tim Menzies (tim@menzies.us) has been working on advanced modeling and AI since 1986. He received his PhD from the University of New South Wales, Sydney, Australia and is the author of over 170 refereeed papers.

A former research chair for NASA, Dr. Menzies is now a associate professor at the West Virginia University's Lane Department of Computer Science and Electrical Engineering.

Google Tech Talks
October 29, 2008

Watch Video

Tags: Techtalks, Google, Conferences, Science, Computer Science, engEDU, Education, Google Tech Talks, Companies