Where Did This Code Come From? Discovering the Provenance of Program Binaries

Posted in Science, Companies, Conferences on May 11, 2012



Google Tech Talk (more info below)
April 22, 2011

Presented by Nathan Rosenblum, UW-Madison

ABSTRACT

Where did this binary come from? How was it compiled? What language did the programmer choose? Who wrote this code? These questions rarely occur to most computer users, but for analysts working in forensics, reverse engineering, and software theft, they are of paramount importance. The provenance of a program binary --- the specific process through which an idea is transformed into executable code --- can provide valuable insight, yet it is in the very domains where such information would be most useful that it is least likely to be available. At the University of Wisconsin, we have investigated techniques to recover these provenance details from program binaries, filling in the gaps in the production process. Provenance recovery occupies the intersection of program analysis, security, and statistical machine learning research; in this talk, I will describe probabilistic models of provenance in the context of compiler toolchain identification and both closed- and open-world solutions to the difficult task of program authorship attribution: picking out stylistic characteristics of executable code that reveal the identity of the programmer. Our work integrates a range of machine learning techniques, from support vector machines to conditional random fields to metric learning and large-margin clustering. I will discuss how we leverage large-scale computing resources to solve scaling problems in model training and inference, and how our work on provenance recovery creates opportunities for research into the social structures of the underground malware economy.

Nathan Rosenblum is a doctoral candidate in the Computer Sciences department at the University of Wisconsin-Madison, under the supervision of Barton Miller. His research interests include systems, security, program analysis and machine learning, particularly when these areas collide. Nathan's current work focuses on discovering characteristics of programmer style in executable machine code. He sometimes remembers fondly the world outside of his office.

Watch Video

Tags: Google Tech Talks, software theft, software security, Machine Learning, Google, GoogleTechTalks