Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Machine Learning and Hadoop by Josh Wills

Abstract: We'll review common use cases for machine learning and advanced analytics found in our customer base at Cloudera and ways in which Apache Hadoop supports these use cases. We'll then discuss upcoming developments for Apache Hadoop that will enable new classes of applications to be supported by the system.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Big Machine Learning made Easy by Miguel Araujo

Miguel Araujo holds a B.S and M.S in computer science from Universidad Antonio de Nebrija and San Diego State University. He is a Machine Learning addict and an active open source hacker that enjoys coding in Python. Miguel is a contributor in open source projects like: django-rules, django-crispy-forms, and requests-oauth.

Abstract: While machine learning has made its way into certain industrial applications, there are many important real-world domains, especially domains with large-scale data, that remain unexplored. There are a number of reasons for this, and they occur at all places in the technology stack.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: A Common GPU n-Dimensional Array for Python and C by Arnaud Bergeron

Abstract: Currently there are multiple incompatible array/matrix/n-dimensional base object implementations for GPUs. This hinders the sharing of GPU code and causes duplicate development work.This paper proposes and presents a first version of a common GPU n-dimensional array(tensor) named GpuNdArray~\citep{GpuNdArray} that works with both CUDA and OpenCL.It will be usable from python, C and possibly other languages.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Bootstrapping Big Data by Ariel Kleiner

Abstract: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators. BLB shares the generic applicability and statistical efficiency of the bootstrap and is furthermore well suited for application to very large datasets using modern distributed computing architectures, as it uses only small subsets of the observed data at any point during its execution. We provide both empirical and theoretical results which demonstrate the efficacy of BLB.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision by Yann LeCun (with Clement Farabet)

Yann LeCun is Silver Professor of Computer Science and Neural Science at the Courant Institute of Mathematical Sciences and the Center for Neural Science of New York University. His current interests include machine learning, computer perception and vision, mobile robotics, and computational neuroscience.

In 2010, Clement Farabet started the PhD program at Universite Paris-Est, with Professors Michel Couprie and Laurent Najman, in parallel with his research work at Yale and NYU. His research interests include intelligent hardware, embedded super-computers, computer vision, machine learning, embedded robotics, and more broadly artificial intelligence.

Abstract: We present a scalable hardware architecture to implement general-purpose systems based on convolutional networks. We will first review some of the latest advances in convolutional networks, their applications and the theory behind them, then present our dataflow processor, a highly-optimized architecture for large vector transforms, which represent 99% of the computations in convolutional networks. It was designed with the goal of providing a high-throughput engine for highly-redundant operations, while consuming little power and remaining completely runtime reprogrammable. We present performance comparisons between software versions of our system executing on CPU and GPU machines, and show that our FPGA implementation can outperform these standard computing platforms.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Parallelizing Training of the Kinect Body Parts Labeling Algorithm by Derek Murray

Abstract: We present the parallelized implementation of decision forest training as used in Kinect to train the body parts classification system. We describe the practical details of dealing with large training sets and deep trees, and describe how to parallelize over multiple dimensions of the problem.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Towards Human Behavior Understanding from Pervasive Data: Opportunities and Challenges Ahead by Nuria Oliver

Nuria Oliver is currently the Scientific Director for the Multimedia, HCI and Data Mining & User Modeling Research Areas in Telefonica Research (Barcelona, Spain). Her research interests include mobile computing, multimedia data analysis, search and retrieval, smart environments, context awareness, statistical machine learning and data mining, artificial intelligence, health monitoring, social network analysis, computational social sciences, and human computer interaction. She is currently working on the previous disciplines to build human-centric intelligent systems.

Abstract: We live in an increasingly digitized world where our -- physical and digital -- interactions leave digital footprints. It is through the analysis of these digital footprints that we can learn and model some of the many facets that characterize people, including their tastes, personalities, social network interactions, and mobility and communication patterns. In my talk, I will present a summary of our research efforts on transforming these massive amounts of user behavioral data into meaningful insights, where machine learning and data mining techniques play a central role. The projects that I will describe cover a broad set of areas, including smart cities and urban computing, psychographics, socioeconomic status prediction and disease propagation. For each of the projects, I will highlight the main results and point at technical challenges still to be solved from a data analysis perspective.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: High-Performance Computing Needs Machine Learning...and Vice Versa by Nicolas Pinto

Abstract: Large-scale parallelism is a common feature of many neuro-inspired algorithms. In this short paper, we present a practical tutorial on ways that metaprogramming techniques -- dynamically generating specialized code at runtime and compiling it just-in-time -- can be used to greatly accelerate a large data-parallel algorithm. We use filter-bank convolution, a key component of many neural networks for vision, as a case study to illustrate these tech- niques. We present an overview of several key themes in template metaprogramming, and culminate in a full example of GPU auto-tuning in which an instrumented GPU kernel template is built and the space of all possible instantiations of this kernel is automatically grid- searched to find the best implementation on various hardware/software platforms. We show that this method can, in concert with traditional hand-tuning techniques, achieve significant speed-ups, particularly when a kernel will be run on a variety of hardware platforms.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Fast Cross-Validation via Sequential Analysis by Tammo Kruger

Abstract: With the increasing size of today's data sets, finding the right parameter configuration via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible the method speeds up the computation while preserving the capability of the full cross-validation. The experimental evaluation shows that our method reduces the computation time by a factor of up to 70 compared to a full cross-validation with a negligible impact on the accuracy.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Machine Learning's Role in the Search for Fundamental Particles by Daniel Whiteson

Daniel Whiteson is an Associate Professor in the Department of Physics & Astronomy at UC Irvine. His research area is experimental particle physics, using data from the world's most powerful colliders to answer questions about the fundamental nature of matter and interactions at the smallest scales. He has a long-standing interest in machine learning and has collaborated with machine learning researchers to apply new ideas to the problems of particle physics.

Abstract: High-energy physicists try to decompose matter into its most fundamental pieces by colliding particles at extreme energies. But to extract clues about the structure of matter from these collisions is not a trivial task, due to the incomplete data we can gather regarding the collisions, the subtlety of the signals we seek and the large rate and dimensionality of the data. These challenges are not unique to high energy physics, and there is the potential for great progress in collaboration between high energy physicists and machine learning experts. I will describe the nature of the physics problem, the challenges we face in analyzing the data, the previous successes and failures of some ML techniques, and the open challenges.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Randomized Smoothing for (Parallel) Stochastic Optimization by John Duchi

John Duchi is currently a PhD candidate in computer science at Berkeley, where he started in the fall of 2008. He works in the Statistical Artificial Intelligence Lab (SAIL) under the joint supervision of Mike Jordan and Martin Wainwright. John is currently supported by an NDSEG fellowship and starting next year, he will be supported by Facebook, who have generously awarded him a Facebook Fellowship.

Abstract: By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates for stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based rates for non-smooth optimization. A combination of our techniques with recent work on decentralized optimization yields order-optimal parallel stochastic optimization algorithms. We give applications of our results to statistical machine learning problems, providing experimental results demonstrating the effectiveness of our algorithms.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent by Rainer Gemulla

Rainer Gemulla graduated from the Technische Universität Dresden in Germany in the area of database sampling. He is currently working as a senior researcher at the Max-Plack-Institut für Informatik in Saarbrücken, Germany.

Abstract: We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel "stratified'' variant of SGD, we obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations and has good scalability properties.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Tutorial: GraphLab 2.0 by Yucheng Low

Yucheng Low is a third year Phd student in the Machine Learning Department at Carnegie Mellon University advised by Carlos Guestrin. His current work is on abstractions for Large Scale Parallel / Distributed Machine Learning.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Graphlab 2: The Challenges of Large Scale Computation on Natural Graphs by Carlos Guestrin

Carlos Guestrin is an Assistant Professor at Carnegie Mellon's Computer Science and Machine Learning Departments. Carlos has conducted research in (1) learning and control in large-scale structured environments; (2) distributed multiagent coordination; (3) robust, efficient and resource-aware algorithms for sensor networks; (4) sensor placement and tasking; (5) query specific probabilistic modeling.

Abstract: Two years ago we introduced GraphLab to address the critical need for a high-level abstraction for large-scale graph structured computation in machine learning. Since then, we have implemented the abstraction on multicore and cloud systems, evaluated its performance on a wide range of applications, developed new ML algorithms, and fostered a growing community of users. Along the way, we have identified new challenges to the abstraction, our implementation, and the important task of fostering a community around a research project. However, one of the most interesting and important challenges we have encountered is large-scale distributed computation on natural power law graphs. To address the unique challenges posed by natural graphs, we introduce GraphLab 2, a fundamental redesign of the GraphLab abstraction which provides a much richer computational framework. In this talk, we will describe the GraphLab 2 abstraction in the context of recent progress in graph computation frameworks (e.g., Pregel/Giraph). We will review some of the special challenges associated with distributed computation on large natural graphs and demonstrate how GraphLab 2 addresses these challenges. Finally, we will conclude with some preliminary results from GraphLab 2 as well as a live demo. This talk represents joint work with Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Alex Smola, and Joseph Hellerstein.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo by Matt Hoffman

Matt Hoffman is a postdoc working with Prof. Andrew Gelman at Columbia University. His did his Ph.D. at Princeton University in Computer Science working in the Sound Lab with Prof. Perry Cook and Prof. David Blei. Matt's research focuses on developing efficient Bayesian inference algorithms and on Bayesian modeling of audio, audio feature extraction, music information retrieval, and the application of music information retrieval and modeling techniques to musical synthesis.

Abstract: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlations that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than popular methods such as random walk Metropolis or Gibbs sampling. However, HMC's performance is highly sensitive to two user-specified parameters: a step size $\epsilon$ and a desired number of steps $L$. In particular, if $L$ is too small then the algorithm exhibits undesirable random walk behavior, while if $L$ is too large the algorithm wastes computation. We present the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps $L$. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. NUTS is able to achieve similar performance to a well tuned standard HMC method, without requiring user intervention or costly tuning runs. NUTS can thus be used in applications such as BUGS-style automatic inference engines that require efficient "turnkey'' sampling algorithms.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Block splitting for Large-Scale Distributed Learning by Neal Parikh

Neal Parikh is a Ph.D. Candidate in the Department of Computer Science at Stanford University.

Abstract: Machine learning and statistics with very large datasets is now a topic of widespread interest, both in academia and industry. Many such tasks can be posed as convex optimization problems, so algorithms for distributed convex optimization serve as a powerful, general-purpose mechanism for training a wide class of models on datasets too large to process on a single machine. In previous work, it has been shown how to solve such problems in such a way that each machine only looks at either a subset of training examples or a subset of features. In this paper, we extend these algorithms by showing how to split problems by both examples and features simultaneously, which is necessary to deal with datasets that are very large in both dimensions. We present some experiments with these algorithms run on Amazon's Elastic Compute Cloud.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Hazy: Making Data-driven Statistical Applications Easier to build and Maintain by Chris Re

Christopher (Chris) Ré is currently an assistant professor in the department of Computer Sciences at the University of Wisconsin-Madison. The goal of his work is to enable users and developers to build applications that more deeply understand data. In many applications, machines can only understand the meaning of data statistically, e.g., user-generated text or data from sensors.

Abstract: The main question driving my group's research is: how does one deploy statistical data-analysis tools to enhance data driven systems? Our goal is to find abstractions that one needs to deploy and maintain such systems. In this talk, I describe my group's attack on this question by building a diverse set of statistical-based data-driven applications: a system whose goal is to read the Web and answer complex questions, a muon detector in collaboration with a neutrino telescope called IceCube, and a social-science applications involving rich content (OCR and speech data). Even in this diverse set, my group has found common abstractions that we are exploiting to build and to maintain systems. Of particular relevance to this workshop is that I have heard of applications in each of these domains referred to as "big data." Nevertheless, in our experience in each of these tasks, after appropriate preprocessing, the relevant data can be stored in a few terabytes -- small enough to fit entirely in RAM or on a handful of disks. As a result, it is unclear to me that scale is the most pressing concern for academics. I argue that dealing with data at TB scale is still challenging, useful, and fun, and I will describe some of our work in this direction. This is joint work with Benjamin Recht, Stephen J. Wright, and the Hazy Team

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Real time data sketches by Alex Smola

Alex is a Principal Researcher at Yahoo. Alex's current research focus is on nonparametric methods for estimation, in particular kernel methods and exponential families. This includes support vector Machines, gaussian processes, and conditional random fields.

Abstract: I will describe a set of algorithms for extending streaming and sketching algorithms to real time analytics. These algorithm captures frequency information for streams of arbitrary sequences of symbols. The algorithm uses the Count-Min sketch as its basis and exploits the fact that the sketching operation is linear. It provides real time statistics of arbitrary events, e.g.\ streams of queries as a function of time. In particular, we use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. The service runs in real time, it scales perfectly in terms of throughput and accuracy, using distributed hashing. The latter also provides performance guarantees in the case of machine failure. Queries can be answered in constant time regardless of the amount of data to be processed. The same distribution techniques can also be used for heavy hitter detection in a distributed scalable fashion.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011

Invited Talk: Spark: In-Memory Cluster Computing for Iterative and Interactive Applications by Matei Zaharia

Matei Zaharia is a fifth year graduate student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in cloud computing, operating systems and networking. He is also a committer on Apache Hadoop. He is funded by a Google PhD fellowship. Before joining Berkeley, Matei got his undergraduate degree at the University of Waterloo in Canada.

Abstract: MapReduce and its variants have been highly successful in supporting large-scale data-intensive cluster applications. However, these systems are inefficient for applications that share data among multiple computation stages, including many machine learning algorithms, because they are based on an acyclic data flow model. We present Spark, a new cluster computing framework that extends the data flow model with a set of in-memory storage abstractions to efficiently support these applications. Spark outperforms Hadoop by up to 30x in iterative machine learning algorithms while retaining MapReduce's scalability and fault tolerance. In addition, Spark makes programming jobs easy by integrating into the Scala programming language. Finally, Spark's ability to load a dataset into memory and query it repeatedly makes it especially suitable for interactive analysis of big data. We have modified the Scala interpreter to make it possible to use Spark interactively as a highly responsive data analytics tool.

At Berkeley, we have used Spark to implement several large-scale machine learning applications, including a Twitter spam classifier and a real-time automobile traffic estimation system based on expectation maximization. We will present lessons learned from these applications and optimizations we added to Spark as a result.

Want more on these topics?

Browse the archive of posts filed under Companies, Conferences]]>