Discovery of frequent patterns = finding positive conjunctions that are true for a given fraction of the observations - this basic idea can be instantiated in many ways: - finding frequent sets from 0/1 data (association mining) - finding frequent episodes in sequences - finding frequent subgraphs in graphs etc. - efficient algorithms exist -- the levelwise approach - theoretical analysis of the algorithms is not trivial - leads to connections to hypergraph transversals etc. - the second part: how can the patterns be used? - sometimes interesting in themselves - can be used to approximate the joint distribution - maximum entropy approaches - combining information from several patterns - ordering patterns

*Author:
Heikki Mannila,
Helsinki University Of Technology*

Want more on these topics?

Browse the archive of posts filed under Science]]>

1 - "Patterns in sets of points: an overview" "We illustrate the importance of optimization principles in the search for interesting patterns, more in particular for patterns in sets of points embedded in a metric space. This talk will be a journey along the types of patterns in point sets that can efficiently be searched for, and general principles will be outlined. We provide examples from dimensionality reduction, classification, clustering, and others. The emphasis will be on patterns that can be expressed in terms of linear functions of the data."

*Author:
Tijl De Bie,
Ku Leuven*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Lecture slides:

- The Analysis of Patterns
- The Value of Patterns
- Patterns Help Us in Many Ways…
- Benefits of Detecting Patterns
- Patterns and Intelligence
- The Instinct for Patterns
- Patterns and Randomness
- Patterns and Randomness
- Visualizing Patterns
- Finding Patterns
- Computational Pattern Finding
- The Analysis of Patterns
- The Analysis of Patterns
- TITLE
- slide16
- Another Gold Rush
- 2001
- The Analysis of Patterns
- The Analysis of Patterns
- The Analysis of Patterns
- The Analysis of Patterns
- The Analysis of Patterns
- Searching for Patterns
- Statistics
- What Are Patterns?
- Gregory Chaitin: \x5C\x22Patterns, Randomness and Information\x5C\x22
- Gregory Chaitin
- Patterns in Sets of Points (Vectors)
- Tijl De Bie: Patterns in Sets of Points
- Patterns in Sequences
- Suffix Tree and Hidden Markov techniques for pattern analysis
- Dan Gusfield Trees, Arrays, Networks and Optimization for Finding Patterns in Biological Sequences
- Raffaele Giancarlo Patterns and Compression
- Conceptual Foundations
- Kernel Methods
- Bernhard Schoelkopf Kernel Methods
- Patterns in Sets
- Heikki Mannila: Finding frequent patterns
- When can we trust the patterns we found?
- John Shawe\x2DTaylor Statistical Aspects of Pattern Analysis
- Nicolo\x27 Cesa\x2DBianchi On\x2Dline linear learning algorithms
- Grammatical Inference
- Colin de la Higuera : \x5C\x22Grammatical Inference, a Tutorial\x5C\x22
- Patterns in Graphs
- TITLE

*Author:
Nello Cristianini,
University Of Bristol*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Lecture slides:

- Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery
- http://www.cc.gatech.edu/~axa/papers A) Speciali
- http://www.cc.gatech.edu/~axa/papers B) Introductory Material
- Acknowledgements Gill Bejerano
- Form \x3D Function
- TITLE
- Bioinformatics the Road Ahead
- At a joint EU \x2D US panel meeting
- Which Information Anyway
- King Phillip Came Over For Green Soup
- The “Chinese” Taxonomy
- Summary
- Defining ‘’Class’’
- Class by Intension
- Statistical Classification
- Statistical Classification
- Statistical Classification
- Inferring Grammars
- Regular, Anomalous, Entropy, Negentropy
- Random, Regular, Compressible
- Random, Regular, Compressible
- Form \x3D Function
- Summary
- Privileging Syntactic Regularities in Strings
- Unavoidable Regularities
- Avoidable Regularities
- Periods cannot coexist too long
- Avoidable Regularities
- Squares or Tandem Repeats
- Detecting Squares
- Tandem Repeats, Repeated Episodes
- Pattern Discovery in WAKA
- Discovering instances of poetic allusion from anthologies of classical Japanese poems
- Cheating by Schoolteachers (the longest substring common to k of n strings)
- Summary
- General Form of Pattern Discovery
- Data Compression by Textual Substitution
- Consumer Prediction (Data Mining) Intrusion Detection (Security) Protein Classification (Bio\x2DInformatics)
- Of Exactitude in Science

*Author:
Alberto Apostolico,
University Of Padova*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Suffix tree construction. Mention the new linear time array constructions - - using suffix trees for finding motifs with gaps (some new observations: 0.5 - 1 hours). - finding cis-regulatory motifs by comparative genomics (1 hour) - Hidden Markov techniques for haplotyping

*Author:
Esko Ukkonen,
University Of Helsinki*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Abstract: The lectures will introduce the role of statistics in pattern analysis with a discussion of the difference between pattern significance and pattern stability. We will go on to discuss composite hypothesis testing and the Bonferroni correction. Concentration inequalities will be introduced and used to assess the statistical reliability of empirical estimates. We move to consider uniform convergence in order to analyse pattern stability. Rademacher complexity will be discussed as a theoretical tool for the bounding of uniform convergence.

*Author:
John Shawe Taylor,
University Of London*

Want more on these topics?

Browse the archive of posts filed under Science]]>

We focus on large graphs where nodes have attributes, such as a social network where the nodes are labelled with each person’s job title. In such a setting, we want to find subgraphs that match a user query pattern. For example, a ‘star’ query would be, “find a CEO who has strong interactions with a Manager, a Lawyer, and an Accountant, or another structure as close to that as possible”. Similarly, a ‘loop’ query could help spot a money laundering ring. Traditional SQL-based methods, as well as more recent graph indexing methods, will return no answer when an exact match does not exist. Our method can find exact-, as well as near-matches, and it will present them to the user in our proposed ‘goodness’ order. For example, our method tolerates indirect paths between, say, the ‘CEO’ and the ‘Accountant’ of the above sample query, when direct paths do not exist. Its second feature is scalability. In general, if the query has nq nodes and the data graph has n nodes, the problem needs polynomial time complexity O(nnq ), which is prohibitive. Our G-Ray (“Graph X-Ray”) method finds high-quality subgraphs in time linear on the size of the data graph. Experimental results on the DLBP author-publication graph (with 356K nodes and 1.9M edges) illustrate both the effectiveness and scalability of our approach. The results agree with our intuition, and the speed is excellent. It takes 4 seconds on average for a 4- node query on the DBLP graph.

*Author:
Hanghang Tong,
Cmu*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Information, Complexity, Patterns, Randomness and Compression. And how these ideas can be traced back through Hermann Weyl to Leibniz in 1686, and connect them with Godel & Turing and with the question of how math compares & contrasts with physics and with biology.

*Author:
Gregory Chaitin,
University Of Auckland*

Want more on these topics?

Browse the archive of posts filed under Science]]>

An ideal segmentation algorithm could be applied equally to the problem of isolating organs in a medical volume or to editing a digital photograph without modifying the algorithm, changing parameters, or sacrificing segmentation quality. However, a general-purpose, multiway segmentation of objects in an image/volume remains a challenging problem. In this talk, I will describe a recently developed approach to this problem that inputs a few training points from a user (e.g., from mouse clicks) and produces a segmentation by computing the probabilities that a random walker leaving unlabeled pixels/voxels will first strike the training set. By exact mathematical equivalence with a problem from potential theory, these probabilities may be computed analytically and deterministically. The algorithm is developed on an arbitrary, weighted, graph/mesh in order to maximize the broadness of application. I will illustrate the use of this approach with examples from several segmentation problems (without modifying the algorithm or the single free parameter), compare this algorithm to other approaches and discuss the theoretical properties that describe its behavior.

*Author:
Leo Grady,
Siemens Corporate Research*

Want more on these topics?

Browse the archive of posts filed under Science]]>

This is an ultra short interview where the posed question is the one that even Gottfried Wilhelm Leibniz posed to himself "What is the pattern" and what research did Gregory Chaitin on this subject.

*Interviewer:
Nello Cristianini,
University Of Bristol
Interviewee:
Gregory Chaitin,
University Of Auckland*

Want more on these topics?

Browse the archive of posts filed under Science]]>

The increasing pervasiveness of location-acquisition technologies (GPS, GSM networks, etc.) is leading to the collection of large spatio-temporal datasets and to the opportunity of discovering usable knowledge about movement behaviour, which fosters novel applications and services. In this paper, we move towards this direction and develop an extension of the sequential pattern mining paradigm that analyzes the trajectories of moving objects. We introduce trajectory patterns as concise descriptions of frequent behaviours, in terms of both space (i.e., the regions of space visited during movements) and time (i.e., the duration of movements). In this setting, we provide a general formal statement of the novel mining problem and then study several different instantiations of different complexity. The various approaches are then empirically evaluated over real data and synthetic benchmarks, comparing their strengths and weaknesses.

*Author:
Mirco Nanni,
Istituto Di Scienza E Tecnologie Dell'informazione "Alessandro Faedo"*

Want more on these topics?

Browse the archive of posts filed under Science]]>

These lectures will provide an introduction to the theory of pattern classification methods. They will focus on relationships between the minimax performance of a learning system and its complexity. There will be four lectures. The first will review the formulation of the pattern classification problem, and several popular pattern classification methods, and present general risk bounds in terms of Rademacher averages, a measure of the complexity of a class of functions. The second lecture will consider pattern classification in a minimax setting, and show that, in this setting, the Vapnik-Chervonenkis dimension is the key measure of complexity. The third lecture will focus on a theme of computational complexity. It will present the elegant relationship between the complexity of a class, as measured by its VC-dimension, and the computational complexity of functions from the class. This lecture will also review general results on the computational complexity of the pattern classification problem, and its tight relationship with that of an associated empirical risk optimization problems. The fourth lecture will consider large margin classification methods, such as AdaBoost, support vector machines, and neural networks, viewing them as convex relaxations of intractable empirical minimization problems. It will review several statistical properties of these large margin methods, in particular, a characterization of the convex optimization problems that lead to accurate classifiers, and relationships between these methods and probability models.

*Author:
Peter L. Bartlett,
Berkeley , University Of California*

Want more on these topics?

Browse the archive of posts filed under Science]]>

In the "structural" paradigm for visual pattern recognition, or what some call "strong" pattern recognition, one is not satisfied with simply assigning a class label to an input object, but instead we aim at finding exactly which parts of the template object correspond to which parts of the scene. This is a much harder problem in principle, because it is inherently combinatorial on the number of parts (features) involved, both in the template object and in the scene. This talk describes a summary of our research efforts in setting this as a mathematical optimization problem and solving it efficiently by exploiting geometric constraints. The key insight involves encoding geometric constraints as conditional independency assumptions in a probabilistic graphical model. Due to some geometric facts, it is possible to show that such models are very well behaved: they allow for exact probabilistic inference in polynomial time. The result is a unified framework for structural visual pattern recognition that is able to handle in a principled way a variety of problems, including point pattern matching in its many instances: invariant to translations, isometries, scalings, affine or projective transformations. Attributed graph matching problems, such as matching road networks, can also be solved within such framework. Limitations and future directions will be discussed.

*Author:
Tibério Caetano,
National Ict Australia*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Spectral representations of graphs, Pattern spaces from graph spectra, Spectral approaches to matching, Heat kernel methods Probabilistic and spectral methods for graph matching and clustering. Applications in computer vision.

*Author:
Edwin Hancock,
University Of York*

Want more on these topics?

Browse the archive of posts filed under Science]]>

In this paper we discuss the problem of feature selection for supervised learning from the standpoint of statistical machine learning. We inquire what subset of features will lead to the best classification accuracy. It is clear that if the statistical model is known, or if there are an unlimited number of training samples, any additional feature can only improve the accuracy. However, we explicitly show that when the training set is finite, using all the features may be suboptimal, even if all the features are independent and carry information on the label. We analyze one setting analytically and show how feature selection can increase accuracy. We also find the optimal number of features as a function of the training set size for a few specific examples. This perspective on feature selection is different from the common approach that focuses on the probability that a specific algorithm will pick a completely irrelevant or redundant feature.

*Author:
Amir Navot,
Hebrew University Of Jerusalem*

Want more on these topics?

Browse the archive of posts filed under Science]]>

This course covers feature selection fundamentals and applications. The students will first be reminded of the basics of machine learning algorithms and the problem of overfitting avoidance. In the wrapper setting, feature selection will be introduced as a special case of the model selection problem. Methods to derive principled feature selection algorithms will be reviewed as well as heuristic method, which work well in practice. One class will be devoted to feature construction techniques. Finally, a lecture will be devoted to the connections between feature section and causal discovery. The class will be accompanied by several lab sessions. The course will be attractive to students who like playing with data and want to learn practical data analysis techniques. The instructor has ten years of experience with consulting for startup companies in the US in pattern recognition and machine learning. Datasets from a variety of application domains will be made available: handwriting recognition, medical diagnosis, drug discovery, text classification, ecology, marketing.

*Author: Isabelle Guyon, Clopinet*

Want more on these topics?

Browse the archive of posts filed under Science]]>

We introduce a new framework for feature grouping based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, which we apply to the problem of finding text in natural scenes. We demonstrate an implementation of our factor graph-based algorithm for finding text on a Nokia camera phone, which is intended for eventual use in a camera phone system that finds and reads text (such as street signs) in natural environments for blind users.

*Author: Huiying Shen, Smith Kettlewell Eye Research Institute (Skeri)*

Want more on these topics?

Browse the archive of posts filed under Science]]>

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely, such as causing significant biases and misleading results in hypothesis tests, correlation analysis and regressions. The very limited previous studies on cleaning disguised missing data use outlier mining and distribution anomaly detection. They highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers. To tackle the problem of cleaning disguised missing data, in this paper, we first model the distribution of disguised missing data, and propose the embedded unbiased sample heuristic. Then, we develop an effective and efficient method to identify the frequently used disguise values which capture the major body of the disguised missing data. Our method does not require any domain background knowledge to find the suspicious disguise values. We report an empirical evaluation using real data sets, which shows that our method is effective – the frequently used disguise values found by our method match the values identified by the domain experts nicely. Our method is also efficient and scalable for processing large data sets.

*Author: Jian Pei, Simon Fraser University*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Commercial datasets are often large, relational, and dynamic. They contain many records of people, places, things, events and their interactions over time. Such datasets are rarely structured appropriately for knowledge discovery, and they often contain variables whose meanings change across different subsets of the data. We describe how these challenges were addressed in a collaborative analysis project undertaken by the University of Massachusetts Amherst and the National Association of Securities Dealers (NASD). We describe several methods for data preprocessing that we applied to transform a large, dynamic, and relational dataset describing nearly the entirety of the U.S. securities industry, and we show how these methods made the dataset suitable for learning statistical relational models. To better utilize social structure, we first applied known consolidation and link formation techniques to associate individuals with branch office locations. In addition, we developed an innovative technique to infer professional associations by exploiting dynamic employment histories. Finally, we applied normalization techniques to create a suitable class label that adjusts for spatial, temporal, and other heterogeneity within the data. We show how these pre-processing techniques combine to provide the necessary foundation for learning high-performing statistical models of fraudulent activity.

*Author: Andrew Fast, University Of Massachusetts Amherst*

Want more on these topics?

Browse the archive of posts filed under Science]]>

We revisit the problem of representing a high-dimensional data set by a distance-preserving projection onto a two-dimensional plane. This problem is solved by well-known techniques, such as multidimensional scaling. There, the data is projected onto a flat plane and the Euclidean metric is used for distance calculation. In real topographic maps, however, travel distance (or time) is not determined by (Euclidean) distance alone, but also influenced by map features such as mountains or lakes. We investigate how to utilize landscape features for a distance-preserving projection. A first approach with rectangular cylindrical mountains in the MDS landscape is presented.

*Author: Frank Klawonn, University Of Applied Sciences Braunschweig/Wolfenbüttel*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Questions in cognitive neuroscience are often framed in terms of correspondences between known types: How is brain state X related to cognitive state Y? What are the correlations or mappings between particular structures and functions? Such framings are well suited for confirmatory testing of coarse-grained hypotheses. They are not necessarily informative, however, for the purpose of exploring finer physical and functional structure. To the contrary, physical states are typically aggregated over anatomical regions of interest, while tasks are designed to optimize one or a few functional contrasts of interest rather than to cover a fuller behavioral or cognitive range.

*Author: Kenneth Whang, National Science Foundation*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Lecture slides:

- Feature Selection and Causality Inference
- Purpose
- Road Map
- Feature Selection
- Uncovering Dependencies
- Predictions and Actions
- Individual Feature Irrelevance pt 1
- Individual Feature Relevance pt 2
- Multivariate Cases
- Is multivariate FS always best?
- In practice…
- Definition of “relevance”
- Is X2 “relevant”?
- Are X1 and X2“relevant”?
- Adding a variable…
- X1 || Y | X2
- Really?
- Same independence relations Different causal relations
- Is X1 “relevant”?
- Non-causal features may be predictive yet not “relevant”
- Causal Features
- Experiments
- Univariate Filter: AUC
- Causal Feature Selection
- Causal features are “robust” under change of distribution
- Conclusion
- http://clopinet.com/fextract-book

*Author: Isabelle Guyon, Clopinet*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Non-linear dimensionality reduction of noisy data is a challenging problem encountered in a variety of data analysis applications. Recent results in the literature show that spectral decomposition, as used for example by the Laplacian Eigenmaps algorithm, provides a powerful tool for non-linear dimensionality reduction and manifold learning. In this paper, we discuss a significant shortcoming of these approaches, which we refer to as the repeated eigendirections problem. We propose a novel approach that combines successive 1dimensional spectral embeddings with a data advection scheme that allows us to address this problem. The proposed method does not depend on a non-linear optimization scheme; hence, it is not prone to local minima. Experiments with artificial and real data illustrate the advantages of the proposed method over existing approaches. We also demonstrate that the approach is capable of correctly learning manifolds corrupted by significant amounts of noise.

*Author: Samuel Gerber, University Of Utah*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Lecture slides:

- Recent Progress on Learning with Graph Representations
- Outline
- Motivation
- Problem
- Measuring similarity of graphs
- Viewed from the perspective of learning
- Learning with graphs (circa 2000)
- Why is structural learning difficult
- Structural Variations
- Contributions
- Spectral Methods
- Graph (structural) representations of shape
- Delaunay Graph
- MOVI Sequence
- Shock graphs
- Graph characteristics
- Pairwise clustering
- Embeddings
- Generative model
- Spectral Generative Model
- Algebraic graph theory (PAMI 2005)
- ….joint work with Richard Wilson
- Spectral Representation
- Properties of the Laplacian
- Eigenvalue spectrum
- Eigenvalues are invariant to permutations of the Laplacian.
- Why
- Symmetric polynomials
- Power symmetric polynomials
- Symmetric polynomials on spectral matrix
- Spectral Feature Vector
- …extend to weighted attributed graphs.
- Complex Representation
- Spectral analysis
- Pattern Spaces
- Manifold learning methods
- Separation under structural error
- Variation under structural error (MDS)
- CMU Sequence
- MOVI Sequence
- YORK Sequence
- Visualisation (LLP+Laplacian Polynomials)
- Cospectrality problem for trees
- Cospectral trees
- Overcome using quantum random walk\t
- The positive support of a matrix
- Cospectral Trees
- Stongly regular graphs
- Generative Tree Union Model
- ..work with Andrea Torsello
- Ingredients
- Illustration
- Cluster structure
- Model
- Union as tree distribution
- Generative Model
- Max-likelihood parameters
- Description length
- Expectation on observation density
- Tree Union
- Simplified Description Cost
- Description Length Gain
- Unattributed
- Future

*Author: Edwin Hancock, University Of York*

Want more on these topics?

Browse the archive of posts filed under Science]]>

For large-scale classification problems, the training samples can be clustered beforehand as a downsampling pre-process, and then only the obtained clusters are used for training. Motivated by such assumption, we proposed a classification algorithm, Support Cluster Machine (SCM), within the learning framework introduced by Vapnik. For the SCM, a compatible kernel is adopted such that a similarity measure can be handled not only between clusters in the training phase but also between a cluster and a vector in the testing phase. We also proved that the SCM is a general extension of the SVM with the RBF kernel. The experimental results confirm that the SCM is very effective for largescale classification problems due to significantly reduced computational costs for both training and testing and comparable classification accuracies. As a by-product, it provides a promising approach to dealing with privacy-preserving data mining problems.

*Author: Xiangyang Xue, Fudan University*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Most practical image segmentation algorithms optimize some mathematical similarity criterion derived from several low-level image features. One possible way of combining different types of features, e.g. color- and texture features on different scales and/or different orientations, is to simply stack all the individual measurements into one high-dimensional feature vector. Due to the nature of such stacked vectors, however, only very few components (e.g. those which are defined on a suitable scale) will carry information that is relevant for the actual segmentation task. We present a novel approach to combining segmentation and feature selection that is capable of overcoming this relevance determination problem. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing thediscriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. All free model parameters of this method are selected by a resampling-based stability analysis. Experiments for both toy examples and real-world images demonstrate that the built-in feature selection mechanism leads to stable and meaningful partitions of the images.

*Author: Volker Roth, Eth Zurich*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Regularized Kernel Discriminant Analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. The performance of RKDA depends on the selection of kernels. In this paper, we consider the problem of learning an optimal kernel over a convex set of kernels. We show that the kernel learning problem can be formulated as a semidefinite program (SDP) in the binary-class case. We further extend the SDP formulation to the multi-class case. It is based on a key result established in this paper, that is, the multi-class kernel learning problem can be decomposed into a set of binary-class kernel learning problems. In addition, we propose an approximation scheme to reduce the computational complexity of the multi-class SDP formulation. The performance of RKDA also depends on the value of the regularization parameter. We show that this value can be learned automatically in the framework. Experimental results on benchmark data sets demonstrate the efficacy of the proposed SDP formulations.

*Author: Shuiwang Ji, The Biodesign Institute, Arizona State University*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Dimensionality reduction is a commonly used step in machine learning, especially when dealing with a high dimensional space of features. The original feature space is mapped onto a new, reduced dimensioanllyity space and the examples to be used by machine learning algorithms are represented in that new space. The mapping is usually performed either by selecting a subset of the original features or/and by constructing some new features. This persentation deals with the first approach, feature subset selection. We provide a brief overview of the feature subset selection techniques that are commonly used in machine learning and give a more detailed description of feature subset selection used in machine learning on text data. Performance of some methods used is document categorization is illustrated by providing experimental comparison on real-world data collected from the Web.

*Author: Dunja Mladeni?, Jožef Stefan Institute*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Many feature selection algorithms are limited in that they attempt to identify relevant feature subsets by examining the features individually. This paper introduces a technique for determining feature relevance using the average information gain achieved during the construction of decision tree ensembles. The technique introduces a node complexity measure and a statistical method for updating the feature sampling distribution based upon confidence intervals to control the rate of convergence. Experiments demonstrate the potential of this method for feature selection and subspace identification.

*Author: Jeremy D. Rogers, University Of Southampton*

Want more on these topics?

Browse the archive of posts filed under Science]]>

The sparse grid method is a special discretization technique, which allows to cope with the curse of dimensionality to some extent. It is based on a hierarchical basis and a sparse tensor product decompositon. Sparse grids have been successfully used to solve partial differential equations in the past and, more recently, have been shown to be competitive for learning problems as well. The lecture will provide a general introduction to the major properties of sparse grids and present the sparse grid combination technique for classification and regression.

*Author: Jochen Garcke, Australian National University Anu*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Distance metric learning and nonlinear dimensionality reduction are two interesting and active topics in recent years. However, the connection between them is not thoroughly studied yet. In this paper, a transductive framework of distance metric learning is proposed and its close connection with many nonlinear spectral dimensionality reduction methods is elaborated. Furthermore, we prove a representer theorem for our framework, linking it with function estimation in an RKHS, and making it possible for generalization to unseen test samples. In our framework, it suffices to solve a sparse eigenvalue problem, thus datasets with 105 samples can be handled. Finally, experiment results on synthetic data, several UCI databases and the MNIST handwritten digit database are shown.

*Author: Fuxin Li, Chinese Academy Of Sciences, Institute Of Automation*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Lecture slides:

- Relational Latent Class Models
- Overview
- Relational Problems are About Networks
- Relational Problems Might Involve Multiple Classes of
- John Donne, 1572 – 1631
- Overview: Learning with Relations (incomplete)
- Statistical Relational Learning
- This work
- II. Before Relational Learning
- IID Learning: The Matrix
- Towards Relational Learning:Time Series Models
- Towards Relational Learning: Hierarchical Bayesian Modeling
- Learning with Related Tasks
- A Hierarchical Bayesian Model
- Parametric HB is too Stiff!
- A Mixture Model
- III Relational Modeling and Learning
- Learning with Relational Data
- Entity Relationship Model
- Representing Ground Facts
- Directed Acyclic Probabilistic Entity Relationship (DAPER) Model
- DAPER and Ground Networks
- Structural Learning in Relational Modeling
- IV Infinite Hidden Relational Modeling
- Hierarchical Bayes and Relational Learning
- Relationship Prediction with Strong Attributes
- Relationship Prediction with Weak (or no) User Attributes
- Nonparametric Relational Bayes: Infinite Hidden Relational Model
- IHRM with Parameters
- The Recipe
- Ground Network With an Image Structure
- Ground Network With an Image Structure and Latent Variables: The IHRM
- Work on Latent Class Relational Learning
- The Generative Model (IHRM)
- The Generative Model (MMSB)
- The Generative Model (DERL)
- The Generative Model (Mixed Membership DERL)
- The Generative Model (Sinkkonen et al.)
- V Making it all work
- Inference in the IHRM
- Experiment 1: Experimental Analysis on Movie Recommendation
- MovieLens Attributes
- Experimental Analysis on Movie Recommendation
- Movie cluster analysis Gibbs sampling with CRP
- Gibbs sampling with CRP - 2
- User Attributes and User Clusters
- Difference to mean distribution
- User Clusters versus Movie Clusters
- Experiment 2: Gene Interaction and Gene Function
- IHRM Model
- Cluster Structure
- Relevance of Attributes and Relationships
- Ongoing Work: Integrate Ontology into IHRM - 1
- Ongoing Work: Integrate Ontology into IHRM - 2
- Experiment 3: Clinical Decision Support
- IHRM Model for Clinical Decision Support
- Procedure Prediction: Given First Procedure
- Experiment 4: Context-Dependent Statistical Trust Learning
- Infinite Hidden Relational Trust Model
- eBay Data
- Predictive Performance
- Conclusion

*Author: Volker Tresp, Siemens*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Lecture slides:

- Efficient Computation of Recursive Principal Component Analysis for Structured Input
- Outline
- What are structured domains and why are they important?
- Examples of Structured Data
- Vectorial Data: Principal Component Analysis
- More Complex Objects
- Principal Component Analysis of Sequences and Trees ?
- The Strategy
- Sequences
- Step 1: Sufficient Conditions
- Step 2: Extended State Space
- Step 3: Reduce
- Step 4: Compose
- Recursive PCA for Trees
- Graphs
- The linear system for graphs
- Computational Problems
- Some Basic Observations and Their Exploitation
- Three Techniques
- Minimal State Space
- Minimal DAG
- QR Decomposition
- Datasets for Experiments
- Experiments Results
- Summary
- Impact on a Regression Task: some preliminary results

*Author: Alessandro Sperduti, Dipartimento Di Matematica Pura Ed Applicata, Università Degli Studi Di Padova*

Want more on these topics?

Browse the archive of posts filed under Science]]>

We design an on-line algorithm for Principal Component Analysis. The instances are projected into a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances.

*Author: Manfred K. Warmuth, Department Of Computer Science, University Of California*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Latent structure models involve real, potentially observable variables and latent, unobservable variables. Depending on the nature of these variables, whether they be discrete or continuous, the framework includes various particular types of model, such as factor analysis, latent class analysis, latent trait analysis, latent profile models, mixtures of factor analysers, state-space models and others. The simplest scenario, of a single discrete latent variable, includes finite mixture models, hidden Markov chain models and hidden Markov random field models. The talk will give an overview of the application of maximum likelihood and Bayesian approaches to the estimation of parameters within these models, emphasising especially the fact that computational complexity varies greatly among the different scenarios. In the case of a single discrete latent variable, the issue of assessing its cardinality will be discussed, in the context of questions such as the appropriate number of mixture components to be included in a mixture model, or, in the interests of parsimony, the minimum plausible cardinality of such a latent variable. Techniques such as the EM algorithm, Markov chain Monte Carlo methods and variational approximations will be featured in the talk.

*Author: Mike Titterington, University Of Glasgow*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Methods for analysis of principal components in discrete data have existed for some time under various names such as grade of membership modelling, probabilistic latent semantic indexing, genotype inference with admixture, non-negative matrix factorization, latent Dirichlet allocation, multinomial PCA, and Gamma-Poisson models. Statistical methodologies for developing algorithms are equally as varied, although this talk will focus on the Bayesian framework. The most well published application is genetype inference, but text analysis is now increasingly seeing use because the algorithms cope with very large sparse matrices. This talk will present the general model, a discrete version of both PCA and ICA, present alternative representations, and several algorithms (mean field and Gibbs).

*Author: Wray Buntine, Helsinki Institute Of Information Technology*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Say you want to do K-Nearest Neighbour classification. Besides selecting K, you also have to chose a distance function, in order to define "nearest". I'll talk about a novel method for *learning* -- from the data itself -- a distance measure to be used in KNN classification. The learning algorithm, Neighbourhood Components Analysis (NCA) directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. It can also learn a low-dimensional linear embedding of labeled data that can be used for data visualization and very fast classification in high dimensions. Of course, the resulting classification model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. If time permits, I'll also talk about newer work on learning the same kind of distance metric for use inside a Gaussian Kernel SVM classifier.

*Author: Sam Roweis, Department Of Computer Science, University Of Toronto*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Data segmentation is usually though of as a chicken-and-egg problem. In order to estimate a mixture of models one needs to first segment the data, and in order to segment the data one needs to know the model parameters. Therefore, data segmentation is usually solved in two stages

1. Data clustering and 2. Model fitting.

Other iterative methods use, e.g. the Expectation Maximization (EM) algorithm. This talk will show that for a wide class of segmentation problems with multi-linear structure (including clustering subspaces of unknown and varying dimensions), the chicken-and-egg dilemma can be tackled as follows:

1. Fit a set of polynomials to all data points, without clustering the data 2. Obtain the model parameters for each group from the derivatives of these polynomials.

Applications of GPCA to image/video/motion segmentation, face clustering, and identification of hybrid dynamical models systems will also be presented.

*Author: Rene Vidal, John Hopkins University*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Say you want to do K-Nearest Neighbour classification. Besides selecting K, you also have to chose a distance function, in order to define ”nearest”. I’ll talk about a method for learning – from the data itself – a distance measure to be used in KNN classification. The learning algorithm, Neighbourhood Components Analysis (NCA) directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. Of course, the resulting classification model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. I will also discuss an variant of the method which is a generalization of Fisher’s discriminant and defines a convex optimization problem by trying to collapse all examples in the same class to a single point and trying to push examples in other classes infinitely far away. By approximating the metric with a low rank matrix, these learning algorithms, can also be used to obtain a low-dimensional linear embedding of the original input features allowing that can be used for data visualization and very fast classification in high dimensions.

*Author: Sam Roweis, Department Of Computer Science, University Of Toronto*

Want more on these topics?

Browse the archive of posts filed under Science]]>

The course provides an introduction to independent component analysis and source separation. We start from simple statistical principles; examine connections to information theory and to sparse coding; we give an overview of available algorithmics; we also show how several key ideas of ICA are illuminated by information geometry.

*Author: Christophe Andrieu, University Of Bristol*

Want more on these topics?

Browse the archive of posts filed under Science]]>

The course provides an introduction to independent component analysis and source separation. We start from simple statistical principles; examine connections to information theory and to sparse coding; we give an overview of available algorithmics; we also show how several key ideas of ICA are illuminated by information geometry.

*Author: Jean François Cardoso, Enst Paris*

Want more on these topics?

Browse the archive of posts filed under Science]]>

In independent component analysis (ICA), the purpose is to linearly decompose a multidimensional data vector into components that are as statistically independent as possible. For nongaussian random vectors, this decomposition is not equivalent to decorrelation as is done by principal component analysis, but something considerably more sophisticated. ICA allows one to separate nongaussian source signals from their linear mixtures 'blindly', i.e. using no other information than the congaussianity of the source signals. ICA can also be used to extract features from image and sound signals according to the principle of redundancy reduction that has its origins in the neurosciences. In my talks I will review the basic theory and theoretical background of ICA together with some recent theoretical developments.

*Author: Aapo Hyvärinen, Helsinki Institute For Information Technology*

Want more on these topics?

Browse the archive of posts filed under Science]]>

We propose a thresholded ensemble model for ordinal regression problems. The model consists of a weighted ensemble of confidence functions and an ordered vector of thresholds. Using such a model, we could theoretically and algorithmically reduce ordinal regression problems to binary classification problems in the area of ensemble learning. Based on the reduction, we derive novel large-margin bounds of common error functions, such as the classification error and the absolute error. In addition, we also design two novel boosting approaches for constructing thresholded ensembles. Both our approaches have comparable performance to the state-of-the-art algorithms, but enjoy the benefit of faster training. Experimental results on benchmark datasets demonstrate the usefulness of our boosting approaches.

*Author: Hsuan Tien Lin, Learning Systems Group, Caltech*

Want more on these topics?

Browse the archive of posts filed under Science]]>

In this paper we explore the use of Tree Augmented Naive Bayes (TAN) in regression problems where some of the independent variables are continuous and some others are discrete. The proposed solution is based on the approximation of the joint distribution by a Mixture of Truncated Exponentials (MTE). The construction of the TAN structure requires the use of the conditional mutual information, which cannot be analytically obtained for MTEs. In order to solve this problem, we introduce an unbiased estimator of the conditional mutual information, based on Monte Carlo estimation. We test the performance of the proposed model in a real life context, related to higher education management, where regression problems with discrete and continuous variables are common. This work has been supported by the Spanish Ministry of Education and Science, project TIN2004-06204-C03-01 and by Junta de Andalucía, project P05-TIC-00276.

*Coauthor: Antonio Salmerón, University Of Almería*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Multiplicative update rules have proven useful in many areas of machine learning. Simple to implement, guaranteed to converge, they account in part for the widespread popularity of algorithms such as nonnegative matrix factorization and Expectation-Maximization. In this paper, we show how to derive multiplicative updates for problems in L1-regularized linear and logistic regression. For L1–regularized linear regression, the updates are derived by reformulating the required optimization as a problem in nonnegative quadratic programming (NQP). The dual of this problem, itself an instance of NQP, can also be solved using multiplicative updates; moreover, the observed duality gap can be used to bound the error of intermediate solutions. For L1–regularized logistic regression, we derive similar updates using an iteratively reweighted least squares approach. We present illustrative experimental results and describe efficient implementations for large-scale problems of interest (e.g., with tens of thousands of examples and over one million features).

*Coauthor: Lawrence Saul, University Of California, San Diego*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. Similarly, in reinforcement learning (RL) models, exploitation-exploration factor and other meta-parameters control learning dynamics and performance. We found that many different measures of animal learning and performance can be reproduced by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with 2 different genetic strains of mice, exposing them to different stressors. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best fit between the model and the animal performance. Exploration-exploitation factors had similar characteristic dynamics for the two simulated experiments, and there were statistically significant differences between different genetic strains and stress conditions.

*Author: Gedi Lukšys, École Polytechnique Fédérale De Lausanne*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Many problems in machine learning use a probabilistic description. Examples are pattern recognition methods and graphical models. As a consequence of this uniform description, one can apply generic approximation methods such as mean field theory and sampling methods. Another important class of machine learning problems are the reinforcement learning problems, aka optimal control problems. Here, also a probabilistic description is used, but up to now efficient mean field approximations have not been obtained. In this presentation, I consider linear-quadratic control of an arbitrary dynamical system and show, that for this class of stochastic control problems the non-linear Hamilton-Jacobi-Bellman equation can be transformed into a linear equation. The transformation is similar to the transformation used to relate the Schrödinger equation to the Hamilton-Jacobi formalism. The computation can be performed efficiently by means of a forward diffusion process that can be computed by stochastic integration or that can be described by a path integral. For this path integral it is expected that a variational mean field approximation could be derived.

*Author: Bert Kappen, Radboud University Nijmegen*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Although Bayesian methods for Reinforcement Learning can be traced back to the 1960s (Howard's work in Operations Research), Bayesian methods have only been used sporadically in modern Reinforcement Learning. This is in part because non-Bayesian approaches tend to be much simpler to work with. However, recent advances have shown that Bayesian approaches do not need to be as complex as initially thought and offer several theoretical advantages. For instance, by keeping track of full distributions (instead of point estimates) over the unknowns, Bayesian approaches permit a more comprehensive quantification of the uncertainty regarding the transition probabilities, the rewards, the value function parameters and the policy parameters. Such distributional information can be used to optimize (in a principled way) the classic exploration/exploitation tradeoff, which can speed up the learning process. Similarly, active learning for reinforcement learning can be naturally optimized. The estimation of gradient performance with respect to value function or and/or policy parameters can also be done more accurately while using less data. Bayesian approaches also facilitate the encoding of prior knowledge and the explicit formulation of domain assumptions.

The primary goal of this tutorial is to raise the awareness of the research community with regard to Bayesian methods, their properties and potential benefits for the advancement of Reinforcement Learning. An introduction to Bayesian learning will be given, followed by a historical account of Bayesian Reinforcement Learning and a description of existing Bayesian methods for Reinforcement Learning. The properties and benefits of Bayesian techniques for Reinforcement Learning will be discussed, analyzed and illustrated with case studies.

*Author: Pascal Poupart, University Of Waterloo*

Want more on these topics?

Browse the archive of posts filed under Science]]>

Reinforcement learning is about learning good control policies given only weak performance feedback: occasional scalar rewards that might be delayed from the events that led to good performance. Reinforcement learning inherently deals with feedback systems rather than (data, class) data samples, providing a more flexible control-like framework than many standard machine algorithms. These lectures will summarise reinforcement learning along 3 axes:

- Learning with or without knowledge of the system dynamics.
- Using state values as an intermediate solution, or learning a policy directly.
- Learning with or without fully observable system states.

*Author: Douglas Aberdeen, National Ict Australia*

Want more on these topics?

Browse the archive of posts filed under Science]]>

The tutorial will introduce Reinforcement Learning, that is, learning what actions to take, and when to take them, so as to optimize long-term performance. This may involve sacrificing immediate reward to obtain greater reward in the long-term or just to obtain more information about the environment. The first part of the tutorial will cover the basics, such as Markov decision processes, dynamic programming, temporal-difference learning, Monte Carlo methods, eligibility traces, the role of function approximation. In the second part we cover some recent developments, namely policy gradient and second order methods, such as LSPI and the modified Bellman residual minimization algorithm.

*Author: Csaba Szepesvari, Department Of Computing Science, University Of Alberta*

Want more on these topics?

Browse the archive of posts filed under Science]]>