Florian T. PokornyAssistant Professor, School of Computer Science and Communication, KTH Royal Institute of Technology
KTH Machine Learning Reading Group
This is a bi-weekly reading group covering state of the art methods in machine learning.
Our meetings take place at KTH main campus, Teknikringen 14.
Bayesian Sparsity and Low-rank modelling in Signal Processing
I start by giving an overview of how sparse and low-rank modelling occurs in Signal Processing and discuss some of the common methods used for these problems. This encompasses the convex optimization based and greedy methods. Thereafter I will discuss Bayesian methods for sparse regression in the form of the Relevance Vector Machine (RVM). I will finish by describing some work performed at the Signal Processing lab of extending the RVM to measurements with outliers and low-rank problems.
Romberg, Justin. "Imaging via compressive sampling" [introduction to compressive sampling and recovery via convex programming] IEEE Signal Processing Magazine 25.2 (2008): 14-20
Tipping, Michael E. "Sparse Bayesian learning and the relevance vector machine" The journal of machine learning research, 1 (2001): 211-244
Sundin, Martin, Saikat Chatterjee, and Magnus Jansson. "Combined modeling of sparse and dense noise for improvement of Relevance Vector Machine" preprint arXiv:1501.02579 (2015)
Sundin, Martin, et al. "Bayesian Learning for Low-Rank matrix reconstruction" preprint arXiv:1501.05740 (2015)
The following paper will be discussed:
"Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning"
A. Criminisi, J. Shotton, and E. Konukoglu, Found. and Trends in Computer Graphics and Vision, 2011.
This review presents a unified, efficient model of random decision forests which can be applied to a number of machine learning, computer vision, and medical image analysis tasks. Our model extends existing forest-based techniques as it unifies classification, regression, density estimation, manifold learning, semi-supervised learning, and active learning under the same decision forest framework. This gives us the opportunity to write and optimize the core implementation only once, with application to many diverse tasks.
Survey on Transfer Learning
"A Survey on Transfer Learning" S. J. Pan, Q. Yang, IEEE Trans. on Knowledge and Data Eng.
A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.
Flexible Transfer Learning under Support and Model Shift
We will discuss the paper "Flexible Transfer Learning under Support and Model Shift"
X. Wang, J. Schneider, NIPS 2014
Transfer learning algorithms are used when one has sufficient training data for one supervised learning task (the source/training domain) but only very limited training data for a second task (the target/test domain) that is similar but not identical to the first. Previous work on transfer learning has focused on relatively restricted settings, where specific parts of the model are considered to be carried over between tasks. Recent work on covariate shift focuses on matching the marginal distributions on observations Xacross domains. Similarly, work on target/conditional shift focuses on matching marginal distributions on labels Yand adjusting conditional distributions P(X|Y), such that P(X) can be matched across domains. However, covariate shift assumes that the support of test P(X) is contained in the support of training P(X), i.e., the training set is richer than the test set. Target/conditional shift makes a similar assumption for P(Y). Moreover, not much work on transfer learning has considered the case when a few labels in the test domain are available. Also little work has been done when all marginal and conditional distributions are allowed to change while the changes are smooth. In this paper, we consider a general case where both the support and the model change across domains. We transform both X and Y by a location-scale shift to achieve transfer between tasks. Since we allow more flexible transformations, the proposed method yields better results on both synthetic data and real-world data.
Recent Work on Convolutional Neural Networks by our Computer Vision Group
We will discuss three recent preprints:
"CNN Features off-the-shelf: an Astounding Baseline for Recognition", ArXiV 2014
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson
"Persistent Evidence of Local Image Properties in Generic ConvNets", ArXiv 2014
A. S. Razavian, H. Azizpour, A. Maki, J. Sullivan, C. H. Ek, S. Carlsson
"From Generic to Specific Deep Representations for Visual Recognition", ArXiv 2014
H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, S. Carlsson
Divergence measures and message passing
We will discuss the paper Divergence measures and message passing by Thomas Minka.
This paper presents a unifying view of message passing algorithms, as methods to approximate a complex Bayesian network by a simpler network with minimum information divergence. In this view, the difference between mean-ﬁeld methods and belief propagation is not the amount of structure they model, but only the measure of loss they minimize (‘exclusive’ versus ‘inclusive’ Kullback-Leibler divergence). In each case, message-passing arises by minimizing a localized version of the divergence, local to each factor. By examining these divergence measures, we can intuit the types of solution they prefer (symmetry-breaking, for example) and their suitability for different tasks. Furthermore, by considering a wider variety of divergence measures (such as alpha-divergences), we can achieve different complexity and performance goals.
What You Saw is Not What You Get: Domain Adaptation Using Asymmetric Kernel Transforms
We will be discussing the paper What You Saw is Not What You Get: Domain Adaptation Using Asymmetric Kernel Transforms by Brian Kulis, Kate Saenko and Trevor Darell.
In real-world applications, “what you saw” during training is often not “what you get” during deployment: the distribution and even the type and dimensionality of features can change from one dataset to the next. In this paper, we address the problem of visual domain adaptation for transferring object models from one dataset or visual domain to another. We introduce ARC-t, a flexible model for supervised learning of non-linear transformations between domains. Our method is based on a novel theoretical result demonstrating that such transformations can be learned in kernel space. Unlike existing work, our model is not restricted to symmetric transformations, nor to features of the same type and dimensionality, making it applicable to a significantly wider set of adaptation scenarios than previous methods. Furthermore, the method can be applied to categories that were not available during training. We demonstrate the ability of our method to adapt object recognition models under a variety of situations, such as differing imaging conditions, feature types and codebooks.
Related work can be found here, here and here.
Multimodal deep learning
Multimodal deep learning
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee and Andrew Y. Ng, ICML 2011
Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time.
To illustrate how these concepts can be applied to robotics i also picked this paper which i will present briefly: Multimodal integration learning of object manipulation behaviors using deep neural networks
Kuniaki Noda, Hiroaki Arie, Yuki Suga, Tetsuya Ogata IROS 2013
If interested in more details, there is also a more recent journal version of this paper (RAS 2014).
On machine learning creating economic value
Industry provides us with multiple examples on how machine learning creates economic value. There are many ways to classify these examples. The classifications could be based on the algorithms being used, the size of the data being analysed or by the field of application. Here we will present on such way of classifying. The origin of training data with respect to an organization; internal to an organization , external to an organization or a combination of both. To build this perspective, we will provide examples from companies we have consulted for and our own startup. The goal of the talk is simply to present this perspective that helped us make sense of the industry.
Kernelized Bayesian Transfer Learning
We will discuss the following paper:
Kernelized Bayesian Transfer Learning
M. Goenen and A. A. Margolin, AAAI 2014
Transfer learning considers related but distinct tasks defined on heterogenous domains and tries to transfer knowledge between these tasks to improve generalization performance. It is particularly useful when we do not have sufficient amount of labeled training data in some tasks, which may be very costly, laborious, or even infeasible to obtain. Instead, learning the tasks jointly enables us to effectively increase the amount of labeled training data. In this paper, we formulate a kernelized Bayesian transfer learning framework that is a principled combination of kernel-based dimensionality reduction models with task-specific projection matrices to find a shared subspace and a coupled classification model for all of the tasks in this subspace. Our two main contributions are: (i) two novel probabilistic models for binary and multiclass classification, and (ii) very efficient variational approximation procedures for these models. We illustrate the generalization performance of our algorithms on two different applications. In computer vision experiments, our method outperforms the state-of-the-art algorithms on nine out of 12 benchmark supervised domain adaptation experiments defined on two object recognition data sets. In cancer biology experiments, we use our algorithm to predict mutation status of important cancer genes from gene expression profiles using two distinct cancer populations, namely, patient-derived primary tumor data and in-vitro-derived cancer cell line data. We show that we can increase our generalization performance on primary tumors using cell lines as an auxiliary data source
Joint Classification of Actions and Object State Changes
We will discuss the following paper:
Joint classification of actions and object state changes with a latent variable discriminative model
E. Vafeias and S. Ramamoorthy, ICRA 2014
We present a technique to classify human actions that involve object manipulation. Our focus is to accurately distinguish between actions that are related in that the object’s state changes define the essential differences. Our algorithm uses a latent variable conditional random field that allows for the modelling of spatio-temporal relationships between the human motion and the corresponding object state changes. Our approach involves a factored representation that better allows for the description of causal effects in the way human action causes object state changes. The utility of incorporating such structure in our model is that it enables more accurate classification of activities that could enable robots to reason about interaction, and to learn using a high level vocabulary that captures phenomena of interest. We present experiments involving the recognition of human actions, where we show that our factored representation achieves superior performance in comparison to alternate flat representations.
A Discussion on Robot Crowdsourcing
We will first look at two recent examples of robotic learning [1,2] based on crowd sourcing, and then discuss whether this is the way forward for robotic learning.
 Alexander Sorokin, Dmitry Berenson, Siddhartha S. Srinivasa, and Martial Hebert.
People Helping Robots Helping People: Crowdsourcing for Grasping Novel Objects.
 Ashesh Jain, Debarghya Das, and Ashutosh Saxena.
PlanIt: A Crowdsourcing Approach for Learning to Plan Paths from Large Scale Preference Feedback
arXiv:1406.2616 [cs.RO], 2014.
Sharp Performance Bounds for Graph Clustering via Convex Optimization
This talk will be based on the following paper:
"Sharp Performance Bounds for Graph Clustering via Convex Optimization"
Ramya Korlakai Vinayak, Samet Oymak, Babak Hassibi, ICASSP 2014
The problem of finding clusters in a graph arises in several ap- plications such as social networks, data mining and computer networks. A typical, convex optimization-approach, that is often adopted is to identify a sparse plus low-rank decompo- sition of the adjacency matrix of the graph, with the (dense) low-rank component representing the clusters. In this paper, we sharply characterize the conditions for successfully identi- fying clusters using this approach. In particular, we introduce the “effective density” of a cluster that measures its signif- icance and we find explicit upper and lower bounds on the minimum effective density that demarcates regions of success or failure of this technique. Our conditions are in terms of (a) the size of the clusters, (b) the denseness of the graph, and (c) regularization parameter of the convex program. We also present extensive simulations that corroborate our theoretical findings.
3D Convolutional Neural Networks for Human Action Recognition
In this talk we will study the paper
"3D Convolutional Neural Networks for Human Action Recognition", Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 2013.
We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.
Stochastic Behavior Trees for Estimating and Optimizing the Performance of Reactive Plan Executions
This work presents a mathematical framework for performance estimation and optimization of plans encoded and executed by Behavior Trees (BTs). BTs are a recent alternative to Finite State Machines (FSMs), for doing modular task switching in robot control architectures. By encoding the switching logic in a tree structure, instead of distributing it in the states of a FSM, modularity and reusability are improved. In this paper, we compute performance measures, such as success/failure probabilities and execution times, for any BTs. With this intention, we first introduce Stochastic Behavior Trees (SBTs), where we assume that the probabilistic performance measures of the basic action controllers are given. We then show how Discrete Time Markov Chains (DTMC) can be used to aggregate these measures from one level of the tree to the next. The recursive structure of the tree then enables us to step by step propagate such estimates from the leaves (basic action controllers) to the root (complete task execution). Once this parameter are estimated, we show how to construct a equivalent BT that optimizes used defined performance preferences. Finally, we verify our analytical results using massive Monte Carlo simulations, and provide an illustrative example of the results for a complex robotic task.
Today we will also have a short talk by Aleksandra Franc:
On Topological Complexity
We will define topological complexity of a configuration space and look at a few simple examples.
Topological Complexity of Motion Planning, M. Farber, Discrete Comput. Geom. 29 (2003), 211-221.
Instabilities of Robot Motion", M. Farber, Topology and its Applications 140 (2004), 245-266.
Recovering Block Sparse Signals using Block Sparse Bayesian Learning
I will present two approaches for recovering block sparse signals using Bayesian Learning from the paper of . The first approach requires a prior information about the block partition. The second approach is a more generalizable method and the block partition is not needed to be known. Their method is based on . In , multiple measurement vector (MMV) signals are recovered by converting them to single measurement vector (SVM) and applying and updated version of sparse bayesian learning (SBL) which is the block sparse Bayesian learning (bSBL) framework. Based on this framework, they introduce temporal SBL (T-SBL) algorithm and I will talk about this method also.
 "Recovery of block sparse signals using the framework of block sparse Bayesian learning",
Z. Zhang and B. D. Rao, ICASSP 2012
 "Sparse Signal Recovery with Temporally Correlated Source Vectors Using Sparse Bayesian Learning",
Z. Zhang, B. D. Rao, IEEE Journal of Selec. Top. in Sig. Proc., 2011
A Discussion On Gaussian Processes
We will discuss Multi-Output Gaussian Processes and some details on Gaussian Processes one might easily overlook when first encountering them.
To review some of the basics of Gaussian Processes as a preparation, the paper "Gaussian Processes for Regression" C.K.I. Williams, C.E. Rasmussen, NIPS 1996 provides a good introduction.
Learning Low Dimensional Predictive Representations
This talk will be based on the paper "Learning Low Dimensional Predictive Representations"
M. Rosencrantz, G. Gordon, S. Thrun, ICML 2004
Predictive state representations (PSRs) have recently been proposed as an alternative to partially observable Markov decision processes (POMDPs) for representing the state of a dynamical system (Littman et al., 2001). We present a learning algorithm that learns a PSR from observational data. Our algorithm produces a variant of PSRs called transformed predictive state representations (TPSRs). We provide an efcient principal-components-based algorithm for learning a TPSR, and show that TPSRs can perform well in comparison to Hidden Markov Models learned with Baum-Welch in a real world robot tracking task for low dimensional representations and long prediction horizons.
Bag of Words Descriptors and 3D Data
We will give an introduction to bag of words features and will focus on:
the main idea of bags of words, implementation details, shortcomings, spatial pyramid matching and sum-pooling and max-pooling. The talk will be based on these papers:
"Locality-constrained Linear Coding for Image Classiﬁcation" J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, CVPR 2010
"SURFing the point clouds: Selective 3D spatial pyramids for category-level object recognition," C. Redondo-Cabrera, R. J. Lopez-Sastre, J. Acevedo-Rodriguez, and S. Maldonado-Bascon, CVPR 2012
"What has my classifier learned? Visualizing the classification rules of bag-of-feature model by support region detection" L. Liu and L. Wang, CVPR 2012
Introduction to Deep Learning
I will present some basics in deep learning (including general Boltzmann Machines, Restricted Boltzmann Machine, Deep Belief Network, basic inferences in Deep Learning) and one of the recent prograss, Multi-Prediction Deep Boltzmann Machines. The talk will be mainly based on two papers:
"Deep Boltzmann Machines, R. Salakhutdinov and G. Hinton", AISTATS 09
"Multi-Prediction Deep Boltzmann Machines, I. J. Goodfellow, M. Mirza, A. Courville, Y. Bengio", NIPS 2013
And there is a recent natural popular science article if anyone interested in the stories.
Shared Factorised Bayesian Gaussian Process Latent Variable Models
In this talk, I will present a line of research that we are currently working on. I will begin the talk explaining and motivating factorised latent variable models for multi-view learning. To make sure that no one is left behind, the talk will provide a basic explanation of Gaussian Processes before dwelling onto sparse variational learning which allows for Bayesian learning. I will then present how these techniques can be adopted in a multi-view learning scenario leading to a model we refer to as "Manifold Relevance Determination".
Gaussian Processes for Data-Efficient Learning in Robotics and Control
The discussion will be based on the paper with the above title (download) by Marc Peter Deisenroth, Dieter Fox and Carl Edward Rasmussen, PAMI 2013.
Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.
The Block Diagonal Infinite Hidden Markov Model
The discussion will be based on the above paper (download) by Thomas Stepleton, Zoubin Ghahramani, Geoffrey Gordon, Tai Sing Lee, AISTATS 2009. More details can be found in the PhD thesis of T. Stepleton.
The Infinite Hidden Markov Model (IHMM) extends hidden Markov models to have a countably infinite number of hidden states (Beal et al., 2002; Teh et al., 2006). We present a generalization of this framework that introduces nearly block-diagonal structure in the transitions between the hidden states, where blocks correspond to "sub-behaviors" exhibited by data sequences. In identifying such structure, the model classifies, or partitions, sequence data according to these sub-behaviors in an unsupervised way. We present an application of this model to artificial data, a video gesture classification task, and a musical theme labeling task, and show that components of the model can also be applied to graph segmentation.
Learning Hierarchical Features for Scene Labeling
The discussion will be based on the above paper (download) by C. Farabet, C. Couprie, L. Najman, Y. LeCun at PAMI 2013.
Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude faster than competing approaches, producing a 320×240 image labeling in less than a second, including feature extraction.
LTL-based decentralized supervisory control of multi-robot tasks
modelled as petri nets
The discussion will be based on the above paper (download) by Bruno Lacerda.
We present a decentralized methodology to control multi-robot systems, where each robot behaviour is modelled as a Petri net (PN) and a set of coordination rules between the robots is given as linear temporal logic (LTL) formulas describing safety properties for the system. The LTL formulas are used to define the events and changes in state that must be communicated between robots and to augment the individual PN model of each robot so that it can handle the incoming communications. These augmented PNs are then used, in conjunction with the LTL formulas, to build PN realizations of local supervisors, based on discrete event system theory, that enforce the LTL specifications by construction. The methodology is illustrated through a simulated application example.
A Tutorial on Convex Optimization
In recent years, convex optimization has become a computational tool of central importance in engineering, thanks to it’s ability to solve very large, practical engineering problems reliably and efficiently. The goal of this tutorial is to give an overview of the basic concepts of convex sets, functions and convex optimization problems, so that the reader can more readily recognize and formulate engineering problems using modern convex optimization. This tutorial coincides with the publication of the new book on convex optimization, by Boyd and Vandenberghe, who have made available a large amount of free course material and links to freely available code. These can be downloaded and used immediately by the audience both for self-study and to solve real problems.
The talk will be based on the paper "A Tutorial on Convex Optimization" by Haitham Hindi (abstract above) and will show how convex optimization can be used in Machine Learning.
A Generalized Kernel Approach to Structured Output Learning
"We study the problem of structured output learning from a regression perspective. We first provide a general formulation of the kernel dependency estimation (KDE) approach to this problem using operator-valued kernels. Our formulation overcomes the two main limitations of the original KDE approach, namely the decoupling between outputs in the image space and the inability to use a joint feature space. We then propose a covariance-based operator-valued kernel that allows us to take into account the structure of the kernel feature space. This kernel operates on the output space and only encodes the interactions between the outputs without any reference to the input space. To address this issue, we introduce a variant of our KDE method based on the conditional covariance operator that in addition to the correlation between the outputs takes into account the effects of the input variables. Finally, we evaluate the performance of our KDE approach on three structured output problems, and compare it to the state-of-the-art kernel-based structured output regression methods."
"Generalized Kernel Approach to Structured Output Learning", ICML 2013, H. Kadri, M. Ghavamzadeh, P. Preux
We will discuss the paper "ISABoost: A weak classifier inner structure adjusting based AdaBoost algorithm—ISABoost based application in scene categorization"
The AdaBoost algorithm fuses weak classifiers to yield a strong classifier by adaptively determining fusion weights of weak classifiers. In this paper, an enhanced AdaBoost algorithm is proposed by adjusting the inner structure of weak classifiers (ISABoost). In the traditional AdaBoost algorithms, the weak classifiers are not changed once they are trained. In ISABoost, the inner structures of weak classifiers are adjusted before their fusion weights determination. ISABoost inherits the advantages of the AdaBoost algorithms in fusing weak classifiers to be a strong classifier. ISABoost gives each weak classifier a second chance to be adjusted stronger. The adjusted weak classifiers are more contributive to make correct classifications for the hardest samples. To show the effectiveness of the proposed ISABoost algorithm, its applications in scene categorization are evaluated. Comparisons of ISABoost and AdaBoost algorithms on three widely utilized scene datasets show the effectiveness of ISABoost algorithm.
I would like to: 1. take a short review of the Adaboost algorithm family and its basis, 2. look into the ISABoost proposed in the paper, 3. discuss multi-class classification problems in terms of the structures of classifiers, 4. and maybe discuss multi-class classification in general.
Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
We will mostly study the paper "Support Vector Machine Learning for Interdependent and Structured Output Spaces" , I. Tsochantaridis et al., ICML 2004
For those who want more, here is the main article which is a similar but longer version, with more details about the proofs and modellings. This tutorial can also be of interest, focused on applications in computer vision: See pages 300-350 for the S-SVM.
An Introduction to a Maximum Entropy Approach
Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge and leads to a type of statistical inference which is called the maximum-entropy estimate. It is the least biased estimate possible of the given information; ie., it is maximally noncommittal with regard to missing information.
We will focus on the paper "A maximum entropy approach to natural language processing", A.L. Berger and S.A.D. Pietra, V.J.D. Pietra. The following paper provides further background information: "Information Theory and Statistical Mechanics", E.T. James.
Efficient Object Category Recognition using Classemes
The topic of my talk will be the paper "Efficient Object Category Recognition using Classemes"
by L Torresani, M Szummer, A Fitzgibbon, ECCV 2010.
Abstract: "We introduce a new descriptor for images which allows the construction of efficient and compact classifiers with good accuracy on object category recognition. The descriptor is the output of a large number of weakly trained object category classifiers on the image. The trained categories are selected from an ontology of visual concepts, but the intention is not to encode an explicit decomposition of the scene. Rather, we accept that existing object category classifiers often encode not the category per se but ancillary image characteristics; and that these ancillary characteristics can combine to represent visual classes unrelated to the constituent categories’ semantic meanings. The advantage of this descriptor is that it allows object-category queries to be made against image databases using efficient classifiers (efficient at test time) such as linear support vector machines, and allows these queries to be for novel categories. Even when the representation is reduced to 200 bytes per image, classification accuracy on object category recognition is comparable with the state of the art (36% versus 42%), but at orders of magnitude lower computational cost."
In connection to this, I will briefly explain the basics of hashing methods for large scale image classification and retrieval. For those who want to know more about this, I recommend that you take a look at the NIPS workshop BigVision 2012, in particular the talk by Chang. My presentation will be based on the one by Torresani from the same workshop.
Using machine learning and energy consumption data to map
human activities in households
One of the biggest challenges for our society in the coming decades is to figure out how we are going to make the transition towards a sustainable energy system. Since energy efficiency measures do not cost money (since they result in a saving), they are the best way towards sustainability. However, due to informational asymmetry, only a fraction of profitable energy efficiency investments are made. Organizations and consumers are not aware of the profitable investments they could make - and they focus their attention elsewhere. In collaboration with Prof. Timo Koski we are exploring how Machine Learning techniques can be used with energy consumption data to identify energy efficiency potentials in an automated fashion. Using a hidden Markov model we are aiming to classify changes in energy consumption as a certain device. This type of analysis is referred to as non-intrusive load monitoring and offers the potential to use advanced analytics to tackle one of our greatest technical challenges. Today's speaker is the initiator of this project
References: G.W. Hart. Nonintrusive appliance load monitoring, Proc. IEEE, 1992 and J. Zico and Kolter Tommi Jaakkola. Approximate inference in additive factorial hmms with application to energy disaggregation, AISTATS 2012.
Large Scale Classification
We will discuss the large scale classification problem - i.e. why it is interesting and what the challenges are to solve such a problem - and advantages and disadvantages of popular solutions to the problem. The talk is going to be very general and serve as an overview of possible ways to solve this problem; mostly focusing on the frequentist non-probabilistic approaches and laying out the grounds for future talks in that directions. I will most probably talk about the necessity of domain knowledge (features), representation learning and some scalable classifiers such as decision trees, boosting and large margin approaches (particularly support vector machines) and their extensions, advantages and limitations.
References: Leon Bottou, NYU Big Data Course, Wikipedia.
Markov Chain Monte Carlo Methods
In Bayesian inference one often has to resort to sampling methods, for example when exploring a posterior that cannot be determined analytically. We will review some of the most common sampling strategies such as rejection sampling and importance sampling leading up to Markov Chain Monte Carlo methods such as the Metropolis algorithm and Gibbs sampling. We will give an introduction to these, comparing their strengths and weaknesses. We will discuss the Metropolis-Hastings algorithm, why it works and how it works, and ways of making it less parameter dependent. If there is time we might even take a look at the CMA-ES (Covariance Matrix Adaptation Evolution Strategy) which is an evolutionary algorithm for difficult non-linear non-convex optimization problems in continuous domains. The following papers/tutorials will be the focus: "Introduction to Monte Carlo Methods" by David J.C. MacKay, "Understanding the Metropolis-Hastings Algorithm" by Siddhartha Chib and Edward Greeenberg, Optional: "Optimal Proposal Distributions and Adaptive MCMC" by Jeffrey S. Rosenthal, "Probabilistic inference using Markov chain Monte Carlo methods" by Radford M. Neal. For CMA-ES see Wikipedia or this link.
Hierarchical Dirichlet Processes
I will present the paper "Hierarchical Dirichlet Processes" in this session of the ML reading group. This mixture model is highly modular and has a lot of nice properties when compared to other mixture models such as Latent Dirichlet Allocation and provides a good base for modelling different types of data. The model itself and its extensions drew a lot of attention in recent years. For those of you who are not familiar with it, the Dirichlet Process Section from Encyclopedia is a good start. Also, Yeh Whye Teh's video lecture provides a nice introduction.
Variational Learning of Inducing Variables in Sparse GPs
We will be discussing the paper "Variational Learning of Inducing Variables in Sparse Gaussian Processes" by Michalis K. Titsas, 2009. For those not well versed in GPs, I suggest the following tutorial by Mackay or Rasmussens book on Gaussian Processes. Bishop's book provides another useful resource. For the variational approach and KL divergence, browsing "Tutorial on variational approximation methods" by Tommi S. Jaakkola might be helpful. There's also a good tutorial in Bishop's book and a briefer one in Mackay's book (available online).
Initial meeting and Bayesian Non-parametrics
Modern Bayesian non-parametric methods provide a powerful technique for the modelling of various types of data. We will discuss the review "Bayesian nonparametrics and the probabilistic approach to modelling" by Zoubin Ghahramani, Phil. Trans. R. Soc. 2011. If you would like to have a peak at some exciting non-parametric methods, also have a look at "The Indian Buffet Process: An Introduction and Review" by Thomas L. Griffiths and Zoubin Ghahramani, Journal of Machine Learning Research, 2011.
This reading group is currently organized by Martin Hjelm and myself. If you have ideas for papers we should discuss or other questions, please let me know, also
You can find the official 2014 course website here: DD3012 Machine Learning Reading Group. You can obtain 6 ECTS credits for this course, if you actively participate in at least 24 sessions and give at least two talks.