(Hedvig Sidenbladh) Kjellstrom Kjellström
Hedvig Kjellstr


The general theme of my research is automatic models of perception and production of human non-verbal communicative behavior and activity, in other words, designing machines that can interpret human non-verbal communication and/or produce human-like non-verbal behavior. As outlined below, these ideas are applied in Social Robotics, Performing Arts, and Healthcare.

Further down, all students (current and alumni) and projects (ongoing and finished) are listed in cronological order.

Social Robotics

NAO robots at KTH

Samuel Murray
Taras Kucherenko
Judith Bütepage (co-supervisor)
Kalin Stefanov (co-supervisor)

Jonas Beskow
Joakim Gustafson
Danica Kragic
Iolanda Leite
Jana Tumova
Michael Black (Max Planck Institute for Intelligent Systems, Germany)

Logic reasoning and deep learning
Robot interaction skills
Latent motion models

Performing Arts

Robocygne (Åsa Unander-Scharin 2010)

Vincent Trichon

Anders Friberg
Carl Unander-Scharin
Åsa Unander-Scharin

Robotic Choir
Conducting Motion


Image from techemergence.com

Sofia Broomé
Marcus Klasson
Olga Mikheeva

Jonas Beskow
Joakim Gustafson
Tino Weinkauf
Paul Ackermann (Karolinska Institute, Sweden)
Pia Haubro Andersen (Swedish Agricultural University, Sweden)
Miia Kivipelto (Karolinska Institute, Sweden)
Cheng Zhang (Disney Research, USA)

VR EquestrianML


Post Docs
Yanxia Zhang (2016, now at University of Delft, Netherlands)
David Geronimo (2013-2014, now at Catchoom Technologies, Spain)

Research Engineers
Vincent Trichon
John Pertoft (2016)
Akshaya Thippur (2012)

PhD Students
Samuel Murray
Sofia Broomé
Olga Mikheeva
Marcus Klasson
Taras Kucherenko
Judith Bütepage (co-supervisor)
Kalin Stefanov (co-supervisor)
Püren Güler (co-supervisor, PhD 2017, now at)
Cheng Zhang (PhD 2016, now at Disney Research, USA)
Alessandro Pieropan (PhD 2015, now at Univrses, Sweden)
Javier Romero (co-supervisor, PhD 2011, now at Body Labs, USA)

MSc Students
Olga Mikheeva (MSc 2017)
Vincent Trichon (MSc 2017)
Josefin Ahnlund (MSc 2016)
John Pertoft (MSc 2016)
Kelly Karipidou (MSc 2015)
Adam Knutsson (MSc 2015)
Patrik Berggren (MSc 2014)
Tong Shen (MSc 2014)
Mihai Damaschin (MSc 2013)
Sina Nakhostin (MSc 2012, guest from Örebro University)
Akshaya Thippur (MSc 2012)
Saad Ullah Akram (MSc 2012)
Sara Mansouri (MSc 2011, guest from Chalmers University of Technology)
Cheng Zhang (MSc 2011)
Nataliya Shapovalova (MSc 2010, guest from University of Bourgogne, France)
Simon Bos (MSc 2008, guest from University of Rennes, France)
Anette Larsson (MSc 2007)
Josef Grahn (MSc 2005)
Matthieu Bray (MSc 2001, guest from University Blaise Pascal, France)

BSc Students
Gustav Andersson (BSc 2016)
Mona Lindgren (BSc 2016)
Jonas Olofsson (BSc 2016)
Anders Sivertsson (BSc 2016)
Björn Hegstam (BSc 2011)
Joakim Hugmark (BSc 2011)
Oliver Schneider (BSc 2010, guest from University of Karlsruhe, Germany)

Principled integration of logic reasoning and deep learning (WASP, 2017-present)

Machine Learning methods based on Deep Neural Networks (DNN) have been tremendously successful in the last few years. The success is especially prominent in domains where large volumes of training data can be acquired, such as Computer Vision and Speech Recognition. However, in domains where the goal is to infer complex causal chains, the needed amount of training data grows rapidly. An example is human-robot collaboration, where the robot might want to infer the goal of a sequence of actions performed by a human, in order to plan its own actions.
On the other hand, humans are able to learn complex models from very few examples. We argue that the key difference between human learning and DNN learning is that humans employ logic reasoning, and knowledge about intuitive physics and intuitive psychology in their learning. Such models, which will combine DNN with logic reasoning and probabilistic models in a principled manner, would enable learning of much more complex models from much less training data - a major breakthrough in Deep Learning!

Joint work with Samuel Murray and Jana Tumova.

EquestrianML: Machine Learning methods for recognition of the pain expressions of horses (VR, 2017-present)

Recognition of pain in horses and other animals is important, because pain is a manifestation of disease and decreases animal welfare. Pain diagnostics for humans typically includes self-evaluation and location of the pain with the help of standardized forms, and labeling of the pain by an clinical expert using pain scales. However, animals cannot verbalize their pain as humans can, and the use of standardized pain scales is challenged by the fact that animals as horses and cattle, being prey animals, display subtle and less obvious pain behavior - it is simply beneficial for a prey animal to appear healthy, in order lower the interest from predators. The aim of this project is to develop a method for automatic recognition of pain in horses. The method employs an RGB-D sensor mounted in the stable ceiling, and detects and recognizes behavioral patterns related to pain, in an automated manner when the horse perceives itself as being alone. The Machine Learning-based pain classification system is trained with examples of behavioral traits labeled with pain level and pain characteristics. This automated, user independent system for recognition of pain behavior in horses will be the first of its kind in the world. A successful system might change the concept for how we monitor and care for our animals.

Joint work with Pia Haubro Andersen and Sofia Broomé.

Project home page

ML4HC: Machine Learning for HealthCare (Promobilia, SeRC, 2016-present)

In this project, we develop Machine Learning methods for automatic decision support to medical doctors in their work to diagnose different types of injuries and illnesses, determining the underlying cause of the condition, and propose suitable treatment.

For more information, see the papers (Zhang et al., 2017), (Zhang et al., 2016a), (Qu et al., 2016), (Zhang et al., 2016b), (Zhang et al., 2016c), and (Zhang et al., 2016d) in the publication list.

Joint work with Paul Ackermann, Marcus Klasson, Tino Weinkauf, and Cheng Zhang.

EACare: Embodied Agent to support elderly mental wellbeing (SSF, 2016-present)

The main goal of this multidisciplinary project is to develop a robot head with communicative skills capable of interacting with elderly people at their convenience.
The robot will be able to analyze their mental and psychological status via powerful audiovisual sensing and assessing their mental abilities to identify subjects in high risk or possibly at the first stages of depressive or dementing disorders, who would benefit from proper medical consultation. The framework will also provide tools for dementia preventive training.
We propose an ambitious, innovative and integrated approach that combines:
  • an easy-to-use patient-friendly and caregiver-friendly embodied agent to interact with patients via both verbal and non-verbal communication, including dialogue especially designed by psychiatrists to probe, measure, test, and monitor the patients mental state,
  • pioneering technological solutions at both the algorithmic and system level for multimodal information processing to achieve audiovisual human-computer interaction including action recognition, aural and facial expressions, and speech understanding,
  • systematic evaluation of the developed systems in both large and focused social groups, using feedback from each evaluation to guide further scientific research and design.

Joint work with Jonas Beskow, Joakim Gustafson, Miia Kivipelto, Taras Kucherenko, Iolanda Leite, and Olga Mikheeva.

Project home page

Data-driven modelling of interaction skills for social robots (KTH, 2016-present)

This project aims to investigate fundamentals of situated and collaborative multi-party interaction and collect the data and knowledge required to build social robots that are able to handle collaborative attention and co-present interaction. In the project we will employ state-of-the art motion- and gaze tracking on a large scale as the basis for modelling and implementing critical non-verbal behaviours such as joint attention, mutual gaze and backchannels in situated human-robot collaborative interaction, in a fluent, adaptive and context sensitive way.

Joint work with Simon Alexanderson, Jonas Beskow, Taras Kucherenko, Kalin Stefanov, and Yanxia Zhang.

Latent models of human motion (KTH, Max Planck Society, 2016-present)

This project concerns representation learning to accomplish different kinds of effective representation of human (body, face, hand) motion, for the purpose of mapping, prediction, inference and interpretation.

For more information, see the paper (Bütepage et al., 2017) in the publication list.

Joint work with Michael Black, Judith Bütepage, Taras Kucherenko, and Olga Mikheeva.

Robotic Choir - expressive body language in different embodiments (KTH, 2016-present)

During ancient times, the choir (χορος, khoros) had a major function in the classical Greek theatrical plays - commenting on and interacting with the main characters in the drama. We aim to create a robotic choir, invited to take part in a full-scale operatic performance in the Croatian National Theatre Ivan Zajc in Rijeka, Croatia, in January 2017 - thereby grounding our technological research in an ancient theatrical and operatic tradition. In our re-interpretation, the choir will consist of a swarm of small flying drones that have perceptual capabilities and thereby will be able to interact with human singers, reacting to their behavior both as individual agents, and as a swarm.

Joint work with Vincent Trichon, Carl Unander-Scharin, and Åsa Unander-Scharin.

SocSMCs: Socialising SensoriMotor Contingencies (EU H2020, 2015-present)

As robots become more omnipresent in our society, we are facing the challenge of making them more socially competent. However, in order to safely and meaningfully cooperate with humans, robots must be able to interact in ways that humans find intuitive and understandable. Addressing this challenge, we propose a novel approach for understanding and modelling social behaviour and implementing social coupling in robots. Our approach presents a radical departure from the classical view of social cognition as mind- reading, mentalising or maintaining internal rep-resentations of other agents. This project is based on the view that even complex modes of social interaction are grounded in basic sensorimotor interaction patterns. SensoriMotor Contingencies (SMCs) are known to be highly relevant in cognition. Our key hypothesis is that learning and mastery of action-effect contingencies are also critical to enable effective coupling of agents in social contexts. We use "socSMCs" as a shorthand for such socially relevant action-effect contingencies. We will investigate socSMCs in human-human and human-robot social interaction scenarios. The main objectives of the project are to elaborate and investigate the concept of socSMCs in terms of information-theoretic and neurocomputational models, to deploy them in the control of humanoid robots for social entrainment with humans, to elucidate the mechanisms for sustaining and exercising socSMCs in the human brain, to study their breakdown in patients with autism spectrum disorders, and to benchmark the socSMCs approach in several demonstrator scenarios. Our long term vision is to realize a new socially competent robot technology grounded in novel insights into mechanisms of functional and dysfunctional social behavior, and to test novel aspects and strategies for human-robot interaction and cooperation that can be applied in a multitude of assistive roles relying on highly compact computational solutions.

For more information, see the papers (Bütepage et al., 2017) and (Bütepage et al., 2016) in the publication list.

Joint work with Mårten Björkman, Judith Bütepage, and Danica Kragic.

Project home page

Analyzing the motion of musical conductors (KTH, 2014-present)

Classical music sound production is structured by an underlying manuscript, the sheet music, that specifies into some detail what will happen in the music. However, the sheet music specifies only up to a certain degree how the music sounds when performed by an orchestra; there is room for considerable variation in terms of timbre, texture, balance between instrument groups, tempo, local accents, and dynamics. In larger ensembles, such as symphony orchestras, the interpretation of the sheet music is done by the conductor. We propose to learn a simplified generative model of the entire music production process from data; the conductor's articulated body motion in combination with the produced orchestra sound. This model can be exploited for two applications; the first is "home conducting" systems, i.e., conductor-sensitive music synthesizers, the second is tools for analyzing conductor-orchestra communication, where latent states in the conducting process are inferred from recordings of conducting motion and orchestral sound.

For more information, see the paper (Karipidou et al., 2017) in the publication list.

Joint work with Simon Alexanderson and Anders Friberg.

Dataset and videos from (Karipidou et al., 2017)

FOVIAL: FOrensic VIdeo AnaLysis - finding out what really happened (VR, EU Marie Curie, 2013-2016)

In parallel to the massive increase of text data available on the Internet, there has been a corresponding increase in the amount of available surveillance video. There are good and bad aspects of this. One undeniably positive aspect is that it is possible to gather evidence from surveillance video when investigating crimes or the causes of accidents; forensic video analysis. Forensic video investigations are now carried out manually. This involves a huge and very tedious effort; e.g., 60 000 hours of video in the Breivik investigation. The amount of surveillance data is also constantly growing. This means that in future investigations, it will no longer be possible to go through all the evidence manually. The solution is to automate parts of the process. In this project we propose to learn an event model from surveillance data, that can be used to characterize all events in a new set of surveillance recorded from a camera network. Our model will also represent the causal dependencies and correlations between events. Using this model, or explanation, of the data from the network, a semi-automated forensic video analysis tool with a human in the loop will be designed, where the user chooses a certain event, e.g., a certain individual getting off a train, and the system returns all earlier observations of this individual, or all other instances of people getting off trains, or all the events that may have caused or are correlated with the given "person getting off train" event.
There are two activities in this project, development of methods for image retrieval, and development of core topic model methodologies for video classification.

For more information, see the papers (Geronimo and Kjellström, 2016), (Zhang et al., 2016c), (Eriksson and Kjellström, 2016), (Geronimo and Kjellström, 2014), (Zhang et al., 2013a), and (Zhang et al., 2013c) in the publication list.

Joint work with Cheng Zhang and David Geronimo.

RoboHow (EU FP7, 2013-2016)

Robohow aims at enabling robots to competently perform everyday human-scale manipulation activities - both in human working and living environments. In order to achieve this goal, Robohow pursues a knowledge-enabled and plan-based approach to robot programming and control. The vision of the project is that of a cognitive robot that autonomously performs complex everyday manipulation tasks and extends its repertoire of such by acquiring new skills using web-enabled and experience-based learning as well as by observing humans.

For more information, see the papers (Caccamo et al., 2016), (Pieropan et al., 2016), (Güler et al., 2015), and (Pieropan et al., 2015) in the publication list.

Joint work with Püren Güler, Danica Kragic, Karl Pauwels, and Alessandro Pieropan.

Project home page
Video from (Pieropan et al., 2015)

Audio-visual object-action recognition (KTH, 2013-2014)

In TOMSY and HumanAct (see below), we have developed a method for segmenting out plausible object hypotheses from a video of a human involved in an object manipulation activity. The found object hypotheses are used for reasoning about objects in terms of their functionality in the activity. This representation is intended for a learning from demonstration system, to enable robots to understand about and represent activities here objects are grasped, moved and processes objects in different ways. In this project, audio and speech extracted from audio will be added as a second modality. We have earlier studied the possibility to augment a statistical model for multimodal inference and learning with acoustic information in the context of robot manipulation. Similar methodologies are applicable to ambient sounds in general.

For more information, see the paper (Pieropan et al., 2014b) in the publication list.

Joint work with Karl Pauwels, Alessandro Pieropan, and Giampiero Salvi.

Dataset from (Pieropan et al., 2014b)
Video from (Pieropan et al., 2014b)

TOMSY: TOpology based Motion SYnthesis for dexterous manipulation (EU FP7, 2011-2014)

The aim of TOMSY is to enable a generational leap in the techniques and scalability of motion synthesis algorithms. We propose to do this by learning and exploiting appropriate topological representations and testing them on challenging domains of flexible, multi-object manipulation and close contact robot control and computer animation. Traditional motion planning algorithms have struggled to cope with both the dimensionality of the state and action space and generalisability of solutions in such domains. This proposal builds on existing geometric notions of topological metrics and uses data driven methods to discover multi-scale mappings that capture key invariances - blending between symbolic, discrete and continuous latent space representations. We will develop methods for sensing, planning and control using such representations.

For more information, see papers (Pieropan et al., 2014a), (Pieropan et al., 2014b), (Pieropan and Kjellström, 2014), (Sun et al., 2014), (Romero et al., 2013a), (Zhang et al., 2013a), (Zhang et al., 2013b), (Romero et al., 2013b), (Pieropan et al., 2013), (Hjelm et al., 2013), (Zhang et al., 2013c), (Thippur et al., 2013b), and (Pokorny et al., 2012) in the publication list.

Joint work with Carl Henrik Ek, Martin Hjelm, Danica Kragic, Karl Pauwels, Alessandro Pieropan, Florian Pokorny, Subramanian Ramamoorthy, Eduardo Ros, Marc Toussaint, Sethu Vijayakumar, and Cheng Zhang.

Project home page
Video from (Pieropan et al., 2014b)
Video from (Romero et al., 2013a)
Video from (Pieropan et al., 2013)

Recognition of Swedish sign language (KTH, Post och Telestyrelsen, 2011-2014)

Automatic recognition of sign language is a research area pertaining to several different areas in computer science, such as computer vision and language technology. Sign language technology research has attracted a lot of attention recently, with a potential to dramatically improve accessibility in society much in the same way as speech technology has in recent years.
The goal of this project is a method for automatic visual recognition of isolated Swedish sign language signs. The method uses data from video and/or 3D sensor input (e.g. Microsoft Kinect), and is a key component in a computer game intended for training signing for support. The game, aimed at children with language disabilities, is developed within the project Tivoli at KTH.

For more information, see the report (Akram et al., 2012) in the publication list.

Joint work with Jonas Beskow and Kalin Stefanov.

Tivoli project homepage

Gesture-based violin synthesis (KTH, 2011-2012)

There are many commercial applications of synthesized music from acoustic instruments, e.g. generation of orchestral sound from sheet music. Whereas the sound generation process of some types of instruments, like piano, is fairly well known, the sound of a violin has been proven extremely difficult to synthesize. The reason is that the underlying process is highly complex: The art of violin-playing involves extremely fast and precise motion with timing in the order of milliseconds.
We believe that ideas from Machine Learning can be employed to build better violin sound synthesizers. The task of this project is to use learning methods to create a generative model of violin sound from sheet music, using an intermediate representation of the kinematic system (violin and bow) generating the sound. To train the generative model, a database with motion capture of bowing will be used, containing a large set of bowing examples, performed by 6 professional violinists.

For more information, see the paper (Thippur et al., 2013a) in the publication list.

Joint work with Anders Askenfelt and Akshaya Thippur.

Example of motion capture data
Parameters extracted from this motion capture data

HumanAct: Visual and multi-modal learning of Human Activity and interaction with the surrounding scene (VR, EIT ICT Labs, 2010-2013)

The overwhelming majority of human activities are interactive in the sense that they relate to the world around the human (in Computer Vision called the "scene"). Despite this, visual analyses of human activity very rarely take scene context into account. The objective in this project is modeling of human activity with object and scene context.
The methods developed within the project will be applied to the task of Learning from Demonstration, where a (household) robot learns how to perform a task (e.g. preparing a dish) by watching a human perform the same task.

For more information, see papers (Zhang et al., 2013a), (Zhang et al., 2013b), (Pieropan et al., 2013), (Zhang et al., 2013c), (Kjellström et al., 2011), and (Kjellström et al., 2010) in the publication list.

Joint work with Michael Black, Alessandro Pieropan, and Cheng Zhang.

Video from (Pieropan et al., 2013)
Videos from (Kjellström et al., 2010)

PACO-PLUS: Perception, Action and COgnition through learning of object-action complexes (EU FP6, 2007-2010)

The EU project PACO-PLUS brings together an interdisciplinary research team to design and build cognitive robots capable of developing perceptual, behavioural and cognitive categories that can be used, communicated and shared with other humans and artificial agents. In my part of the project, we are interested in programming by demonstration applications, in which a robot learns how to perform a task by watching a human do the same task. This involves learning about the scene, objects in the scene, and actions performed on those objects. It also involves learning grammatical structures of actions and objects involved in a task.

For more information, see papers (Kjellström et al., 2011), (Sanmohan et al., 2011), (Romero et al., 2010), and (Kjellström et al., 2008b) in the publication list.

Joint work with Tamim Asfour, Jan-Olof Eklundh, Danica Kragic, Volker Krüger, and Javier Romero.

Project home page
Video from (Romero et al., 2010)

ARTUR: A multi-modal ARticulation TUtoR (VR, 2004-2006)

The intended outcome of the project is a system that simplifies articulation training for hearing impaired or second language learners by providing both acoustic and articulatory feedback. This involves two research problems spanning over speech technology, computer vision and human-computer interaction. Firstly, methods to reconstruct the motion of the face, lips and vocal tract from speech and video of the face have to be developed. Secondly, presentation strategies have to be developed and tested. The feedback to the user will consist of the reconstructed face, lips and vocal tract, along with explanations what was correct and what should be changed in the articulation.

For more information, see papers (Kjellström and Engwall, 2009) and (Engwall et al., 2006b) in the publication list.

Joint work with Olle Bälter and Olov Engwall.

Project home page

Detection of humans in images (FOI, 2003-2006)

The scope of this work is methods for detection of humans in video. This encompasses human motion detection for automatic visual surveillance, and fast detection of pedestrians from IR imagery for car safety applications, in collaboration with industry.

For more information, see the paper (Sidenbladh, 2004) in the publication list.

Joint work with Jörgen Karlholm.

Monte Carlo methods for information fusion (FOI, 2002-2005)

In this work, particle filter methods for tracking of a changing and unknown number of objects were developed. One application is to track multiple vehicles or units in terrain, observed by aerial vehicles, ground sensor networks and humans on the ground.

For more information, see papers (Sidenbladh, 2003) and (Sidenbladh and Wirkander, 2003) in the publication list.

We also developed methods for comparison of situation pictures of different granularity or at different points in time. One application is to obtain a measure of system reliability - if the situation pictures obtained from two independent systems (e.g. trackers) differ greatly, one or both systems are wrong or have too little information. This can be used to give a human operator of the system information about the reliablility of the situation pictures.

For more information, see papers (Sidenbladh et al., 2005) and (Sidenbladh et al., 2004) in the publication list.

The methods were part of a larger data fusion system, indended to fuse data presented to operators of a military command and control system. Here, fusing means to combine data statistically to lower the amount of data presented to the operator, while at the same time enhancing the quality of the presented data.

Joint work with Johan Schubert, Pontus Svenson, and Sven-Lennart Wirkander.

Videos from (Sidenbladh, 2003)

3D reconstruction of human motion (SSF, 1997-2001)

The subject of my PhD work was Bayesian methods (e.g. particle filtering) for tracking and reconstruction of human motion in 3D from image sequences. Possible applications are video-based human motion capture and recognition of human actions and gestures. Of course this is a highly underdetermined problem. The depth information that was lost in the projection of the human onto the camera plane has to be inferred from other data. Therefore, models of how humans normally move and appear were learned from training data (images and motion capture data), and used to infer 3D motion from the 2D image sequence.

For more information, see the PhD thesis or papers (Sidenbladh and Black, 2003), (Sidenbladh et al., 2002), and (Sidenbladh et al., 2000b) in the publication list.

Joint work with Michael Black, Fernando De la Torre, Jan-Olof Eklundh, and David Fleet.

Image data and example code from the thesis
Video tracking examples from the thesis