KTH / CSC / Kurser / DT2140

DT2140 Multimodal interfaces

Suggested projects

This list will be completed and is non-exhaustive. You may choose any project topic that interests you more, as long as you have had it approved by one of the teachers of the course.

Supervisors: OE=Olov Engwall, AF=Anders Friberg, RB=Roberto Bresin, JM=Jonas Moll AL=Anders Lundstrom.

Tangible interfaces, Augmented Reality etc.

Spoken interaction

Gestures as a control in immersive gaming (OE):
Description: Build an aplication using Kinect which allows you to navigate through google street view using gestures i.e. turn left, turn right, walk ahead, walk backwards, zoom-in, zoom-out. The aim is to use this application for people who are in an immersive gaming dome (http://mediagrid.org/summit/media/2011_Boston_Summit/GeoDome_Portal.jpg). Test your application by recording one participant navigating trough google street view in the immersive gaming dome. Requirements: Some programming experience (C++ or C#), high fun factor, good teamwork skills.

Real-time facial animation (OE): Description: In this project, you will use an existing video-based facial tracking software, Face-API (Seeing Machines), to animate a 3D head in real-time. The project involves streaming tracking information into a game-engine, employing animation and retargeting techniques to animate a 3D head.

You may optionally choose to build a head-mounted camera (i.e. the head-worn rig for the camera) to capture the facial images.

Key-words: Real-time facial animation, telepresence, face tracking.

Large virtual camera (OE): Background: By having a one-to-one correspondence between a real object and a 3D model of the same object and also knowing how they are spatially related, you can project the virtual object onto the real one. This gives the possibility to alter the real objects surface representation and to add a dynamic dimension to it (cf. Rorschachs face in the film “Watchmen”).

Project description: >/i> This project aims to use motion capture technology to track a physical projection screen in space and let a virtual representation of the screen be projected back onto it in real-time. The virtual screen, in its turn, acts as a virtual camera in a digital scene. Set together, the tracked screen acts as a large camera viewfinder into a virtual space.

Key-words: Motion capture, Mixed Reality, projection, Virtual camera

Games with the purpose to acquire labels for multimodal data (OE):
Background: Development of speech applications (multimodal or voice-only) requires very large amounts of labeled data (indicating the correspondence between the acoustic signal and the text content or meaning). In recent years there have been attempts to acquire these labels at low cost and with high efficiency with so-called human computation through games with a purpose.

Task: To build a prototype game with a purpose designed to collect labels for audio and video recordings of spoken dialogues, and to test the quality of the labels. The goal is to keep the overall game design general enough that it can be used for other tasks with relatively small effort. Longer description

Head Rotation in 3 party dialogue (OE): Do people always rotate their heads towards who they are directing speech at, when standing at a Short distance? If not, when do they?
Record a 5 minutes dialogue of 3 people standing at a 1 meter distance, and another one at 2 meters from each others and study scientifically their head rotation behavior, and whether it changes over the different distance.
Requirements: High scientific accuracy, No programming experience, Good teamwork skills, High level of curiousity.
Difficulty Factor: Easy, Time Factor: High, Scientific Value: High, Fun Factor: Medium, Contribution: building natural robot behavior.

Mutual Gaze in 3 party dialogues (OE): How much do people have eye contact when talking? How much people look at easy others when standing closly or further from each others?
Record a 5 minutes dialogue of 3 people standing at a 1 meter distance, and another one at 2 meters from each others and study scientifically their mutual gaze, and whether the amount and timing changes over the different distances.
Requirements: High scientific accuracy, No programming experience, Good teamwork skills, High level of curiousity.

Audiovisual speech perception test with an augmented reality talking face (OE): "Is a 2D or a 3D representation best?"
Background: Seeing animations of tongue movements can help listeners understand sentences with degraded sound. However, it is not clear how the animations are interpreted by the listeners. Is it good to have many details, providing much information, or is a simple display better? The project is a small follow-up study to investigate if a three-dimensional, "realistic" display, or a simplified display with a two-dimensional tongue result in the best recognition results. The test set-up is available.
Task: Run a perception experiment with 10-20 subjects. The experiment is set up with two groups, so that one group sees one set of sentences with a 3D display and another with a 2D display and the condition is reversed for the other group. The subjects should indicate the words that they hear and the experimenter then counts the number of correct words in the different conditions.
Analyze the results: For which representation are the recognition scores the best? (OE)

Multimodal speech perception test with an augmented reality talking face (OE): "How can you read tongue movements?" .
Background: It is well known that lip reading supports speech perception. In some recent studies, we have investigated our ability to use information from seeing tongue movements. The project would be a small follow-up study and the test set-up and the stimuli is available.
Task: Run a perception study with 10-20 subjects and investigate where the subjects are looking using an eye tracker.
Analyze the results: Is there any relationship between where the subjects focus their attention (on the face, the lips, the tongue) and the word recognition results?

Use a freely available API to build a multimodal (or speech only) interface (OE): Software: Microsoft ASR & TTS, available in English Windows Vista & 7; WAMI toolkit for Javascript; Nuance Cafe for VoiceXML; CSLU toolkit, as an extension of Lab 3.

Build an app or evaluate the performance of speech recognition on IPhone or Android mobile phones (OE) (Siri, Dragon Dictation etc.)

Speech syntehsis evaluation (OE): For example, how good is the "Read out loud" feature in Adobe Reader? Test this functionality of the reader. Let listeners write down what they hear; what is the word accuracy? Let listeners rate the effort and pleasantness; is the "Read out loud" function any good?

Haptic interfaces

Course responsible: Olov Engwall, engwall@kth.se, 790 75 65

DT2140 Multimodal interfaces

Suggested projects

Tangible interfaces, Augmented Reality etc.

Visual input and gestures

Sound and Music

Spoken interaction

Haptic interfaces