KTH / CSC / Kurser / DT2140

DT2140 Multimodal interfaces

Laboratory exercise 3: Multimodal speech interfaces

Task: To build a simple speech-driven interface with a talking head to the object manipulation in Processing (lab exercise 2) and sound manipulation in pd (lab exercise 1).
The objective of the exercise is to give the students a basic hands-on experience with speech synthesis and speech recognition components as well as illustrating the benefit of using alternative modalities and the importance of efficient confirmation and correction management.
Software & equipment: The CSLU toolkit, headphones and one microphone. Please bring your own headset/microphone if you have one.
Note: If you wish, it is possible to download the CSLU toolkit to your own computer and do (much of) the task on your own (to do it entirely, including the communication with other programs, you also need the tcl package udp, Processing and pd on your computer). If you do, you should save your solution as a .rad-file and present it at the scheduled exercise session.

Background:
The human-computer interface that you create should be appropriate for e.g. a cell phone (a small screen, some limited cursor control, inefficient keyboard for letters, but a digit keypad):
- The main interaction (input and output) is via voice.
- You should use an animated agent that speaks with either text-to-speech synthesis or time aligned recorded prompts of your own voice (see Tutorial 13 for the latter).
- The screen presents the animated agent and allowed replies from the user for each question.
- The digit keypad may be used in repair subdialogues (e.g. to type the amount to resize), but should not be the main interaction input.
- The cursor may be used in repair subdialogues to make a user selection graphically.
The agent needs to check that the user has been correctly recognized and correct misrecognitions. It is however inefficient (and hence not allowed in this exercise) to do the check
- at the end and restart from the beginning if the user is not satisfied, or
- explicitly after each user answer, as this will make the dialogue slow and frustrating.

Preparations:
Before the exercise session, you should
- look at Tutorials 1-3, 5-6, 11-12 and 19, in order to get to know the Rapid Application Developer in the toolkit.
- team up with a fellow student to do the lab with.
- (in the lab pairs) draw a flowchart in the program of your choice (or on paper and scan the sketch) of the components and the transitions you need for the dialogue interaction in the exercise (see below).
- The sketch should be submitted in bilda before the lab.
You should be able to present and discuss the flowchart with the lab assistant at the beginning of the lab session.
- decide when and how confirmation requests and repairs should be done in the dialogue.
Instructions:
The exercise consists of two parts, one compulsory, where you build a simple and very constrained dialogue, where the interaction is completely system-driven and one optional, where you create a freer, more user-controlled interaction, if you have the time. Download the following scripts and save them in V:/.rad/:
The socket program that enables you to communicate with Processing
The command to run the socket program,
The Processing interpreter that reads the commands that the CSLU toolkit sends and executes them (you need to unzip the file to V:/.rad), and
The sound patch in PD that creates a sound dependent on the 2D position of the cursor.

Start Processing, load the files in V:/.rad/UDP and play.
Start PD and load the sound patch
Start the RAD (Rapid Application Development) from the CSLU Toolkit folder on the Start menu.
Use the RAD components to create a dialogue where a computer-animated character asks the user for (with possible answers):
SHAPE: The shape of the object: [rectangle], [triangle], [circle], [ellipse].
COLOUR: The colour of the object: [red], [blue], [green], [cyan], [magenta], [yellow], [black], [white].
RESIZE: Resize the object: [smaller], [larger], [no].
SIZE_AMOUNT: The resize amount: [2], [3], [4], [5].
 SOUND: Decide if a sound should be associated with the object: [yes], [no].
MOVE: Move the object on the canvas (and hence change the sound): [up], [down], [left], [right], [stop].
You should insert intermediate confirmation/correction checks at appropriate states in the dialogue (e.g., include a correction option in the next stage) and take the correct action to correct errors, i.e. the user should NOT have to repeat answers that had been correctly recognized. Repair sub-dialogues may use an alternative method to request the information (see the dialogue components below).
Build the dialogue incrementally, i.e., start with a small part of the dialogue, build and test it, then add new components and finally repair sub-dialogues.
At the end of the exercise session, show your solution to the lab assistant and discuss your choice of dialogue flow and the performance of your interface.
Dialogue components:
You should use the following states in your dialogue, but you may use others as well, and you decide on how each component is used. You may also make small changes to the states in order to adapt them to your dialogue scheme, as long as the complexity of the task or the use of multimodality are not reduced.
* WELCOME: The agent greets the user.
* SHOW_SHAPE: Shows a list of the possible shapes: [rectangle], [triangle], [circle], [ellipse].
* SHAPE: Asks for and recognize what shape the user wants, sends the value to Processing (see below).
* SHOW_COLOUR: Shows a list of the possible colours: [red], [blue], [green], [cyan], [magenta], [yellow], [black], [white].
* COLOUR: Asks for and recognize what colour the user wants, sends the value to Processing (see below).
* RESIZE: Asks (and recognizes) if the user wants to make the object [smaller] or [larger] or not ( [no]).
* SIZE_AMOUNT: Asks the user how much larger or smaller the object should be (note that this stage should not be reached if the user answers "no" at RESIZE). Sends the value to Processing (see below).
* SOUND: Asks the user if the sound should be turned on. Sends [on] to Processing if the user answers yes (see below).
* MOVE: Lets the user move the object [up], [down], [left] or [right], until (s)he says [stop]. Sends the value to Processing (see below).
Think of how you create the loop; you do not want a prompt, e.g., "How do you want to move the object" for each move.
* RESTART: Asks the user if the program should restart with a new object. If not, end the dialogue.
* END: The agent closes the dialogue.
Useful components and hints:
* Use the ACTION component (the gear wheel) each time you need to send data to Processing. Double-click on the gear wheel and "Load" exec.tcl from where you saved it (or simply type exec sendudp 127.0.0.1 6000 "VARNAME $varvalue" in the window). VARNAME is the name of the variable, and you should change it to the correct variable name, i.e:
shape
scale
colour
sound
move
$varvalue is one of the possible values for each variable, as listed above.
Note that for scale, you need combine the answer at RESIZE and SIZE_AMOUNT, so that you send one scale value, which should be equal to $SIZE_AMOUNT if $RESIZE is "larger" and 1/$SIZE_AMOUNT if $RESIZE is "smaller". That is, $varvalue for scale may be 0.2, 0.25, 0.33, 0.5, 2, 3, 4 or 5.
* The recognition output variable from a GENERIC component (e.g. SHAPE) is referred to using the component name + (recog), e.g. $SHAPE(recog)
* Tcl/Tk is a little bit peculiar with variable reference: when you initialize them you do not put $ before the variable name, but you do when you refer to them, e.g. set SHAPE "circle", but if $SHAPE=="circle" in a conditional test.
* To check if a field (e.g. MOVE) has a certain value (e.g. "stop"), if {$MOVE(recog)=="stop"} is used.
* Remember to remove the previous SHOW_* frame when presenting a new one, to avoid confusion (this is done with Right-Click>Remove Media).
The socket program

Optional add-ons:
If you have the time:
Change the dialogue so that it is more user-driven, i.e., the agent asks "What do you want to do?" and depending on the user's answer (e.g. "Make a red circle", "Change the sound", "Make it twice as big."), different actions are taken.
Create multimodal repair components:
* SIZE_AMOUNT_WITH_KEYS: If the speech recognition fails for the SIZE_AMOUNT component, the user should be asked to key in the amount instead. Use the RESPONSE component with the following Properties: Buttons, Button List: 1 2 3 4 5
* SELECT_SHAPE/SELECT_COLOUR: Create an image map (see Tutorial 5) with the different shapes or colours, so that the user may click on the appropriate choice.
Requirements:
- A preparation flow-chart on paper should be presented to the lab assistant.
- The solution should make use of multimodality as appropriate (speech, text, keypad input etc) to solve the task efficiently.
- The solution should be able to handle a successful dialogue and attempt making confirmation requests and corrections in a user-friendly manner.

Course responsible: Olov Engwall, engwall@kth.se, 790 75 65