Natural Speech Interfaces for Environment and Device Control

The number of electronic devices that control our environment is ever increasing. While this trend brings greater flexibility and control over our environment, configuring each individual device to achieve the desired environmental state becomes ever more tedious and often burdensome. For example, in the home, to prepare the environmental state for cooking, one might want to turn the radio on and turn it to the news, turn the volume of the speakers up since the kitchen will be noisy, and make the kitchen lights bright. Then, to prepare the environmental state for eating dinner, one might dim the lights in the dining room, turn on the mp3 player to play a dinner music playlist, turn the volume of the speakers down, and close the blinds. Controlling all of these devices—the lights, the radio, the speaker volume, the mp3 player, and the blinds—to achieve a desired environmental state is quite tedious.

As with any interface, an interface for controlling the environmental state in the home should match the user's mental model instead of that of the underlying devices. That is, the user should only need to specify the {\em name} of the environmental state rather than the {\em configuration} of each individual device needed to achieve the desired state. In the home example, the user should be able to indicate “set cooking mode” or “apply dinner mode” rather than specifying the configuration of the radio, speakers and lights to achieve the cooking or dinner environmental state.

The challenge in designing such an interface that matches the user's mental model is not only in discovering the names for the environmental states (e.g., “set cooking mode”), but also discovering the configurations of devices necessary to achieve a named environmental state. In other words, we are only given the set of devices that can be controlled and how to control them, but we do not know the names of the desired environmental states, the configuration of devices to achieve each environmental state, nor the mapping from environment names to device configurations. We call this issue the name-configuration mapping problem. To provide an ideal natural interface, the users should be able to customize the environment names and the corresponding configurations of devices. Therefore, the name-configuration mapping problem cannot be solved a priori but must be solved for each individual set of users and domains.

In addition to matching the user's mental model, we want the interface to be “calm.” That is, the interface in a ubicomp environment, like environment control, should be almost invisible except during direct (focal) interaction (as advocated by Weiser[1]). In this context, a speech-based interface seems like a good option. With distributed microphone technology, the physical interface all but disappears, but jumps fluidly to the foreground when the system responds to spoken input. Furthermore, speech is often considered the most natural form of human expression and has the potential to address certain accessibility concerns.

Such natural speech interfaces, and more broadly, natural language-based interfaces only exacerbate the name-configuration mapping problem, as it adds a level of uncertainty to the system: either from imprecision in human input or uncertainty in the capturing of human input. For speech interfaces, users have to recall their commands, which may result in slight variants. For example, one may say “set mode for cooking” or “please change the environment for cooking” instead of “set cooking mode.” The automatic speech recognizer will inevitably have recognition errors, such as recognizing “set cooking mode” as “set cook in low.”

We believe this theme occurs in many ubicomp environments. For example, in the workspace the name-configuration mapping problem arises in mapping the “semantic gap” from a configuration command like “presentation” to control the lights, projector, and sound devices. In a workspace with an array of individually controllable lights instead of a bank of lighting all controlled by one light switch, mapping from a configuration command like “Joe's lights on” to turn on the lights over Joe's desk presents the same challenge. In the context of mobile control, this challenge would also be present in mapping a request like “high priority calls only” to a specific subset of the user's contacts who are important enough to trigger a ring response.

We have designed and deployed a natural speech interface in an open-plan workspace. The system runs live and controls 79 devices (individually controllable lights) from 25 users who have both their own and shared environment names and configurations. We call this system Illuminac.

[1] M. Weiser and J. S. Brown. The coming age of calm technolgy. In Beyond calculation: the next fifty years, pages 75–85. Copernicus, New York, NY, USA, 1997.