Research @ MIT's Media Lab
Teaching Computers to Recognize and Express Emotion
Research in computer recognition and expression of emotion is in its infancy. Two of the current research efforts at the MIT Media Lab focus on recognition of facial expression and voice affect synthesis. These are not, of course, the only ways to recognize affective states; posture and physiological signs like gestures and increased breathing rate, for example, also provide valuable cues.
Computers, like people, can use cognitive reasoning -- a form of common sense (see chapter 9) -- to understand a person's goals and predict his or her affective state when they are disrupted. For example, HAL may predict that because "I killed Dave's colleagues and won't let him back on the ship, Dave will be upset." If prediction and observations agree, the computer is likely to strengthen its belief in that line of reasoning. If they disagree, it will see it as an interesting (perhaps even puzzling) event: "Most people would be enraged by all this, but Dave doesn't look very upset. He is great at concealing his emotions. Or maybe he knows something important I don't know?"
One way to recognize an expression is to record facial movements during a short video sequence, digitize the sequence, then apply the tools of pattern recognition. Recognition from a moving sequence is generally more accurate than recognition from a still image. If, for example, a person's "neutral" expression is a pout, only deviations from the pout (captured by video as movement) will be significant for recognizing affect.
Using this method requires a video camera, a digitizer, and a computer running video-analysis and pattern-recognition algorithms. Pattern recognition can utilize a variety of techniques -- such as analyzing individual muscle actuation parameters or (more coarsely) characterizing an overall facial-movement pattern. In a test involving eight people, recognition rates were as high as 98 percent for four emotions (see figure 13.2). Studies are underway to determine how the recognition rate changes when there are more experimental subjects. As yet, this method of recognition doesn't work in real time; it takes a few seconds to recognize each expression. However, advances in hardware and pattern recognition should make recognition essentially instantaneous in the near future -- at least for familiar expressions.
Although facial features are one of the most visible signs of underlying emotional states, they are also easy to control in order to hide emotion. Having a good "poker-face" that reveals none of your emotions is valuable, not only for playing cards, but also in the cutthroat worlds of business and law. The social-display rules of emotion specific to our culture are impressed upon us all as we grow up. I have seen a student who was undergoing great personal pain resist crying, while his eyes twitched unnaturally to hold back his tears. He was taught at an early age never to show emotion in public. Nonetheless, the healthy human body seems unable to suppress emotion entirely. He might not cry, but his eyes may twitch. She might not sound nervous, but she may, literally, have cold feet.
Emotional expression is not, clearly, limited to facial movement. Vocal intonation is the other most common way to communicate strong feelings. Several features of speech are modulated by emotion; we divide them into such categories as voice quality (e.g., breathy or resonant, depending on individual vocal tract and breathing), timing of utterance (e.g., speaking faster for fear, slower for disgust), and utterance pitch contour (e.g., showing greater frequency range and abruptness for anger, a smaller range and downward contours for sadness), as illustrated in figure 13.3. As these features vary, the emotional expression of the voice changes. The research problem of precisely how to vary these features to synthesize realistic intonation so far remains unanswered.
The inverse problem - intonation analysis, or recognizing how something is said, is also quite difficult. Research to date has limited the speaker to a small number of sentences, and the results are still closely dependent on the particular words spoken. A method of precisely separating what is said from how it is said has not yet been developed (see chapter 6).
You will note that no one method - whether recognition of facial expression or of voice intonation - is likely to produce
reliable recognition of emotion. In this sense, affect recognition is similar to other recognition problems like speech recognition
and lipreading. It is probable that a personalized combination taking into account both perceptual cues (say from vision and
audition) and cognitive cues (such as HAL's reasoning about how Dave would respond) is most likely to succeed. These cues
will undoubtedly work best when considered in context: is it a poker game, where bluffing is the norm, or a marriage
proposal, where sincerity is expected?