Paper by U-M researchers selected for Best Paper in IEEE Transactions on Affective Computing

September 23, 2022

The research on automatic speech emotion recognition is one of the five papers featured in the collection.

A paper authored by CSE researchers has been selected as one of five Best Papers featured in IEEE Transactions on Affective Computing (T-AFFC). In “Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG),” the researchers propose new methods for generalizing representations of speech for emotion so that recognition performance is improved across datasets.

As more work is done in the field of human-machine interaction, there is an increasing desire to decode the emotions of users. Many methods in the field of emotion recognition are inspired by the methods used in automatic speech recognition. However, the comparatively small size of speech emotion datasets often results in emotion models that work well only in the domains in which they were trained, revealing gaps in using emotional models when more varied information is present.

“When our environment changes, the acoustics of our emotion expressions change as well. However, as humans, we know that this does not mean that in a new environment we lose the ability to recognize the emotion expressions of others. Yet, this is often the case for our classifiers,” said Prof. Emily Mower Provost, one of the authors of the paper. “Our goal is to create methods that find commonality between emotions expressed in different environments and to use these commonalities to recognize emotion in a more generalizable manner.”

In the featured paper, the authors introduce Adversarial Discriminative Domain Generalization (ADDoG), a new method for finding more generalized intermediate representation for speech emotion across datasets. Using a “meet in the middle” approach, ADDoG is adversarially trained to iteratively move different dataset representations closer to one another. Then with the fellow-on proposal of Multiclass ADDoG (MADDoG), experiments are able to incorporate many datasets at a time and build even more robust and generalized representations.

According to Mower Provost, “ADDoG was inspired by domain generalization techniques and a common failing: instability and mode collapse. We created a new technique that iteratively moved representations learned for each dataset closer to one another, improving cross-dataset generalization and significantly improving cross-dataset performance.”

The team plans to conduct further experiments that will explore other factors of speech variation besides dataset, like gender, phoneme, subject and recording device. They additionally intend to explore the trade-off between building systems specialized for certain domains versus building generalized representations and using the techniques they’ve developed to facilitate the estimation of mood in natural environments.

“Our results show consistent convergence and significant improvements over existing state-of-the-art techniques. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings,” said Mower Provost.
Authors on the paper include former CSE doctoral student John Gideon, Thomas B and Nancy Upjohn Woodworth professor of Bipolar Disorder and Depression Melvin McInnis, and associate professor and CSE Associate Chair for Graduate Affairs Emily Mower Provost.

Explore:

Emily Mower Provost; Honors and Awards; Human Computer Interaction; Human computer interaction; Language and Text Processing; Research News; e-HAIL