Font Size:
Multimodal Emotion Recognition: A comparative study
Last modified: 2024-05-15
Abstract
Recent advancements in machine learning have highlighted the importance of integrating different data sources to improve classification model performance. By utilizing multiple data representations, a richer understanding of subjects or objects can be achieved. For instance, in emotion recognition field combining multiple sources and /or modalities of information (e.g., voice, text, facial expression, body posture) performs well than those relying solely on a single modality. The challenge lies in fusing distinct types of data such as image, text, audio or video that are not naturally aligned.
Traditional classification algorithms, initially designed for uni-modal datasets, struggle with the complexities presented by multi-modal scenarios. This complexity is exacerbated by the need to align heterogeneous data sources, manage increased dimensionality, and create complementary and non-redundant representations. To tackle these issues, two principal family of approaches have emerged. The first is \texttt{agnostic} to the specific model, focusing on the moment when the fusion occurs along with the nature of the fusion, i.e: early (feature-level), late (decision-level) or hybrid. The second family of approaches is \texttt{model-dependent}, which involves sophisticated techniques like kernel methods, graphical methods, and deep neural networks. These strategies aim to consider the full potential of multi-modal data, thereby significantly elevating the capabilities of classification models.
For each family of approaches, different techniques exist and in this work we aim to highlight the main methods tackling this problem and we present a comparative study on different multi-modal datasets. Additionally, we outline prospective directions for future research in this evolving field.
Traditional classification algorithms, initially designed for uni-modal datasets, struggle with the complexities presented by multi-modal scenarios. This complexity is exacerbated by the need to align heterogeneous data sources, manage increased dimensionality, and create complementary and non-redundant representations. To tackle these issues, two principal family of approaches have emerged. The first is \texttt{agnostic} to the specific model, focusing on the moment when the fusion occurs along with the nature of the fusion, i.e: early (feature-level), late (decision-level) or hybrid. The second family of approaches is \texttt{model-dependent}, which involves sophisticated techniques like kernel methods, graphical methods, and deep neural networks. These strategies aim to consider the full potential of multi-modal data, thereby significantly elevating the capabilities of classification models.
For each family of approaches, different techniques exist and in this work we aim to highlight the main methods tackling this problem and we present a comparative study on different multi-modal datasets. Additionally, we outline prospective directions for future research in this evolving field.
Keywords
Multimodality, Classification, Deep Learning, Emotion recognition