As human-machine dialogue systems gradually become part of daily life, high emotional intelligence and humanized dialogue technology have become more desirable. Understanding human intents while providing emotional support and feedback is now a hot topic in current research. Traditionally, emotion recognition and intent recognition in dialogue systems are performed separately. However, using these technologies in isolation has certain limitations. Emotion recognition may fail to accurately determine user intents, making it difficult for interaction systems to meet user needs effectively; meanwhile, intent recognition might overlook the user's emotional state, failing to provide emotional value. Therefore, research on joint emotion and intent recognition has significant theoretical and practical value.
Multimodal Emotion and Intent Joint Understanding (MEIJU) aims to decode the semantic information expressed in the multimodal dialogues while inferring the emotions and intents, providing users with a more humanized human-machine interaction experience. Unlike traditional tasks of emotion recognition and intent recognition, the MEIJU task faces unique challenges. Firstly, multimodal dialogues often involve various data forms such as speech, text, and images. Effectively integrating and modeling these different types of information to fully understand user needs and emotional states is a challenge that must be overcome. Secondly, the interaction between emotions and intents is complex. Existing studies indicate that the emotions expressed by speakers convey specific intents, which are responded to in an empathetic manner. Capturing and modeling the complex relationships between emotions and intents in the model is an urgent problem to be addressed. In summary, leveraging the intricate relationship between emotions and intents to achieve effective joint understanding is an important research topic.
Recent challenges have targeted multimodal emotion recognition to develop robust solutions across various scenarios. The MuSe (Multimodal Sentiment Analysis Challenge) 2023 addressed three contemporary problems: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. The EmotiW (Emotion Recognition in the Wild Challenge) 2023 focused on predicting engagement levels and perceived emotional health states in group dialogues within real-world scenarios. The MER (Multimodal Emotion Recognition) 2023 challenge explored effective methods through three tracks: multi-label learning, modality robustness, and semi-supervised learning. For intent recognition, large-scale datasets such as MIntRec and the Behance Intent Discovery dataset have been created, allowing researchers to investigate effective intent recognition methods. Despite the extensive research on emotion and intent recognition, there remains a significant gap in work focusing on the joint recognition of multimodal emotion and intent. Additionally, there is a scarcity of large open datasets annotated with both emotion and intent categories.
To enhance user experience in human-machine interaction and accelerate the development of multimodal emotion and intent recognition tasks, we are launching the ICASSP 2025 Multimodal Emotion and Intent Recognition Challenge.
We have designed two tracks to address the major challenges faced in real life. We warmly welcome researchers from academia and industry to participate and jointly explore reliable solutions for these challenging scenarios.
Track 1: Low-Resource Multimodal Emotion and Intent Recognition. Collecting a large number of samples with emotion and intent labels is very challenging, which poses a significant barrier to the development of multimodal emotion and intent joint recognition. To address this issue, researchers typically focus on methods such as data augmentation to train robust models with limited data. Additionally, a large amount of unlabeled data can be introduced to improve model performance based on semi-supervised or unsupervised learning methods. Therefore, we propose the Low-Resource Multimodal Emotion and Intent Recognition Challenge to encourage participants to improve model performance through data augmentation or the design of semi-supervised methods.
Track 2: Imbalanced Emotion and Intent Recognition. In real life, the probabilities of people expressing different emotions or intents are not equal. For example, in peaceful everyday life, people are more inclined to express neutral or positive emotions (such as happiness or joy), while the probability of expressing negative emotions (such as anger or disgust) is lower. Therefore, the issue of category imbalance becomes a common and challenging scenario in real life. We set up a track focusing on the Category Imbalance problem, encouraging participants to propose interesting algorithms and model structures to improve the recognition performance of systems in situations with imbalanced categories.
For different tracks, we have chosen to use different evaluation metrics:
Track 1: We choose the weighted F-score, a metric widely used in emotion recognition, as the evaluation metric of emotion and intent recognition.
Track 2: Due to the category imbalance issue in Track 2, we choose to use the micro F-score instead of the weighted as the evaluation metric. The micro F-score treats all samples from all categories as equally important, thereby not being affected by category imbalance. This is particularly useful for evaluating category imbalance issues and can better reflect overall performance. Considering that our task involves joint recognition of emotions and intents, we additionally designed new evaluation metrics for all tracks, Joint Recognition Balance Metrics (JRBM), to balance the performance of both tasks:
where M is the metric of corresponding tracks, weighted F-score or micro F-score. The JRBM is used to save the best model and as the final ranking. In addition, we will report the performances of emotion recognition and intent recognition separately in the final results.
To support research on multimodal emotion and intent joint understanding, we previously introduced the MC-EIU dataset [1]. This dataset consists of 4,970 conversational video clips from 3 English and 4 Chinese TV series, including genres such as comedy, drama, family, etc., offering dialogue scenarios closely related to the real world [2]. It comprises annotations for 7 emotions and 9 intents, encompassing textual, acoustic, and visual modalities, in a total of 45,009 English utterances and 11,003 Mandarin utterances. To illustrate the annotation quality of the MC-EIU dataset, we present Fleiss's Kappa coefficients for MC-EIU alongside several related datasets in Table 1. The table illustrates that the κ of MC-EIU is at a high level and is either higher or comparable to that of the other datasets. This provides compelling evidence that the annotations of MC-EIU are reliable and accurate. For more detailed information about the MC-EIU dataset, please refer to the [1].
Track 1: For Track 1, we select several labeled data points from the MC-EIU dataset to construct the category-balanced subset, which will be used for training and validation of model performance. Specifically, we retained all the data from the category with the fewest samples in the MC-EIU dataset and randomly selected an equal amount of data from the categories with a larger number of samples. These combined data form the balanced subset used for this track. Additionally, we provide a large amount of unlabeled data for participants to use, aiming to explore more effective unsupervised or semi-supervised learning strategies. We divide the labeled data into training, validation, and test sets in the ratio of 7:1:2. We use a stratified sampling method to ensure that each set is representative of the overall category distribution. The categories in each set are relatively well-balanced, and the number of samples in different categories does not vary greatly. In addition, the unlabeled data will be included in the training set.
Track 2: In Track 2, we emphasize the issue of category imbalance in real-life scenarios. Therefore, we use the MC-EIU dataset as the dataset for this track. By utilizing this imbalanced dataset, we hope participants can propose effective methods to enhance the attention and recognition capabilities of models toward minority categories. Similarly, we partition the dataset into training, validation, and test sets in the ratio of 7:1:2. The dataset for each track includes 7 emotion labels (happy, surprise, sad, disgust, anger, fear, and neutral) and 8 intent labels (questioning, agreeing, acknowledging, encouraging, consoling, suggesting, wishing, and neutral), and contains both English and Mandarin languages. Each language dataset represents a sub-track. Participants can choose to challenge one of the sub-tracks or multiple sub-tracks simultaneously. All participants in the Challenge are required to download the End User License Agreement (EULA), fill it out, and send it to our email address (after the Challenge is released, we will designate an explicit email address to receive applications from participants) to access the data. In the case of team entries, one application will suffice. We will review each participant's application and contact them promptly. The EULA requires participants to use this dataset for academic research only and not to edit or upload samples to the Internet.
[1] Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, Bj ̈orn W Schuller, and Haizhou Li.
Emotion and intent joint
understanding in multimodal conversation: A benchmarking dataset. arXiv preprint
arXiv:2407.02751, 2024.
[2] Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, and Haizhou
Li. M3ed: Multi-modal multi-scene multi-label emotional dialogue database. In Proceedings of
the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 5699–5710, 2022.
[3] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim,
Jeannette N Chang,
Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture
database. Language
resources and evaluation, 42:335–359, 2008.
[4] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and
Rada Mihalcea. Meld: A
multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of
the 57th Annual Meeting
of the Association for Computational Linguistics, pages 527–536, 2019.
[5] Anuradha Welivita, Yubo Xie, and Pearl Pu. Fine-grained emotion and intent learning in
movie dialogues. arXiv
preprint arXiv:2012.13624, 2020.
6
The rankings of our challenge are based on the CodaLab Leaderboard, so entrants will need to register on Codalab using the email provided on the EULA or the email that sends the EULA in order for entrants to upload their results and view the rankings. We will provide the CodaLab link when the challenge is released.
User License Agreement Link: https://ai-s2-lab.github.io/MEIJU2025-website/static/MC-EIU_Agreement.zip
Baseline code:
https://github.com/AI-S2-Lab/MEIJU2025-baseline
Contact email: liurui_imu@163.com , zuohaolin_0613@163.com
Testing dataset link:
According to the requirements of the ICASSP 2025 Grand Challenge, the challenge organizers will invite the top 5 submissions to submit 2-page papers and present at the ICASSP 2025 conference (accepted papers will be in the ICASSP proceedings, the review process is coordinated by the challenge organizers and the SPGC chairs). All 2-page proceedings papers should be covered by an ICASSP registration and should be presented in person at the conference. The teams that present their work at ICASSP in person are also invited to submit a full paper about their work to OJ-SP. A challenge special session will be held during the ICASSP 2025 conference in Hyderabad, India during 6-11 April 2025. This session will include an overview by the challenge organizers (including the announcement of winners), followed by the paper presentations (oral or poster) of the top-5 participants, followed by a panel or open discussion. In addition, authors may post their preprints on arXiv.org. This does not count as prior publication. If the copyright was transferred to IEEE before posting, include the following statement: “© 20XX IEEE. Personal use of this material is permitted.
Due to the official requirement that each challenge can submit up to five papers for ICASSP oral/poster presentation, the paper submission rules for our challenge are set as follows:
August 26th, 2024: Registration opens, and the release of the training set, validation set,
and code.
October 20th, 2024: Release of test set.
November 9th, 2024: Deadline for submitting results for both tasks.
November 23rd, 2024: Result announced.
December 9th, 2024: ICASSP 2025 grand challenge 2-page paper deadline (top 5 teams only).
December 30th, 2024: ICASSP 2025 grand challenge 2-page paper acceptance notification.
January 13th, 2025: ICASSP 2025 grand challenge 2-page camera-ready deadline.
Inner Mongolia University
South China University of Technology
Institute of Automation, Chinese Academy of Sciences (CASIA)
The Chinese University of Hong Kong
Imperial College London
Inner Mongolia University
We are pleased to announce that we will be creating a WeChat group to facilitate communication among participants of MEIJU'25. The QR code for the WeChat group is as follows: