ICASSP 2025 Grand Challenge – MEIJU - The 1st Multimodal Emotion and Intent Joint Understanding Challenge

Call for Participation

As human-machine dialogue systems gradually become part of daily life, high emotional intelligence and humanized dialogue technology have become more desirable. Understanding human intents while providing emotional support and feedback is now a hot topic in current research. Traditionally, emotion recognition and intent recognition in dialogue systems are performed separately. However, using these technologies in isolation has certain limitations. Emotion recognition may fail to accurately determine user intents, making it difficult for interaction systems to meet user needs effectively; meanwhile, intent recognition might overlook the user's emotional state, failing to provide emotional value. Therefore, research on joint emotion and intent recognition has significant theoretical and practical value.

Multimodal Emotion and Intent Joint Understanding (MEIJU) aims to decode the semantic information expressed in the multimodal dialogues while inferring the emotions and intents, providing users with a more humanized human-machine interaction experience. Unlike traditional tasks of emotion recognition and intent recognition, the MEIJU task faces unique challenges. Firstly, multimodal dialogues often involve various data forms such as speech, text, and images. Effectively integrating and modeling these different types of information to fully understand user needs and emotional states is a challenge that must be overcome. Secondly, the interaction between emotions and intents is complex. Existing studies indicate that the emotions expressed by speakers convey specific intents, which are responded to in an empathetic manner. Capturing and modeling the complex relationships between emotions and intents in the model is an urgent problem to be addressed. In summary, leveraging the intricate relationship between emotions and intents to achieve effective joint understanding is an important research topic.

Recent challenges have targeted multimodal emotion recognition to develop robust solutions across various scenarios. The MuSe (Multimodal Sentiment Analysis Challenge) 2023 addressed three contemporary problems: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. The EmotiW (Emotion Recognition in the Wild Challenge) 2023 focused on predicting engagement levels and perceived emotional health states in group dialogues within real-world scenarios. The MER (Multimodal Emotion Recognition) 2023 challenge explored effective methods through three tracks: multi-label learning, modality robustness, and semi-supervised learning. For intent recognition, large-scale datasets such as MIntRec and the Behance Intent Discovery dataset have been created, allowing researchers to investigate effective intent recognition methods. Despite the extensive research on emotion and intent recognition, there remains a significant gap in work focusing on the joint recognition of multimodal emotion and intent. Additionally, there is a scarcity of large open datasets annotated with both emotion and intent categories.

To enhance user experience in human-machine interaction and accelerate the development of multimodal emotion and intent recognition tasks, we are launching the ICASSP 2025 Multimodal Emotion and Intent Recognition Challenge.

We have designed two tracks to address the major challenges faced in real life. We warmly welcome researchers from academia and industry to participate and jointly explore reliable solutions for these challenging scenarios.

Track Setting and Evaluation

Track Setting

Track 1: Low-Resource Multimodal Emotion and Intent Recognition. Collecting a large number of samples with emotion and intent labels is very challenging, which poses a significant barrier to the development of multimodal emotion and intent joint recognition. To address this issue, researchers typically focus on methods such as data augmentation to train robust models with limited data. Additionally, a large amount of unlabeled data can be introduced to improve model performance based on semi-supervised or unsupervised learning methods. Therefore, we propose the Low-Resource Multimodal Emotion and Intent Recognition Challenge to encourage participants to improve model performance through data augmentation or the design of semi-supervised methods.

Track 2: Imbalanced Emotion and Intent Recognition. In real life, the probabilities of people expressing different emotions or intents are not equal. For example, in peaceful everyday life, people are more inclined to express neutral or positive emotions (such as happiness or joy), while the probability of expressing negative emotions (such as anger or disgust) is lower. Therefore, the issue of category imbalance becomes a common and challenging scenario in real life. We set up a track focusing on the Category Imbalance problem, encouraging participants to propose interesting algorithms and model structures to improve the recognition performance of systems in situations with imbalanced categories.

Evaluation Metrics

For different tracks, we have chosen to use different evaluation metrics:

Track 1: We choose the weighted F-score, a metric widely used in emotion recognition, as the evaluation metric of emotion and intent recognition.

Track 2: Due to the category imbalance issue in Track 2, we choose to use the micro F-score instead of the weighted as the evaluation metric. The micro F-score treats all samples from all categories as equally important, thereby not being affected by category imbalance. This is particularly useful for evaluating category imbalance issues and can better reflect overall performance. Considering that our task involves joint recognition of emotions and intents, we additionally designed new evaluation metrics for all tracks, Joint Recognition Balance Metrics (JRBM), to balance the performance of both tasks:

where M is the metric of corresponding tracks, weighted F-score or micro F-score. The JRBM is used to save the best model and as the final ranking. In addition, we will report the performances of emotion recognition and intent recognition separately in the final results.

Dataset

To support research on multimodal emotion and intent joint understanding, we previously introduced the MC-EIU dataset [1]. This dataset consists of 4,970 conversational video clips from 3 English and 4 Chinese TV series, including genres such as comedy, drama, family, etc., offering dialogue scenarios closely related to the real world [2]. It comprises annotations for 7 emotions and 9 intents, encompassing textual, acoustic, and visual modalities, in a total of 45,009 English utterances and 11,003 Mandarin utterances. To illustrate the annotation quality of the MC-EIU dataset, we present Fleiss's Kappa coefficients for MC-EIU alongside several related datasets in Table 1. The table illustrates that the κ of MC-EIU is at a high level and is either higher or comparable to that of the other datasets. This provides compelling evidence that the annotations of MC-EIU are reliable and accurate. For more detailed information about the MC-EIU dataset, please refer to the [1].

Track 1: For Track 1, we select several labeled data points from the MC-EIU dataset to construct the category-balanced subset, which will be used for training and validation of model performance. Specifically, we retained all the data from the category with the fewest samples in the MC-EIU dataset and randomly selected an equal amount of data from the categories with a larger number of samples. These combined data form the balanced subset used for this track. Additionally, we provide a large amount of unlabeled data for participants to use, aiming to explore more effective unsupervised or semi-supervised learning strategies. We divide the labeled data into training, validation, and test sets in the ratio of 7:1:2. We use a stratified sampling method to ensure that each set is representative of the overall category distribution. The categories in each set are relatively well-balanced, and the number of samples in different categories does not vary greatly. In addition, the unlabeled data will be included in the training set.

Track 2: In Track 2, we emphasize the issue of category imbalance in real-life scenarios. Therefore, we use the MC-EIU dataset as the dataset for this track. By utilizing this imbalanced dataset, we hope participants can propose effective methods to enhance the attention and recognition capabilities of models toward minority categories. Similarly, we partition the dataset into training, validation, and test sets in the ratio of 7:1:2. The dataset for each track includes 7 emotion labels (happy, surprise, sad, disgust, anger, fear, and neutral) and 8 intent labels (questioning, agreeing, acknowledging, encouraging, consoling, suggesting, wishing, and neutral), and contains both English and Mandarin languages. Each language dataset represents a sub-track. Participants can choose to challenge one of the sub-tracks or multiple sub-tracks simultaneously. All participants in the Challenge are required to download the End User License Agreement (EULA), fill it out, and send it to our email address (after the Challenge is released, we will designate an explicit email address to receive applications from participants) to access the data. In the case of team entries, one application will suffice. We will review each participant's application and contact them promptly. The EULA requires participants to use this dataset for academic research only and not to edit or upload samples to the Internet.

Table 1: Comparison of Fleiss’s Kappa coefficients between MC-EIU and other datasets.

[1] Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, Bj ̈orn W Schuller, and Haizhou Li. Emotion and intent joint understanding in multimodal conversation: A benchmarking dataset. arXiv preprint arXiv:2407.02751, 2024.
[2] Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, and Haizhou Li. M3ed: Multi-modal multi-scene multi-label emotional dialogue database. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5699–5710, 2022.
[3] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
[4] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, 2019.
[5] Anuradha Welivita, Yubo Xie, and Pearl Pu. Fine-grained emotion and intent learning in movie dialogues. arXiv preprint arXiv:2012.13624, 2020. 6

Rules

Use of Dataset: The datasets prepared for each track are intended for use only in their respective tracks. Furthermore, during the competition, the validation set provided is strictly for evaluating the performance of models and cannot be used for training purposes.
Result Submission: The annotations for all the test sets we release are invisible to the participants. Participants need to use their trained models to make predictions on the test set and submit the prediction results to CodaLab.
Frequency of Submission: For each track, participants can submit their results no more than 3 times a day.
Attribution of Intellectual Property Rights: All intellectual property (IP) rights remain with the participants. Code shared or submitted during the challenge does not transfer IP rights to the organizers. Participants are encouraged to include a suitable license if they make it publicly available.
Fair Competition Guarantee: To maintain a fair and equitable environment for all participants, we will ensure that no participating team will have any conflict of interest with any of the challenge organizers. Any team involved in a conflict of interest will be disqualified immediately.
Organizer's Interpretation: The organizer reserves the right of final interpretation. In cases of special circumstances, the organizer will coordinate the interpretation.

Paper Submission and Results Announcement

According to the requirements of the ICASSP 2025 Grand Challenge, the challenge organizers will invite the top 5 submissions to submit 2-page papers and present at the ICASSP 2025 conference (accepted papers will be in the ICASSP proceedings, the review process is coordinated by the challenge organizers and the SPGC chairs). All 2-page proceedings papers should be covered by an ICASSP registration and should be presented in person at the conference. The teams that present their work at ICASSP in person are also invited to submit a full paper about their work to OJ-SP. A challenge special session will be held during the ICASSP 2025 conference in Hyderabad, India during 6-11 April 2025. This session will include an overview by the challenge organizers (including the announcement of winners), followed by the paper presentations (oral or poster) of the top-5 participants, followed by a panel or open discussion. In addition, authors may post their preprints on arXiv.org. This does not count as prior publication. If the copyright was transferred to IEEE before posting, include the following statement: “© 20XX IEEE. Personal use of this material is permitted.

ICASSP 2025 Submission Rules for the MEIJU Challenge

Due to the official requirement that each challenge can submit up to five papers for ICASSP oral/poster presentation, the paper submission rules for our challenge are set as follows:

1.We have four Tracks: Track 1 – English, Track 1 – Mandarin, Track 2 – English, and Track 2 – Mandarin. The Top 1 team in each track will be recommended to ICASSP 2025.
2.If the Top 1 teams in multiple tracks are from the same group, we will compare the improvement rates of the first-place team relative to the second-place team. The team with the lower improvement rate will allow the second-place team of that track to be recommended to ICASSP 2025. For example, in this scenario where Teams A, B, and C are involved: A scores 0.35 in Track 1 – English and 0.46 in Track 1 – Mandarin, ranking first in both. B scores 0.30 in Track 1 – English and ranks second, while C scores 0.41 in Track 1 – Mandarin and also ranks second. The improvement rate of A over B is (0.35 – 0.30) / 0.30 = 0.167, and over C is (0.46 – 0.41) / 0.41 = 0.122. A lower improvement rate indicates closer competition between teams, and thus, the second-place team in the track with the lower improvement rate (in this case, C in Track 1 – Mandarin) will be recommended to ICASSP 2025.
3.Without conflicting with the above rules, if a team manages to secure Top 1 in both Track 1 and Track 2, they will be entitled to an additional paper submission, allowing them to submit up to two papers. For example, if Team A secures Top 1 in both Track 1 – English and Track 2 – English, and Teams B and C are the second-place teams in Track 1 – English and Track 2 – English, respectively, after comparison, the track with the lower improvement rate (Track 2 – English) will have C’s paper recommended. Additionally, A's performance in Track 2 – English will also be recommended. Notably, if a team ranks first in three or more tracks, they will still only receive a maximum of two submission slots and must decide which papers to submit to ICASSP 2025.
4.If two teams both satisfy Rule 3, the average improvement rate between the two teams will be compared. If the average improvement rates are equal, the average scores will be compared until a ranking is established. The team ranked first will receive the additional submission slot as per Rule 3.
5.If no teams meet the criteria outlined in Rules 3 and 4, the remaining submission slots will first be offered to the top-performing teams from Rule 2 that relinquished their slots, followed by the second-place team with the smallest gap from the first place.

ICASSP 2025 MEIJU Challenge Results Announcement

We have announced the rankings of this challenge. Teams that are eligible for paper recommendation have a yellow background.

Track 1 - English

No.	Team	Institution	Score	Email
1	wcqy	University of the Chinese Academy of Science	0.4641	2476226021@qq.com
2	MSXF_Audio	Mashang Consumer Finance Co., Ltd.	0.4609	bin.yang02@msxf.com
3	Jasonerrr	Tencent Beijing	0.4488	yxd19991220@163.com
4	GMLAB_SZU	Shenzhen University	0.4337	huhe@gml.ac.cn
5	xiao li	Nanjing University of Science and Technology	0.4177	gc009879@163.com

Track 1 - Mandarin

No.	Team	Institution	Score	Email
1	lxe	Nanjing University of Science and Technology	0.5532	865069924@qq.com
2	Jasonerrr	Tencent Beijing	0.5478	yxd19991220@163.com
3	CIP	Institute of Automation，Chinese Academy of Sciences	0.5468	zhuangwenwen2023@ia.ac.cn
4	a1454889119	Hunan University Of Technology and Business	0.4810	b1454889119@gmail.com
5	MSXF_Audio	Mashang Consumer Finance Co., Ltd.	0.0599	bin.yang02@msxf.com

Track 2 - English

No.	Team	Institution	Score	Email
1	SEU_AIPLab	Southeast University	0.6230	cheng.lu@seu.edu.cn
2	MSXF_Audio	Mashang Consumer Finance Co., Ltd.	0.6015	bin.yang02@msxf.com
3	yihongzhu	Lenovo Research	0.5892	hzyi2000@163.com
4	Soceremite	University of Science and Technology of China	0.5812	yadongliu@mail.ustc.edu.cn
5	Haorr	Northeastern University	0.5780	1756889441@qq.com

Track 2 - Mandarin

No.	Team	Institution	Score	Email
1	SEU_AIPLab	Southeast University	0.7370	cheng.lu@seu.edu.cn
2	honghongWang	Beijing Yuanjian Technology Co., Ltd	0.7356	wanghonghong@fosafer.com
3	CIP	Institute of Automation，Chinese Academy of Sciences	0.7054	zhuangwenwen2023@ia.ac.cn
4	jellyfish	Inspur Information Co., Ltd	0.6826	binqiang2wang@qq.com
5	lxe	Nanjing University of Science and Technology	0.6721	865069924@qq.com