Evaluating and Improving Explanation Coherence for Multimodal Emotion Recognition

Abstract. Multimodal Large Language Models (MLLMs) achieve strong performance in Multimodal Emotion Recognition (MER), yet accurate predictions alone do not guarantee genuine emotional understanding. Explainable MER (EMER) extends MER by requiring natural language explanations, but such explanations are often redundant or inconsistent with predicted emotions, revealing deficiencies in emotional reasoning coherence. A key challenge is that coherence remains poorly defined and difficult to optimize systematically. In this work, we propose the Fine-Grained Coherence Evaluator (FG-CE), which opera- tionalizes explanation coherence as a measurable and optimizable signal. FG-CE decomposes explanations into sentence-level units and identifies their functional role, modality grounding, and emotional consistency, producing interpretable coherence scores. Building upon FG-CE, we introduce Coherence-Aware EMER (CA-EMER), a unified optimization framework that integrates FG-CE into reinforcement learning and evaluator-guided re- jection sampling, enabling coherence-aware self-improvement without additional human annotations. Experiments demonstrate that CA-EMER improves emotional explanation coherence while preserving emotion recognition performance and generalization.

Contents

Framework Overview
Demos

Framework Overview

Figure 1. Overview of CA-EMER and FG-CE. (a) Major training stages: cold-start SFT, two-phase GRPO, FG-CE–guided rejection sampling, and a final GRPO stage identical to Phase 1, which is omitted for simplicity in the diagram. (b) Computation of the FG-CE score.

We propose CA-EMER, a training framework that explicitly optimizes explanation coherence in MLLMs for EMER while maintaining reliable emotion recognition. CA-EMER consists of four stages: (1) supervised fine-tuning (SFT) on human-supervised cold-start data to acquire basic emotion recognition and generate fluent explanations, establishing a foundation for coherence optimization; (2) FG-CE–guided Reinforcement Learning (RL), which follows an easy-to-hard strategy--first enforcing format and prediction correctness, then progressively aligning perception, reasoning, and emotion prediction via coherence reward; (3) FG-CE–guided Rejection Sampling Fine-Tuning (RFT), which filters and reconstructs high-coherence explanations to enable self-improvement in coherent emotional reasoning; and (4) a final RL stage, which reuses the same GRPO objective as the initial format-and-accuracy phase in (2), to restore and refine recognition performance. Overall, CA-EMER treats explanation coherence as an explicit optimization objective rather than an implicit byproduct of training. The pipeline is illustrated in Figure 1.

Demos

We demonstrate results of HumanOmni-7B trained under our proposed CA-EMER framework and the official checkpoint of ERV-7B.

Demo 01

Ground Truth Emotion: disappointment

CA-EMER-HumanOmni-7B

Prediction: disappointment

Explanation:
In the video, the female character's facial expression appears somewhat distressed, with a slightly downturned mouth and a gaze that seems to be focusing on her thoughts. Her hands are placed on her head, which may indicate that she is feeling headachy or experiencing some kind of pressure. In the audio, the character sighs at the end of the speech, expressing feelings of helplessness and frustration.