Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

ABSTRACT With the rapid development of social networks, more and more people express their emotions and opinions via online videos. However, most of the current research on multimodal sentiment analysis cannot do well with effective emotional fusion in multimodal data. To deal with the problem, we propose a multi-tensor fusion network with cross-modal modeling for multimodal sentiment analysis. In this study, the multimodal feature extraction with cross-modal modeling is utilized to obtain the relationship of emotional information between multiple modalities. Moreover, the multi-tensor fusion network is used to model the interaction of multiple pairs of bimodal and realize the emotional prediction of multimodal features. The proposed approach performs well in regression and different dimensions of classification experiments on the two public datasets CMU-MOSI and CMU-MOSEI.


Introduction
With the popularity of social media, a short video has gradually become a mainstream form to convey personal feelings and opinions. These videos contain abundant image, audio, and text features, thus vividly conveying user sentiments. The information between the various modalities can be used to solve the problem of the scarcity of short text features through mutual auxiliary reference, and more accurate sentiment prediction results can be obtained. For example, the language content of the speaker may be ambiguous, but more precise sentiments can be acquired by combining facial expressions and voice tone, which can reduce the ambiguity caused by a single modality. Compared with unimodal emotion analysis, multimodal sentiment analysis can better perceive human affections through the combination of intonation, gesture, and micro-expression Liu and Cheng (2018) Patel, Hong, and Zhao (2016).
To model and integrate the multimodal features efficiently, most researches focus on some multimodal fusion approaches, which can be divided into fusion approaches based on concatenation features and fusion approaches on non-concatenation features. The multimodal features are spliced end to end to obtain the fusion result in concatenation-based approaches, but the simple multimodal concatenated features may cause the loss of the dynamic correlation information in sentiment analysis Kumar and Vepa (2020). Nonconcatenation-based approaches can combine dynamic multimodal features and perform decision fusion in sentiment classification Chaturvedi et al. (2019). Some researchers Zadeh et al. (2017) proposed a tensor fusion network that can capture multimodal feature associations in the form of tensors and store modal dynamic information to achieve feature fusion between different modalities. However, traditional tensor fusion networks have the drawback of insufficient feature extraction and poor modal interaction capabilities Sahay et al. (2018).
In order to solve the problem, we propose a multi-tensor fusion network with cross-modal modeling (MTFN-CMM), which can predict more reasonable emotional intensity by enhancing the multi-modal feature fusion by obtaining more dynamic information of cross-modality. Initially, the textual, audio, and visual features are extracted from each utterance in the video separately, and then the context representation of multimodal features can be extracted with cross-modal modeling. Every two modalities are matched as a pair of cross-attention values, and three bimodal cross-attention values are selected to obtain the fine-grained bimodal relationship features. Further, the tensor fusion network with cross-modal modeling makes full use of the dynamic information of intra-modal relational and inter-modal interaction and obtains more accurate emotion prediction. The main contributions of the proposed approach are as follows: • We proposed a multi-tensor fusion network with cross-modal modeling for multimodal sentiment analysis. Cross-modal modeling is used to extract the interaction relationship between the modalities, and then the dynamic information of cross-modality is fully utilized with a multitensor fusion network. The proposed multimodal features fusion approach can retain more cross-correlation information of multimodal and make the emotion prediction more accurate. • Different from other fusion approaches Zadeh et al. (Zadeh, et al., 2018a), Poria et al. (2017), Li et al. (2021) and Xue et al. (2020), the proposed approach focuses on the fusion of the intra-modal and cross-modal features together rather than the concatenation of three unimodal features or the fusion of multiple bimodal features. To the best of our knowledge, the work is the first to represent non-concatenation multi-level crossmodal fusion with multi-tensor fusion network.
• We evaluate the proposed approach with a series of regression and classification experiments on the two public datasets CMU-MOSI and CMU-MOSEI. The experiment results show that our MTFN-HA approach outperforms other baseline approaches for multi-modal sentiment analysis on a series of regression and classification tasks.
The remainder of the paper is organized as follows: Section 2 is a brief introduction of the related work. Section 3 describes multi-tensor fusion network with cross-modal modeling for multimodal sentiment analysis in detail. Section 4 provides our experiment results in comparison to recent representative methods and discussion. Finally, we conclude this paper in Section 5.

Related Work
With the spread of social media and short videos, multimodal sentiment analysis has an important impact on some fields, such as human communication comprehension Zadeh et al. (Zadeh, et al., 2018b), financial and political forecasting Xing, Cambria, and Welsch (2018), Ebrahimi, Yazdavar, and Sheth (2017), community detection Young et al. (2018), dialog systems Majumder et al. (2018), Zadeh et al. (2016), and so on. Unlike unimodal sentiment analysis, multimodal sentiment analysis needs to better perceive human emotions through a variety of ways such as intonation, gestures, and micro-expressions. Because the fusion of multimodal features makes multimodal sentiment analysis more complicated, it is necessary to comprehensively consider the intramodal and intermodal dynamics in sentiment analysis Poria et al. (2016).
Most of the previous works focused on multimodal feature fusion for multimodal sentiment analysis. Concatenation-based feature fusion approaches are simply to concatenate multimodal features at the input section. Some researchers concatenate the multimodal features as an input for the concerned learning model. Zhou et al. (2016) proposed the Long-Short Term Hybrid Memory and Multi-head Attention mechanism to capture the most important semantic features. Kumar et al. Kumar and Vepa (2020) introduced a gated mechanism for attention in the binary classification of emotion tasks to reach higher accuracy. Multi-Head Attention Zadeh et al. (Zadeh, et al., 2018b) and Context-aware Interactive Attention Chauhan et al. (2019) are also applied to improve the sentiment intensity prediction for emotion classification. Besides, Li et al. (2021) introduced a hash feature network structure to concatenate and fuse the multimodal features, and the predicted sentiment label is inferred by calculating the hash value similarity. Although the concatenation-based feature fusion approaches can obtain the relevant features between different modalities with low complexity to a certain extent, it will lead to the loss of the high-dimensional multi-modal relationship information.
Some works on non-concatenated features fusion occur in recent years. Zadeh et al. (Zadeh, et al., 2018a) and Poria et al. (2017) propose different attention mechanisms separately to fuse the instantaneous features of multiple modalities in the perspective of time sequence. Chaturvedi et al. (2019) used deep learning to extract features from each modality and then projected them to a common effective space to speed up the classification performance. Tsai et al. (2019) introduced the multi-modal transformer to reduce the long-range dependencies between elements across modalities and capture the correlated cross-modal signals. In addition, tensor fusion is also a typical non-concatenated features fusion approach, which can quickly build a spatial relationship model from different source features. Zadeh et al. (2017) regarded the problem of multimodal sentiment analysis as modeling intramodal and intermodal dynamics and introduced a novel tensor fusion network that can combine multimodal dynamics information. Motivated by the work of Zadeh et al. (2017), Sahay et al. (2018) presented relational tensor network architecture to model the intermodal interactions for the sequence of segments in a video. The relational tensor network is regarded as a generalization of tensor fusion with multiple Bi-LSTM for multimodalities and an n-fold Cartesian product from modality embedding. These approaches can also fuse different modal features and can retain as much multimodal feature relationship information as possible, but it is easy to cause high-dimensional feature sparsity problems in the cross-modal fusion process.
As current multimodal sentiment analysis approaches are insufficiently prepared for feature extraction and modalities interaction to a certain extent, they might cause the loss of the critical emotional features and decrease the effectiveness of multimodal fusion due to noise interference in some modal features. Our work is related to a variant of the tensor method to enhance the modal feature fusion, which proposed the multi-tensor fusion network with cross-modal modeling to acquire more intramodal relational information and intermodal interaction and affects the inference effect of the final network model for multimodal sentiment analysis.

Multi-tensor Fusion Network with Cross-modal Modeling
The multi-tensor fusion network with cross-modal modeling (MTFN-CMM) for multimodal sentiment analysis consists of three main components: extracting context-independent unimodal features, multimodal feature extraction with cross-modal modeling, and multi-tensor fusion network for the prediction of multimodal emotional intensity. To enhance the cross-modal feature fusion, the Bi-LSTM network and cross-attention mechanism are separately used to capture more intramodal relational information and intermodal interaction, and a multi-level tensor fusion network is utilized to enhance the ability to acquire cross-modal features, improve the inference accuracy of the final result. We describe these components in detail as follows.

Extracting Context-Independent Unimodal Features
The unimodal features (the textual, audio, and visual features) are extracted from each utterance in the video separately. Initially, we use the FACET to extract the user's facial features in the video, the key feature of the face is extracted as a structured feature vector, which can be used to represent the basic facial features of affection of each frame and highlight the affected facial features. For audio, we use the COVAREP to extract low-and medium-level acoustic features. This tool is usually able to extract rich voice features including 12 Mel-Frequency Cepstral Coefficients (MFCCs), voice segment features, peak slope parameters, and maximum dispersion coefficients. For text, we directly use the Glove Pennington, Socher, and Manning (2014) to quickly obtain the word vector feature. Through the feature extraction work of these open-source tools, the unstructured video data can be transformed into structured vector features. Because the extracted feature dimensions are usually different and the timing between multiple modalities is not aligned, it is necessary to perform modal preprocessing timing alignment at the word level by way of manual annotation, and the alignment result can be input into the network framework in the next section. Figure 1 presents the process of multimodal feature extraction with crossmodal modeling. Firstly, we capture the long-distance context representation of the intra-modality related information effectively by Bi-LSTM network Plank, Søgaard, and Goldberg (2016), and then every two modalities can be embedded into a single value matrix of attention mechanism to enhance the salient features in the bimodal interaction and weaken the irrelevant features.

Multimodal Feature Extraction with Cross-modal Modeling
In order to capture the long-distance context representation of intra-modality effectively, we put the extracted features vector of each modality into the Bi-LSTM network to obtain the output state of the last layer and the characteristic state of the hidden layer. Taking the video modal feature sequence ðS 1 ; S 2 ; . . . ; S n Þ as an example, the context representation result output ðO l 2 R n�d Þ can be defined as follows. O l ; H l ¼ Bi À LSTMðS 1 ; S 2 ; . . . ; S n Þ; l 2 fA; V; Tg (1) Figure 1. The process of multi-modal feature extraction with cross-modal modeling.

e2000688-190
Where n is the length of feature sequence, d is the feature dimension, subscript l is the modality of the video, A, V and T are audio, visual and text modalities, respectively, and H l is the hidden layer state output of Bi-LSTM.
Further, we apply the cross-attention mechanism for bimodal embedding and fusion to capture the interaction characteristics of these pairs of modalities. Each bimodal feature is fused by the output features of one modal and a hidden state of another modal through a cross-attention mechanism. Taking visual and text modalities as an example, Cross VT can be obtained by fusing the LSTM output features expressed by visual features and the LSTM hidden state expressed by text features through the cross-attention mechanism to capture the interactive information between two peaks. We capture the interaction characteristics of these two modalities by calculating Cross 0 VT and Cross 0 TV cross-attention value, the calculation method is described as follows.
Where H V and H T are unimodal contexts from text and visual modalities, respectively, representing hidden layer state characteristics.
ffi ffi ffi ffi ffi d k p is the normalized coefficients; O V and O T are unimodal contexts representing output features from text and visual modalities, respectively; Cross VT represents cross-attention mechanism value matrix of visual modal features embedded with text context information. Cross TV represents the value matrix of the cross-attention mechanism for the text modal features embedded in the visual context information. Similarly, we can obtain six groups of cross-attention value matrices (Cross TA ; Cross TV ; Cross AT ; Cross AV ; Cross VT ,and Cross VA ), which can be generated by different bimodal combinations through the cross-attention mechanism. To compress the sparse features and make the bimodal spatial features converge intensively, the six groups of cross-attention value matrices are regarded as an input to a fully connected network block with "Dropout+Linear+ ReLU" respectively. Figure 2 demonstrates an overview of the multi-tensor fusion network with three-peak modalities. In the network framework model, we use multi-tensors to capture the interaction characteristics in three modalities and store the high-dimensional semantic connections across modalities, so that the final regression prediction can get better prediction values.

Multi-tensor Fusion Network with Three-peak Modalities
In order to better model the three-peak modal relationship, we divide six groups of cross-attention value matrices into two groups. To make the dynamic relationship modeling of the three models in tensor space, the cross-modal features of each group need to include three modalities of text, video, and audio at the same time. So, we choose (Cross AV ; Cross TV ; Cross VA ) as a group, and (Cross TA ; Cross VT ; Cross AT ) as the other group, instead of randomly dividing them into two groups.
Moreover, a multi-tensor fusion network can be obtained by combining the feature vectors of different modalities multiple times. The specific tensor is defined as follows. Where tensor fusion can be used for bimodal or multi-modal modes, � represents the outer product between vectors. The multi-tensor fusion network is simple to integrate, and the shared feature subspace is often semantically invariant Xue et al. (2020). A series of tensor fusion of bimodal or multimodal modes enables the network model to transfer feature knowledge from one model to another, and fully capture the dynamic relationship information between the modalities. Before multi-level tensor fusion, the "1" vector should be stitched onto the feature vectors of each modality, so that each modality can be modeled correctly in the end. Tensor bi is used to capture the bimodal effect of two modalities, and Tensor tr is used to capture the three-peak effect of three modalities.
Regarding the two obtained tensors as a new view, the two compressed tensors can be merged again to achieve hierarchical fusion.
After multi-level tensor fusion, a tensor with spatial relations is acquired and we can obtain a multimodal fusion tensor network with high-dimensional relations. The prediction layer uses function Sigmoid for normalization so as to facilitate the control of the output range. The regression prediction method is defined as follows.
where I is the prediction result of affect intensity, W s is the weight and Z is the total tensor. FC is a fully connected neural network used with the weight W s conditioned on the total tensor Z, which contains two layers of dropout and two layers of activation unit ReLu, and be connected to the last prediction layer.
On the basis of the cross-modal modeling, the multi-tensor fusion network with three-peak modalities can dynamically realize multi-modal semantic combination during model training, and make full use of the multi-layer characteristics of deep neural networks.

The Datasets
To evaluate the effectiveness of our proposed approach, we compared MTFN-CMM with other baseline models on CMU- MOSI Zadeh et al. (2016) and CMU- MOSEI Zadeh et al. (2018c), two public benchmark multimodal sentiment analysis datasets. The CMU-MOSI dataset contains YouTube videos from 93 different speakers, which covers 2199 opinions. There are 23.2 opinion clips per video on average. The average length of each opinion clip is about 4.2 seconds and the opinions are expressed totally in 26,295 words. Each opinion fragment correspondingly has an emotional label. Emotional intensity labels range from [−3, 3]. The number of discourse fragments used in the training set, test set, and verification set accounted for 229, 229, and 686, respectively. CMU-MOSEI is an emotion analysis dataset, which has 3,229 videos and 22,676 quotes from more than 1,000 online YouTube speakers. The number of utterances is about ten times as much as that of CMU-MOSI. The training, validation, and test sets consist of 16,216, 1,835, and 4,625 vocalizations, respectively.

Parameter Setting and Evaluation Metrics
We extracted the unimodal features separately on the multimodal data sets. 1) Text feature extraction. We converted the text lines obtained from the SDK into the pre-trained Glove word embedding. Both annotated dataset word embeddings are 300-dimensional vectors. 2) Audio feature extraction. Covarep Degottex et al. (2014) Table 1 shows the hyperparameters in our network framework for the multimodal emotional intensity prediction task.
The regression and different dimensions of classification experiments are performed on the two public data sets CMU-MOSI and CMU-MOSEI. We chose the mean absolute error (MAE), Pearson correlation coefficient (r) as the model evaluation indexes of our regression experiment, and the accuracy A 2 value and F1 score as the model evaluation indexes of two-class classification experiment. Moreover, the accuracy A 5 of five-class sentiment classification and the accuracy A 7 of seven-class sentiment classification are regarded as the model evaluation indexes. If the average absolute error MAE is smaller, the correlation coefficient r is larger in the regression experiments, and F1 and the accuracy A i in different dimensions of classification experiments is higher, the performance of the model represented by the multimodal sentiment analysis is better.

Comparison with Other Baseline Models
We compare the proposed approach on CMU-MOSI and CMU-MOSI datasets with the following baseline model:  Table 2.
We observe that the proposed MTFN-CMM yields better performance against other baseline models for the regression and different dimensions of classification experiments on the two benchmark datasets. The experiment results in bold indicate the corresponding model obtained the best performance on the concerned indicator. For intensity prediction, our proposed MTFN-CMM obtains lesser mean absolute error MAE with high Pearson correlation scores r. In comparison to context-aware interactive attention (CIA), we yield approximately 0.02 and 0.09 points improvement in MAE with higher r on two benchmark datasets, respectively. In addition, our proposed approach achieves approximately 1.5 and 3 percentage higher F1 scores with slightly higher accuracy on two benchmark datasets for two-class sentiment classification. This may be attributed to the use of the cross-modal modeling to strengthen the early bimodal feature extraction work, which enables the multi-tensor fusion network to establish more intra-modal and inter-modal connections. Compared with tensor fusion network (TFN) and relational tensor network (RTN) approaches, our proposed MTFN-CMM has certain advantages in regression and different dimensions of classification experiments. Especially, we observe approximately 3-5 percentage improvement in accuracy values on five-class and seven-class classification experiments. It indicates that MTFN-CMM effectively exploits this interaction for a three-peak modal relationship with multi-tensor fusion and our model can learn more cross-modal interaction among the multi-modalities than TFN and RTN models. We perform a statistical significance test (t-test) on the experimental results, and observe that performance improvement in the proposed MTFN-CMM over other compared baseline models is significant with 95% confidence (i.e., p-value < 0.05).

Coupling Experiments with Multi-tensor Fusion
To further study the effectiveness of cross-modal fusion with multi-tensor network, we perform the coupling effects of multi-tensor fusion network (MTFN) on cross-modal modeling (CMM) with different attention mechanisms, such as self-attention mechanism (SM), multi-head attention mechanism (MM), and cross-attention mechanism (CM). The experimental results are observed in Table 3.
Through the experiments of different attention mechanisms in cross-modal modeling, it can be found that coupling multi-tensor fusion network with cross-modal modeling can basically obtain lesser mean absolute error MAE on two datasets CMU-MOSI and CMU-MOSEI. The multi-head attention mechanism achieving a better coupling experimental effect than the selfattention mechanism. However, compared with MTFN-CMM (SM) and MTFN-CMM (MM), our proposed MTFN-CMM can perform better in regression and classification experiments. This may be because that the cross-modal modeling with a cross-attention mechanism can capture more interactive information within and between the modalities, which is conducive to the multi-tensor fusion of the three-peak cross-modality. Besides, crossmodal modeling with a cross-attention mechanism focuses on the contextual information within cross-modality and facilitates hierarchical multi-tensor network fusion in different modalities. It is speculated that the multi-tensor fusion network may be better able to capture dynamic relationship information with bimodal or even multi-peak interaction from cross-modal modeling with the cross-attention mechanism.

Visualization Analysis of Cross-modal Modeling
In order to further explore the influence of cross-modal modeling feature extraction in MTFN-CMM, we visually analyze the degree of cross-modal bimodal interaction of videos with audio modal noise interference in the process of multi-modal feature fusion. As shown in Figure 3, we randomly intercept and select 10 video segments from adjacent contexts in the video, and display the weight matrix of the crossattention mechanism in the cross-modal modeling process with the help of the heatmaps, where the cross-modal weight matrix is used to measure the bimodal relationship information. Figure 3a-c show the audio-text modal, visual-text modal, and visual-audio modal cross-attention weight matrices of 10 adjacent segments, respectively. By observing the color depth of the crossattention value of the seventh segment, we found that lighter colors appeared in both Figure 3 a and c, and darker colors appeared in Figure 3b. The results show that audio modalities capture opposite emotional features from text and visual modalities. At the same time, video and text capture highly similar emotional features. By comprehensively analyzing the cross-modal modeling feature extraction of the seventh segment information, it can be inferred that the audio modal in the multi-modal feature may have noise interference. Therefore, especially when there are a lot of noise features in the real video, cross-modal modeling feature extraction is very effective in improving the performance of multi-tensor fusion network feature fusion.

Conclusion
We proposed a multi-tensor fusion network with the cross-modal modeling for multimodal sentiment analysis in this study, which can capture intramodal dynamics and inter-modal interactions and can be used for multimodal affective intensity prediction effectively. The performance of the proposed method is superior to the existing advanced methods and shows significant performance improvement on CMU-MOSI and CMU-MOSI datasets. The proposed approach only tries to retain coarse-grained modal fusion, and we plan to do some work on fine-grained modal fusion in the future, so as to further reduce the number of required parameters of the model and improve the accuracy of inference and prediction of emotional intensity Xi, Lu, and Yan (2020), Mittal et al. (2020) and Fs et al. (2020).