Abstract: To address the problem of low accuracy of single-modal emotion recognition, a speech-text bimodal emotion recognition model algorithm based on Bi-LSTM-CNN is proposed. The algorithm uses a Bi-LSTM(bi-directional long short-term memory network) with word embedding and a CNN(convolutional neural network) to form a Bi-LSTM-CNN model for text feature extraction, and the fusion results with acoustic features are used as the input of the joint CNN model for speech emotion computation. The test results based on the IEMOCAP multimodal emotion detection dataset show that the emotion recognition accuracy reaches 69.51%, which is at least 6 percentage points better than the single text modality model.

Key words: speech emotion recognition, convolutional neural network(CNN), long short-term memory(LSTM), feature fusion [1] 孙晓虎,李洪均.语音情感识别综述[J].计算机工程与应用,2020,56(11):1-9.
SUN X H,LI H J.Overview of speech emotion recognition[J].Computer Engineering and Applications,2020,56(11):1-9.
[2] ZHAO J F,MAO X,CHEN L J.Learning deep features to recognize speech emotion using merged deep CNN[J].IET Signal Processing,2018,12(6):713-721.
[3] CHAO L,TAO J,YANG M,et al.Long short term memory recurrent neural network based encoding method for emotion recognition in video[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing,2016:2752-2756.
[4] HUSAM A,HALA B W,ABDULM J.A new proposed statistical feature extraction method in speech emotion recognition[J].Computers and Electrical Engineering,2021,93:107172.
[5] ZHANG S Q,TAO X,CHUANG Y,et al.Learning deep multimodal affective features for spontaneous speech emotion recognition[J].Speech Communication,2021,127:73-81.
[6] 饶元,吴连伟,王一鸣,等.基于语义分析的情感计算技术研究进展[J].软件学报,2018,29(8):2397-2426.
RAO Y,WU L W,WANG Y M,el at.Research progress on emotional computation technology based on semantic analysis[J].Journal of Software,2018,29(8):2397-2426.
[7] LEE C W,SONG K Y,JEONG J,et al.Convolutional attention networks for multimodal emotion recognition from speech and text data[C]//Grand Challenge and Workshop on Human Multimodal Language,2018:28-24.
[8] GU Y,CHEN S,MARSIC I.Deep multimodal learning for emotion recognition in spoken language[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing,2018.
[9] 陈鹏展,张欣,徐芳萍.基于语音信号与文本信息的双模态情感识别[J].华东交通大学学报,2017,34(2):100-104.
CHEN P Z,ZHANG X,XU F P.Multimodal emotion recognition based on speech signal and text information[J].Journal of East China Jiaotong University,2017,34(2):100-104.
[10] 胡婷婷,沈凌洁,冯亚琴,等.语音与文本情感识别中愤怒与开心误判分析[J].计算机技术与发展,2018,28(11):124-127.
HU T T,SHEN L J,FENG Y Q,et al.Research on anger and happy misclassification in speech and text emotion recognition[J].Computer Technology and Development,2018,28(11):124-127.
[11] SAIBAT T N,VINYALS O,SENIOR A,et al.Convolutional,long short-term memory,fully connected deep neural networks[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing,2015:4580-4584.
[12] TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing,2016:5200-5204.
[13] XU Z,CHEN B Z,SHENG H C,et al.A text-driven aircraft fault diagnosis model based on a Word2vec and priori-knowledge convolutional neural network[J].Aerospace,2021,8(4):1-16.
[14] 黄鹤,荆晓远,董西伟,等.基于Skip-gram的CNNs文本邮件分类模型[J].计算机技术与发展,2019,29(6):143-147.
HUANG H,JING X X,DONG X W,et al.CNNs-highway text message classification model based on Skip-gram[J].Computer Technology and Development,2019,29(6):143-147.
[15] WANG R,LI Z,CAO J,et al.Convolutional recurrent neural networks for text classification[C]//2019 International Joint Conference on Neural Networks,2019:1-6.
[16] EYBEN F,WENINGER F,GROSS F,et al.Recent developments in openSMILE,the Munich open-source multimedia feature extractor[C]//21st ACM International Conference on Multimedia,2013:835-838.
[17] PORIA S,CAMBRIA E,HAZARIKA D,et al.Context-dependent sentiment analysis in user-generated videos[C]//55th Annual Meeting of the Association for Computational Linguistics,2017:873-883.
[18] BUSSO C,BULUT M,LEE C,et al.IEMOCAP:interactive emotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335-359.
吴迪, 姜丽婷, 王路路, 吐尔根·依布拉音, 艾山·吾买尔, 早克热·卡德尔. 结合多头注意力机制的旅游问句分类研究 [J]. 计算机工程与应用, 2022, 58(3): 165-171. 王希鹏, 李永, 李智, 梁起明. 引入短时记忆的Siamese网络目标跟踪算法 [J]. 计算机工程与应用, 2022, 58(3): 235-241. 施漪涵, 仝明磊, 张魁, 姚宏扬. 基于双塔结构的场景文字检测模型 [J]. 计算机工程与应用, 2022, 58(3): 242-248. 王博, 张丽媛, 师为礼, 杨华民, 蒋振刚. 改进的M2det内窥镜息肉检测方法 [J]. 计算机工程与应用, 2022, 58(2): 193-200. 王光, 陶燕, 沈慧芳, 周树东. 基于多特征融合与CELM的场景分类算法 [J]. 计算机工程与应用, 2022, 58(1): 232-240. 陆莉霞,邹俊忠,郭玉成,张见,王蓓. 多模态融合的膝关节损伤预测 [J]. 计算机工程与应用, 2021, 57(9): 225-232. 王玲,王家沛,王鹏,孙爽滋. 融合注意力机制的孪生网络目标跟踪算法研究 [J]. 计算机工程与应用, 2021, 57(8): 169-174. 李明山,韩清鹏,张天宇,王道累. 改进SSD的安全帽检测方法 [J]. 计算机工程与应用, 2021, 57(8): 192-197. 郭晓静,隋昊达. 改进YOLOv3在机场跑道异物目标检测中的应用 [J]. 计算机工程与应用, 2021, 57(8): 249-255. 吕浩,张盛兵,王佳,刘硕,景德胜. 卷积神经网络SIP微系统实现 [J]. 计算机工程与应用, 2021, 57(5): 216-221.