Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition

Hai-tao Xu, Jie Zhang, Li-rong Dai

Crying is the main way for babies to communicate with the outside world. Analyzing cry enables not only the identification of babies' needs/thoughts they want to express, but also the prediction of potential diseases. In general, it is much more difficult to recognize special needs and emotions from infant cry than adults, because infant cry does not contain any linguistic information and the emotional expression is not as rich as adults.In this work, we focus on the time-frequency characteristics of infant crying signals and propose a differential time-frequency log-Mel spectrogram features based vision transformer (ViT) approach for infant cry recognition (ICR). We first calculate the deltas of log-Mel spectrogram of infant crying sounds over time frames and frequencies, respectively. The log-Mels and deltas are then combined as a 3-D feature representation and fed into the ViT model for cry classification. Experimental results on the CRIED database show the superiority of the proposed system over comparison methods and that the combination of logMels, the time-frame delta and frequency-bin delta achieves the best performance. The proposed method is further validated on a self-recorded dataset.

doi: 10.21437/Interspeech.2022-18

Cite as: Xu, H.-t., Zhang, J., Dai, L.-r. (2022) Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition. Proc. Interspeech 2022, 1963-1967, doi: 10.21437/Interspeech.2022-18

@inproceedings{xu22_interspeech,
  author={Hai-tao Xu and Jie Zhang and Li-rong Dai},
  title={{Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1963--1967},
  doi={10.21437/Interspeech.2022-18}