Neuromorphic Multi-modal Speech Enhancement Framework using Shift Invariant Deep Learning Architectures Conference Paper uri icon

abstract

  • Background noise, interference from other speakers and reverberation cause impairments to the target speech in a realtime indoor environment. Through an audio-visual speech enhancement system, it is possible to simultaneously exploit the acoustic and visual modalities to localize the target speaker and apply source separation and beamforming algorithms to enhance the target speech. However, such an audio-visual speech enhancement strategy may not be fully effective owing to motion of the target speaker and the interferers within the field of view of the sensors. A joint voice activity detection and target localization framework based on deep learning is an effective strategy to combat background interference and reverberation. However, in dynamic environments, even such a framework needs to take into account the variations in amplitude and directions-of-arrival of the incoming signal, in order to compensate for the background interference and reverberation. The framework needs to be able to accurately infer the speech presence/absence as well as estimate the direction-of-arrival of the target speech from samples of short observation duration. Deep learning algorithms incorporating transformation invariance such as rotation, shift and amplitude can provide attractive solutions towards simultaneous estimation of speech presence and its associated direction-ofarrival. The subsequent speech enhancement module can immensely benefit from such a neuromorphic multi-modal joint voice activity detection and localization framework, where deep learning aided beam steering may be dynamically performed towards estimated target source direction and adaptive de-reverberation algorithm may be applied to suppress the room reverberation.

publication date

  • 2023-01-01