Neuromorphic Multi-modal Speech Enhancement Framework using Shift Invariant Deep Learning Architectures
Conference Paper
Overview
Research
Identity
Additional Document Info
View All
Overview
abstract
Background noise, interference from other speakers and
reverberation cause impairments to the target speech in a realtime indoor environment. Through an audio-visual speech
enhancement system, it is possible to simultaneously exploit
the acoustic and visual modalities to localize the target speaker
and apply source separation and beamforming algorithms to
enhance the target speech. However, such an audio-visual
speech enhancement strategy may not be fully effective owing
to motion of the target speaker and the interferers within the
field of view of the sensors. A joint voice activity detection and
target localization framework based on deep learning is an
effective strategy to combat background interference and
reverberation. However, in dynamic environments, even such
a framework needs to take into account the variations in
amplitude and directions-of-arrival of the incoming signal, in
order to compensate for the background interference and
reverberation. The framework needs to be able to accurately
infer the speech presence/absence as well as estimate the
direction-of-arrival of the target speech from samples of short
observation duration. Deep learning algorithms incorporating
transformation invariance such as rotation, shift and amplitude
can provide attractive solutions towards simultaneous
estimation of speech presence and its associated direction-ofarrival. The subsequent speech enhancement module can
immensely benefit from such a neuromorphic multi-modal
joint voice activity detection and localization framework,
where deep learning aided beam steering may be dynamically
performed towards estimated target source direction and
adaptive de-reverberation algorithm may be applied to
suppress the room reverberation.