Abstract
The ability to perform automatic recognition of human gender is important for a
number of systems that process or exploit human-source information. The outcome
of an Automatic Gender Recognition (AGR) system can be used for improving
intelligibility of man-machine interactions, annotating video files or reducing the
search space in subject recognition or surveillance systems. In the previous studies,
the AGR systems were typically based on only one modality (audio or vision)
and their robustness in real-world scenarios was seldom considered. However, in
many typical applications, both audio signal and visual signal are available. Ideally,
an AGR system should be able to exploit both modalities to improve the overall
robustness. In this work, we develop a multi-modal AGR system based on audio
and visual cues and present its thorough evaluation in realistic scenarios. First,
in the framework of two uni-modal AGR systems, we analyze robustness of different
audio features (pitch frequency, formant and cepstral representations) and
visual features (eigenfaces, fisherfaces) under varying conditions. Then, we build
an integrated audio-visual system by fusing information from each modality at the
classifier level. Additionally, we evaluate performance of the system with respect to
quality of data used for training the system. We conducted the AGR studies on the
BANCA database. In the framework of the uni-modal AGR systems, we show that:
(a) the audio-based system is more robust than the vision-based system and its
resilience to noisy conditions is increased by modelling only voiced speech frames;
(b) in case of audio, the cepstral features are superior to the pitch frequency and
formant features, and in case of vision, the fisherfaces outperforms the eigenfaces;
(c) for the cepstral features, modelling of higher spectral details and the use of both
static and delta coefficients makes the system robust towards noisy conditions. The
integration of audio and visual cues yields a robust system that preserves the performance
of the best modality in clean conditions and helps in improving performance
in noisy conditions. Finally, the multi-conditional training (clean+noisy data) helps
in improving performance of the visual features and, consequently, the recognition
rate of the audio-visual AGR system.
Download
Paper: [PDF (4.52 MB)]
Bibtex