Motion history images for online speaker/signer diarization
Gebre, B. G., Wittenburg, P., Heskes, T., & Drude, S.
Motion history images for online speaker/signer diarization. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
(pp. 1537-1541). Piscataway, NJ: IEEE.
We present a solution to the problem of online speaker/signer diarization - the task of determining "who spoke/signed when?". Our solution is based on the idea that gestural activity (hands and body movement) is highly correlated with uttering activity. This correlation is necessarily true for sign languages and mostly true for spoken languages. The novel part of our solution is the use of motion history images (MHI) as a likelihood measure for probabilistically detecting uttering activities. MHI is an efficient representation of where and how motion occurred for a fixed period of time. We conducted experiments on 4.9 hours of a publicly available dataset (the AMI meeting data) and 1.4 hours of sign language dataset (Kata Kolok data). The best performance obtained is 15.70% for sign language and 31.90% for spoken language (measurements are in DER). These results show that our solution is applicable in real-world applications like video conferences.