Common Voice: A Massively-Multilingual Speech Corpus

Published in LREC 2020, 2020

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. It is designed for Automatic Speech Recognition but useful in other domains (e.g. language identification). The project employs crowdsourcing for both data collection and validation. The release includes 29 languages (38 collecting as of Nov 2019), with 50,000+ participants and 2,500 hours of audio. We present speech recognition experiments using Mozilla’s DeepSpeech toolkit; by applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages.

Recommended citation: Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber. (2020). "Common Voice: A Massively-Multilingual Speech Corpus." LREC 2020.
Download Paper