Most of research in speech recognition has relied on the availability of labelled data. Especially end to end automatic speech recognition (ASR) systems benefit from large quantities of labelled training data. Such networks need thousands of hours of data to reach an acceptable performance, which is not available for the majority of the spoken languages in the world.
Self supervised learning is a subset of unsupervised learning where instead of learning from labels, you can generate output labels intrinsically from the data object by exposing a relation between the parts of the object or different views of the object. This paradigm has been very successful for natural language processing and is an active area of research in computer vision. In the context of speech, learning solely from labelled examples does not resemble learning process in infants.
Recently, a framework for self supervised learning for speech representations (wav2vec 2.0) was released by Facebook. It learns speech representations from raw audio data. This approach first encodes speech audio via convolutional layers and then masks the resulting latent speech representations. This is similar to masked language modeling used in BERT. The latent speech representations are then passed through a transformer network to build contenxualized representations. The model also jointly learns a discrete vocabulary of latent speech representations which is used to train the model with a contrastive loss. After pretraining on unlabelled speech, the model is fine tuned on labelled data with a Connetionist Temporal Classification (CTC) loss.
In the paper the authors demonstrate that state-of-the-art results are achieved by using just 10 minutes of labeled data and pre-training with 53,000 hours of unlabeled data for English language. In another paper which talk about cross lingual representation learning for speech recognition, the authors have demonstrated that by pre-training a single model from the raw waveform of speech in multiple languages, the resulting fine-tuned model on low resource languages outperforms monolingual training.
My team and I started pre-training with~4000 hours of Hindi audio data and to test out cross lingual language transfer capabilities, we fine tuned on 40 hours of Gujarati data. There was no Gujarati data in pre-training. When evaluated on the test set the Word Error Rate (WER) comes out be 22.5. When we evaluated Google’s speech to text API on the same test set, the WER was 26.86.
This shows great promise, considering these results are without a language model while decoding and no use of Gujarati data while pre-training. The WER can only go down if we take care of these things while training. Since we are able to utilize latent speech representations learnt while pre-training on Hindi audio data while fine-tuning on Gujarati data, the phonetic similarity of languages used in pre-training and fine-tuning also plays a very important role. In a country like India, which has numerous languages this will be a big leap in the field of speech recognition.