New Delhi: Microsoft India announced the availability of Microsoft Indian language Speech Corpus, offering speech training and test data for Telugu, Tamil and Gujarati. Offering the largest publicly available Indian language speech dataset, Microsoft aims to help researchers and academia build Indian language speech recognition for all applications where speech is used.
This Indian language Speech Corpus content is provided by Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences.
Today, there is a scarcity of adequate digital data for text, speech and linguistic resources - which are imperative in building large machine learning models for many vernacular languages across the world. Moreover, the differences in enunciation, accent, diction, and slang across various regions in India are very subtle.
As a result of these complexities, development of accurate digital tools in Indian languages has been slow. Microsoft is working to address this lack of data and catalyze the development of machine learning based models that can help in building systems for low resource languages.
Thus, enabling the eco system of researchers, academia and tech companies working on India language models and to accelerate the needs of Indian users.