Thank you for exploring Project Vaani. Launched in 2022 by IISc/ARTPARK and Google, Project Vaani aims to create an open-source multi-modal dataset truly representing India's linguistic diversity. It aims to collect over 775,000 images; 150,000 hours of speech & text data from 1 million people across all 773 districts, capturing diversity in language, dialects, and demographics. The geo-centric approach, instead of language centric, allows capturing dialects and languages spoken in remote areas, though making it extremely operationally intensive and challenging.
From this, Phase 1 covering 80 districts has currently been open sourced.
Dataset: https://huggingface.co/datasets/ARTPARK-IISc/Vaani