Josh Talks - Dataset Catalogue - Website

	B	C	D	E	F
1
2	Data Type	Languages	Description	Samples
3	Natural Conversational Voice Dataset	Hindi and Indian English	10,885+ Hours of Hindi and 10,392+ of Indian English Conversation Voice Dataset. Natural two - person conversations recorded in each language. Each sample captures genuine interactions covering a range of topics and emotions typical in everyday conversations. With 1,000+ hours transcribed for each.	Hindi Audio	Hindi Sample Transcript
4	Natural Conversational Voice Dataset	Hindi and Indian English		Indian English Audio	English Sample Transcript
5	Natural Conversational Voice Dataset	Telugu, Malayalam, Bengali, Punjabi, Tamil, Marathi	10,000+ Hours of Telugu, Malayalam, Bengali, Punjabi, Tamil and Marathi. Natural two - person conversations recorded in each language. Each sample captures genuine interactions covering a range of topics and emotions typical in everyday conversations. Transcription not readily available.	Telugu Audio	Telugu Sample Transcript
6				Malayalam Audio	Malayalam Sample Transcript
7				Bengali Audio	Bengali Sample Transcript
8				Punjabi Audio	Punjabi Sample Transcript
9	Natural Conversational Voice Dataset - Low Resource Languages	Bodo, Maithili, Bhojpuri and Haryanvi	10,000+ Hours for each language. Natural two-person conversations recorded in each language. Each sample captures genuine interactions covering a range of topics and emotions typical in everyday conversations. Transcription not readily available.	Bodo Audio	Bodo Sample Transcript
10				Maithili Audio	Maithili Sample Transcript
11				Bhojpuri Audio	Bhojpuri Sample Transcript
12				Haryanvi Audio	Haryanvi Sample Transcript
13	Read Speech Hindi	Hindi	This dataset comprises 1.2 million recordings totaling 10,374 hours of Hindi content, based on 124,000 unique paragraphs. With contributions from over 40,000 speakers across India, each 20-30 second recording captures linguistic diversity. (also available in English, Tamil, Telugu, Marathi, Bangla)	Audio	Transcript
14	Natural conversations - 4 person	Hindi, Indian English	550+ hours: Natural, four-person conversational audio recordings where participants were prompted to discuss various topics. The conversations include a mix of Hindi and Indian English. (also available in Tamil, Telugu, Marathi, Bangla)	Audio	Transcript
15	High Mother Tounge Influence, High Emotion Conversations	Hindi	1069 hours:Indian English. Speakers with medium and high degree of mother tongue influence were selected. Speakers were given an imaginary situation and asked to be overly emotional. (also available in English, Tamil, Telugu, Marathi, Bangla)	Audio
16	Real World Conversation Indian English (uncontrolled environment)	Indian English	20,542 hours: Natural conversations between 2 people learning to speak fluently in English with 1,000 hours transcribed. The conversations took place in an uncontrolled environment, including adverse conditions and may have background noise as is the case in real world environments. Transcription not readily available.	Audio
17	Unique Environment High Emotion Utterances	Hindi	1099 hours: Conversational data in Indian English and Hindi. Each recording has high degree of emotion. Users were given prompts of situations such as getting a promotion for a job and instructed to be highly emotional. Each utterance is a short conversation ranging from 30 sec to 3 min long. A total of 10 emotions are covered across all conversations. The users were instructed to be in certain specific environments like on a busy road and 8 other environments. (also available in English, Tamil, Telugu, Marathi, Bangla)	Audio (Environment: washroom with echo, Emotion: excited
18	Code Switching Conversation Hindi - English	Hindi- English	1000+ hours: 1259 unique speakers have 2-person unscripted conversations. Each recording includes conversational audio where speakers fluidly switch between Hindi and English, capturing the natural bilingual speech patterns common in multilingual communities. (also available in other languages paired with English)	Audio
19	Josh Talks YouTube Videos Speech Dataset	Hindi	1534 hours of studio recorded spontaneous speech covering 8 languages. This audio data is a part of the library of Josh Talks content that is shared on YouTube. (also available in English, Tamil, Telugu, Marathi, Bangla, Malayalam)	Audio	Transcript
20	Voice Assistant Prompts with Phonetic Transcriptions	Hindi	This dataset comprises 100,000 utterances recorded by 200 unique speakers from 68 districts across India each speaking 500 phonetically rich sentences. (also available in English, Tamil, Telugu, Marathi, Bangla)	Audio	Phonetic Transcript