Open Tamil Data Catalogue

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U
1	Title	Title_Tamil	Identifier	Group Creator	Individual Creator	Publisher	Description	Language	Date	License	Available Online	Sources	Application 1	Application 2	Type of Resource	Format	Size	Physical Description	Collection methodology	Additional URLs	Notes

2	Tamil Common Voice Speech Corpus	தமிழ் பொதுக் குரல் பேச்சு உரைத் தொகுப்பு	OTDC0001	Contributor:Community		Mozilla Foundation	The Common Voice dataset is an open and publicly available resource that can be used to train a wide variety of speech-enabled applications.	tam	2020-12-11	CCO Public Domain	Yes	sentences are mostly from publications	speech recognition	speech synthesis	speech-text corpus	audio/mpeg\|text/tab-separated-values	14 validated hours\|648 MB	born digital	crowdsourced using mozilla common voice platform
3	Wikimedia Corpus	விக்கிப்பீடியா தமிழ் உரைத்தொகுப்பு	OTDC0002	Contributor:Community	Contributor:R. Ashokan		Tamil text corpus compiled from Tamil Wikipedia and related projects by AshokR .	tam	2019-02-19	CC BY-SA 3.0	Yes	Wikimedia projects	text analysis	speech synthesis	text corpus	text/plain	5.9 million words\|148 MB	born digital
4	EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)	ஆங்கிலம்-தமிழ் இணை உரைத் தொகுப்பு	OTDC0003		Creator:Loganathan Ramasamy\|Creator:Ondřej Bojar\|Creator:Zdeněk Žabokrtský	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)	EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.	tam\|eng	2014-10-31	CC BY-SA 3.0	Yes	websites	machine translation	natural language processing	parallel corpus	text/plain	169871 sentences\|23.71 MB	born digital	web crawling\|nlp processing
5	AUKBC Tamil Part-of-Speech Corpus	அண்ணாமலைப் பல்கலைக்கழக தமிழ் சொல் வகை உரைத் தொகுப்பு	OTDC0004	Creator:Computational Linguistic Research Group (CLRG)\|Creator:AU-KBC Research Centre		AU-KBC Research Centre	Ponniyin Selvan text annotated according to the Bureau of Indian Standards Tagset standard.	tam	2016	AUKBC-TamilPOSCorpus License	No	Ponniyin Selvan novel	natural language processing	language modeling	part-of-speech corpus	text/plain	50876 sentences\|414483 words\|515283 tokens	reformatted digital
6	Tamil Dependency Treebank v0.1	தமிழ்மொழி சார்புநிலை மரவங்கி	OTDC0005		Creator:Loganathan Ramasamy\|Creator:Zdeněk Žabokrtský	Institute of Formal and Applied Linguistics, Charles University in Prague	TamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank	tam	2011	CC BY-SA 3.0	Yes	news articles	syntactic analysis	language modeling	treebank	application/xml\|text/tab-separated-values	600 sentences	born digital	Original text collected from dinamani.com news articles. Manually annotated with the assistence of tools.
7	Tamil Wikipedia Articles	தமிழ் விக்கிப்பீடியா கட்டுரைகள்	OTDC0006	Contributor:Community	Creator:Gaurav Arora		127 K cleaned wikipedia articles - with a train and test set to benchmark your LM	tam	2019-12-24	CC BY-SA 4.0	Yes	Wikimedia projects	text analysis	machine learning	text corpus	text/plain	127000 articles	born digital
8	Project Madurai Tamil works	மதுரைத் திட்ட தமிழ் நூல்கள்	OTDC0007	Contributor:Community		Project Madurai		tam	2021-02-25	Public Domain or with permission from creators	Yes	publications	text analysis	natural language processing	text corpus	text/html\|application/pdf\|text/plain	777 works	reformatted digital		https://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/project_madurai
9	IndoWordNet	இந்திய சொல் வலை	OTDC0008			Center for Indian Language Technology (CFILT)	IndoWordNet is a linked lexical knowledge base of wordnets of 18 scheduled languages of India	tam\|mal\|kan\|tel\|hin\|mul	2014	GNU	Yes		text analysis	natural language processing	wordnet	text/plain	36024 synsets	born digital			The license is not clearly stated; however, aligned projects have released their dataset as GNU.
10	TamilWordNet	தமிழ் சொல் வலை	OTDC0009	Creator:Computational Linguistic Research Group (CLRG)\|Creator:AU-KBC Research Centre		AU-KBC Research Centre	Tamil WordNet contains information about nouns, verbs, adjective and adverbs and is organized around the notion of a synset.	tam	2014	GNU	Yes		text analysis	natural language processing	wordnet	application/sql	50000 words	born digital			The words are stored in the database in the transliteration format.
11	TamilNews-7M	தமிழ் செய்திகள் தொகுப்பு	OTDC0010		Creator:Mu Selvakumar		Collection of tamil news articles for language modelling task.	tam	2019-03-07	CC BY-SA 4.0	Yes	news articles	language modeling	natural language processing	text corpus	text/plain	2GB	born digital			Continuous list of words and phrases. Articles are not separated individually.
12	Tamil News Classification Dataset (Tamilmurasu)	தமிழ் செய்திப் பகுப்பாய்வு தரவுத் தொகுப்பு	OTDC0011		Creator:J. Vijayabhaskar		The data is acquired by scrapping the publicly available articles published on Tamilmurasu.org, which is a well-known newspaper in Tamil nadu, India. The dataset contains articles from 06/01/2011 - 06/01/2020 with categories.	tam	2020-01-09	CCO Public Domain	Yes	news articles	text classification	topic modeling	text corpus	text/csv	127000 articles	born digital			The dataset is released with public domain license; however, the original data is not public domain.
13	Noolaham Oral Histories	நூலக வாய்மொழி வரலாறுகள்	OTDC0012				More than 500 oral history recordings of people from many fields, released to public domain.	tam	2021	CC BY-SA 4.0	Yes	oral histories	speech recognition	natural language processing	speech data corpus	audio/mpeg	500 oral histories\|~750 hours	born digital			The audio have not been transcribed.
14	Tamil Hand Written Characters	தமிழ் கையெழுத்து தனி எழுத்துக்கள்	OTDC0013		Creator:Seeni		A collection of more than 75 000 Tamil handwritten character images.	tam	2018-05-23	CCO Public Domain	Yes		machine learning	optical character recognition	handwritten images dataset	image/tiff\|image/png	~75000 images\|183.52 MB	reformatted digital
15	The EMILLE/CIIL Corpus	எமிலி உரைத் தொகுப்பு	OTDC0014			European Language Resources Association	The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages.	tam\|mul	2009-03-06	Non-Commercial Use	No		text analysis	natural language processing	text corpus	text/plain	10.1 million words	born digital
16	Linguistic Data Consortium for Indian Languages Tamil Text Corpora	இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் உரைத் தொகுப்பு	OTDC0015			Linguistic Data Consortium for Indian Languages (LDC-IL)	Monolingual Written Tamil Text Corpora compiled by Central Institute of Indian Languages.	tam	2014	LDC-IL License (Non Commercial, Not Sharable)	No	publications\|websites\|news articles	text analysis	natural language processing	text corpus	application/xml	10933484 words	reformatted digital
17	Linguistic Data Consortium for Indian Languages Annotated Tamil Text Corpora	இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் சொல் வகை குறியீட்டுடான உரைத் தொகுப்பு	OTDC0016			Linguistic Data Consortium for Indian Languages (LDC-IL)	Tamil Text Corpora annotated according to Bureau Of Indian Standard (BIS) POS tagset standards.	tam	2013	LDC-IL License (Non Commercial, Not Sharable)	No	publications\|websites\|news articles	language modeling	natural language processing	part-of-speech corpus		1376857 words	reformatted digital
18	Linguistic Data Consortium for Indian Languages Tamil Speech Corpora	இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் தமிழ் பேச்சுத் தொகுப்பு	OTDC0017			Linguistic Data Consortium for Indian Languages (LDC-IL)		tam	2014	LDC-IL License (Non Commercial, Not Sharable)	No		speech recognition	speech synthesis	speech data corpus		143 hours	reformatted digital
19	Linguistic Data Consortium for Indian Languages Annotated Tamil Speech Corpora	இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் குறியீட்டுடான தமிழ் பேச்சு தொகுப்பு	OTDC0018			Linguistic Data Consortium for Indian Languages (LDC-IL)	This Tamil Speech Recognition database was collected in Tamilnadu and contains the voices of 450 different native speaker who were selected according to age distribution (16-20,21-50,51+), Gender, Dialectical Regions and environment (home, office and public place). Each speaker recorded a news text in a noisy environment through recorder having an inbuilt microphone. The recordings are in stereo recording and the extracted channel are also included in the specific files. It includes audio file, text file, NIST files which were saved as .ZIP Files. All the speech data are transcribed and labeled at the sentence level.	tam	2014	LDC-IL License (Non Commercial, Not Sharable)	No		speech recognition	speech synthesis	annotated speech-text corpus	audio/wav\|text/plain\|text/nist	58 hours	reformatted digital			Contact ramamoorthy@ciil.org for more info!
20	Linguistic Data Consortium for Indian Languages Tamil Pronunciation Dictionaries	இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் தமிழ் உச்சரிப்பு ஒலி அகராதி	OTDC0019			Linguistic Data Consortium for Indian Languages (LDC-IL)		tam	2014	LDC-IL License (Non Commercial, Not Sharable)	No		speech recognition	speech synthesis	speech data corpus		48 hours	born digital
21	Thirukkural Dataset	திருக்குறள் தரவுத்தொகுப்பு	OTDC0020		Creator:RVK		Thirukkural couplets and explanations as a dataset.	tam	2018-08-10	CCO Public Domain	Yes		text analysis	text classification	text corpus	text/plain	651.07 KB	born digital
22	Dinamalar Tamil News Dataset	தினமலர் தமிழ் செய்தித்தாள் தரவுத் தொகுப்பு	OTDC0021		Creator:J. Vijayabhaskar		The data is acquired by scrapping the publicly available articles published on Dinamalar.com, which is a well-known newspaper in Tamil Nadu, India. The dataset contains articles from 2009 - 2019.	tam	2020-01-05	CC0: Public Domain	Yes		text classification	topic modeling	text corpus	text/csv	120k articles\|5.15 GB	born digital
23	Tamil News Dataset	தமிழ் செய்தி தரவுத்தொகுப்பு	OTDC0022		Creator:Gaurav		The data set consist of 6500 new articles and their categories collected from news websites.	tam	2019-12-25	CC BY-SA 4.0	Yes		language modeling	natural language processing	text corpus	text/plain	6500 articles\|3 MB	born digital
24	Tamil Characters (Vowels)	தமிழ் எழுத்துக்கள் (உயிரெழுத்துக்கள்) வகையாக்கம்	OTDC0023				The dataset consists of 3000 images of Tamil Characters from அ to ஓ. Each letter consists of approx. 300 images. Each of images are of different shapes.	tam\|eng	2019		Yes		character recognition	machine learning	handwritten images dataset		3000 images	born digital
25	NER Annotated Corpus	என்.இ.ஆர் (NER) சிறுகுறிப்புத் தொகுப்பு	OTDC0024	Contributor:Information Retrieval Society of India	Creator:Sobha Lalitha Devi	AU-KBC Research Centre		tam\|mal\|ben\|eng\|hin\|mul	2013-12-06	AUKBC-TamilPOSCorpus License	Yes		entity extraction	natural language processing	annotated text corpus			reformatted digital
26	Sentiment Lexicons for 81 Languages: Sentiment Polarity Lexicons (Positive vs. Negative)	உணர்வு முனைமை சொற்களஞ்சியம் (81 மொழிகள்)	OTDC0025		Creator:Rachael Tatman		This dataset contains both positive and negative sentiment lexicons for 81 languages. The sentiment lexicons in this dataset were generated via graph propagation based on a knowledge graph--a graphical representation of real-world entities and the links between them. The general intuition is that words which are closely linked on a knowledge graph probably have similar sentiment polarities. For this project, sentiments were generated based on English sentiment lexicons.	tam\|mul	2017-09-13	GNU General Public License	Yes		natural language processing	machine learning	annotated text corpus	text/csv	1.96 MB	born digital
27	Crowdsourced high-quality Tamil multi-speaker speech data set	உயர்தர தமிழ் பேச்சு தரவு தொகுப்பு	OTDC0026			European Language Resources Association (ELRA)	This data set contains transcribed high-quality audio of Tamil sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains an anonymized FileID and the transcription of audio in the file.	tam\|mul	2020	CC BY-SA 4.0	Yes		text analysis	natural language processing	speech data corpus	audio/wav\|text/tab-separated-values	~4000 sentences	born digital
28	MASSIVE Dataset		OTDC0027	Creator:Amazon.com, Inc.			MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.	multilingual	2022	CC BY 4.0	Yes	SLURP: A Spoken Language Understanding Resource Package	natural language understanding	natural language processing	annotated text corpus	application/json	~19,500 unique phrases in 51 languages.	born digital
29	Microsoft Speech Corpus		OTDC0028	Creator:Microsoft			This dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages.	multilingual	2023-06-20	" this dataset shall not be used for commercial purposes"	Yes		speech synthesis	speech recognition		audio/wav	12.3 GB	born digital
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100