ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASAT
1
TitleTitle_TamilIdentifierGroup CreatorIndividual CreatorPublisherDescriptionLanguageDateLicenseAvailable OnlineSourcesApplication 1Application 2Type of ResourceFormatSizePhysical DescriptionCollection methodologyAdditional URLsNotes
2
Tamil Common Voice Speech Corpusதமிழ் பொதுக் குரல் பேச்சு உரைத் தொகுப்புOTDC0001Contributor:CommunityMozilla FoundationThe Common Voice dataset is an open and publicly available resource that can be used to train a wide variety of speech-enabled applications.tam2020-12-11CCO Public DomainYessentences are mostly from publicationsspeech recognitionspeech synthesisspeech-text corpusaudio/mpeg|text/tab-separated-values14 validated hours|648 MBborn digitalcrowdsourced using mozilla common voice platform
3
Wikimedia Corpusவிக்கிப்பீடியா தமிழ் உரைத்தொகுப்புOTDC0002Contributor:CommunityContributor:R. AshokanTamil text corpus compiled from Tamil Wikipedia and related projects by AshokR .tam2019-02-19CC BY-SA 3.0YesWikimedia projectstext analysisspeech synthesistext corpustext/plain5.9 million words|148 MBborn digital
4
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)ஆங்கிலம்-தமிழ் இணை உரைத் தொகுப்புOTDC0003Creator:Loganathan Ramasamy|Creator:Ondřej Bojar|Creator:Zdeněk Žabokrtský Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.tam|eng2014-10-31CC BY-SA 3.0Yeswebsitesmachine translationnatural language processingparallel corpustext/plain169871 sentences|23.71 MBborn digitalweb crawling|nlp processing
5
AUKBC Tamil Part-of-Speech Corpusஅண்ணாமலைப் பல்கலைக்கழக தமிழ் சொல் வகை உரைத் தொகுப்புOTDC0004Creator:Computational Linguistic Research Group (CLRG)|Creator:AU-KBC Research CentreAU-KBC Research CentrePonniyin Selvan text annotated according to the Bureau of Indian Standards Tagset standard.tam2016AUKBC-TamilPOSCorpus LicenseNoPonniyin Selvan novelnatural language processinglanguage modelingpart-of-speech corpus text/plain50876 sentences|414483 words|515283 tokensreformatted digital
6
Tamil Dependency Treebank v0.1தமிழ்மொழி சார்புநிலை மரவங்கிOTDC0005
Creator:Loganathan Ramasamy|Creator:Zdeněk Žabokrtský
Institute of Formal and Applied Linguistics, Charles University in PragueTamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebanktam2011CC BY-SA 3.0Yesnews articlessyntactic analysislanguage modelingtreebankapplication/xml|text/tab-separated-values600 sentencesborn digitalOriginal text collected from dinamani.com news articles. Manually annotated with the assistence of tools.
7
Tamil Wikipedia Articlesதமிழ் விக்கிப்பீடியா கட்டுரைகள்OTDC0006Contributor:CommunityCreator:Gaurav Arora127 K cleaned wikipedia articles - with a train and test set to benchmark your LMtam2019-12-24CC BY-SA 4.0YesWikimedia projectstext analysismachine learningtext corpustext/plain127000 articlesborn digital
8
Project Madurai Tamil worksமதுரைத் திட்ட தமிழ் நூல்கள்OTDC0007Contributor:CommunityProject Maduraitam2021-02-25Public Domain or with permission from creatorsYespublicationstext analysisnatural language processingtext corpustext/html|application/pdf|text/plain777 worksreformatted digitalhttps://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/project_madurai
9
IndoWordNetஇந்திய சொல் வலைOTDC0008Center for Indian Language Technology (CFILT) IndoWordNet is a linked lexical knowledge base of wordnets of 18 scheduled languages of Indiatam|mal|kan|tel|hin|mul2014GNUYestext analysisnatural language processingwordnettext/plain36024 synsetsborn digitalThe license is not clearly stated; however, aligned projects have released their dataset as GNU.
10
TamilWordNetதமிழ் சொல் வலைOTDC0009Creator:Computational Linguistic Research Group (CLRG)|Creator:AU-KBC Research CentreAU-KBC Research CentreTamil WordNet contains information about nouns, verbs, adjective and adverbs and is organized around the notion of a synset.tam2014GNUYestext analysisnatural language processingwordnetapplication/sql50000 wordsborn digitalThe words are stored in the database in the transliteration format.
11
TamilNews-7Mதமிழ் செய்திகள் தொகுப்புOTDC0010Creator:Mu SelvakumarCollection of tamil news articles for language modelling task.tam2019-03-07CC BY-SA 4.0Yesnews articleslanguage modelingnatural language processingtext corpustext/plain2GBborn digitalContinuous list of words and phrases. Articles are not separated individually.
12
Tamil News Classification Dataset (Tamilmurasu)தமிழ் செய்திப் பகுப்பாய்வு தரவுத் தொகுப்புOTDC0011Creator:J. VijayabhaskarThe data is acquired by scrapping the publicly available articles published on Tamilmurasu.org, which is a well-known newspaper in Tamil nadu, India. The dataset contains articles from 06/01/2011 - 06/01/2020 with categories.tam2020-01-09CCO Public DomainYesnews articlestext classificationtopic modelingtext corpustext/csv127000 articlesborn digitalThe dataset is released with public domain license; however, the original data is not public domain.
13
Noolaham Oral Historiesநூலக வாய்மொழி வரலாறுகள்OTDC0012More than 500 oral history recordings of people from many fields, released to public domain.tam2021CC BY-SA 4.0Yesoral historiesspeech recognitionnatural language processingspeech data corpusaudio/mpeg500 oral histories|~750 hoursborn digitalThe audio have not been transcribed.
14
Tamil Hand Written Charactersதமிழ் கையெழுத்து தனி எழுத்துக்கள்OTDC0013Creator:SeeniA collection of more than 75 000 Tamil handwritten character images.tam2018-05-23CCO Public DomainYesmachine learningoptical character recognitionhandwritten images datasetimage/tiff|image/png~75000 images|183.52 MBreformatted digital
15
The EMILLE/CIIL Corpusஎமிலி உரைத் தொகுப்பு
OTDC0014European Language Resources AssociationThe EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages. tam|mul2009-03-06Non-Commercial UseNotext analysisnatural language processingtext corpustext/plain10.1 million wordsborn digital
16
Linguistic Data Consortium for Indian Languages Tamil Text Corporaஇந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் உரைத் தொகுப்புOTDC0015Linguistic Data Consortium for Indian Languages (LDC-IL)Monolingual Written Tamil Text Corpora compiled by Central Institute of Indian Languages.tam2014LDC-IL License (Non Commercial, Not Sharable)Nopublications|websites|news articlestext analysisnatural language processingtext corpusapplication/xml10933484 wordsreformatted digital
17
Linguistic Data Consortium for Indian Languages Annotated Tamil Text Corporaஇந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் சொல் வகை குறியீட்டுடான உரைத் தொகுப்புOTDC0016Linguistic Data Consortium for Indian Languages (LDC-IL)Tamil Text Corpora annotated according to Bureau Of Indian Standard (BIS) POS tagset standards.tam2013LDC-IL License (Non Commercial, Not Sharable)Nopublications|websites|news articleslanguage modelingnatural language processingpart-of-speech corpus 1376857 wordsreformatted digital
18
Linguistic Data Consortium for Indian Languages Tamil Speech Corporaஇந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின்
தமிழ் பேச்சுத் தொகுப்பு
OTDC0017Linguistic Data Consortium for Indian Languages (LDC-IL)tam2014LDC-IL License (Non Commercial, Not Sharable)Nospeech recognitionspeech synthesisspeech data corpus143 hoursreformatted digital
19
Linguistic Data Consortium for Indian Languages Annotated Tamil Speech Corporaஇந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் குறியீட்டுடான
தமிழ் பேச்சு தொகுப்பு
OTDC0018Linguistic Data Consortium for Indian Languages (LDC-IL)This Tamil Speech Recognition database was collected in Tamilnadu and contains the voices of 450 different native speaker who were selected according to age distribution (16-20,21-50,51+), Gender, Dialectical Regions and environment (home, office and public place). Each speaker recorded a news text in a noisy environment through recorder having an inbuilt microphone. The recordings are in stereo recording and the extracted channel are also included in the specific files. It includes audio file, text file, NIST files which were saved as .ZIP Files. All the speech data are transcribed and labeled at the sentence level.tam2014LDC-IL License (Non Commercial, Not Sharable)Nospeech recognitionspeech synthesisannotated speech-text corpusaudio/wav|text/plain|text/nist58 hoursreformatted digitalContact ramamoorthy@ciil.org for more info!
20
Linguistic Data Consortium for Indian Languages Tamil Pronunciation Dictionariesஇந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் தமிழ் உச்சரிப்பு ஒலி அகராதி
OTDC0019Linguistic Data Consortium for Indian Languages (LDC-IL)tam2014LDC-IL License (Non Commercial, Not Sharable)Nospeech recognitionspeech synthesisspeech data corpus48 hoursborn digital
21
Thirukkural Datasetதிருக்குறள் தரவுத்தொகுப்புOTDC0020Creator:RVKThirukkural couplets and explanations as a dataset.tam2018-08-10CCO Public DomainYestext analysistext classificationtext corpustext/plain651.07 KBborn digital
22
Dinamalar Tamil News Datasetதினமலர் தமிழ் செய்தித்தாள் தரவுத் தொகுப்புOTDC0021Creator:J. VijayabhaskarThe data is acquired by scrapping the publicly available articles published on Dinamalar.com, which is a well-known newspaper in Tamil Nadu, India. The dataset contains articles from 2009 - 2019.tam2020-01-05CC0: Public Domain Yestext classificationtopic modelingtext corpustext/csv120k articles|5.15 GBborn digital
23
Tamil News Datasetதமிழ் செய்தி தரவுத்தொகுப்புOTDC0022Creator:GauravThe data set consist of 6500 new articles and their categories collected from news websites.tam2019-12-25CC BY-SA 4.0Yeslanguage modelingnatural language processingtext corpustext/plain6500 articles|3 MBborn digital
24
Tamil Characters (Vowels)தமிழ் எழுத்துக்கள் (உயிரெழுத்துக்கள்) வகையாக்கம்OTDC0023The dataset consists of 3000 images of Tamil Characters from அ to ஓ. Each letter consists of approx. 300 images. Each of images are of different shapes.tam|eng2019Yescharacter recognitionmachine learninghandwritten images dataset3000 imagesborn digital
25
NER Annotated Corpus என்.இ.ஆர் (NER) சிறுகுறிப்புத் தொகுப்புOTDC0024Contributor:Information Retrieval Society of IndiaCreator:Sobha Lalitha DeviAU-KBC Research Centretam|mal|ben|eng|hin|mul2013-12-06AUKBC-TamilPOSCorpus LicenseYesentity extractionnatural language processingannotated text corpusreformatted digital
26
Sentiment Lexicons for 81 Languages: Sentiment Polarity Lexicons (Positive vs. Negative)உணர்வு முனைமை சொற்களஞ்சியம் (81 மொழிகள்)OTDC0025Creator:Rachael TatmanThis dataset contains both positive and negative sentiment lexicons for 81 languages. The sentiment lexicons in this dataset were generated via graph propagation based on a knowledge graph--a graphical representation of real-world entities and the links between them. The general intuition is that words which are closely linked on a knowledge graph probably have similar sentiment polarities. For this project, sentiments were generated based on English sentiment lexicons.tam|mul2017-09-13GNU General Public LicenseYesnatural language processingmachine learningannotated text corpustext/csv1.96 MBborn digital
27
Crowdsourced high-quality Tamil multi-speaker speech data setஉயர்தர தமிழ் பேச்சு தரவு தொகுப்புOTDC0026European Language Resources Association (ELRA)This data set contains transcribed high-quality audio of Tamil sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains an anonymized FileID and the transcription of audio in the file.tam|mul2020CC BY-SA 4.0Yestext analysisnatural language processingspeech data corpusaudio/wav|text/tab-separated-values~4000 sentencesborn digital
28
MASSIVE DatasetOTDC0027Creator:Amazon.com, Inc.MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.multilingual2022CC BY 4.0YesSLURP: A Spoken Language Understanding Resource Packagenatural language understandingnatural language processingannotated text corpusapplication/json~19,500 unique phrases in 51 languages.born digital
29
Microsoft Speech CorpusOTDC0028Creator:MicrosoftThis dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages.multilingual2023-06-20" this dataset shall not be used for commercial purposes"Yesspeech synthesisspeech recognitionaudio/wav12.3 GBborn digital
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100