A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | AD | AE | AF | AG | AH | AI | AJ | AK | AL | AM | AN | AO | AP | AQ | AR | AS | AT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Title | Title_Tamil | Identifier | Group Creator | Individual Creator | Publisher | Description | Language | Date | License | Available Online | Sources | Application 1 | Application 2 | Type of Resource | Format | Size | Physical Description | Collection methodology | Additional URLs | Notes | ||||||||||||||||||||||||||
2 | Tamil Common Voice Speech Corpus | தமிழ் பொதுக் குரல் பேச்சு உரைத் தொகுப்பு | OTDC0001 | Contributor:Community | Mozilla Foundation | The Common Voice dataset is an open and publicly available resource that can be used to train a wide variety of speech-enabled applications. | tam | 2020-12-11 | CCO Public Domain | Yes | sentences are mostly from publications | speech recognition | speech synthesis | speech-text corpus | audio/mpeg|text/tab-separated-values | 14 validated hours|648 MB | born digital | crowdsourced using mozilla common voice platform | |||||||||||||||||||||||||||||
3 | Wikimedia Corpus | விக்கிப்பீடியா தமிழ் உரைத்தொகுப்பு | OTDC0002 | Contributor:Community | Contributor:R. Ashokan | Tamil text corpus compiled from Tamil Wikipedia and related projects by AshokR . | tam | 2019-02-19 | CC BY-SA 3.0 | Yes | Wikimedia projects | text analysis | speech synthesis | text corpus | text/plain | 5.9 million words|148 MB | born digital | ||||||||||||||||||||||||||||||
4 | EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) | ஆங்கிலம்-தமிழ் இணை உரைத் தொகுப்பு | OTDC0003 | Creator:Loganathan Ramasamy|Creator:Ondřej Bojar|Creator:Zdeněk Žabokrtský | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) | EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains. | tam|eng | 2014-10-31 | CC BY-SA 3.0 | Yes | websites | machine translation | natural language processing | parallel corpus | text/plain | 169871 sentences|23.71 MB | born digital | web crawling|nlp processing | |||||||||||||||||||||||||||||
5 | AUKBC Tamil Part-of-Speech Corpus | அண்ணாமலைப் பல்கலைக்கழக தமிழ் சொல் வகை உரைத் தொகுப்பு | OTDC0004 | Creator:Computational Linguistic Research Group (CLRG)|Creator:AU-KBC Research Centre | AU-KBC Research Centre | Ponniyin Selvan text annotated according to the Bureau of Indian Standards Tagset standard. | tam | 2016 | AUKBC-TamilPOSCorpus License | No | Ponniyin Selvan novel | natural language processing | language modeling | part-of-speech corpus | text/plain | 50876 sentences|414483 words|515283 tokens | reformatted digital | ||||||||||||||||||||||||||||||
6 | Tamil Dependency Treebank v0.1 | தமிழ்மொழி சார்புநிலை மரவங்கி | OTDC0005 | Creator:Loganathan Ramasamy|Creator:Zdeněk Žabokrtský | Institute of Formal and Applied Linguistics, Charles University in Prague | TamilTB.v0.1 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank | tam | 2011 | CC BY-SA 3.0 | Yes | news articles | syntactic analysis | language modeling | treebank | application/xml|text/tab-separated-values | 600 sentences | born digital | Original text collected from dinamani.com news articles. Manually annotated with the assistence of tools. | |||||||||||||||||||||||||||||
7 | Tamil Wikipedia Articles | தமிழ் விக்கிப்பீடியா கட்டுரைகள் | OTDC0006 | Contributor:Community | Creator:Gaurav Arora | 127 K cleaned wikipedia articles - with a train and test set to benchmark your LM | tam | 2019-12-24 | CC BY-SA 4.0 | Yes | Wikimedia projects | text analysis | machine learning | text corpus | text/plain | 127000 articles | born digital | ||||||||||||||||||||||||||||||
8 | Project Madurai Tamil works | மதுரைத் திட்ட தமிழ் நூல்கள் | OTDC0007 | Contributor:Community | Project Madurai | tam | 2021-02-25 | Public Domain or with permission from creators | Yes | publications | text analysis | natural language processing | text corpus | text/html|application/pdf|text/plain | 777 works | reformatted digital | https://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/project_madurai | ||||||||||||||||||||||||||||||
9 | IndoWordNet | இந்திய சொல் வலை | OTDC0008 | Center for Indian Language Technology (CFILT) | IndoWordNet is a linked lexical knowledge base of wordnets of 18 scheduled languages of India | tam|mal|kan|tel|hin|mul | 2014 | GNU | Yes | text analysis | natural language processing | wordnet | text/plain | 36024 synsets | born digital | The license is not clearly stated; however, aligned projects have released their dataset as GNU. | |||||||||||||||||||||||||||||||
10 | TamilWordNet | தமிழ் சொல் வலை | OTDC0009 | Creator:Computational Linguistic Research Group (CLRG)|Creator:AU-KBC Research Centre | AU-KBC Research Centre | Tamil WordNet contains information about nouns, verbs, adjective and adverbs and is organized around the notion of a synset. | tam | 2014 | GNU | Yes | text analysis | natural language processing | wordnet | application/sql | 50000 words | born digital | The words are stored in the database in the transliteration format. | ||||||||||||||||||||||||||||||
11 | TamilNews-7M | தமிழ் செய்திகள் தொகுப்பு | OTDC0010 | Creator:Mu Selvakumar | Collection of tamil news articles for language modelling task. | tam | 2019-03-07 | CC BY-SA 4.0 | Yes | news articles | language modeling | natural language processing | text corpus | text/plain | 2GB | born digital | Continuous list of words and phrases. Articles are not separated individually. | ||||||||||||||||||||||||||||||
12 | Tamil News Classification Dataset (Tamilmurasu) | தமிழ் செய்திப் பகுப்பாய்வு தரவுத் தொகுப்பு | OTDC0011 | Creator:J. Vijayabhaskar | The data is acquired by scrapping the publicly available articles published on Tamilmurasu.org, which is a well-known newspaper in Tamil nadu, India. The dataset contains articles from 06/01/2011 - 06/01/2020 with categories. | tam | 2020-01-09 | CCO Public Domain | Yes | news articles | text classification | topic modeling | text corpus | text/csv | 127000 articles | born digital | The dataset is released with public domain license; however, the original data is not public domain. | ||||||||||||||||||||||||||||||
13 | Noolaham Oral Histories | நூலக வாய்மொழி வரலாறுகள் | OTDC0012 | More than 500 oral history recordings of people from many fields, released to public domain. | tam | 2021 | CC BY-SA 4.0 | Yes | oral histories | speech recognition | natural language processing | speech data corpus | audio/mpeg | 500 oral histories|~750 hours | born digital | The audio have not been transcribed. | |||||||||||||||||||||||||||||||
14 | Tamil Hand Written Characters | தமிழ் கையெழுத்து தனி எழுத்துக்கள் | OTDC0013 | Creator:Seeni | A collection of more than 75 000 Tamil handwritten character images. | tam | 2018-05-23 | CCO Public Domain | Yes | machine learning | optical character recognition | handwritten images dataset | image/tiff|image/png | ~75000 images|183.52 MB | reformatted digital | ||||||||||||||||||||||||||||||||
15 | The EMILLE/CIIL Corpus | எமிலி உரைத் தொகுப்பு | OTDC0014 | European Language Resources Association | The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages. | tam|mul | 2009-03-06 | Non-Commercial Use | No | text analysis | natural language processing | text corpus | text/plain | 10.1 million words | born digital | ||||||||||||||||||||||||||||||||
16 | Linguistic Data Consortium for Indian Languages Tamil Text Corpora | இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் உரைத் தொகுப்பு | OTDC0015 | Linguistic Data Consortium for Indian Languages (LDC-IL) | Monolingual Written Tamil Text Corpora compiled by Central Institute of Indian Languages. | tam | 2014 | LDC-IL License (Non Commercial, Not Sharable) | No | publications|websites|news articles | text analysis | natural language processing | text corpus | application/xml | 10933484 words | reformatted digital | |||||||||||||||||||||||||||||||
17 | Linguistic Data Consortium for Indian Languages Annotated Tamil Text Corpora | இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் சொல் வகை குறியீட்டுடான உரைத் தொகுப்பு | OTDC0016 | Linguistic Data Consortium for Indian Languages (LDC-IL) | Tamil Text Corpora annotated according to Bureau Of Indian Standard (BIS) POS tagset standards. | tam | 2013 | LDC-IL License (Non Commercial, Not Sharable) | No | publications|websites|news articles | language modeling | natural language processing | part-of-speech corpus | 1376857 words | reformatted digital | ||||||||||||||||||||||||||||||||
18 | Linguistic Data Consortium for Indian Languages Tamil Speech Corpora | இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் தமிழ் பேச்சுத் தொகுப்பு | OTDC0017 | Linguistic Data Consortium for Indian Languages (LDC-IL) | tam | 2014 | LDC-IL License (Non Commercial, Not Sharable) | No | speech recognition | speech synthesis | speech data corpus | 143 hours | reformatted digital | ||||||||||||||||||||||||||||||||||
19 | Linguistic Data Consortium for Indian Languages Annotated Tamil Speech Corpora | இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் குறியீட்டுடான தமிழ் பேச்சு தொகுப்பு | OTDC0018 | Linguistic Data Consortium for Indian Languages (LDC-IL) | This Tamil Speech Recognition database was collected in Tamilnadu and contains the voices of 450 different native speaker who were selected according to age distribution (16-20,21-50,51+), Gender, Dialectical Regions and environment (home, office and public place). Each speaker recorded a news text in a noisy environment through recorder having an inbuilt microphone. The recordings are in stereo recording and the extracted channel are also included in the specific files. It includes audio file, text file, NIST files which were saved as .ZIP Files. All the speech data are transcribed and labeled at the sentence level. | tam | 2014 | LDC-IL License (Non Commercial, Not Sharable) | No | speech recognition | speech synthesis | annotated speech-text corpus | audio/wav|text/plain|text/nist | 58 hours | reformatted digital | Contact ramamoorthy@ciil.org for more info! | |||||||||||||||||||||||||||||||
20 | Linguistic Data Consortium for Indian Languages Tamil Pronunciation Dictionaries | இந்திய மொழிகளுக்கான மொழியியல் தரவு கூட்டமைப்பின் தமிழ் உச்சரிப்பு ஒலி அகராதி | OTDC0019 | Linguistic Data Consortium for Indian Languages (LDC-IL) | tam | 2014 | LDC-IL License (Non Commercial, Not Sharable) | No | speech recognition | speech synthesis | speech data corpus | 48 hours | born digital | ||||||||||||||||||||||||||||||||||
21 | Thirukkural Dataset | திருக்குறள் தரவுத்தொகுப்பு | OTDC0020 | Creator:RVK | Thirukkural couplets and explanations as a dataset. | tam | 2018-08-10 | CCO Public Domain | Yes | text analysis | text classification | text corpus | text/plain | 651.07 KB | born digital | ||||||||||||||||||||||||||||||||
22 | Dinamalar Tamil News Dataset | தினமலர் தமிழ் செய்தித்தாள் தரவுத் தொகுப்பு | OTDC0021 | Creator:J. Vijayabhaskar | The data is acquired by scrapping the publicly available articles published on Dinamalar.com, which is a well-known newspaper in Tamil Nadu, India. The dataset contains articles from 2009 - 2019. | tam | 2020-01-05 | CC0: Public Domain | Yes | text classification | topic modeling | text corpus | text/csv | 120k articles|5.15 GB | born digital | ||||||||||||||||||||||||||||||||
23 | Tamil News Dataset | தமிழ் செய்தி தரவுத்தொகுப்பு | OTDC0022 | Creator:Gaurav | The data set consist of 6500 new articles and their categories collected from news websites. | tam | 2019-12-25 | CC BY-SA 4.0 | Yes | language modeling | natural language processing | text corpus | text/plain | 6500 articles|3 MB | born digital | ||||||||||||||||||||||||||||||||
24 | Tamil Characters (Vowels) | தமிழ் எழுத்துக்கள் (உயிரெழுத்துக்கள்) வகையாக்கம் | OTDC0023 | The dataset consists of 3000 images of Tamil Characters from அ to ஓ. Each letter consists of approx. 300 images. Each of images are of different shapes. | tam|eng | 2019 | Yes | character recognition | machine learning | handwritten images dataset | 3000 images | born digital | |||||||||||||||||||||||||||||||||||
25 | NER Annotated Corpus | என்.இ.ஆர் (NER) சிறுகுறிப்புத் தொகுப்பு | OTDC0024 | Contributor:Information Retrieval Society of India | Creator:Sobha Lalitha Devi | AU-KBC Research Centre | tam|mal|ben|eng|hin|mul | 2013-12-06 | AUKBC-TamilPOSCorpus License | Yes | entity extraction | natural language processing | annotated text corpus | reformatted digital | |||||||||||||||||||||||||||||||||
26 | Sentiment Lexicons for 81 Languages: Sentiment Polarity Lexicons (Positive vs. Negative) | உணர்வு முனைமை சொற்களஞ்சியம் (81 மொழிகள்) | OTDC0025 | Creator:Rachael Tatman | This dataset contains both positive and negative sentiment lexicons for 81 languages. The sentiment lexicons in this dataset were generated via graph propagation based on a knowledge graph--a graphical representation of real-world entities and the links between them. The general intuition is that words which are closely linked on a knowledge graph probably have similar sentiment polarities. For this project, sentiments were generated based on English sentiment lexicons. | tam|mul | 2017-09-13 | GNU General Public License | Yes | natural language processing | machine learning | annotated text corpus | text/csv | 1.96 MB | born digital | ||||||||||||||||||||||||||||||||
27 | Crowdsourced high-quality Tamil multi-speaker speech data set | உயர்தர தமிழ் பேச்சு தரவு தொகுப்பு | OTDC0026 | European Language Resources Association (ELRA) | This data set contains transcribed high-quality audio of Tamil sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains an anonymized FileID and the transcription of audio in the file. | tam|mul | 2020 | CC BY-SA 4.0 | Yes | text analysis | natural language processing | speech data corpus | audio/wav|text/tab-separated-values | ~4000 sentences | born digital | ||||||||||||||||||||||||||||||||
28 | MASSIVE Dataset | OTDC0027 | Creator:Amazon.com, Inc. | MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. | multilingual | 2022 | CC BY 4.0 | Yes | SLURP: A Spoken Language Understanding Resource Package | natural language understanding | natural language processing | annotated text corpus | application/json | ~19,500 unique phrases in 51 languages. | born digital | ||||||||||||||||||||||||||||||||
29 | Microsoft Speech Corpus | OTDC0028 | Creator:Microsoft | This dataset contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. | multilingual | 2023-06-20 | " this dataset shall not be used for commercial purposes" | Yes | speech synthesis | speech recognition | audio/wav | 12.3 GB | born digital | ||||||||||||||||||||||||||||||||||
30 | |||||||||||||||||||||||||||||||||||||||||||||||
31 | |||||||||||||||||||||||||||||||||||||||||||||||
32 | |||||||||||||||||||||||||||||||||||||||||||||||
33 | |||||||||||||||||||||||||||||||||||||||||||||||
34 | |||||||||||||||||||||||||||||||||||||||||||||||
35 | |||||||||||||||||||||||||||||||||||||||||||||||
36 | |||||||||||||||||||||||||||||||||||||||||||||||
37 | |||||||||||||||||||||||||||||||||||||||||||||||
38 | |||||||||||||||||||||||||||||||||||||||||||||||
39 | |||||||||||||||||||||||||||||||||||||||||||||||
40 | |||||||||||||||||||||||||||||||||||||||||||||||
41 | |||||||||||||||||||||||||||||||||||||||||||||||
42 | |||||||||||||||||||||||||||||||||||||||||||||||
43 | |||||||||||||||||||||||||||||||||||||||||||||||
44 | |||||||||||||||||||||||||||||||||||||||||||||||
45 | |||||||||||||||||||||||||||||||||||||||||||||||
46 | |||||||||||||||||||||||||||||||||||||||||||||||
47 | |||||||||||||||||||||||||||||||||||||||||||||||
48 | |||||||||||||||||||||||||||||||||||||||||||||||
49 | |||||||||||||||||||||||||||||||||||||||||||||||
50 | |||||||||||||||||||||||||||||||||||||||||||||||
51 | |||||||||||||||||||||||||||||||||||||||||||||||
52 | |||||||||||||||||||||||||||||||||||||||||||||||
53 | |||||||||||||||||||||||||||||||||||||||||||||||
54 | |||||||||||||||||||||||||||||||||||||||||||||||
55 | |||||||||||||||||||||||||||||||||||||||||||||||
56 | |||||||||||||||||||||||||||||||||||||||||||||||
57 | |||||||||||||||||||||||||||||||||||||||||||||||
58 | |||||||||||||||||||||||||||||||||||||||||||||||
59 | |||||||||||||||||||||||||||||||||||||||||||||||
60 | |||||||||||||||||||||||||||||||||||||||||||||||
61 | |||||||||||||||||||||||||||||||||||||||||||||||
62 | |||||||||||||||||||||||||||||||||||||||||||||||
63 | |||||||||||||||||||||||||||||||||||||||||||||||
64 | |||||||||||||||||||||||||||||||||||||||||||||||
65 | |||||||||||||||||||||||||||||||||||||||||||||||
66 | |||||||||||||||||||||||||||||||||||||||||||||||
67 | |||||||||||||||||||||||||||||||||||||||||||||||
68 | |||||||||||||||||||||||||||||||||||||||||||||||
69 | |||||||||||||||||||||||||||||||||||||||||||||||
70 | |||||||||||||||||||||||||||||||||||||||||||||||
71 | |||||||||||||||||||||||||||||||||||||||||||||||
72 | |||||||||||||||||||||||||||||||||||||||||||||||
73 | |||||||||||||||||||||||||||||||||||||||||||||||
74 | |||||||||||||||||||||||||||||||||||||||||||||||
75 | |||||||||||||||||||||||||||||||||||||||||||||||
76 | |||||||||||||||||||||||||||||||||||||||||||||||
77 | |||||||||||||||||||||||||||||||||||||||||||||||
78 | |||||||||||||||||||||||||||||||||||||||||||||||
79 | |||||||||||||||||||||||||||||||||||||||||||||||
80 | |||||||||||||||||||||||||||||||||||||||||||||||
81 | |||||||||||||||||||||||||||||||||||||||||||||||
82 | |||||||||||||||||||||||||||||||||||||||||||||||
83 | |||||||||||||||||||||||||||||||||||||||||||||||
84 | |||||||||||||||||||||||||||||||||||||||||||||||
85 | |||||||||||||||||||||||||||||||||||||||||||||||
86 | |||||||||||||||||||||||||||||||||||||||||||||||
87 | |||||||||||||||||||||||||||||||||||||||||||||||
88 | |||||||||||||||||||||||||||||||||||||||||||||||
89 | |||||||||||||||||||||||||||||||||||||||||||||||
90 | |||||||||||||||||||||||||||||||||||||||||||||||
91 | |||||||||||||||||||||||||||||||||||||||||||||||
92 | |||||||||||||||||||||||||||||||||||||||||||||||
93 | |||||||||||||||||||||||||||||||||||||||||||||||
94 | |||||||||||||||||||||||||||||||||||||||||||||||
95 | |||||||||||||||||||||||||||||||||||||||||||||||
96 | |||||||||||||||||||||||||||||||||||||||||||||||
97 | |||||||||||||||||||||||||||||||||||||||||||||||
98 | |||||||||||||||||||||||||||||||||||||||||||||||
99 | |||||||||||||||||||||||||||||||||||||||||||||||
100 |