Spoken corpora in CLARIN
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
£
%
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
CorpusLanguageData typeTime spanGenre/domainSizeAnno.AvailabilityLicenceAdd. commentsVLO
2
Mbochi speech corpusAchinese, FrenchRecordings and transcripts4.5 hours; 5000 sentencesDownloadELRAELRA Catalogue (broken link in VLO)Yes
3
Arabic Speech CorpusArabicRecordings and transcriptsDownloadCC NC-SA 3.0Oxford Text ArchiveYes
4
Spoken corpus of Karel MakoňCzechRecordings and transcripts1960s-1990sChristian mysticism1000 hoursDownloadCC BY-SA 3.0LINDATYes
5
Czech Senior COMPANION Expressive Speech CorpusCzechRecordings and transcriptsExpressive speech6508 utterancesConcordancer and downloadCC BY-NC-SA 3.0LINDATYes
6
ORAL2008: Balanced corpus of informal spoken CzechCzechTranscriptsInformal conversations1 million wordsConcordancer and downloadCC BY-NC-SA 3.0LINDATYes
7
ORAL2013: balanced corpus of informal spoken Czech (transcriptions)CzechTranscriptsInformal conversations2.8 million wordsConcordancer and downloadCC BY-NC-SA 4.0LINDATYes
8
ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)CzechRecordings and transcriptsInformal conversations2.8 million wordsRecordings and transcripts anonymisedConcordancer and downloadAcademic Licence Agreement for Czech National Corpus DataLINDATYes
9
ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)CzechTranscriptsInformal conversations1 million wordsOrthographic and phonetic transcription; MSD-tagged, lemmatisedConcordancer and downloadCC BY-NC-SA 4.0LINDATYes
10
ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio)CzechRecordings and transcriptsInformal conversations1 million wordsOrthographic and phonetic transcription; MSD-tagged, lemmatisedConcordancer and downloadAcademic Licence Agreement for Czech National Corpus DataLINDATYes
11
DIALEKT v1: dialectal corpus with multi-tier transcriptionCzechRecordings and transcriptsTraditional dialectological material, usually unprepared monologue-type speech100,000 wordsOrthographic and dialectological transcription; MSD-tagged, lemmatisedConcordancer; download available upon requestAcademic Licence Agreement for Czech National Corpus Datahttp://kontext.korpus.czNoAdded by michal.kren@ff.cuni.cz
12
Faroese Danish Corpus Hamburg 0.2.dan (FADAC-0.2.dan Hamburg)Danish (Faroese), FaroeseRecordingsInformal interviewsHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
13
IFA Spoken Language CorpusDutchRecordings and transcripts50,000 words (41 minutes/speaker)Hand-segmented speechDownloadLINDAT (download from dedicated webpage)Yes
14
AUTONOMATA Spoken Names CorpusDutchReading aloud of first names, surnames, street names, etc.5,000 wordsPhonetic transcription (syllable boundaries, stressmarkers)DownloadLINDAT (BROKEN LINK TO LANDING PAGE!)Yeshttps://ivdnt.org/downloads/taalmaterialen/tstc-autonomata-namencorpus
15
AUTONOMATA-POI-corpusDutchReading aloud points-of-interest800 POI's??https://ivdnt.org/taalmaterialen/102-taalmaterialen/1959-tstc-autonomata-poi-corpus
16
CGN 2.0DutchRecordings and transcripts1,000 hours, 10 million wordsPoS-tagged, lemmatised, partial phonetic transcrption, partial syntactic annotaionConcordancer + Download of transcripts with annotations + Download audioCLARIN RESOpenSONAR
Taalmaterialen website at www.ivdnt.org
Yeshttps://ivdnt.org/taalmaterialen/102-taalmaterialen/1965-tstc-cgn-annotaties

https://ivdnt.org/downloads/taalmaterialen/tstc-corpus-gesproken-nederlands
17
IFA speech corpusDutchLINDAT (Broken link to landing page!)Yes
18
JASMIN Speech CorpusDutchRecordings and transcriptsChildren, non-natives, senior people; human-machine interaction, read speech115 hours of speechPoS-tagged, lemmatised, phonetic transcriptionLINDAT (link to landing page broken)Yeshttps://ivdnt.org/downloads/taalmaterialen/tstc-jasmin-spraakcorpus
19
TRAINS Spoken Dialog CorpusEnglishRecordingsTask-oriented dialogs6.5 hours, 55,000 wordsYes
20
Boston University Radio Speech CorpusEnglishRecordings and transcriptsRadio news7 hoursPoS-tagged, phonetic alignment, prosodic markersDownloadCLARIN RESUPennYes
21
Buckeye Corpus of Conversational SpeechEnglishRecordings and transcriptsInterviewphonetic labelsCLARIN RESOrtolangYes
22
English TTS speech corpus of air traffic (pilot) messages - Czech accentEnglishRecordingsAir traffic control1692 utterancesDownloadCC-BY-NC-SA 4.0LINDATYes
23
English TTS speech corpus of air traffic (pilot) messages - Serbian accentEnglishRecordingsAir traffic control3000 utterancesConcordancer and downloadCC BY-NC-SA 3.0LINDATYes
24
English TTS speech corpus of air traffic (pilot) messages - German accentEnglishRecordingsAir traffic control1685 utterancespitch markingDownloadCC BY-NC-SA 4.0LINDATYes
25
English TTS speech corpus of air traffic (pilot) messages - Taiwanese accentEnglishRecordingsAir traffic control1000 utterancespitch markingDownloadCC BY-NC-SA 3.0LINDATYes
26
The Spoken Wikipedia CorporaEnglish, German, DutchTranscriptsRead Wikipedia articlesText segmentation, normalization, time-alignmentDownloadCC-BY SA 4.0CLARIN-D; uni-hamburgYes
27
Corpus of Spoken EstonianEstonianTranscriptsVarious1 million wordsUnspecified taggingLINDATYes
28
Persian Speech CorpusFarsi, PersianRecordings2.5 hours of speechDownloadELRAElra CatalogueYes
29
Ph@ttSessionz Adolescents Speech CorpusGermanRecordingsAdolescent speechConcordancer and downloadBAS CLARIN RepositoryYes
30
Hamburg Modern Times CorpusGermanRecordings and transcriptsadult L2 acquisition; task-oriented communication; film retelling186 minutesmanual annotation of phonetic phenomena, accentuan/stress markingHZSK-ACA (academic, non-commercial only)CLARIN-D; uni-hamburgYes
31
EXMARaLDA Demo Corpus 1.0German, English, French, Spanish, Turkish, Polish, Vietnamese, Swedish, Norwegian, Italian, Russian, Afrikaans, PortugueseRecordingsDemo corpus - a demonstration of the EXMARaLDA system132 minutessuprasegmental information, accentuation/stress markingDownload HZSK-PUB (public, non-commercial only)CLARIN-D; uni-hamburgYes
32
Hamburg Adult Bilingual LAnguage (HABLA)German, French, ItalianRecordingsInterviews4748 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
33
ALCEBLAGerman, SpanishRecordings and transcriptsSpeech tasks performed by bilingual children4293 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
34
35
Nepali Spoken CorpusNepaliVarious31 hours 26 minutesDownloadELRAELRA catalogueYes
36
Nganasan Spoken Language Corpus (NSLC)Nganasan, RussianRecordings and transcriptsVarious1962 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
37
Hamburg Corpus of Polish in Germany (HamCoPoliG)PolishRecordingsSpontaneous speech and reading tasks2264 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
38
Spoken Portuguese CorpusPortugueseRecordings and trascripts1970-2001Various domains and different countries (Portugal, Brazil, African countries with Portuguese as official language)86 recordings, 153,588 tokensPlain text (PoS-tagged) and aligned transcriptionsDownloadELRAELRA CatalogueYes
39
Consecutive and Simultaneous Interpreting (CoSi)Portuguese, EnglishRecordingsLectures in Portuguese with simultaneous interpretation in English345 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
40
Spoken corpus Gos 1.0SlovenianTranscriptsradio and TV shows, school lessons, private conversations, business meetings1 million words (120 hours)Two types of transcription: standardized spelling and pronunciation-based spellingConcordancer and downloadCC BY-NC-SA 4.0CLARIN. SI + Dedicated webconcordancerYes
41
Spoken corpus Gos VideoLectures 2.0 (audio)SlovenianRecordingsPublic academic speech9.8 hours of speechDownloadCC BY-NC-ND 4.0CLARIN.SIYes
42
Spoken corpus Gos VideoLectures 2.0 (transcription)SlovenianTranscriptsPublic academic speech79420 wordsPoS-tagged, lemmatised, two types of transcription: standardized spelling and pronunciation-based spellingConcordancer and downloadCC BY 4.0CLARIN.SIYes
43
Spanish TTS Speech CorpusSpanishRecordings and transcripts1,787 sentencesSyllable, stress and acoustic events marking. Utterance-level segmentation, word-level annotationDownloadELRAELRA CatalogueYes
44
COLA - Corpus Oral de Lenguaje AdolescenteSpanishRecordings and transcriptsSpotaneous conversations among teenagers750,000 tokensConcordancer and downloadCLARIN RESCLARINONo
45
Hamburg Corpus of Argentinean Spanish (HaCASpa)SpanishRecordingsSpontaneous speech and reading tasks1119 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
46
Catalan in a bilingual context (PhonCAT)Spanish (Catalan)RecordingsRead, elicited and spontaneous speech data8619 minutesHZSK-RES (restricted, non-commercial only)CLARIN-D; uni-hamburgYes
47
Gothenburg Dialogue CorpusSwedishTranscripts1 470 000 tokensMSD-tagged, lemmatisedConcordancerAccount neededSWE-CLARINNo
48
Interaction and Variation in Pluricentric Languages (IVIP)Swedish (Sweden and Finland)Recordings (audio and video) and transcriptsSpontaneous conversations in the service domain380 000 tokensorthographic transcription, MSD-tagged, lemmatised, named-entity-taggedConcordancerAccount neededSWE-CLARINNo
49
IVIP demo - Interaction and Variation in Pluricentric LanguagesSwedish (Sweden and Finland)Recordings (audio and video) and transcriptsSpontaneous conversations in the service domain573 tokensorthographic transcription, MSD-tagged, lemmatised, named-entity-taggedConcordancer and download of annotated transcriptsCC-BYSWE-CLARINNo
50
The BigBrother CorpusNorwegianRecordings, transcriptsBig Brother 2001440 300 tokensorthographic transcription, msd-tagged, lemmatisedConcordancerAccount neededCLARINOYes
51
Corpus of American Nordic Speech (CANS)Norwegian, SwedishRecordings,TranscriptsInterviews, conversations. Norwegian and Swedish dialects in America.251 000 tokensorthographic transcription, msd-tagged, lemmatised and connected to phonetic transcriptionConcordancerAccount neededCLARINOYes
52
Corpus of Doctor-Patient Conversations from AhusNorwegianTranscriptsDoctor-patient conversations 958 830 tokensorthographic transcription, msd-tagged, lemmatisedConcordancerAccount neededCLARINOYes
53
LIANorwegianRecordings, TranscriptsInterviews, conversations. Norwegian dialects1.5 million tokensorthographic transcription, msd-tagged, lemmatised and connected to phonetic transcriptionConcordancerAccount neededCLARINOSoon ...
54
Nordic Dialect CorpusNorwegian, Swedish, Danish, Faroese, ÖvdalianRecordings,TranscriptsInterviews, conversations. Norwegian dialects3.1 million tokensorthographic transcription, msd-tagged, lemmatised. Norwegian and Övdalian orthographic transcription are connected to phonetic transcriptionConcordancer
Account needed
CLARINOYes
55
NoTa-OsloNorwegianRecordings, transcriptsInterviews, conversations. Oslo sosiolects.1 million tokensorthographic transcription, msd-tagged, lemmatisedConcordancerAccount neededCLARINOYes
56
The Ruija CorpusKvens, FinnishRecordings, transcripts430 000 words; 76 hours 18 minutesConcordancerAccount neededNOT a CLARINO corpus. REMOVE!No
57
TalkoNorwegianRecordings, transcriptsConcordancerAccount neededNOT a CLARINO corpus. REMOVE!No
58
TAUSNorwegianRecordings, transcriptsInformal nterviews. Oslo sosiolects 270 000 tokensorthographic transcription, msd-tagged, lemmatised. Some transcriptions are also phonetically transcribedConcordancerAccount neededCLARINOYes
59
Deutsche Mundarten: Zwirner-KorpusGerman, (some Frisian and Dutch)Audio recordings, transcriptsGerman dialects: Interviews, elicited speech4 million words; 1076 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYesfrom here to row #100: added by thomas.schmidt@ids-mannheim.de
60
Deutsche Mundarten: ehemalige deutsche OstgebieteGermanAudio recordings, transcriptsGerman dialects: Interviews, elicited speech838 000 words; 461 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
61
Deutsche Mundarten: DDRGerman, (some Sorbian)Audio recordings, transcriptsGerman dialects: Interviews, elicited speech212 000 words; 385 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
62
Deutsche Mundarten: Südwestdeutschland und VorarlbergGermanAudio recordingsGerman dialects: Interviews, elicited speech72 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
63
Deutsche Mundarten: SchwarzwaldGermanAudio recordingsGerman dialects: Interviews, elicited speech37 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
64
Deutsche Mundarten: Kreis BöblingenGermanAudio recordingsGerman dialects: Interviews, elicited speech43 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
65
Deutsche Umgangssprachen: Pfeffer-KorpusGermanAudio recordings, transcriptsGerman regional varieties: Interviews646 000 words; 80 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
66
Deutsche Standardsprache: König-KorpusGermanAudio recordings, transcriptsGerman standard language: Interviews, elicited speech50 000 words; 6 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYesExcerpt - complete corpus (around 120 hours) currently in the process of curation
67
Deutsche HochlautungGermanAudio recordings, transcriptsGerman standard language: Broadcast data10 000 words; 2 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
68
Binnen- und auslandsdeutsche Mundarten: VariaGermanAudio recordingsGerman dialects / extraterritorial varieties: Interviews, elicited speech70 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
69
AustraliendeutschGermanAudio recordings, transcriptsGerman extraterritorial varieties: Interviews330 000 words, 65 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
70
Russlanddeutsche DialekteGermanAudio recordings, transcriptsGerman extraterritorial varieties: Interviews100 000 words; 10 hoursliteral and orthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
71
Emigrantendeutsch in IsraelGermanAudio recordings, video recordings,transcriptsGerman extraterritorial varieties: Interviews232 000 words; 285 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
72
Emigrantendeutsch in Israel: Wiener in JerusalemGermanAudio recordings, transcriptsGerman extraterritorial varieties: Interviews225 000 words; 51 hoursorthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
73
Zweite Generation deutschsprachiger Migranten in IsraelGermanAudio recordings, video recordingsGerman extraterritorial varieties: Interviews125 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
74
Slawische Mundarten im RuhrgebietGerman, Polish, Russian, Slovenian, UkrainianAudio recordingsGerman varieties, multilingualism7 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
75
Forschungs- und Lehrkorpus gesprochenes DeutschGermanAudio recordings, video recordings,transcriptsAuthentic interaction data from various domains2.3 million words, 230 hoursliteral and orthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYesongoing compilation
76
Gesprochene Wissenschaftssprache KontrastivGerman, English, Polish, BulgarianAudio recordings, video recordings,transcriptsAcademic interaction760 000 words, 92 hoursliteral and orthographic transcription, lemma, POS, time aligment, annotation of discourse phenomena and language mixingvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYesongoing curation, also available at https://gewiss.uni-leipzig.de/index.php?id=home
77
Grundstrukturen: Freiburger KorpusGermanAudio recordings, transcriptsAuthentic interaction data from various domains600 000 words, 70 hoursorthographic transcription, intonation, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
78
DialogstrukturenGermanAudio recordings, transcriptsAuthentic interaction data from various domains140 000 words, 15 hoursorthographic transcription, intonation, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
79
Berliner WendekorpusGermanAudio recordings, transcriptsNarrative interviews about German reunification260 000 words, 28 hoursliteral and orthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
80
Biographische und ReiseerzählungenGermanAudio recordings, transcriptsNarrative/biographic interviews50 000 words, 6 hoursorthographic transcriptionvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
81
Belgische TV-DebattenGermanAudio recordings, video recordingsBroadcast TV debates10 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
82
Elizitierte KonfliktgesprächeGermanAudio recordings, transcriptsElicited conflict interaction160 000 words, 12 hoursorthographic transcriptionvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
83
JugendkommunikationGermanAudio recordingsAdolescent interaction5 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
84
Mehrsprachige Kinder im VorschulalterGermanAudio recordings, transcriptsElicitation tasks with pre-school children17 000 words, 13 hoursliteral and orthographic transcription, lemma, POS, time aligmentvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
85
Kindersprache: Saarbrücker KorpusGermanAudio recordingsChild speech5 hoursvia Database of Spoken German (DGD): Browsing / Query / DownloadAccount neededAGD @ IDS MannheimYes
86
Deutsch HeuteGermanAudio recordings, transcriptsGerman regional varieties: Interviews, elicited data> 1000 hoursorthographic transcription, lemma, POS, time alignmentsoonAGD @ IDS MannheimNoongoing curation, first release by end of 2018
87
Kiezdeutsch-KorpusGerman
88
Berlin Maptask CorpusGerman
89
Texas German Dialect CorpusGermanAudio recordings, transcriptsGerman extra-territorial varieties
90
Last Minute CorpusGerman
91
Spoken BNC 2014English
92
Santa Barbara Corpus of Spoken American EnglishEnglish
93
Griffith Corpus of Spoken Australian EnglishEnglish
94
Aston Corpus of West Midlands EnglishEnglish
95
Vienna Oxford International Corpus of English English
96
Enquêtes Sociolinguistiques à OrléansFrench
97
Corpus de LAngue Parlée en InteractionFrench
98
Corpus de Français Parlé Parisien des années 2000French
99
Corpus de Français Parlé à BruxellesFrench
100
Phonologie du Français ContemporainFrench
Loading...
Main menu