ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
CorpusURL LanguageSizeText typesDisciplinePeriodEncoding
Non-linguistic annotation
Linguistic annotationLicenceOn-lineDownloadRepositoryVLOPublication
Publication URL
2
Czech Sociological Review
https://hdl.handle.net/11372/LRT-2703
Czech3 million words
research papers, essays
sociology1993-2016TSV??MITnoyesLINDATYes
3
UH's German E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016102807
German560,000 tokens
MA&PhD theses
?1999-2016Korp??CC BYyesnoFIN-CLARINYes
4
UH's English E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016102401
English
200 million tokens
MA&PhD theses
?1999-2016Korp??CC BYyesnoFIN-CLARINYes
5
ACL Anthology Reference Corpus
https://app.sketchengine.eu/#dashboard?corpname=preloaded%2Faclarc_2
English75 million tokens
research papers
computational linguistics1979-2015XML
author&text metadata
word, lemma, PoSCC BY SA
yes (not sure if the same version)
yesLDCYes
Bird et al. 2008
https://acl-arc.comp.nus.edu.sg/lrec08.pdf
6
English Scientific Text Corpus
http://hdl.handle.net/11858/00-246C-0000-0023-8CF9-6
English35 million tokensjournal articles
computer science, computational linguistics, ioinformatics, digital construction, microelectronics, linguistics, biology, mechanical engineering, electrical engineering
1970/80 & 2000sCQP
author&text metadata, doc structure
word, lemma, PoSrestrictedyesno
CLARIN-D (Saarbrücken)
Yes
Degaetano-Ortlieb et al. 2013
https://nbn-resolving.org/urn:nbn:de:bsz:mh39-49336
7
Academic Corpus
https://www.wgtn.ac.nz/lals/resources/academicwordlist/information/corpus
English3.5 m wrd
journal articles, book chapters, course workbooks, laboratory manuals, course notes
arts, commerce, law, biology & subdisciplins
?????nono
Dedicated webpage
No
8
GENIA corpus
http://www.geniaproject.org/genia-corpus
English437,000 words
journal paper abstracts
biomedicine?various, e.g. PTB
text metadata, doc structure
PoS, Syn, Term, Event, Relation, Coreffree but unspecifiednoyesPORTULANYes
Jian et al. 2008
http://drops.dagstuhl.de/opus/volltexte/2008/1522/
9
Reading Academic Text corpus
http://www.reading.ac.uk/internal/appling/corpus.htm
English?PhD theses
agriculture, psychology, food science, technology, meteorology, history
?ASCII & HTML??restrictednono
Dedicated webpage
No
10
Academic Corpus PUCV-2006
http://repositorio.ucv.cl/handle/10.4151/6855?show=full
Spanish59 million words
dictionaries, didactic guidelines, disciplinary texts, lectures, regulations, reports, research articles, tests, textbooks
psychology, social work, construction engineering, industrial chemistry
???PoS?nono?NoParodi 2010
http://www.scielo.br/scielo.php?pid=S1518-76322010000300006&script=sci_arttext#back
11
UH's Spanish E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016102809
Spanish2.3 million tokens
MA&PhD theses
?1999-2016Korp??CC BYyesnoFIN-CLARINYes
12
Corpus of Estonian scientific texts
http://hdl.handle.net/11297/1-00-0000-0000-0000-0002-4
Estonian5 million words
PhD theses, scientific articles
??TEI P5??CLARIN_ACA-NCno
yes (not working)
CELRYes
13
UH's Finnish E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016090601
Finnish
12.5 million tokens
MA&PhD theses
?1999-2016Korp?PoS, lemmaCC BYyesnoFIN-CARINYes
14
Chambers-Le Baron Corpus of Research Articles
http://hdl.handle.net/20.500.12024/2527
French1 million words
research papers
media/culture, literature, linguistics and language learning, social anthropology, law, economics, sociology and social sciences, philosophy, history, and communication
1998-2006TXT??OTAnoyes
Oxford Text Archive
Yes
15
UH's French E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016102806
French580,000 tokens
MA&PhD theses
?1999-2016Korp??CC BYyesnoFIN-CLARINYes
16
Corpus of academic Lithuanian
http://coralit.lt/en/node/18
Lithuanian9 million words
textbooks, monographs, journal articles
humanities, social sciences, physical sciences, biomedical sciences, technological sciences & subdisciplines
1999-2009TEI 5???yesno
Dedicated webpage
No
Usonienė and Linkevičienė (2009)
http://mokslozurnalai.lmaleidykla.lt/publ/0235-716X/2009/3-4/133-143.pdf
17
UH's Russian E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016102808
Russian1.1 million words
MA&PhD theses
?1999-2016Korp??CC BYyesnoFIN-CLARINYes
18
UH's Swedish E-thesis corpus
http://urn.fi/urn:nbn:fi:lb-2016102810
Swedish
105 million tokens
MA&PhD theses
?1999-2016Korp??CC BYyesnoFIN-CLARINYes
19
Academic texts - humanities
http://hdl.handle.net/10794/49
Swedish
14.5 million tokens
?humanities1997-2012Korp (XML,TXT)??CC BYyesyesSWE-CLARINYes
20
Academic texts - social science
http://hdl.handle.net/10794/50
Swedish
10.8 million tokens
?social science1997-2012Korp (XML,TXT)?sentenceCC BYyesyesSWE-CLARINYes
21
Czech and English abstracts of ÚFAL papers
https://hdl.handle.net/11234/1-1731
Czech, English2 million words
research paper abstracts
formal and applied linguistics?TSV?doc alignmentCC BYnoyesLINDATYes
22
MuchMore Springer Bilingual Corpus
http://muchmore.dfki.de/resources1.htm
English, German1 million tokens
journal paper abstracts
medical&subisciplines?MuchMore XMLdoc structurePoS, Morph, Chunks, SemClass, SemRelfree but unspecifiednoyes
Dedicated webpage
No
23
Spanish-English Research Article Corpus
https://books.google.si/books?id=NZbWCgAAQBAJ&pg=PA178&lpg=PA178&dq=serac+corpus&source=bl&ots=A7F-vUMJsr&sig=ACfU3U1b8W_r944Bs8OviL9xauHtUoeqVg&hl=sl&sa=X&ved=2ahUKEwiRuq_5nczmAhXT5KYKHWUtBlcQ6AEwAHoECAUQAQ#v=onepage&q=serac%20corpus&f=false
Spanish, English5.7 million wordsjournal articles?2000-2010????nono/No
24
Scientext corpus
https://scientext.hypotheses.org/corpus
French, English20 million words
scientific texts, argumentative essays
humanities, experimental sciences, applied or technical sciences&subdisciplines
????CC BYyesno
Dedicated webpage
No
25
Corpus of Academic Slovene KAS 1.0
http://hdl.handle.net/11356/1244
Slovenian1.7 billion tokens
BA&MA&PhD theses
humanities, social sciences, natural sciences
2000-2018TEI?
MSD, lemma, bilingual and monolingual term candidates
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
YesYesCLARIN.SIYes
Erjavec et al. FORTHCOMING
26
OROSSIMO Corpus
http://hdl.grnet.gr/11500/ATHENA-0000-0000-2410-5
Greek2.5 million tokens
"academic discourse text"
social sciences, computer science, economics, linguistics, photography, law, engineering, history, astronomy, earth sciences and geology, medicine and health, biology
?XML (XCES)?
Term candidates, "mixed structural annotation"
CC-BYNoYesCLARIN:ELNo
27
Modern Greek Dialects: scientific papers
http://hdl.grnet.gr/11500/KEG-0000-0000-2502-4
Greek113,000 words
scientific texts
linguistics, dialects?plain?? (presumably none)CC-BY-SANoYesCLARIN:ELNo
28
tekmirion
http://hdl.grnet.gr/11500/IONION-0000-0000-2512-2
English, Greek, French
2,750 (of what?)journal articleslibrary science, museology?plain?? (presumably none)CC-BYNoNoCLARIN:ELNo
29
The Language of Literature and the Language of Translation (collected scientific papers)
http://hdl.grnet.gr/11500/KEG-0000-0000-24F2-6
Greek48,300 wordsjournal articleslinguistics, translation?plain?? (presumably none)CC-BY-SANoYesCLARIN:ELNo
30
KIAP - Cultural Identity in Academic Prose
http://hdl.handle.net/11372/LRT-373
English, French, Norwegian
?
scientific papers
????PoS?No (broken link)NoLINDATNo
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100