L2 resources (incl corpora from the list at Louvain & the TEITOK foreign language learner corpora)
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

Included in the report CLARIN corpus YES/NOCLARIN commentred = not CLARIN; blue= observer; no colour = CLARINCorpus nameLinkLanguage of the corpusNative language(s)ContentSizePeriodAnnotationAvailabilityLicenseWhere?Selected publication(s)Contact infoCommentsTools used for annotationCLARIN country
NoAvailable to browse and download TLTAustraliaUWI: The University of the West Indies learner corpushttps://talkbank.org/access/SLABank/French/UWI.htmlFrenchEnglish, Jamaican Creolespoken
CHAT (MacWhinney 2000)
Browse CLAN; Download mp3 with transcripts and annotation
PETERS (2005) (Development of a longitudinal oral interlanguage corpus of Jamaican learners of French. Caribbean Journal of Education 27(2): 75-96).
Dr. Hugues PETERS
9 learners; 25 one-on-one interviews (2-3 / student); NOT INCLUDED IN THE REPOSITORY: 42 other interviews; one with at 10th informant.
NoSeemingly unavailableBelarusBELLCE: The Belarussian Learner Corpus of Englishhttps://sites.google.com/site/bellearnercorpus/EnglishRussian, Belarussianwritten58 argumentative essaysunannotatednothing online, contact A. RakhubaAnastasia Rakhuba Anastasia.rakhuba@gmail.com
NoSeemingly unavailable - is it somehow available through UCL. In a presentation they said they wanted a good easy-to-use interfaceBelgiumLeerdercorpus Nederlands (incl. LCN als Vreemde Taal & LCN)https://uclouvain.be/fr/instituts-recherche/ilc/valibel/corpora.htmlDutchFrench; German, Polish, Hungarian, Indonesian, English, otherswritten?3468 texts; 774658 wordsproject from 2003- (LCN texts 1998-2004; LSNaVT 1999-2007)CENTAL (UCL?)Liesbeth Degand, Louvain, Belgium
NoSeemingly unavailableBelgiumCYLIL: The Corpus of Young Learner InterlanguageEnglishDutch, French, Italian, Greekspoken500 000 wordsproject started 1990no errortaggingVrije universiteit BrusselA. Housen A corpus-based study of the L2-acquisition of the English verb system. in: Computer learner corpura, second language acquistition, foreign language... (eds. Granger, Hung, Patch-Tyson)alex.housen@vub.ac.besome longitudinal data; interviews
YesYes complete ICLE or only partsLINDAT - but is it really available thereBelgiumICLE: The International Corpus of Learner Englishhttps://uclouvain.be/en/research-institutes/ilc/cecl/icle.htmlEnglish
Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Tswana, Turkish
writtenSylviane Granger, Louvainargumentative essays; v1 and v2 on cd-rom, v3 under way
unavailable except to project members
BelgiumThe LONGDALE project: Longitudinal database of Learner Englishhttps://uclouvain.be/en/research-institutes/ilc/cecl/longdale.htmlEnglishvariouswritten, spokengrowingproject january 2008"The Longdale data are not publicly available at this stage and are analyzed in priority by the teams who collected them."Louvain
Meunier, Fanny 2016. Introduction to the LONGDALE Project. In Castello, Erik / Ackerley, Katherine / Coccetta, Francesca (eds) Studies in Learner Corpus Linguistics. Research and Applications for Foreign Language Teaching and Assessment. Berlin: Peter Lang, 123-126.
Prof. Fanny Meunierlongitudinal minimum 3 years; database! not corpus - variaty of material; first focus on written material but hope to include spoken
No?seems to be available to downloadBelgiumFRIDA: French Interlanguage Databasehttps://uclouvain.be/en/research-institutes/ilc/cecl/frida.html; download? via https://sites.uclouvain.be/cecl/projects/Frida/gateway.htmFrenchEnglish, Dutch, variouswrittenraw text version; error tagged versionUCLSylviane Granger, Louvain??? (no contact listed on the UCL FRIDA pages)
NoBelgiumLCF: The Learner Corpus of FrenchFrenchDutch writtenK.U.Leuven Campus Kortrijk, UGent and Lessius
No?the link available in a presentation requires login, thus hard to know if it worksBelgiumAprescrilov ("Aprendera Escribiren Lovaina") I / II
ilt.kuleuven.be/aprescrilov (MIGHT WORK BUT REQUIRES LOGIN)
SpanishDutch; (Belgian) Frenchwrittenca 2700 textsCollected from 2004-2010Markin tool annotation - positive and negative error annotation Hard to tell. Link given in presentation but leads one to a loginpage for the university.https://www.arts.kuleuven.be/ling/dag_van_het_onderzoek/posters_en_demos/buyse_apescrilov; K. Buyse. L. Fernandez Perera, K. Verveckken. 2016. The Aprescrilov corpus, or broadening of the horizon of Spanish language learning in Flanders. Article published in:
Spanish Learner Corpus Research: Current trends and future perspectives
Edited by Margarita Alonso-Ramos

[Studies in Corpus Linguistics 78] 2016
► pp. 143–168
NoBeligumLINDSEI: The Louvain International Database of Spoken English Interlanguage EnglishvariousspokenLouvain
NoBrazilThe Br-ICLE corpusEnglishBrazilian PortuguesewrittenTony Berber Sardinha, Sao Paulo & Stella O. Tagnin, Sao Paulo
NoBrazilMLC: The Multilingual Learner CorpusMultilingual: Engilsh, German, Italian, SpanishBrazilian Portuguesewritten
NoCanadaThe Quebec learner corpusEnglishCanadian Frenchwritten
NoCanadaThe "Dire Autrement" corpusFrenchEnglish mainlywritten
NoCanadaRPD: The University of Toronto Romance Phonetics DatabaseMultilingual: English, French, Italian, Portuguese, Spanish, Romanianvariousspoken
NoChinaCAWE: The Chinese Academic Written English CorpusEnglishChinesewritten
NoChinaCLEC: The Chinese Learner Corpus of EnglishEnglishChinesewritten
NoChinaESCCL: The English Speech Corpus of Chinese LearnersEnglishChinesespoken
NoChinaTECCL: The Ten-Thousand English Compositions of Chinese LearnersEnglishChinesewritten
NoCroatiaCroLTeC Croatian36: Afrikaans, Arabic, Bulgarian, Catalan, Czech, Danish, German, English, Estonian, Persian, Finnish, French, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Lari, Mandinka, Dutch, Norwegian, Polish, Portuguese, Russian, Slovak, Slovenian, Spanish, Albanian, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Chinese, Malaywritten essays6,213 essays; 1,054,287 tokens2014-Learner modifications (deletions, transpositions, insertions)
MSD tags
Learner Errors
Freely available online (downloadable by request)not decided yethttp://teitok.iltec.pt/croltec/Mikelić Preradović, N., Berać, M., Boras, D. (2015). Learner Corpus of Croatian as a Second and Foreign Language. In Multidisciplinary Approaches to Multilingualism. Frankfurt am Main, Germany: Peter Lang, pp.107-126. nmikelic@ffzg.hrin preparationTEITOK(Croatia)
YESYes Listen in VLO / LINDAT. Is the corpus called CzeSL or AKCES? Somewhat unclear in VLOCzech RepublicCzeSL – Czech as a Second Language (part of AKCES)http://utkl.ff.cuni.cz/learncorp/Czech
54: ar, az, ba, be, bg, cs, da, de, el, en, es, fa,
fi, fr, he, hi, hr, hu, hy, id, it, ja, ka, kg, kk, ko, ky,
la, lv, mk, mn, mo, ms, nl, no, pl, pt, ro, ru, sh,
sk, sl, sq, sr, sv, tg, th, tl, tr, uk, uz, vi, xal, zh
essays8,617 texts; 111K sentences; 958K words; 1,148K tokens2009–2013Anonymization, learner and teacher modifications, MSD tags, automatically identified learner errors (type and target hypothesis); for 645 texts: manually identified learner errors and parsed target hypothesissearchable and downloadable, various releasesCC-BY-SA 3.0http://utkl.ff.cuni.cz/learncorp/Rosen, A., Hana, J., Štindlová, B., and Feldman, A. (2014). Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation – Special Issue: Resources for language learning, 48(1):65–92. http://utkl.ff.cuni.cz/~rosen/public/2011-czesl-lrej_prefinal.pdf

Rosen, A. (2016). Building and using corpora of non-native Czech. In Brejová, B., editor, Proceedings of the 16th ITAT: Slovenskočeský NLP workshop (SloNLP 2016), volume 1649 of CEUR Workshop Proceedings, pages 80–87, Bratislava, Slovakia. Comenius University in Bratislava. http://ceur-ws.org/Vol-1649/80.pdf
alexandr.rosen@ff.cuni.czCzech Republic
Unavailable?Dubai & UKThe BUID Arab Learner Corpus (BALC)EnglishArabicwrittenMick Randall, Dubai; Nicholas Groom, Birmingham
M. Randall & N. Groom. The BUiD Arab Learner Corpus: a resource for studyingthe acquisition of L2 English spelling. 7th internationalconference on Practical Applications In Language andComputers. The Department of English Language andApplied Linguistics, Łódź University, Poland. April 2009; M. Randall & N. Groom. Introducing BALC: the BUiD Arab Learner Corpus.Workshop, 15th TESOL Arabia Conference, Dubai, March2009
Section 2 not in VLOEstoniaEIC: The Estonian Interlanguage Corpus of Tallinn Universityhttp://evkk.tlu.ee/EstonianRussian, Finnish, English, German, Latvian, Lithuanian, Ukrainian, BelorussianwrittenPille Eslon (peslon@tlu.ee)Estonia
Section 2not in VLOEstoniaThe Tartu Learner Corpus of Spanish as a L3+SpanishEstonianwrittenc. 885 000 tokensMari Kruse, Tartu mari.kruse@ut.ee
Yesincluded in TOPLING below, do not countFinlandCEFLING: Linguistic Basis of the Common European Framework for L2 English and L2 Finnishhttp://metashare.csc.fi/repository/browse/cefling-project-corpus/57c4098c37c411e2b67c005056be118e51bf01e3dd194ba29014dca05cc8e258/Finnish, Englishvarious
written essaysMaisa Martin, Jyväskylä (Ari Huhta)
Yesin the VLO, FIN-CLARINGFinlandTOPLING: Paths in Second Language Acquisitionhttp://urn.fi/urn:nbn:fi:lb-20140730168
Finnish, English, Swedish
variouswritten essays
Umbrella entry, subcorpora are available: see below
Maisa Martin, Jyväskylä (Ari Huhta)includes data from the previous CEFLING project
Section 2?FinlandSve2Ju
Finland(Ari Huhta)
Section 2?FinlandTHE BAT MAT corpus: EFL academic writinghttps://www.utu.fi/fi/sivustot/afinla2013/ohjelma/Sivut/lindgren.aspxEnglishSwedish, Finnish (Finland-Swedish according to link, of course some are then probably bilingual. Numbers?)written90 BA/MA dissertationsUniversity of TurkuSigne-Anita Lindgren, Turku
listed belowFinlandLAS2: The Advanced Finnish Learner CorpusFinnishRussian, Czech, Swedish, Estonian, Lithuanian, Komi, English, Hungarian, German, icelandic, JapanesewrittenKirsti Siitonen, University of Turku, Finland & Ilmari Ivaska, University of Turky, Finland
listed belowFinlandYKI: The Finnish National Foreign Language Certificate CorpusFinnishEnglish, Finnish, French, German, Italian, Sami, Spanish, Swedish, Russian, written, spokenAri Maijanen, Centre for Applied Language Studies, University of Jyväskylä, Finland & Tiina Lammervo, Centre for Applied Language Studies, University of Jyväskylä, Finland
yesYESFinlandICLFI: The International Corpus of Learner FinnishFinnishvariouswritten1,009,719 tokens2007 - 2015Anonymization, lemmatization, grammatical annotation, proficiency assessment (CEFR) CLARIN RES (Restricted)CLARIN RES (Restricted) End-User License +PLAN +NC +INF +LOC +PRIV +DEP 1.0Sisko Brunni, University of Oulu; Jarmo Jantunen, University of JyväskyläBrunni, Sisko; Jantunen, Jarmo; Lehto, Liisa-Maria; Airaksinen, Valtteri 2015: How to Annotate Morphologically Rich Language? Problems and Solutions. In Ann-Kristin Helland Gujord, Susan Nacey, Silje Ragnhildstveit LCR 2013 Conference Proceedings. Norway. Jantunen, Jarmo Harri, Kansainvälinen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi. – Lähivõrdlusi. Lähivertailuja 21, Eesti Rakenduslingvistika Ühing, 86–105. ,sisko.brunni@oulu.fiMore: http://metashare.csc.fi/repository/browse/international-corpus-of-learner-finnish-iclfi/632aa08efccc11e18b49005056be118e2d2ce40d2a224c85b4602c9d1337b876/Connexor fi-fdg (and manual)Finland
section 1 - addedyes(via ortolang)FranceThe Anglish corpusEnglishFrenchspoken60 speakers2008
Orthographic Transcription : https://hdl.handle.net/11403/sldr000733/v2
https://hdl.handle.net/11403/sldr000731Anne Tortel, LPL, Aix-Marseille UniversitépraatFrance
which part?FranceThe Scientext English Learner CorpusEnglishFrenchwrittenGrenoble
which part?FranceVESPA: The Varieties of English for Specific Purposes dAtabase learner corpusEnglishvariouswrittenLouvain
which part?FranceThe Chy-FLE (Cypriot Learner Corpus of French)FrenchGreek, Cypriot GreekwrittenPoitiers (cooperaton with Cyprus)
which part?FranceThe COREIL corpusFrench; English?spokenParis-Diderot
which part?FranceLTC: The MeLLANGE Learner Translator Corpusmultilingualvariouswritten
which part?FrancePAROLE: The corpus PARallèle Oral en Langue EtrangèreMultilingual: English, French, Italian, mainly L2 but some L1variousspoken
not in VLOGermanyAachen corpus of academic writing (ACAW)http://www.anglistik.rwth-aachen.de/go/id/gdgfEnglishGermanElma Kerz, Aachen, Germany
not in VLOGermanyCALE: The Corpus of Academic Learner EnglishEnglishGermanwrittenMarcus Callies, Bremen
not in VLOGermanyThe Eastern European English Learner CorpusEnglishRussian, Ukrainian, Polish, SlovakspokenElena Salakhian
YesYesVLO CLARIN-UKGermany / CLARIN-UKGLBCC: The Giessen-Long Beach Chaplin Corpushttp://ota.ox.ac.uk/desc/2506EnglishGerman (mainly), Spanish, Korean, Japanese (The corpus also includes Native English speakers (American, British Australian)spoken1999-2005Download zipfiles from Oxford text archives
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Andreas Jucker, Giessen & Sara Smith, Giessen; AVAILABLE THROUGH OXFORD TEXT ARCHIVE
YesYesVLO European Language Resource AssociationGermanyISLE speech corpus (Interactive spoken language education)http://catalog.elra.info/en-us/repository/browse/isle-speech-corpus/16ab6156a9da11e7a093ac9e1701ca025183819723c049a49ecb32d1517a9e05/; http://linghub.lider-project.eu/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0083EnglishGerman, Italianspoken20 minutes x 46 speakers = 11484 utterances (nearly 18 hours)Hamburg
not listed in VLOGermanyFALKO: The Fehlerannotiertes Lernerkorpus (‘error annotated learner corpus’)https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falkoGermanvarious, L1 GermanwrittenTarget hypothesis, error annotation, Can be searched online at https://korpling.german.hu-berlin.de/falko-suche/
Reznicek, Marc; Lüdeling, Anke; Krummes, Cedric; Schwantuschke, Franziska; Walter, Maik; Schmidt, Karin; Hirschmann, Hagen; Andreas, Torsten (2012): Das Falko-Handbuch. Korpusaufbau und Annotationen Version 2.01
YesYesListed in VLO - CLARIN centresGermanyLeaP: The Learning the Prosody of a Foreign Language - a phonological corpus of learner English and learner Germanhttps://sourceforge.net/projects/leapcorpus/German; Englishvarious (some native speakers)spoken12 hours2001-2003Manual annotation with ESPS /waves+ and Praat. 8 tiers but not in all files. phonological annotation (see manual); transcriptions in SAMPA, Download from https://sourceforge.net/projects/leapcorpus/Ulrike Gut, Augsburg; University of Bileiefeld
Not listed in VLOGermanyThe LeKo (Lernerkorpus) corpushttps://korpling.german.hu-berlin.de/cqpwi/login.phpGermanAnke Lüdeling, Humboldt-Universität Berlin, GermanyThere is another LEKO-project but that it Italian L2, A. Abel. This LEKO corpus which is listed at UCL (contact anke.luedeling@rz.hu-berlin.de) is not listed in Berlin under their corpora.
YesVLO - Tubingen Language Resources (CLARIN-D)Germany CREG - Corpus of Reading Comprehension Exercises in Germanhttps://www.uni-tuebingen.de/en/research/core-research/collaborative-research-centers/sfb-833/section-a-context/a4-meurers/software-resources-and-corpora.htmlGermanEnglishwritten answers to reading comprehension questions~23K answers to 1059 questions to 112 reading texts; ~290K tokens2009-Manual:

- entire corpus: transcription, teacher assessment of correctness of answer meaning

- subcorpus: focus annotation
available by requesthttps://www.uni-tuebingen.de/en/research/core-research/collaborative-research-centers/sfb-833/section-a-context/a4-meurers/software-resources-and-corpora.htmlNiels Ott, Ramon Ziai, and Detmar Meurers (2012): "Creation and Analysis of a Reading Comprehension Exercise Corpus: Towards Evaluating Meaning in Context". Thomas Schmidt and Kai Wörner (Eds.), Multilingual Corpora and Multilingual Corpus Analysis. Benjamins. http://niels.drni.de/n3files/read-me/ott-ziai-meurers-12.pdf

Detmar Meurers, Niels Ott, Ramon Ziai (2010): "Compiling a Task-Based Corpus for the Analysis of Learner Language in Context". Pre-Proceedings of Linguistic Evidence 2010.

Ramon Ziai and Detmar Meurers (2014): "Focus Annotation in Reading Comprehension Data". In Proceedings of the 8th Linguistic Annotation Workshop (LAW VIII), Dublin, Ireland.
not in VLOGermany / Czech Republic / Austria?MERLIN: Multilingual Platform for the European Reference Levels - exploring interlanguage in contextCzech, German, ItalianArabic, Chinese, Czech, English, French, German, Hungarian, Italian, Polish, Portuguese, Russian, Slovak, Spanish, Turkish, Otherwritten texts from examinations
2,286 texts; 314,515 tokens
2012-2014Manual: detailed CEFR rating, transcription (anonymization, task-related annotation), target hypotheses, error annotation; Automatic: POS, lemma, dependency parsessearchable and downloadableCC-BYhttp://www.merlin-platform.euAbel, Andrea; Wisniewski, Katrin; Nicolas, Lionel; Boyd, Adriane; Hana, Jirka; Meurers, Detmar (2014): A Trilingual Learner Corpus illustrating European Reference Levels. In: Ricognizioni – Rivista di Lingue, Letterature e Culture Moderne 2 (1), 111-126.

Katrin Wisniewski. Die Validität der Skalen des Gemeinsamen europäischen Referenzrahmens für Sprachen.Eine empirische Untersuchung der Flüssigkeits- und Wortschatzskalen des GeRS am Beispiel des Italienischen und des Deutschen. Language Testing and Evaluation vol.33, Frankfurt am Main 2014

Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne, Barbora Štindlová and Chiara Vettori. The MERLIN corpus: Learner Language and the CEFR pdf. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA), Reykjavik, May 26-31, 2014.

Katrin Wisniewski, Karin Schöne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, Jirka Hana. MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data. ICT for Language Learning 2013, Conference Proceedings pdf Libreriauniversitaria.it. Edizioni, Florence, Italy, November 14-15, 2013.
info@merlin-platform.eu XMLmind, Exmaralda/Falko Excel add-ins, MMAX2, UIMAGermany, Italy, Czech Republic
Not listed in VLOGermany / LithuaniaThe AleSKO corpushttp://ling.uni-konstanz.de/pages/home/zinsmeister/alesko.htmlGermanChinesewritten13587 tokens (and 34155 L1 data)
syntactic, referential and discourse
Konstanz / Vilnius
not listed in VLOGermany / USACHALC: The Cologne-Hanover Advanced Learner CorpusEnglishGermanwrittenc. 210 000 (according to UCLs site)Ute Römer, Georgia State university, Atlantae.g. Römer, U. (2007). Learner language and the norms in native corpora and EFL teaching materials: A case study of English conditionals. In: S. Volk-Birke & J. Lippert (eds.). Anglistentag 2006 Halle. Proceedings. Trier: Wissenschaftlicher Verlag Trier. 355-363.

term papers, essays
not listed in VLOUKCLEG13: The Corpus of Learner Germanhttp://korpling.german.hu-berlin.de/public/CLEG13/CLEG13_documentation.pdfGermanEnglishwritten2003-2007age, sex, mother's and father's L1, language at home, years of German at school, number and length of stay in Germany, other FLUrsula Maden-Weinberger
Dept of European languages and cultures, Lancaster, UK
not listed in VLOGreeceYOLECORE: The Young Learner Corpus of EnglishEnglishGreekspoken and written170 school hours; 1.5 million tokens??http://www.enl.uoa.gr/fileadmin/enl.uoa.gr/uploads/conference/Matthaioudaki_Abstract_4th_PGS_Conference.pdf; https://ikee.lib.auth.gr/record/295293/
Project director: Marina Mattheoudakis, Aristotle University of Thessaloniki, Greece
noHong KongCUCASE: The City University Corpus of Academic Spoken EnglishEnglishChinese, L1Englishspoken
noHong KongThe Hong Kong University of Science and Technology Learner CorpusEnglishChinese (mainly Cantonese)written
noHong KongThe Learner Corpus of English for Business CommunicationEnglishChinesewritten
noHong KongThe Learner Corpus of Essays and ReportsEnglishChinesewritten
noHong KongTSLC: The TELEC Seconday Learner CorpusEnglishChinesewritten
not listed in VLOHungaryJPU: The Janus Pannonius University Corpushttp://joeandco.blogspot.fi/EnglishHungarianwritten300000 (221 Hungarian student essays and research papers)1992-1999https://lextutor.ca/conc/eng/
József Horváth, University of Pécs, Hungary
essays and research papers (according to UCL list)
IsraelThe Israeli Learner Corpus of Written EnglishEnglishHebrewwritten
not in VLOItalyThe KOLIPSI corpus: South Tyrolean pupils and the second language (+ KOLIPSI II continutation)
GermanItalianwrittenunder development (UCL list) Andrea Abel & Aivars Glazniekshttp://www.eurac.edu/de/research/Publications/Pages/publicationdetails.aspx?pubId=0100156&type=Q
Two written language production tasks of a standardized test (email/letter): A2-C1 level. (UCL list)
not in VLOItalyCAIL2: Corpus di Apprendenti di Italiano L2Italianvariouswritten237 000 (UCL list)Log in: https://www.unistrapg.it/cqpweb/
Stefania Spina, Università per Stranieri di Perugia
not in VLOItalyCorpus parlato di italiano L2http://elearning.unistrapg.it/osservatorio/Corpora.htmlItalianEnglish, German, Japanesespokenhttp://elearning.unistrapg.it/osservatorio/corpus/frames-cqp.htmlStefania Spina, Università per Stranieri di Perugia;
not in VLOItalyLIPS: The Lexicon of Spoken Italian by Foreignershttp://www.parlaritaliano.it/index.php/en/data/653-corpus-lipsItalianvariousspoken1500 exams (500 learners); c 700 000 tokens, c. 100 hours1993-2006transcribed, lemmatised, Schmidt's tree tagger POS tagging, Manual correction of lemmatisation and POS-tagging. Special tag for multi-word lexical units. Foreign words from L1 or other L2 tagged. Interlanguage data - special tag.Might be downloadable from http://www.parlaritaliano.it/index.php/en/data/653-corpus-lips But information is only in Italian.
University of Siena, Italy
http://www.lancaster.ac.uk/fass/events/laelpgconference/papers/v04/2-Francesca%20Gallina.pdfFrancesca Gallina
proficiency tests of the Certification of Italian as a Foreign Language (CILS) of the University for Foreigners of Siena
not in VLOItalyMISTiC (Multiple Italian Student TranslatIon Corpus)ItalianEnglish, FrenchORIGINAL LANGUAGE OF THE TEXT BUT THE TRANSLATORS ARE ITALIAN L1 SPEAKERS written125000+50000 tokensnot available (UCL list)S. Castagnoli 2016. Investigating trainee translators' constrative pragmalinguistic competence: a corpus-based analysis of interclausal linkage in learner translations. In: The Interpreter and Translator Trainer.
Sara Castagnoli, University of Bologna, Italy
not in VLOItalyVALICO: Varietà di Apprendimento della Lingua Italiana: Corpus Onlinehttp://www.bmanuel.org/projects/br-HOME.htmlItalianvariouswrittenc. 570 000 (UCL list)2003-planned: POS, error taggingFreely searchable online: http://www.corpora.unito.it/valico/valico.php
Manuel Barbera:manuel.barbera@bmanuel.org
not in VLOItalyLOCCLI: Longitudinal Corpus of Chinese Learners of ItalianItalianChinesewrittenLog in: https://www.unistrapg.it/cqpweb/
not in VLOItalyCOLI: Corpus of Chinese Learners of ItalianItalianChinesewritten, spoken82 300 (UCL list)Log in : https://www.unistrapg.it/cqpweb/
not in VLOItalyThe Padova Learner Debate Corpus (PLDC)Englishnon-native e.g. ItalianCMCc. 87 000under developmentHeader: biographic information about the student + text information; no error annotation or POS tagging since deemed unnecessary.F. Dalziel & F. Helm. 2008. Exploring modality in a learner corpus of online writing. In: Carol Taylor Torsello et al (eds) Corpora for university language teachers. Peter LangFiona Dalziel & Francesca Helm, University of Padua, ItalyStudent work from blended learning classes. Genres: various. Longitudinal. 5 online debales
JapanCCEAUS: The Corpus of English Essays written by Asian University StudentsEnglishvariouswrittenShin Ishikawa, Kobe, Japan
JapanICCL: The International Corpus of Crosslinguistic InterlanguageEnglishvariouswritten
JapanICNALE: The International Corpus Network of Asian Learners of EnglishEnglishChinese, Indonesian, Japanese, Korean, malay, etcwritten, spoken
JapanJEFELL: The Japanese English Foreign Langauge Learner CorpusEnglishJapanesewritten
JapanNICT JLE: The Japanese Learner English CorpusEnglishJapanesespoken
JapanSILS: SILS learner corpus of EnglishEnglishvarious mainly Japanesewritten
Japan, France, Switzerland, IPFC: The "Interphonologie du Français Contemporain" corpusFrenchCypriot Greek, Dutch, English (Canada), German, Japanese, Norwegian, Spanishspoken
KoreaNICKLE: The Neungyule Interlanguage Corpus of Korean Learners of EnglishEnglishKoreanwritten, spoken
KoreaSKELC: The Seoul National University of Korean-speaking English learner corpusEnglishKoreanwritten
KoreaThe Yonsei English Learner Corpus (YELC)EnglishKoreanwritten
not in VLOLatviaEsamhttp://www.esamkorpuss.lv/Latvian, LithuanianLatvian, Lithuanianwritten essays~52K tokens: LV->LT ~ 42K, LT->LV ~7K2007-2014
POS, lemma, sentence type, errors
publicly available onlinewww.esamkorpuss.lvhttp://ucrel.lancs.ac.uk/cl2015/doc/CL2015-AbstractBook.pdfinga.s.znotina@gmail.comerror tagging not finished; more material to be added laterTEITOKLatvia
not in VLOLithuaniaLLC LithuanianArabic, Azeri, Bulgarian, Czeck, Chinese, English, Finnish, Georgian, German, French, Italian, Japanese, Korean, Latvian, Polish, Russian, Slovak, Spanish, Turkish, Ukranianmainly descriptive essays, but also emails, postcards, menusgrowing (46 000 tokens now)2017-2021Planned: normalization, anonymization, error annotationwill be available online when readynot decided yethttp://teitok.iltec.pt/llc/not available yetjurate.ruzaite@vdu.ltin preparation; not listed in the UCL listTEITOKLithuania
LuxembourgThe deL1L2IM corpusGermanRussian-Belorussian bilingualswrittenSviatlana Höhn, Luxembourg
MalaysiaLCEA: The Learner Corpus of Engineering AbstractsEnglishMalaysianwritten
MalaysiaMACLE: The Malaysian Corpus of Learner EnglishEnglish Malaywritten
MalaysiaMCSAW: The Malaysian Corpus of Students argumentative writingEnglishMalay, Chinese Indianwritten
Yesnot in VLO, not in CLARINO list, but in Språkbanken which is a link from CLARINO's site. Should ba available through http://clarino.uib.no/korpuskel/NorwayASKhttps://www.nb.no/sprakbanken/show?serial=oai:clarino.uib.no:nob-ask&lang=NorwegianAlbanian, Bosnian-Croatian-Serbian, Dutch, English, German, Polish, Russian, Somali, Spanish, Vietnamese (plus Norwegian Bokmål and Norwegian Nynorsk as controls)written student test essays1736 learner texts (618,293 tokens); 200 texts in the Control corpus of Norwegian as L1 (72,000 tokens) 2003-2006POS, grammatical and error annotationSearchable online through Corpuscle, downloadable upon agreementCLARIN-RES-PRIV (BY-ID-NORED-PERM-PLAN)clarino.uib.no/ask/Selected publications: Tenfjord, Kari. (2004). ASK – A Computer Learner Corpus. I Peter Juel Henrichsen (ed.) CALL for the Nordic Languages. Tools and Methods for Computer Assisted Language Learning. København: Copenhagen Business School 2004. ISBN 87-593-1176-2. s. 147-158. Tenfjord, Kari; Hagen, Jon Erik; Johansen, Hilde. (2006). The hows and whys of coding categories in a learner corpus (or How and why an error-tagged learner corpus is not ipso facto one big comparative fallacy). In Rivista di Psicolinguistica Applicata (RiPLA) 2006;VI(3):93–108.Ann-Kristin.Gujord@uib.no, siljera@hvl.noThis info is for the main corpus. In addition, some subcorpora are accessible.Norway
not in VLO, not in CLARINO. CLARINO has no knowledge about this corpus (P.C. Koenraad de Smet)NorwayThe EVA corpus of Norwegian School Englishhttp://clu.uni.no/icame/ij21/eva-corp.pdfEnglishNorwegian (+ British L1 control group)spoken35 000 words + control corpusAngela Hasselgren, BergenE. Hasselgren. 1997. The EVA corpus of Norwegian School English. ICAME journal 21: 124-125
not in VLO (PELCRA is included)PolandPLEC: The PELCRA Learner English Corpushttp://pelcra.pl/plec/EngilshPolishwritten, spoken3 million wordsPOS tagged (Stanford POS tagger), Lists can be downloaded. + Search engine for open online searches: http://pelcra.pl/PLEC/search.jsp; can be downloaded if you contact them University of Łódźhttp://pelcra.pl/new/spoken_corpora_50
essays, letters, MA dissertations etc
not in VLO (but ICLE is included under LINDAT in VLO)PolandPICLE: Polish component of ICLEhttp://ifa.amu.edu.pl/~kprzemek/PICLE_search.php (last update 2006)EnglishPolishwrittenc 330 000 wordsmostly open (one concordancer behind login): http://ifa.amu.edu.pl/~kprzemek/PICLE_search.phphttp://ifa.amu.edu.pl/~kprzemek/PICLE_search.phpPrzemyslaw KaszubskiLinks in Louvain list do not work: listed here:http://clip.ipipan.waw.pl/LRT
not in VLOPolandFLEC: The Foreign Language sExaminations Corpushttp://www.anakuma.pl/flecMultilingualPolishwrittenin progress. Plan: phonology, morphosyntax, semantics, error, transferseemingly unavailable
not in VLOPortugalCOPLE2Portuguese15: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian and Swedish.written essays and oral productionsGrowing, currently:
1058 texts
171.461 tokens
2014-Student modifications
Teacher corrections
Learner Errors
Freely available online (downloadable by request)http://alfclul.clul.ul.pt/teitok/learnercorpus/Mendes, A., Antunes, S., Janssen, M. and Gonçalves, A. (2016). The COPLE2 Corpus: a Learner Corpus for Portuguese. In Proceedings of LREC 2016, Portorož, Slovenia.
del Río, I., Antunes, S., Mendes, A. and Janssen, M. (2016). Towards error annotation in a learner corpus of Portuguese. In Proceedings of the 5th NLP4CALL and 1st NLP4LA workshop in Sixth Swedish Language Technology Conference (SLTC). Umeå University, Sweden, 17-18 November.
not in VLOPortugalPEAPL2Portuguese38: Arabic, Baoulé, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Farsi, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Italian, Japanese, Korean, Latvian, Lithuanian, Polish, Portuguese, Romanian, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tagalog, Tetum, Thai, Turkish, Ukrainianwritten essays
Growing, currently: 628 texts 163.219 tokens
2008-Planned: POS
Lemma, normalization
publicly available online: http://teitok.iltec.pt/peapl2/index.php?action=cqp&act=advanced
Martins, Cristina (2013): "O Corpus de Produções Escritas de Aprendentes de PL2 (CELGA). Caracterização e desenvolvimento de uma infra-estrutura de investigação." In Português Língua Não Materna: investigação e ensino , ed. Rosa Bizarro, Maria Alfredo Moreira e Cristina Flores, 70 - 79. ISBN: 9789727579280 . Lisboa: Lidel.
SerbiaPLC: The Persian Learner CorpusPersian (Farsi)variouswritten
SerbiaSFLC: The Salam Farsi Learner CorpusPersian (Farsi)Serbianwritten
SingaporeNUS corpus of Learner EnglishEnglishSeveral Asian languageswritten
not in VLOSloveniaThe PIKUST pilot learner corpusSlovene18 different languages: Bosnian, Bulgarian, Chinese, Croatian, English, French, German, Hungarian, Italian, Macedonian, Polish, Romanian, Russian, Ser-bian, Slovakian, Spanish, Thai and Ukrainian.written35000 words: 128 texter (119 learners)2001, 2005-2006Learner ID, text ID, title, date of creation, source of text, Slovene proficiency, L1, age, gender, Slovene-speakning ancestors, education, profession, parents' L1, other FL, years of learning Slovene, stays in Slovenia; error annotation (see article for specification)Not clearhttps://kuscholarworks.ku.edu/bitstream/handle/1808/5274/8Stritar.pdf;jsessionid=E149736FE0007998495DDCAA6FBBC87D?sequence=1Mojca Stritar92% from a Slovene language exam, 8% from courses
South AfricaThe Tswana Learner English Corpus (TLEC)EnglishTswanawritten
South AfricaRUDaF: Rhodes University Deutsch als FremdspracheGermanEnglish, Afrikaans, isiXhosa, XiTsongawritten