Resources

Corpora

Corpor @uclouvain	Some of the corpora compiled by members of our research institute are distributed on the Corpor@uclouvain catalogue. This catalogue contains learner corpora and corpora of various other types.
Learner corpora around the world	The Centre for English Corpus Linguistics maintains a list of learner corpora with relevant metadata and information about their availability for research purposes
L2 learner corpora resource family	The CLARIN infrastructure provides access to 74 L2 learner corpora

The core metadata schema for learner corpora (LC-meta)

The Core Metadata Schema for Learner Corpora contains a list of metadata fields that can be used to describe learner corpus data.

One of the earliest efforts to address the need for metadata standardisation is Granger & Paquot (2017). This initiative was revived in 2022 in the form of a collaborative project between the Centre for English Corpus Linguistics (UCLouvain, Belgium), the Institute for Applied Linguistics at Eurac Research (Bolzano, Italy) and CLARIN ERIC. The following table provides the versioning history of the schema:

Core Metadata Schema for Learner Corpora, version 2 (LC-meta)

Paquot, M., König, A., Stemle, E. W., & Frey, J. (2024). « Core Metadata Schema for Learner Corpora (version 2) », https://doi.org/10.14428/DVN/AAUEM2, Open Data @ UCLouvain, UNF:6:D46/69S0DuhuxwMnT7rn9A== [fileUNF]

The second version of the schema is described in Paquot, M., König, A., Stemle, E. W., & J.-C. Frey (forthcoming 2024). The Core Metadata Schema for Learner Corpora (LC-meta): Collaborative efforts to advance data discoverability, metadata quality and study comparability in L2 research. International Journal of Learner Corpus Research 10(2).

Version 1

Paquot, M., König, A., Stemle, E. & Frey, J.-C (2023). Core Metadata Schema for Learner Corpora, https://doi.org/10.14428/DVN/4CDX3P, Open Data @ UCLouvain, V1, UNF:6:WhLZTg+knFe2FjjgxGg3Uw== [fileUNF]

Draft version

Granger, S. & Paquot, M. (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, University of Gothenburg, Sweden.

Other CKL2CORPORA resources and tools

Some of the resources and tools developed by members of our research institute can be used to compile, annotate and analyze learner corpora:

A database of English dependencies with measures of frequency, association, range and keyness	The database of English dependencies with measures of frequency, association, range and keyness includes dependencies extracted from the Louvain Corpus of Research Articles (LOCRA). Each of the 982,906 lines of the TSV (tabulation-separated values) file gives frequency values from the LOCRA corpus (as target corpus) and the ENCOW16 corpus (as reference corpus), as well as various measures of association, range and keyness described later.
Academic Keyword List	The Academic Keyword List contains 930 academic words that can be used to explore the lexical sophistication of L2 English learner language
CEFRLex	The CEFRLex project proposes several lexical resources graded according to the Common European Framework of Reference for language skills (CEFR)
FABRA	FABRA was first developed as a readability toolkit based on the aggregation of a large number of readability predictor variables targeting French. In practice, the tool computes a large number of complexity measures typically used in L2 research
fsca	fsca is an open-source R package for the extraction of syntactic units from dependency-parsed French texts.
Guide pratique de constitution de corpus	A set of guidelines (written in French) to help our students collect and document written and spoken corpora
ICLE500	The ICLE500 dataset contains 500 argumentative essays from the International Corpus of Learner English (Granger et al., 2020) together with basic metadata (see ICLE website for more info) and CEFR levels. The procedure to map the texts to the CEFR is carefully described in a technical report released with the dataset (Kanistra & Kollias, 2024).
ICLE1300	ICLE1300 provides basic text metadata and proficiency information in the form of comparative judgement (CJ) scores for 1300 argumentative texts from the International Corpus of Learner English (ICLE, Granger et al., 2020).
Recto-Verso	The software allows you to automatically introduce the 1990 spelling corrections into a text
Resyf	French lexical resource with synonyms graded according to their level of difficulty
TreeTagger	Web interface that facilitates the use of the TreeTagger tagger, developed at the Institute for Computational Linguistics at the University of Stuttgart
UCLouvain Error Editor (UCLEEv2)	Software meant to facilitate the insertion of error tags and corrections into learner texts, as well as their subsequent processing

Publications

CECL papers	The CECL Papers aim to make available to the academic community a series of articles, books and technical papers related to activities (conferences, corpus collection, corpus annotation, etc.) led by the CECL. Several of these publications focus on L2 research (e.g. The Louvain Error Tagging Manual).
Learner Corpus Bibliography	The Learner Corpus Bibliography (LCB) is a collection of c. 2000 references related to Learner Corpus Research. The LCB was created and maintained by the CECL for many years. In 2013, the CECL agreed to share the LCB with the Learner Corpus Association, which currently maintains it in the form of a Zotero-based collection available to all its members.
The International Journal of Learner Corpus Research	The International Journal of Learner Corpus Research (IJLCR) is a forum for researchers who collect, annotate, and analyse computer learner corpora and/or use them to investigate topics in Second Language Acquisition and linguistic theory in general, inform foreign language teaching, develop learner-corpus-informed tools (e.g. courseware, proficiency tests, dictionaries and grammars) or conduct natural language processing tasks (e.g. annotation, automatic spell- and grammar-checking, L1 identification).

Other tools, resources, and webpages we recommend

CIABATTA	CIABATTA stands for “Corpus In A Box: Automated Tools, Tutorials, & Advising.” It is a corpus-building toolkit, a collection of how-to resources delivered primarily through the CIABATTA GitHub wiki. CIABATTA provides templates for corpus building—examples, design patterns, best practices, and step-by-step processes—that provide a starting point for developing new corpora.
Tools for Corpus Linguistics	A comprehensive list of almost 300 tools used in corpus compilation and analysis.
Questions Ethiques et Cadre Juridique	L'objectif de cet outil est de vous accompagner dans vos démarches liées au cadre éthique et juridique concernant les corpus, tout au long de votre projet de recherche en linguistique. Dans sa version actuelle, les contenus sont prioritairement orientés sur la protection des données personnelles.

Menu

ilc | Louvain-la-Neuve, Mons

Corpora

The core metadata schema for learner corpora (LC-meta)

Other CKL2CORPORA resources and tools

Publications

Other tools, resources, and webpages we recommend