Skip to main content

Resources

ilc | Louvain-la-Neuve, Mons

Corpora

Corpor@uclouvain Some of the corpora compiled by members of our research institute are distributed on the Corpor@uclouvain catalogue. This catalogue contains learner corpora and corpora of various other types.
Learner corpora around the worldThe Centre for English Corpus Linguistics maintains a list of learner corpora with relevant metadata and information about their availability for research purposes
L2 learner corpora resource family The CLARIN infrastructure provides access to 74 L2 learner corpora

 

The core metadata schema for learner corpora (LC-meta)

The Core Metadata Schema for Learner Corpora contains a list of metadata fields that can be used to describe learner corpus data. 

One of the earliest efforts to address the need for metadata standardisation is Granger & Paquot (2017). This initiative was revived in 2022 in the form of a collaborative project between the Centre for English Corpus Linguistics (UCLouvain, Belgium), the Institute for Applied Linguistics at Eurac Research (Bolzano, Italy) and CLARIN ERIC. The following table provides the versioning history of the schema:

Core Metadata Schema for Learner Corpora, version 2 (LC-meta)

Paquot, M., König, A., Stemle, E. W., & Frey, J. (2024). « Core Metadata Schema for Learner Corpora (version 2) », https://doi.org/10.14428/DVN/AAUEM2, Open Data @ UCLouvain, UNF:6:D46/69S0DuhuxwMnT7rn9A== [fileUNF]

The second version of the schema is described in Paquot, M., König, A., Stemle, E. W., & J.-C. Frey (forthcoming 2024). The Core Metadata Schema for Learner Corpora (LC-meta): Collaborative efforts to advance data discoverability, metadata quality and study comparability in L2 research. International Journal of Learner Corpus Research 10(2).

Version 1Paquot, M., König, A., Stemle, E. & Frey, J.-C (2023). Core Metadata Schema for Learner Corpora, https://doi.org/10.14428/DVN/4CDX3P, Open Data @ UCLouvain, V1, UNF:6:WhLZTg+knFe2FjjgxGg3Uw== [fileUNF]
Draft versionGranger, S. & Paquot, M. (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, University of Gothenburg, Sweden.

Other resources and tools 

Some of the resources and tools developed by members of our research institute can be used to compile, annotate and analyze learner corpora:

A database of English dependencies with measures of frequency, association, range and keynessThe database of English dependencies with measures of frequency, association, range and keyness includes dependencies extracted from the Louvain Corpus of Research Articles (LOCRA). Each of the 982,906 lines of the TSV (tabulation-separated values) file gives frequency values from the LOCRA corpus (as target corpus) and the ENCOW16 corpus (as reference corpus), as well as various measures of association, range and keyness described later.
Academic Keyword ListThe Academic Keyword List contains 930 academic words that can be used to explore the lexical sophistication of L2 English learner language 
CEFRLexThe CEFRLex project proposes several lexical resources graded according to the Common European Framework of Reference for language skills (CEFR)
FABRAFABRA was first developed as a readability toolkit based on the aggregation of a large number of readability predictor variables targeting French. In practice, the tool computes a large number of complexity measures typically used in L2 research
fscafsca is an open-source R package for the extraction of syntactic units from dependency-parsed French texts. 
Guide pratique de constitution de corpusA set of guidelines (written in French) to help our students collect and document written and spoken corpora
ICLE500The ICLE500 dataset contains 500 argumentative essays from the International Corpus of Learner English (Granger et al., 2020) together with basic metadata (see ICLE website for more info) and CEFR levels. The procedure to map the texts to the CEFR is carefully described in a technical report released with the dataset (Kanistra & Kollias, 2024). 
ICLE1300ICLE1300 provides basic text metadata and proficiency information in the form of comparative judgement (CJ) scores for 1300 argumentative texts from the International Corpus of Learner English (ICLE, Granger et al., 2020).
Recto-VersoThe software allows you to automatically introduce the 1990 spelling corrections into a text
ResyfFrench lexical resource with synonyms graded according to their level of difficulty
TreeTaggerWeb interface that facilitates the use of the TreeTagger tagger, developed at the Institute for Computational Linguistics at the University of Stuttgart
UCLouvain Error Editor (UCLEEv2)Software meant to facilitate the insertion of error tags and corrections into learner texts, as well as their subsequent processing

 

Publications

CECL papersThe CECL Papers aim to make available to the academic community a series of articles, books and technical papers related to activities (conferences, corpus collection, corpus annotation, etc.) led by the CECL. Several of these publications focus on L2 research (e.g. The Louvain Error Tagging Manual).
Learner Corpus BibliographyThe Learner Corpus Bibliography (LCB) is a collection of c. 2000 references related to Learner Corpus Research. The LCB was created and maintained by the CECL for many years. In 2013, the CECL agreed to share the LCB with the Learner Corpus Association, which currently maintains it in the form of a Zotero-based collection available to all its members.
The International Journal of Learner Corpus ResearchThe International Journal of Learner Corpus Research (IJLCR) is a forum for researchers who collect, annotate, and analyse computer learner corpora and/or use them to investigate topics in Second Language Acquisition and linguistic theory in general, inform foreign language teaching, develop learner-corpus-informed tools (e.g. courseware, proficiency tests, dictionaries and grammars) or conduct natural language processing tasks (e.g. annotation, automatic spell- and grammar-checking, L1 identification).