More individuals, more problems: towards a reliable, semi-automatic data annotation procedure for large-scale diachronic corpora

Image credit: MacBERTh team

Abstract

Linguistic variation and change have long been studied as properties of community-level language, i.e. the “shared system of communicative conventions succinctly referred to as ‘the language’ of that community” (Anthonissen 2019: 126). In (historical) (socio-)linguistics, individual speakers are believed to align with their communities in an “orderly heterogeneous” fashion (Weinreich et al. 1968), meaning that, if all contextual factors are accounted for, linguistic variation at the level of the individual can be considered to be “reduced below the level of linguistic significance” (Labov 2012: 265). Consequently, the role of the individual in language change was not a popular topic in empirical research on language change (Raumolin-Brunberg and Nurmi 2011: 251), but this has certainly started to change over the last decade (e.g. Nevalainen et al. 2011; Schmid & Mantlik 2015; Petré & Van de Velde 2018; MacKenzie 2019; Fonteyn & Nini 2020; Anthonissen 2021; Bizzoni et al. 2022; Fonteyn & Petré 2022).

A considerable challenge for such empirical research into the relation between individual-level language use and population-level change, however, that data should be collected for a sufficiently large number of individual language users across time, and for each of these individuals a sufficiently large number of data points should be included. With the creation of large-scale diachronic corpora with author-level meta-data, such as the Royal Society Corpus (RSOC; Fischer 2020) or the Early Modern Multiloquent Authors corpus (EMMA; Petré et al. 2019), it has become possible to collect large numbers of data points per individual, even for relatively low-frequency phenomena. But as the number of required data points grows, it becomes increasingly difficult for historical corpus linguists to process and annotate the collected data manually.

In this talk, Lauren will present a semi-automated data annotation procedure that relies on the MacBERTh model (Manjavacas & Fonteyn 2021, 2022), a BERT-based model pretrained on historical English. Using manually annotated data, MacBERTh was fine-tuned as a classifier trained to predict annotation labels for unseen data. I will present some preliminary results on the model’s performance in annotating lexical material (disambiguating the various senses of the words mass and weight in RSOC) and grammatical constructions (distinguishing different types of ing-forms in EMMA).

Date
Jun 8, 2022 10:00 AM — 2:00 PM
Location
Saarbrücken, Germany
MacBERTh and GysBERT
MacBERTh and GysBERT
Language Models for Historical English and Dutch