Software demonstration: meaning-based querying of historical corpora with MacBERTh

Abstract

Because of great efforts by the corpus-linguistic community, a large number of historical texts have been digitized (and sometimes even syntactically parsed and pos-tagged), which has enabled the automatic retrieval of words/phrases/sentence structures by means of formal queries. The next step in corpus querying, then, would be to move from formal querying to semantic querying, which has proven a difficult challenge in the past. However, in recent years, it became evident that on-going progress in distributional semantic models, from type-embeddings derived from algorithms like word2vec (Mikolov et al, 2013) to contextualized token-embeddings such as BERT (Devlin et al. 2019), can help capture the denotations and connotations of linguistic items. Contextualized embeddings in particular perform excellently in (semi-)automatic sense disambiguation and exemplar-based retrieval (e.g. Fonteyn 2021 for an application to a linguistic case study).

This demonstration will focus on MacBERTh, a BERT-based model pre-trained on Early Modern and Late Modern English (3.9B (tokenized) words, time span: 1450-1950; Manjavacas & Fonteyn 2021, 2022). We will demonstrate how MacBERTh may help researchers (i) access and (ii) analyse the semantic information encoded in linguistic corpus data in a (semi-)automatic way. By creating embeddings with MacBERTh, researchers will also be able to map out the distances between senses of linguistic items (in different texts and different time stages) as formalized in their representational distances. In this demonstration, we walk participants through two case studies (i.e. lexical and grammatical) that show how MacBERTh can support historical corpus analysis (from data collection to pre-processing to (error) analysis). During the demonstration, participants are guided through the open source code notebooks (in jupyter) and two case studies (one conducted with the help of our research assistants Nina Haket & David Natarajan). Participants are also encouraged to discuss possible research questions that MacBERTh (or its Dutch sibling GysBERT) may help address.

Date
Jul 27, 2022 7:00 AM — Jul 29, 2022 10:00 PM
Location
Anglia Ruskin University, Cambridge, United Kingdom.
MacBERTh and GysBERT
MacBERTh and GysBERT
Language Models for Historical English and Dutch