Meaning-based querying of historical corpora with MacBERTh

Abstract

This demonstration will focus on MacBERTh, a BERT-based model pre-trained on Early Modern and Late Modern English (3.9B (tokenized) words, time span: 1450-1950; Manjavacas & Fonteyn 2021, 2022). We will demonstrate how MacBERTh may help researchers (i) access and (ii) analyse the semantic information encoded in linguistic corpus data in a (semi-)automatic way. Because the contextualized embeddings MacBERTh produces can be used to measure semantic ‘closeness’ between word/phrase/sentence tokens, they can be manipulated to enable automatic word sense disambiguation. In this demonstration, we walk participants through case studies that show how MacBERTh can be fine-tuned for historical word sense disambiguation, and we introduce an example case study (with the help of our research assistants Nina Haket & David Natarajan) in humanities research where automated word sense disambiguation is relevant.

Date
Jun 2, 2022 11:30 AM — Jun 1, 2022 1:00 PM
Location
University of Luxembourg (virtual + on site)
MacBERTh and GysBERT
MacBERTh and GysBERT
Language Models for Historical English and Dutch