MacBERTh and GysBERT

Language Models for Historical English and Dutch

PDI-SSH 2020

MacBERTh and GysBERT models

How to cite:

MacBERTh (English): When using the English model (MacBERTh), please cite the following paper (BibTeX can be found using the ‘cite’ button in ‘Project Publications’):
- Manjavacas, Enrique & Lauren Fonteyn. 2022. Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities jdmdh:9152. https://doi.org/10.46298/jdmdh.9152
GysBERT (Dutch): When using the Dutch model (GysBERT), please cite the following paper (BibTeX can be found using the ‘cite’ button in ‘Project Publications’):
- Manjavacas, Enrique & Lauren Fonteyn. 2022. Non-Parametric Word Sense Disambiguation for Historical Languages. Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (NLP4DH), 123-134. Association for Computational Linguistics. https://aclanthology.org/2022.nlp4dh-1.16

MacBERTh and GysBERT are language models (more specifically, BERT models) pre-trained on historical textual material (date range: 1450-1950).

Researchers who interpret and analyse historical textual material are well-aware that languages are subject to change over time, and that the way in which concepts and discourses of class, gender, norms and prestige function in different time periods. As such, it is quite important that the interpretation of textual/linguistic material from the past is not approached from a present-day point-of-view, which is why NLP models pre-trained on present-day language data are less than ideal candidates for the job. That’s where our historical models can help.

MacBERTh, a model pre-trained on historical English (1450-1950), has been published in the huggingface repository. GysBERT, a model pre-trained on historical Dutch (1500-1950), has been published in the huggingface repository.

Past & Upcoming Talks

Software demonstration: meaning-based querying of historical corpora with MacBERTh

Because of great efforts by the corpus-linguistic community, a large number of historical texts have been digitized (and sometimes even …

Jul 27, 2022 7:00 AM — Jul 29, 2022 10:00 PM Anglia Ruskin University, Cambridge, United Kingdom.

Software demonstration: meaning-based querying of historical corpora with MacBERTh

Language Model Pre-Training for Historical English: Approaches and Evaluation

Machine-based exploration of culturally relevant datasets (e.g. newspapers, periodicals, correspondence or annals) often involves …

Jul 27, 2022 5:00 AM — 6:30 AM Tokyo (Online conference only)

Language Model Pre-Training for Historical English: Approaches and Evaluation

More individuals, more problems: towards a reliable, semi-automatic data annotation procedure for large-scale diachronic corpora

Linguistic variation and change have long been studied as properties of community-level language, i.e. the “shared system of …

Jun 8, 2022 10:00 AM — 2:00 PM Saarbrücken, Germany

More individuals, more problems: towards a reliable, semi-automatic data annotation procedure for large-scale diachronic corpora

Meaning-based querying of historical corpora with MacBERTh

This demonstration will focus on MacBERTh, a BERT-based model pre-trained on Early Modern and Late Modern English (3.9B (tokenized) …

Jun 2, 2022 11:30 AM — Jun 1, 2022 1:00 PM University of Luxembourg (virtual + on site)

Something wicked this way comes: Analyzing language variation and change with MacBERTh

This presentation demonstrates how MacBERTh may help researchers to (i) access and (ii) analyse the functional-semantic information encoded in linguistic corpus data in a (semi-)automatic way. More specifically, after surveying the performance of the model on general downstream NLP tasks, we home in on a specific case study in morphosyntactic variation in English ing-forms.

May 10, 2022 2:15 PM — 4:00 PM Freie Universität Berlin, Germany

Lauren Fonteyn & Enrique Manjavacas

Something wicked this way comes: Analyzing language variation and change with MacBERTh

See all events

Project Publications

Adapting vs Pre-training Language Models for Historical Languages

Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.

Enrique Manjavacas, Lauren Fonteyn

Adapting vs Pre-training Language Models for Historical Languages

Adjusting scope: a computational approach to case-driven research on semantic change

The aim of this paper is to set out a ‘hands-off’ computational procedure to study specific cases of semantic change. The case we address is the development of the phrasal expression to death from a literal, resultative phrase (e.g. he was beaten to death) into an intensifier (e.g. We were just pleased to death to see her). The methodology we outline may help tackle some common challenges in the use of vector representations to study similar cases of semantic change, and pinpoint (remaining) challenges.

Lauren Fonteyn, Enrique Manjavacas

Adjusting scope: a computational approach to case-driven research on semantic change

MacBERTh: Development and Evaluation of a Historically Pre-trained Language Model for English (1450-1950)

This paper presents “MacBERTh”, a transformer-based language model pre-trained on historical English, and exhaustively assess its benefits on a large set of relevant downstream tasks. Our experiments highlight that, despite some differences across target time periods, pre-training on historical language from scratch outperforms models pre-trained on present-day language and later adapted to historical language.

Enrique Manjavacas, Lauren Fonteyn

MacBERTh: Development and Evaluation of a Historically Pre-trained Language Model for English (1450-1950)

Non-Parametric Word Sense Disambiguation for Historical Languages

In this paper, we follow recent developments in non-parametric learning and show how LLMs can be efficiently fine-tuned to achieve strong few-shot performance on WSD for historical languages. We test our hypothesis using (i) a large, general evaluation set taken from large lexical databases, and (ii) a small real-world scenario involving an ad-hoc WSD task. Moreover, this paper marks the release of GysBERT, a LLM for historical Dutch (1500-1950).

Enrique Manjavacas, Lauren Fonteyn

Non-Parametric Word Sense Disambiguation for Historical Languages

Related Publications

Have you worked with MacBERTh and/or GysBERT? Get in touch with the project team to list your publication here.

The following teams have used and evaluated MacBERTh:

Iiro Rastas, Yann Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, Filip Ginter. 2022. Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model. Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, 68 - 77.
S. Menini, T. Paccosi, S. Tonelli, M. Van Erp, I. Leemans, P. Lisena, R. Troncy, W. Tullett, A. Hürriyetoglu, G. Dijkstra, F. Gordijn, E. Jürgens, J. Koopman, A. Ouwerkerk, S. Steen, I. Novalija, J. Brank, D. Mladenic, A. Zidar. 2022. A Multilingual Benchmark to Capture Olfactory Situations over Time. Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, 1 - 10.
Massri, M. Besher, Inna Novalija, Dunja Mladenić, Janez Brank, Sara Graça da Silva, Natasza Marrouch, Carla Murteira, Ali Hürriyetoğlu, and Beno Šircelj. 2022. Harvesting Context and Mining Emotions Related to Olfactory Cultural Heritage. Multimodal Technologies and Interaction 6(7): 57.