MacBERTh: Development and Evaluation of a Historically Pre-trained Language Model for English (1450-1950)

Enrique Manjavacas, Lauren Fonteyn

December 2021

Abstract

The new pre-train-then-fine-tune paradigm in Natural Language Processing (NLP) has made important performance gains accessible to a wider audience. Once pre-trained, deploying a large language model presents comparatively small infrastructure requirements, and offers robust performance in many NLP tasks. The Digital Humanities (DH) community has been an early adapter of this paradigm. Yet, a large part of this community is concerned with the application of NLP algorithms to historical texts, for which large models pre-trained on contemporary text may not provide optimal results. In the present paper, we present “MacBERTh”—a transformer-based language model pre-trained on historical English—and exhaustively assess its benefits on a large set of relevant downstream tasks. Our experiments highlight that, despite some differences across target time periods, pre-training on historical language from scratch outperforms models pre-trained on present-day language and later adapted to historical language. [Note: The updated and extended version of this paper is: “Adapting vs Pre-training Language Models for Historical Languages”]

Type

Conference paper

Publication

In Proceedings of the International Workshop on Natural Language Processing for Digital Humanities (NLP4DH)

Evaluation code is available through the project’s repository. “MacBERTh” itself is available as emanjavacas/MacBERTh from the transformers repository (Wolf et al. 2019).