MacBERTh and GysBERT are language models (more specifically, BERT models) pre-trained on historical textual material (date range: 1450-1950).
Researchers who interpret and analyse historical textual material are well-aware that languages are subject to change over time, and that the way in which concepts and discourses of class, gender, norms and prestige function in different time periods. As such, it is quite important that the interpretation of textual/linguistic material from the past is not approached from a present-day point-of-view, which is why NLP models pre-trained on present-day language data are less than ideal candidates for the job. That’s where our historical models can help.
At present, a model pre-trained on historical English (1450-1950) has been published in the huggingface repository. The release of a Dutch historical model, called GysBERT, is planned for 2022.
How to cite:
We have written up a paper describing the English model and its evaluation, which will be published soon. We will add the citation details as soon as they are known.
Because of great efforts by the corpus-linguistic community, a large number of historical texts have been digitized (and sometimes even …