The first attempts at digitizing humanities research date back to the 1940s. For his dissertation on Thomas Aquinas' concept of presence, theologian Robert Busa had to compile a concordance of Aquinas' extensive work. With the support of IBM, he used punch cards for this purpose. The institutionalization of the field began in the 1960s, and around the turn of the millennium, the term “digital humanities” became established. For a long time, history lagged behind in this process. This is because the field works mainly with handwritten and printed sources that are stored in archives or libraries and could only be viewed there. It is only in recent years, with the digitization of some archives and the mass digitization of books and, above all, historical journals, that it has become possible to apply natural language processing (NLP) methods to historical texts and thus address the challenge posed by digitization of how to select and process the texts relevant to a research question from millions of potentially relevant texts.
However, despite digitalization, the application of NLP methods continues to face challenges. Modern OCR models, for example, do not cope well with the widespread use of blackletter typefaces and the complex column layout of historical newspapers, meaning that the digital text versions are often flawed. We at the Department of History also discovered this when we wanted to test topic models on digitized versions of the Kölnische Zeitung newspaper for a course. While searching for a solution, I came across the BNTrAinee program with the help of TRA Individuals and Societies, and my collaboration with Moritz Wolter began.
As part of a project group, we worked with history and computer science students to create a training dataset for recognizing the layout of the Kölnische Zeitung newspaper and trained an initial convolutional neural network. With financial support from TRA Modelling, we were able to continue the project and expand the training dataset to include additional newspapers and advertisements. The annotated training dataset now comprises 801 pages with over three million words and is publicly available via gitlab. This provides researchers with the largest German-language newspaper dataset set in blackletter typeface, and there are plans to further expand the dataset in the future and make it usable for training transformer architectures. In addition, with the help of the supercomputer Marvin in Bonn and JUWELS at the Forschungszentrum Jülich, we have trained a complete pipeline of various convolutional networks and long-short-term memory cells that recognizes layout elements and text on a newspaper page and makes them available digitally in XML format. We were able to publish the results of our work in the Journal of Data-centric Machine Learning Research in mid-2025.