12. September 2025

AI application for historical newspapers AI application for historical newspapers

TRAs support BNTrAInee project in Digital Humanities

Economic historian PD Dr. Felix Selgert reports on his collaboration with computer scientist Dr. Moritz Wolter

IMG_3547.jpg
IMG_3547.jpg © Felix Selgert
Download all images in original size The impression in connection with the service is free, while the image specified author is mentioned.
Please fill out this field using the example format provided in the placeholder.
The phone number will be handled in accordance with GDPR.

The first attempts at digitizing humanities research date back to the 1940s. For his dissertation on Thomas Aquinas' concept of presence, theologian Robert Busa had to compile a concordance of Aquinas' extensive work. With the support of IBM, he used punch cards for this purpose. The institutionalization of the field began in the 1960s, and around the turn of the millennium, the term “digital humanities” became established. For a long time, history lagged behind in this process. This is because the field works mainly with handwritten and printed sources that are stored in archives or libraries and could only be viewed there. It is only in recent years, with the digitization of some archives and the mass digitization of books and, above all, historical journals, that it has become possible to apply natural language processing (NLP) methods to historical texts and thus address the challenge posed by digitization of how to select and process the texts relevant to a research question from millions of potentially relevant texts.

However, despite digitalization, the application of NLP methods continues to face challenges. Modern OCR models, for example, do not cope well with the widespread use of blackletter typefaces and the complex column layout of historical newspapers, meaning that the digital text versions are often flawed. We at the Department of History also discovered this when we wanted to test topic models on digitized versions of the Kölnische Zeitung newspaper for a course. While searching for a solution, I came across the BNTrAinee program with the help of TRA Individuals and Societies, and my collaboration with Moritz Wolter began.

As part of a project group, we worked with history and computer science students to create a training dataset for recognizing the layout of the Kölnische Zeitung newspaper and trained an initial convolutional neural network. With financial support from TRA Modelling, we were able to continue the project and expand the training dataset to include additional newspapers and advertisements. The annotated training dataset now comprises 801 pages with over three million words and is publicly available via gitlab. This provides researchers with the largest German-language newspaper dataset set in blackletter typeface, and there are plans to further expand the dataset in the future and make it usable for training transformer architectures. In addition, with the help of the supercomputer Marvin in Bonn and JUWELS at the Forschungszentrum Jülich, we have trained a complete pipeline of various convolutional networks and long-short-term memory cells that recognizes layout elements and text on a newspaper page and makes them available digitally in XML format. We were able to publish the results of our work in the Journal of Data-centric Machine Learning Research in mid-2025.

BNTrAinee - Bonn Transdisciplinary Training in Artificial Intelligence Behavior
 
The existing AI expertise in computer science is structurally networked with various specialist disciplines as users of AI as part of the project, so that demand-oriented teaching/learning opportunities are developed together. The teaching/learning opportunities are provided via a learning platform.

Research infrastructure at the University of Excellence in Bonn

Libraries and digital services are part of the central services offers at the University of Bonn, which are open to all researchers. These include the services provided by the University and State Library (ULB), the High-Performance Computing and Analytics Lab (HPC/A Lab), and the Bonn Center for Digital Humanities (BCDH).

PD Dr. Felix Selgert

Department of Constitutional, Social, and Economic History

University of Bonn

fselfert@uni-bonn.de

Dr. Moritz Wolter

High Performance Computing & Analytics Lab (HPC/A)

University of Bonn

moritz.wolter@uni-bonn.de

Wird geladen