What has happened since you won first place in the university’s competition celebrating the birthday of Marvin, the HPC supercomputer?
Nicolas Kluge Correa: This year has been full of milestones. At the start of the year, we received funding from the TRA Sustainable Futures, then in March we won expert support and additional compute time in the Marvin competition, and in the summer we published a paper on Tucano. As a Brazilian, I’ve seen how language models like ChatGPT work far better in English than in many other languages—simply because there’s so much more high-quality English training data. With Tucano, I wanted to lay the foundation for a strong Portuguese language model. It was our pilot project.
What are your next steps?
Under the name Polyglot, we want to expand on what we achieved with Tucano and make it available for many more languages. In March 2026, we’re planning a four-day workshop in collaboration with the university’s IT and computer science departments. Researchers at the university will have the opportunity to learn how to build large language models for so-called “low-resource languages.” We’ve also started working on models for Bengali and Hindi – naturally influenced by Shiza’s and Aniket’s backgrounds.
How far along are you with these new models?
Shiza Fatimah: We’ve compiled our training corpora and are almost ready to begin training. We’ve also added more data to Tucano. For the initial training, we had 200 billion tokens—that’s a lot for Portuguese. Since then, we’ve gathered another 100 billion tokens, so our dataset has grown by about a third in just one year. For Bengali, we have 20 billion tokens, and for Hindi, 80 billion. That might sound surprising given how many people speak these languages, but that’s the reality of the data situation.
But if more people speak a language, shouldn't there be more online text available?
Aniket Sen: You’d think so, but that’s not the case. Brazilians actively use the internet and produce a lot of content in Portuguese, giving us a solid foundation to build on. But for other languages, especially in post-colonial contexts like India, the situation is different. Many Bengali and Hindi texts are mixed with English, and high-quality material suitable for AI training—especially in academia—is often written directly in English. That makes it incredibly hard to gather clean, usable data. This is our biggest challenge: underrepresented languages simply don’t have enough digital footprint online.
How did you select the Portuguese training data for Tucano? Where does it come from?
Aniket Sen: Most of it comes from Common Crawl, a nonprofit that systematically scrapes web content and makes the data publicly available. It’s a huge, open dataset. But that’s just the start. We then have to clean the data: filter it, remove duplicates, and exclude problematic content.
Nicolas Kluge Correa: We specifically looked for educational content and used metadata to exclude extremist or toxic websites from the start. Even though we strive to remove toxicity from our datasets, it is almost impossible to train these models in a way that ensures they will never exhibit potentially problematic biases. We try to alleviate some of these issues with post-training and alignment, but ultimately, making language models that are fair and harmless, regardless of the context they are in, is a huge open problem for the field.
What makes Tucano different from commercial models like ChatGPT?
Nicolas Kluge Correa: Tucano specializes in the Portuguese language, and the full training process – from pre-training to post-training - was conducted solely in this language. This makes it a very lightweight and specialized model. It is ideal for low-resource settings, such as when you need the AI to run locally on your phone. But its most significant difference is that it is genuinely open. Anyone can reproduce what we did – there is no secret source we are hiding.
Sophia Falk: And we definitely want to keep it that way. We’re researchers driven by curiosity and a desire to share knowledge. Turning it into a commercial product would go against our values.
What’s behind the names of your language models?
Nicolas Kluge Correa: Polyglot means multilingual, which fits our goal of developing strong models in many languages. Tucano, our Portuguese model, is named after the colorful toucan, a native Brazilian bird. We chose the name because giving animal names to LLMs has become something of a trend. We’ll give the other models catchy names, too. Maybe someone at the University of Bonn would like to help us build a German-specific model? They’re welcome to help choose the name! So if you're reading this and want to join us—get in touch. We’d love that.
How does interdisciplinary collaboration work in your team?
Nicholas Kluge Correa: We’re not just colleagues—we’re friends. We met through the university’s International Club. Our different academic backgrounds have turned out to be a real advantage.
Sophia Falk: I focus on the sustainability of AI, so we’ve documented the energy use and CO₂ emissions of our training processes and made a big effort to minimize our ecological footprint. That’s why we started with smaller experiments on smaller models. The idea is to learn and make mistakes on a small scale – when the energy footprint is not that significant – and once we have a clear picture of what to do, we scale up and utilize everything Marvin can offer. This motto has helped us to reduce the energy and carbon footprint of our work.
Aniket Sen: I brought in experience in high-performance computing. Nicolas had the idea for Tucano, and I immediately thought of leveraging Marvin for Polyglot. Without the university’s supercomputer, this project wouldn’t have been possible. We’re extremely grateful—because not every university has such infrastructure.
It must’ve been great to win extra compute time and support from the High Performance Computing (HPC) experts. How were you supported?
Shiza Fatimah: The best part was that we got to define exactly where we needed support—from optimizing our code to tailored one-on-one training sessions. The HRZ/HPC Team courses on how to work with High Performance Computing Clusters were really helpful. And Jan Steiner from the HPC team was simply fantastic. He ran a two-day training with us where we learned so much – knowledge, that we immediately put into improving our work and experiments. He also gave us a tour of Marvin, which was a highlight. And what makes us really happy: Jan will join us next year as an expert for our workshop on language model development.
What are your hopes for the future of Project Polyglot?
Nicolas Kluge Correa: With long-term funding, we could dedicate even more time to Polyglot. This year in March we applied for funding via the Deutsche Forschungsgemeinschaft. We hope to get a positive response. That would be huge for us.
Interview by Evelyn Stolberg