06. March 2026

Researchers develop a ChatGPT for Portuguese Researchers develop a ChatGPT for Portuguese

+++ RESEARCH TICKER UNI BONN: neural text generation+++

Large language models, such as ChatGPT, perform significantly less well in Portuguese than in English despite both languages being spoken worldwide. This gap has now been closed with "GigaVerbo". The team led by Dr. Nicholas Kluge Corrêa from the Center for Science and Thought at the University of Bonn is now presenting the project in the journal "Patterns". The researchers were among the first to utilize the new "Marvin" supercomputer at the University of Bonn. Nicholas Kluge Corrêa and his colleague Aniket Sen are both members of the Transdisciplinary Research Area "Sustainable Futures" at the University of Bonn.

Team Tucano (from left): - Dr. Nicholas Kluge Correa, Dr. Aniket Sen, Shiza Fatimah, and Sophia Falk took first place in the ”most interesting results competition," which was awarded as part of the event celebrating the first anniversary of the Marvin supercomputer at the University of Bonn. The event “Marvin's 1st Anniversary: 365 Days of Supercomputing” was jointly organized on March 25, 2025, by the HPC Team of the University Computing Center, the HPC/A Lab, and TRA Modelling. © Photo: Barbara Frommann/University of Bonn

Download all images in original size The impression in connection with the service is free, while the image specified author is mentioned.

WHAT IS IT ALL ABOUT?
GigaVerbo is the name of the dataset developed by the researchers. The project "Tucano: Advancing Neural Text Generation for Portuguese" aims to bridge the resource gap in Portuguese natural language processing (NLP) by providing high-quality datasets and cutting-edge language models specifically designed for the Portuguese language. The development and release of the GigaVerbo corpus, comprising 200 billion deduplicated tokens, along with the Tucano family of models, aims to foster progress in neural text generation in an open and reproducible manner, promoting equitable access.

HOW DID THEY PROCEED?
The researchers collected several Portuguese corpora from different sources to ensure high linguistic diversity and quality. These corpora were then deduplicated and filtered to form the GigaVerbo dataset. Using this dataset, they trained several decoder models on the Marvin supercomputer, which followed rigorous evaluation and optimization cycles.

WHAT GAP DOES THE PROJECT FILL?
The project addresses two major gaps: first, the scarcity of comprehensive open-source resources for Portuguese, a language often overshadowed by resource-rich languages like English. Second, the deficiency in open-source LLM development, which impedes the scientific reproducibility of these models.

HOW DID YOU USE THE MARVIN SUPERCOMPUTER?
The Marvin cluster was crucial for training the Tucano models. We leveraged its powerful computing capabilities to process the large GigaVerbo dataset efficiently, train the Tucano series, and conduct extensive evaluations using multiple benchmarks.

WHAT IS THE NEXT STEP?
The researchers are currently working to scale up their developments in Portuguese by improving their dataset and training larger models. They are also currently developing resources for other low-resource languages, such as Bengali and Hindi, all thanks to Marvin and the University of Bonn.

WHO WAS INVOLVED IN THE PROJECT?
Nicholas Kluge Correa (Center for Science and Thought), Aniket Sen (High Performance Computing and Analytics Lab and Helmholtz Institute for Radiation and Nuclear Physics), Sophia Falk (Institute for Science and Ethics), and Shiza Fatimah (Institute for Computer Science).

WHAT IS THE SOURCE?
Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah: Tucano: Advancing Neural Text Generation for Portuguese, Patterns, DOI: 10.1016/j.patter.2025.101325

WHERE CAN I FIND OUT MORE?
Dr. Nicholas Kluge Correa, Transdisciplinary Research Unit “Sustainable Futures”, Institute of Philosophy, Center for Science and Thought, Tel. +49 (0)228/73-54017, E-Mail: kluge@uni-bonn.de, Internet: https://nkluge-correa.github.io/Tucano/

New: Tucano 2

The Polyglot project, funded by TRA Sustainable Futures (University of Bonn), develops open, efficient language models for underserved languages. With a budget of just €10,000, the team has released Tucano 2, a family of Portuguese language models (0.5–3.7B parameters) that outperform much larger multilingual systems, alongside LilMoo (Hindi) and LilTii (Bengali). In total, the project has published 28 models, more than 20 large-scale curated datasets, custom evaluation suites, and complete training recipes, all under permissive licenses. From March 10–13, 2026, the team hosts the Polyglot Workshop at the University of Bonn, teaching participants worldwide to build language technologies for their own communities. More: https://huggingface.co/Polygl0t ; https://huggingface.co/blog/Polygl0t/tucano2 ; https://huggingface.co/blog/Polygl0t/liltii