NorBERTo: A New LLM for Portuguese, Optimized for Real-World Scenarios

The landscape of Large Language Models (LLMs) continues to evolve rapidly, with increasing focus on language-specific models optimized for efficiency. In this context, NorBERTo emerges as a new encoder-only model based on the ModernBERT architecture, specifically trained for Brazilian Portuguese. Its introduction marks a significant step forward for Natural Language Processing (NLP) in this language, providing a powerful and versatile tool for developers and businesses.

NorBERTo stands out for its long-context support and the implementation of efficient attention mechanisms, crucial features for tackling complex NLP tasks. The model was trained on Aurora-PT, a newly curated corpus that represents, to date, the largest publicly available monolingual Portuguese resource. This corpus, comprising a substantial 331 billion GPT-2 tokens, was assembled from a variety of web sources and existing multilingual datasets, ensuring broad and diverse linguistic coverage.

Technical Details and Benchmark Performance

The ModernBERT architecture underlying NorBERTo was chosen for its processing capabilities and efficiency. Long-context support allows the model to understand and generate text that requires considering large portions of information, overcoming the limitations of previous models. Efficient attention mechanisms, on the other hand, contribute to reducing the computational load during inference, making NorBERTo an attractive choice for resource-constrained environments.

To evaluate its capabilities, NorBERTo underwent a rigorous benchmarking process on semantic similarity, textual entailment, and classification tasks, using standardized datasets such as ASSIN 2 and PLUE. The results are notable: the NorBERTo-large version achieved the best performance among the encoder models evaluated on the PLUE dataset, reaching an F1 of 0.9191 on MRPC and an accuracy of 0.7689 on RTE. On the ASSIN 2 dataset, NorBERTo-large recorded the highest entailment F1 (~0.904) among all encoders considered, although models like Albertina-900M and BERTimbau-large still hold an advantage in some metrics.

Implications for On-Premise Deployment and Data Sovereignty

NorBERTo's design as a mid-sized encoder, โ€œstraight-forward to fine-tune,โ€ and โ€œefficient to serve,โ€ makes it particularly suitable for โ€œrealistic deployment scenarios.โ€ This emphasis on efficiency and ease of integration is of great interest to organizations considering self-hosted or on-premise solutions for their LLM workloads. The ability to perform inference efficiently on local hardware can translate into a more favorable Total Cost of Ownership (TCO) compared to cloud services in the long term, in addition to offering greater data control.

For companies operating in regulated sectors or handling sensitive data, the ability to keep models and data within their own infrastructure is crucial for ensuring data sovereignty and regulatory compliance, such as GDPR. A model like NorBERTo, optimized for local deployment, allows for the construction of robust and performant NLP systems without compromising security or privacy. Its suitability as a โ€œbackboneโ€ for Retrieval-Augmented Generation (RAG) systems in Portuguese is an additional advantage, enabling companies to integrate proprietary knowledge securely and controllably.

Future Prospects for Portuguese NLP

The introduction of NorBERTo and the Aurora-PT corpus represents a significant advancement for the Portuguese NLP community. The availability of such a vast and high-quality corpus is a valuable resource for research and the development of new models and applications. NorBERTo, with its efficiency features and long-context support, positions itself as an ideal candidate for a wide range of applications, from text understanding to response generation, sentiment analysis, and classification.

For businesses and DevOps teams evaluating deployment options for Large Language Models, NorBERTo offers a concrete example of how models can be optimized for efficient use in on-premise environments. This approach can not only reduce operational costs but also strengthen control over one's AI infrastructure. AI-RADAR continues to monitor the development of solutions that balance performance and deployment requirements, providing analytical frameworks to evaluate the trade-offs between cloud and on-premise, as discussed in our dedicated section on on-premise LLM deployments.