Kreuzberg v4.3.0: Open-Source Framework for Document Data Extraction

Kreuzberg, an open-source framework written in Rust for extracting data from documents, has released version 4.3.0. This new version introduces performance improvements and adds PaddleOCR as an optional backend, thanks to a native Rust integration.

Comparative Benchmarks

New comparative benchmarks have been published comparing Kreuzberg with other data extraction tools, including Apache Tika, Docling, Unstructured, PDFPlumber, PyMuPDF4LLM, MarkItDown, and Mineru. The tests are performed in a standardized Linux environment, measuring throughput, extraction duration, memory consumption, CPU usage, latency, and extraction quality. The results show significantly higher throughput for documents such as PDFs, DOCX, PPTX, and HTML.

What's new in version 4.3.0

Version 4.3.0 introduces support for PaddleOCR via a native Rust integration, with automatic model downloading and caching. Currently, six languages are supported: English, Chinese, Japanese, Korean, German, and French. This integration makes it easier to build pipelines that require high-quality OCR for Asian languages. Structured document data extraction has also been added, support for various formats has been expanded, and LibreOffice has been removed as a dependency, introducing native extraction for legacy formats such as .doc and .ppt. Reducing external dependencies simplifies deployment and reduces installation size, especially in containerized environments.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

Kreuzberg v4.3.0: Open-Source Framework for Document Data Extraction

Comparative Benchmarks

What's new in version 4.3.0

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Rust Coreutils 0.7 Released With Many Performance Optimizations

Qwen3.5-35B-A3B: performance close to Claude Opus with continuous verification

Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

👥 Join 160+ AI explorers