Kreuzberg, an open-source framework written in Rust for extracting data from documents, has released version 4.3.0. This new version introduces performance improvements and adds PaddleOCR as an optional backend, thanks to a native Rust integration.
Comparative Benchmarks
New comparative benchmarks have been published comparing Kreuzberg with other data extraction tools, including Apache Tika, Docling, Unstructured, PDFPlumber, PyMuPDF4LLM, MarkItDown, and Mineru. The tests are performed in a standardized Linux environment, measuring throughput, extraction duration, memory consumption, CPU usage, latency, and extraction quality. The results show significantly higher throughput for documents such as PDFs, DOCX, PPTX, and HTML.
What's new in version 4.3.0
Version 4.3.0 introduces support for PaddleOCR via a native Rust integration, with automatic model downloading and caching. Currently, six languages are supported: English, Chinese, Japanese, Korean, German, and French. This integration makes it easier to build pipelines that require high-quality OCR for Asian languages. Structured document data extraction has also been added, support for various formats has been expanded, and LibreOffice has been removed as a dependency, introducing native extraction for legacy formats such as .doc and .ppt. Reducing external dependencies simplifies deployment and reduces installation size, especially in containerized environments.
For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!