Kreuzberg, an open-source (MIT) document intelligence framework written in Rust, has released version 4.5.
It supports 12 programming languages, including Python, TypeScript/Node.js, PHP, Ruby, Java, C#, Go, Elixir, R, C, and WASM. Kreuzberg is designed to extract text, structure, and metadata from over 88 file formats, perform OCR, generate embeddings, and is built for AI pipelines and document processing at scale.

What's new in v4.5

The main novelty is that Kreuzberg now understands document structure (layout/tables), not just text, thanks to the integration of Docling's RT-DETR v2 (Docling Heron) model.

Kreuzberg integrates this model into a Rust-native pipeline, achieving document layout extraction that matches Docling's quality, but with a 2.8x speed improvement, lower memory overhead, and no Python dependencies.

Benchmarks on 171 PDF documents (academic papers, government and legal documents, invoices, OCR scans) show:

  • Structure F1: Kreuzberg 42.1% vs Docling 41.7%
  • Text F1: Kreuzberg 88.9% vs Docling 86.7%
  • Average processing time: Kreuzberg 1,032 ms/doc vs Docling 2,894 ms/doc

The speed is due to Rust's native memory management, pdfium text extraction at the character level, ONNX Runtime inference, and Rayon parallelism across pages.

Kreuzberg extracts text directly from the PDF's native text layer using pdfium, preserving exact character positions, font metadata (bold, italic, size), and unicode encoding. Layout detection then classifies and organizes this text according to the document's visual structure. For pages without a native text layer, Kreuzberg automatically falls back to Tesseract OCR.

PDFs with broken font CMap tables are now automatically fixed via per-character gap analysis at the page level.