A Reddit user has made available an archive of approximately one million files related to the Epstein case, converted to text format via Optical Character Recognition (OCR).

Dataset Details

The files are organized into 12 ZIP archives, totaling less than 2 GB. The initiative stems from the positive feedback received from a previous release of 20,000 files. The author used Tesseract OCR for the conversion, with the goal of improving accuracy via DeepSeek-OCR-2 in the future.

Access and Usage

The files are available for download. The author invites interested users to test an eDiscovery AI platform based on the files, offering early access via direct message. This type of dataset can be useful for developing language analysis models (LLMs) in specific contexts, with the ability to operate in on-premise environments to ensure data sovereignty. For those evaluating on-premise deployments, there are trade-offs to consider, as discussed in AI-RADAR's analytical frameworks on /llm-onpremise.