## AI for Data Extraction from Scientific PDFs A new artificial intelligence system can efficiently extract data from complex scientific PDF documents. The system is based on predefined schemas and controlled vocabularies to guide the extraction process, transforming documents into structured, analysis-ready records. The system addresses the challenges posed by optical character recognition (OCR) errors, long-document fragmentation, and the need for auditability. The architecture includes document ingestion via resume-aware hashing, partitioning into caption-aware page-level chunks, and asynchronous processing with concurrency controls. ## Improving Accuracy with Schemas and Auditability The use of predefined schemas significantly improves extraction fidelity for critical variables, such as assay classification, outcome definitions, and follow-up duration. The system maintains a complete trace of data origins, enabling verification and auditability of results. This approach promises to make biomedical evidence synthesis more efficient and reliable, a fundamental process for scientific research and the development of new therapies.