Scalable and Secure AI Inference in Healthcare

Efficient and scalable deployment of machine learning models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. These settings require a balance between minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards.

This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes. We measured median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions.

Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.

For those evaluating on-premise deployments, there are trade-offs between performance, costs, and compliance requirements. AI-RADAR offers analytical frameworks at /llm-onpremise to evaluate these aspects.