Tokenization and Foundation Models for Structured EHRs

Foundation models trained on longitudinal sequences of timestamped clinical events (EHRs) show great potential in learning adaptable patient representations. Tokenization, the way these timelines are converted into discrete model inputs, determines what information is preserved, how efficiently it is encoded, and which relationships must be learned.

A recent study analyzed the impact of tokenization choices on the performance and computational efficiency of a transformer pre-trained on pediatric EHR data. Several tokenization strategies were evaluated, varying event encoding, time encoding, and workflow annotation.

Key Findings

Joint event encoding and positional time encoding outperformed their alternatives in 73/74 and 71/74 of the clinical prediction tasks, respectively, while requiring 39.5% and 9.6% fewer floating-point operations during pre-training. The effectiveness of joint encoding was attributed to the efficiency of local binding, i.e., combining code-attribute pairs into single tokens, rather than splitting them into separate tokens that the model must learn to associate.

External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results highlight tokenization as a key factor in improving both the performance and efficiency of foundation models for EHRs.