Clinical entity extraction: a new approach to reduce noise

Precise extraction of clinical entities from medical notes and reports is critical. Encoder models, particularly BERT, fine-tuned for Named Entity Recognition (NER) have proven efficient in this task. However, achieving high precision remains a challenge.

A new study presents a Noise Removal (NR) model that significantly improves the accuracy of BERT-based NER models. This NR model analyzes the probability sequences generated by the NER model, classifying predictions as "weak" or "strong".

Overcoming the limitations of probability thresholds

A simple approach to filtering predictions would rely on probability thresholds. However, due to the characteristics of the SoftMax function, Transformer architectures tend to assign high confidence scores even to uncertain predictions. The proposed NR model overcomes this limitation by adopting a supervised modeling strategy.

The NR model leverages advanced features such as the Probability Density Map (PDM), which captures the Semantic-Pull effect observed in Transformer embeddings. This approach allows the model to classify predictions more accurately, reducing false positives by 50% to 90% in various clinical NER models.