Managing healthcare data for training machine learning models presents significant challenges due to stringent privacy regulations.

MultiGraSCCo: A new multilingual benchmark

To overcome these difficulties, MultiGraSCCo, a multilingual benchmark for data anonymization, has been created. This tool uses machine translation to generate synthetic data in ten languages, maintaining the original annotations of personal information.

Benchmark details

The benchmark includes over 2,500 annotations of personal information, culturally and contextually adapted for each language. The quality of the translations has been validated by medical professionals, ensuring the accuracy and utility of the data.

Applications and benefits

MultiGraSCCo can be used to:

  • Train annotators.
  • Validate annotations across institutions.
  • Improve the performance of automatic personal information detection systems.

The availability of this benchmark and related guidelines promotes research and development of solutions for the secure sharing of healthcare data, in compliance with privacy regulations.