A framework for de-identification of free-text data in electronic medical records enabling secondary use
Louis Mercorelli A B , Harrison Nguyen C D , Nicole Gartell E , Martyn Brookes C , Jonathan Morris F and Charmaine S Tam C D *A Sydney Informatics Hub, University of Sydney, NSW, Australia.
B Clinical Informatics Unit, Northern Sydney Local Health District, NSW, Australia.
C Performance and Analytics, Northern Sydney Local Health District, NSW, Australia.
D Faculty of Medicine and Health, University of Sydney, Office 543, Level 5, School of Computer Science (J12), NSW 2006, Australia.
E Health Information Services, Northern Sydney Local Health District, NSW, Australia.
F Clinical Excellence Commission, NSW, Australia.
Australian Health Review 46(3) 289-293 https://doi.org/10.1071/AH21361
Submitted: 23 November 2021 Accepted: 18 March 2022 Published: 12 May 2022
© 2022 The Author(s) (or their employer(s)). Published by CSIRO Publishing on behalf of AHHA.
Abstract
Clinical free-text data represent a vast, untapped source of rich information. If more accessible for research it would supplement information captured in structured fields. Data need to be de-identified prior to being reused for research. However, a lack of transparency with existing de-identification software tools makes it difficult for data custodians to assess potential risks associated with the release of de-identified clinical free-text data. This case study describes the development of a framework for releasing de-identified clinical free-text data in two local health districts in NSW, Australia. A sample of clinical documents (n = 14 768 965), including progress notes, nursing and medical assessments and discharge summaries, were used for development. An algorithm was designed to identify and mask patient names without damaging data utility. For each note, the algorithm output the (i) note length before and after de-identification, (ii) the number of patient names and (iii) the number of common words. These outputs were used to iteratively refine the algorithm performance. This was followed by manual review of a random subset of records by a health information manager. Notes that were not correctly de-identified were fixed, and performance was reassessed until resolution. All notes in this sample were suitably de-identified using this method. Developing a transparent method for de-identifying clinical free-text data enables informed-decision making by data custodians and the safe re-use of clinical free-text data for research and public benefit.
Keywords: algorithm, de‐identification, documentation, e-health, EMR, governance, health services management, information management.
References
[1] Kong HJ. Managing unstructured big data in healthcare system. Healthc Inform Res 2019; 25 1–2.| Managing unstructured big data in healthcare system.Crossref | GoogleScholarGoogle Scholar | 30788175PubMed |
[2] Fernandes AC, Cloete D, Broadbent MT, Hayes RD, Chang CK, Jackson RG, et al. Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med Inform Decis Mak 2013; 13 71
| Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records.Crossref | GoogleScholarGoogle Scholar | 23842533PubMed |
[3] Jones KH, Ford EM, Lea N, Griffiths LJ, Hassan L, Heys S, et al. Toward the development of data governance standards for using clinical free-text data in health research: position paper. J Med Internet Res 2020; 22 e16760
| Toward the development of data governance standards for using clinical free-text data in health research: position paper.Crossref | GoogleScholarGoogle Scholar |
[4] Data collections-Disclosure of Unit Record Data for Research or Management of Health Services 2015. Available at https://www1.health.nsw.gov.au/PDS/pages/doc.aspx?dn=PD2015_037.
[5] Jackson R, Kartoglu I, Stringer C, Gorrell G, Roberts A, Song X, et al. CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital. BMC Med Inform Decis Mak 2018; 18 47
| CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital.Crossref | GoogleScholarGoogle Scholar | 29941004PubMed |
[6] Burckhardt P, Padman R. deidentify. AMIA Annu Symp Proc 2017; 2017 485–94.
| 29854113PubMed |
[7] Kushida CA, Nichols DA, Jadrnicek R, Miller R, Walsh JK, Griffin K. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care 2012; 50 Suppl S82–S101.
| Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies.Crossref | GoogleScholarGoogle Scholar | 22692265PubMed |
[8] Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. Proc ACM Conf Health Inference Learn 2020; 2020 214–21.
[9] Neamatullah I, Douglass MM, Lehman L-WH, Reisner A, Villarroel M, Long WJ, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8 32
| Automated de-identification of free-text medical records.Crossref | GoogleScholarGoogle Scholar | 18652655PubMed |
[10] Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med Res Methodol 2012; 12 109
| Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents.Crossref | GoogleScholarGoogle Scholar | 22839356PubMed |
[11] Tam CS, Gullick J, Saavedra A, Vernon ST, Figtree GA, Chow CK, et al. Combining structured and unstructured data in EMRs to create clinically-defined EMR-derived cohorts. BMC Med Inform Decis Mak 2021; 21 91
| Combining structured and unstructured data in EMRs to create clinically-defined EMR-derived cohorts.Crossref | GoogleScholarGoogle Scholar | 33685456PubMed |
[12] Fry E. The new instant word list. Read Teach 1980; 34 284–89.
[13] Murugadoss K, Rajasekharan A, Malin B, Agarwal V, Bade S, Anderson JR, et al. Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns (N Y) 2021; 2 100255
| Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.Crossref | GoogleScholarGoogle Scholar |
[14] Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019; 17 195
| Key challenges for delivering clinical impact with artificial intelligence.Crossref | GoogleScholarGoogle Scholar | 31665002PubMed |