Arran Schlosberg1,2. 1. Sydney Medical School, The University of Sydney, NSW 2006, Australia. 2. Informatics Committee, The Royal College of Pathologists of Australasia, NSW 2010, Australia.
Sir,I write with respect to the Technical Note “Generating unique identifiers (IDs) from patient identification data using security models,”[1] the authors of which propose a method to “create a unique one-way encrypted ID per patient that can be used for data sharing.” In summary, their method involves concatenation of a patient's date of birth, sex, and surname, utilizing either the MD5 or SHA-1 cryptographic hash of this value as the record ID.The authors conclude that this “can be used to share patient electronic medical records between practitioners without revealing patients' identifiable data.” Here, I demonstrate that this is not the case and wish to recommend that the method should not be utilized under circumstances in which the privacy of underlying patient data is required.The authors state that “the difficulty of coming up with any message having a given MD is on the order of 2128 operations;” however, even in the absence of known weaknesses in the MD5 algorithm,[2] this assumes an unbounded input space. The proposed methodology is strictly limited by the number of feasible birth dates, names, and sexes – excluding leap days and assuming only binary sexes, the input space for a 100-year period is only 73,000 per surname.It is thus possible to perform a brute-force, precomputed attack utilizing common surnames. Known as a rainbow table, I calculated the proposed IDs for two sexes, birth dates spanning all of the century 1917–2016 inclusive, and the top ten most common surnames in the 2000 USA census.[3] This approach reduces the search space to < 223 and performed on my personal laptop; computation took a mere 8.8 s to compromise the IDs of over 13 million people (based on census counts) for both MD5 and SHA-1. The results of my calculations are available for download at https://goo.gl/xqwphs and constitute a reverse-lookup database that fully compromises the security of the proposed method.It is trivial to modify the input format for the precomputed IDs and to extend the rainbow table to cover more surnames; nevertheless, the secrecy of the input format would not contribute to security, under Kerckhoffs' principle (French original;[4] English elucidation[5]). Given the independence between IDs, this brute-force process is known as embarrassingly parallel,[6] allowing for computation to be shared across any number of devices (without modifying code) which results in a decreased time for compromise. A number of other weaknesses exist in the proposed methodology, but I limit myself to detailing the most severe one in the interest of being succinct.