Literature DB >> 34910758

CANDIDATE: A tool for generating anonymous participant-linking IDs in multi-session studies.

Abstract

BACKGROUND: To ensure the privacy of participants is an ethical and legal obligation for researchers. Yet, achieving anonymity can be technically difficult. When observing participants over time one needs mechanisms to link the data from the different sessions. Also, it is often necessary to expand the sample of participants during a project.
OBJECTIVES: To help researchers simplify the administration of such studies the CANDIDATE tool is proposed. This tool allows simple, unique, and anonymous participant IDs to be generated on the fly.
METHOD: Simulations were used to validate the uniqueness of the IDs as well as their anonymity.
RESULTS: The tool can successfully generate IDs with a low collision rate while maintaining high anonymity. A practical compromise between integrity and anonymity was achieved when the ID space is about ten times the number of participants. IMPLICATIONS: The tool holds potential for making it easier to collect more comprehensive empirical evidence over time that in turn will provide a more solid basis for drawing reliable conclusions based on research data. An open-source implementation of the tool that runs locally in a web-browser is made available.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34910758 PMCID： PMC8673636 DOI： 10.1371/journal.pone.0260569

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The privacy and safety of participants is of utmost importance in research that involves people. The World Medical Association’s Declaration of Helsinki–Ethical Principles for Medical Research Involving Human Subjects states that “Every precaution must be taken to protect the privacy of research subjects and the confidentiality of their personal information.” Privacy is also regulated by legislation such as the General Data Protection Regulations (GDPR) which applies in the European Union. When researchers need to store personal information such as names, national ID numbers, phone numbers, etc., the researchers need to document adequate mechanisms for secure storage of the data, and routines for deleting the data at the end of a project. Often researchers must file formal applications for permissions to store personal data. If personal data includes sensitive information, such as information about health, reduced functioning, and similar, the regulations and procedures are even stricter. Obviously, health related research often involves sensitive information. But studies in other fields such as computer science could also involve vulnerable cohorts such as individuals with dyslexia [1], low vision [2, 3] or other disabilities. The issue of privacy of participants in research studies is indeed highly relevant in many disciplines. If a study can be conducted in a single session per participant, it is usually straightforward to maintain anonymity as one does not need to know the identity of the participant. For example, if a within-groups experiment where one observes how participants respond to two variations of an input technique can be conducted in a single session there is no need to know who these participants are [4]. Such anonymous data usually do not have privacy issues. The need to identify participants arises if one needs to consult participants several times. For example, if a within-groups experiment takes too long to be completed in a single session, one may have to split the various conditions of the experiment into several separately scheduled sessions. In a pre-post experimental design, one may have a first session where participants are probed following some intervention such as using some technology, and then a second session, scheduled later, where the participants are probed again to observe effects of the intervention [5]. Alternatively, to observe how participants learn to use a technology over time, one needs to observe the participants at various points in time at regular intervals [6, 7]. Longitudinal studies are used in several disciplines including physical health [8, 9], mental health [10], and human computer interaction [11-13]. The challenges associated with data linking for research has received much attention and it has been pointed out that erroneous record linkage may result in biased results [14]. Most of the research into record linkage has been of a technical nature. However, there have also been efforts to explore participants’ perceptions of volunteering to participate in studies that require linking of data [15]. Findings showed that not all participants trust the anonymization mechanisms. Decisions to participate often rests on a balance between the sensitivity of the issues to be studied and the potential benefits the results may provide. When analysing multi-session data, it is necessary to link the data of one person in one session to the corresponding data in other sessions, to perform paired tests, repeated measures analyses, or similar analyses. Connecting data is straightforward with a linking table where each participant is assigned a running number. By labelling the data with the running number one avoid revealing the identity of that person in the data. One must assume that the linking table is kept confidential. However, if the linking table is leaked, the privacy of participants is compromised. The goal is therefore to avoid such linking tables altogether. Linking tables are still commonly used but often require the researchers to solicit formal approvals. Acquiring such formal permissions can be time consuming, bureaucratic, and difficult requiring application writing experience and knowledge. Students who want to conduct a multi-session experiment during a course running over a semester may be prevented to do so because of a lack of time and competence to obtain the necessary permissions. Students therefore miss out on the valuable learning experiences of conducting such experiments in practice. Moreover, it is a matter of concern if a researcher settles for a single session experiment when a multi-session experiment would be a more suitable choice simply to prevent the administrative and bureaucratic burden of obtaining formal permissions to store linking tables. Even worse, a researcher may choose to ignore the regulations and store personal information. A key motivation of this work thus is to simplify, or even eliminate the need for, formal data storage approval procedures. Many anonymous linking procedures have been proposed during the last decades. They broadly fall into two categories, namely, self-generated codes and Bloom-filter approaches. Self-generated codes [16] rely on brief questionnaires the participants complete during each session. The responses to a set of personal questions are used to construct unique IDs for each participant. One drawback of self-generated codes is that they require effort from the participants diverting valuable attention away from the actual session activity. Moreover, participants may find the questions invasive. Self-generated codes have also been found to exhibit high error rates. In contrast, Bloom filter approaches [17] are automatic and therefore used in large scale studies, especially studies involving register data. Typically, the bigrams making up the participants name are fed into a series of hash functions that are used to construct a bit vector. The use of bigrams means that the method is robust to input errors such as misspellings. One drawback of Bloom filter approaches is the very long and cryptic IDs, typically 1000 bits (or around 256 printable characters). Bloom filters have also been criticised for being vulnerable to attack [18]. Recent approaches attempted to generate short IDs that are perceivable less threatening to participants. For instance, the HIDE procedure [19] uses truncated hashes and a stochastic search for a universal salt (an encoding parameter) that results in a set of short and unique IDs. A salt is a piece of data (string) that is concatenated to a value before it is hashed. However, all the participants needed to be known in advance. To overcome this limitation the BRIDGE procedure was proposed [20] which also uses truncated hashes. But instead of searching for a salt that results in unique codes for a fixed set of names, the procedure detects collisions and asks the researcher to manually resolve collisions using word challenges. I.e., the researcher or participant must recognize, or not recognize, a given word. Many of the published works on linking codes use the term “anonymous”. Clearly, such schemes may be anonymous to an arbitrary onlooker. This understanding of “anonymous” will also be used herein. However, Oxford dictionary defines anonymity as a “lack of outstanding, individual, or unusual features”. Using such a definition none of the published linking procedures can be considered fully anonymous as there are enough features stored such that someone, usually the participant, would be able to identify their own data (using their name or a self-generated code, etc). The “right to be deleted” GDPR principle can be used to illustrate this; If it is somewhat possible to satisfy a participant’s request to have their data deleted, the data is not truly anonymous. Although linking procedures may not be able to provide anonymity in a strict sense, they may assist experimenters in designing robust data handling plans that reduce the risk of accidentally leaking personal information. The goal of the CANDIDATE tool proposed herein was to overcome challenges with previous linking procedures by providing privacy through short IDs with an automatic procedure that does not rely on effort from the researcher and participants. The main emphasis in this study is to rely on participant’s names as the basis for generating linking IDs. Soliciting additional information of a more private nature (such as the birth date) requires effort and time and may be perceived as intrusive and lead participants to withdraw from a study. This paper is organized as follows. The next section reviews related work. Next, the procedure used by the CANDIDATE tool is presented followed by an evaluation and discussion of its integrity and anonymity. Finally concluding remarks are presented.

Related work

Self-generated codes belong to one major anonymization category, and many such procedures have been proposed [21-27]. A self-generated code is created from the responses to a simple questionnaire the participants complete during each session. For example, a six-character ID could be generated from the first letter of the mother’s name, number of older brothers, month of birth and the first letter of the middle name [28]. A study with 745 participants revealed that self-generated codes were successful in linking 75.2% of the records, 22.1% remained unmatched and 2.7% were matched incorrectly. Successful linking of just 3 of 4 participants may render this approach too unreliable. Similar results have been found in other studies [29] which concluded that self-generated codes are ineffective for longitudinal studies. The alternative to self-generated codes are automatically generated codes. Several simple approaches have been proposed including attempts relying on the Soundex algorithm [30, 31]. The Soundex algorithm converts words (names) into a more coarse-grained phonetic representation. In this sense the Soundex algorithm is a one-way (irreversible) function. However, in practice Soundex does not provide anonymity as it is possible to match names according to their Soundex representations. It has been suggested that the anonymity with the Soundex encoding can be increased by obfuscating the IDs with additional dummy records [32]. The Soundex algorithm makes the procedure tolerant to input errors such as certain spelling mistakes [33-35]. Another simple approach is to encode participants using control numbers [36]. Such control numbers could be generated from pieces of demographic information such as first letter of surname, date of birth, gender, sum of the ASCII characters of the name, etc. The availability of demographic information may be limited in some contexts. More importantly, one should carefully question the actual anonymity of such schemes. To improve the anonymity of participants several approaches rely on hash codes, either alone or in conjunction with other methods [37-39]. A hash function is a one-way function that provides few clues about the input, and it is therefore usually not possible to derive the input to the hash function based on the output (digest) of the hash function. However, simple hashing schemes are vulnerable to phonebook attacks [38] where one can confirm that a participant on a list of contenders was part of a study based on matching hash values. One way to overcome phonebook attack is to add a secret salt (a simple text string) to the name before hashing [40]. However, if this salt is leaked to adversaries the identities become vulnerable to phonebook attacks. Another approach is to reduce the information content such as just taking the two first characters from the first name and family names and date of birth as input to the hash function [41]. The consequence is that a phonebook attack will lead to multiple hits per ID and the adversary can therefore not conclude with certainty that a particular ID belongs to a particular person. A problem with this simple approach is that there is a probability of collision also for the names in the list of participants. The principle of obfuscating data such that multiple individuals share the same characteristics is known as k-anonymity [42]. During the last decade, the most intense research effort seems to have been in methods that rely on Bloom filters. Bloom filter approaches facilitate partial matches and are therefore robust to errors in the data [17]. However, critical voices have raised concerns over the anonymity of Bloom filters as they have been demonstrated to be vulnerable to cryptanalysis attacks [43-45]. Possible countermeasures include the use of salts with the hash functions [46]. Besides implementations provided by The German Record Linking Centre (https://www.record-linkage.de/) there appear to be few Bloom filter record linking implementations available to researchers. Bloom filters also exhibit long IDs. Such long IDs may appear cryptic to participants, and IDs comprising several hundred characters may be hard to work with during manual data handling as humans are typically able to store 5 items/digits at a time in short-term memory [47].

The CANDIDATE anonymization tool

Participant representations

It is assumed that the input to the procedure is the participant’s full name (given name, middle name, and family name) transcribed in characters from the Latin alphabet, and that each name is unique. In most practical situations with small sample sizes this is usually the case, although an inspection of any phonebook will reveal that many individuals share the same names. Note that the procedure does not depend on the input being names but could comprise digits or combinations of digits and Latin characters. For example, national ID numbers could be used, but it is assumed that national ID numbers are perceived as more private and sensitive than names and participants may feel uncomfortable sharing their national ID numbers with researchers. Less sensitive alternatives could be the participants phone numbers or e-mail addresses. A researcher needs to assess what feature to use as the source for the ID encodings that is most beneficial to a given study. Note that the input error tolerance mechanisms do not apply If other representations than names are used. The current implementation of the tool does not support other character types such as Chinese, Russian, or Arabic. Algorithm 1 shows the pseudo code of the CANDIDATE procedure and Fig 1A–1D shows corresponding flowcharts of the procedure. New participants are added with the Add procedure that takes the name of the participant, the anticipated maximum number of participants L, and the size of the coding space N as inputs. The size of the coding space needs to at least match the number of anticipated participants. The list of anonymous participant IDs are maintained in an ID-list. Lookup is used to find the ID of a given participant. Encode is used by Add and Lookup to transform names into IDs using one of the available hash functions.

Fig 1

Flowcharts of the CANDIDATE procedure: (a) Add, (b) Encode, (c) Lookup, and (d) Hash.

Flowcharts of the CANDIDATE procedure: (a) Add, (b) Encode, (c) Lookup, and (d) Hash. Constants: hash-typedefault: = 0 hash-typeoffset: = 10 salt: = [salt1, salt2, …, saltn], n: = number of salts L: = maximum number of anticipated participants coding-factor: = 10 (typically) State variable: ID-list: = {} Add(name, L) N: = coding-factor × L IDoriginal: = Encode(name, N, hash-typedefault) IF IDoriginal in ID-list hash-typefree: = Find-free-slot(name, L, ID-List) IDalternative: = Encode(name, N, hash-typefree) validation-code: = Encode(name, N, hash-typefree + hash-typeoffset) ATTACH (hash-typefree, validation-code) TO IDoriginal IDoriginal: = IDalternative ADD IDoriginal TO ID-list Lookup(name, L) N: = coding-factor × L IDoriginal: = Encode(name, N, hash-typedefault) ID: = Doriginal FOR EACH (hash-type, validation-code) ATTACHED TO IDoriginal IDcontender: = Encode(name, N, hash-type) ID-contender-validation: = Encode(name, N, hash-type + hash-typeoffset) IF (ID-contender-validation == validation-code) ID: = IDcontender RETURN ID Encode(name, N, hash-type) namesorded: = Sort(name) namephonetic: = Soundex(namesorted) digest: = Hash(namephonetic, hash-type) ID: = digest MOD N RETURN ID Hash(name, hash-type) SWITCH (hash-type) 0: digest: = djb2(name) 1: digest: = CRC-32(name) 2: digest: = CRC-32(reverse(name)) 3: digest: = djb2(reverse(name)) 4: digest: = djb2(shift(name, 1) 5: digest: = djb2(shift(name, 2) 6: digest: = djb2(shift(name, 3) 7: digest: = djb2(shift(name, 4) 8: digest: = djb2(shift(name, 5) ≥ 9 AND < n + 8: digest: = djb2(name + salt[8—hash-type]) ≥ n + 8: UNRECOVERABLE ERROR—Cannot compute hash. RETURN digest Algorithm 1. Pseudo-code of the CANDIDATE linking algorithm (implementation available at .

Encoding participants

Add attaches new participants and resolves collisions. The algorithm assumes that the input contains no unwanted characters such as hyphens and apostrophes. Such characters should be eliminated in the user interface. First, two optional steps can be applied if a name representation is used, namely sorting and phonetic coding. Name sorting involves sorting name parts (given, middle and family names) into alphabetical order to make the procedure robust to variations in name orderings. Next, the sorted name parts are converted into a phonetic representation using Soundex. For example, “Christian” would be coded as C6235, i.e., the first letter (C), 6 for the r-sound, 2 for the c/g,/j,/k/q/s/x/z-sounds, 3 for the t/d-sounds and 5 for the m/n-sounds. Note that the full-length encoding is used, which differs from the original Soundex algorithm which only includes the first four characters (C623). This step makes CANDIDATE tolerant to certain input errors such as spelling mistakes or transcription errors. In short Soundex removes all vowels and double consonants, reassigns consonants to a more coarse-grained set of classes of similar sounding sounds, i.e., m and n, d and t, etc. Note that the Soundex step should only be used with name representations transcribed in Latin characters. Since Soundex is well-documented (see for instance [19, 30–35]), with many available implementations, it is not described in detail herein. Note that the phonetic coding and sanitation step can be switched off in the CANDIDATE implementation. The phonetic representation is then hashed using one of several hashing functions. Hash returns the corresponding digest for a name using one of the available hash functions constructed from the djb2 and CRC-32 algorithms, namely a djb2-hash, CRC-32 code, CRC-32 of the name reversed, djb2 hash of the name reversed, djb2 hash of the characters of the names shifted by 1–5 characters, and djb2 hash of the names with salts taken from the most frequent words in English. Dan J. Bernstein’s djb2 algorithm [48] builds a hash by processing the input from left to right. A hash is accumulated by adding the ASCII value of each character to the previous result multiplied by 33. CRC-32 (Cyclic redundancy check) [49] is more complex. It involves performing repeated XOR operations over a table of 256 32-bit precomputed constants for each character of the input. Fig 2 shows examples of how the string “Christian” is coded with the different hash functions.

Fig 2

Examples of how different hash functions code the name “Christian”.

This step ensures anonymity in that it should not be possible to identify who has participated in a study. The outputs of any linking procedure are linking IDs. These linking IDs must be assumed as being public and observable by adversaries. A procedure therefore needs to be designed such that it is impossible to derive the identity of a participant from the linking ID. In the CANDIDATE procedure anonymity of participants is ensured by applying a hash function to the name and truncating the digest (the output from the hash function). The truncation of the digest obfuscates the identity of a person as several different individuals will result in the same ID, and an attacker can therefore not be certain who a particular person is through phonebook attacks. The more the digest is truncated the higher the degree of anonymity is achieved. CANDIDATE simply truncates by taking modulus N of the digest (the reminder of the division of the hash output divided by N).

Handling collisions

Each participant should be associated with a unique ID (integrity). However, there is a probability that a hash function will result in collisions. Table 1 illustrates the probability of collision with three different hash functions when coding 100 different names (based on a simulation with 10,000 iterations). Clearly, the probability of collision is related to the size of the coding space as larger coding spaces yield lower probabilities of collision. If the coding space has the same size as the number of items, there will be a collision for more than every third item.

Table 1

Collision probability with djb2, CRC-32 and double hashes (half djb2/half CRC-32) for 100 randomly selected names with different coding space sizes.

Based on simulation with 10,000 iterations.

coding space	djb2	CRC-32	djb2+CRC-32
100	37.2238%	37.2224%	37.2198%
1,000	4.8574%	4.8584%	4.8911%
10,000	0.4966%	0.5086%	0.5084%
100,000	0.0531%	0.0496%	0.0597%
1,000,000	0.0042%	0.0056%	0.0064%
10,000,000	0.0002%	0.0004%	0.0004%
100,000,000	0.0000%	0.0000%	0.0000%

Collision probability with djb2, CRC-32 and double hashes (half djb2/half CRC-32) for 100 randomly selected names with different coding space sizes.

Based on simulation with 10,000 iterations. Moreover, the larger the truncated part is the higher the probability of collision will be. There is thus a trade-off between the degree of anonymity and probability of collision. Collisions between items are desirable if the names are not part of the study as it strengthens anonymity, while we need to be able to uniquely distinguish between all participants in the sample. The CANDIDATE collision handling is based on two assumptions, namely that a) the list of existing IDs is known, and b) the researchers know that they are adding a new participant or looking up an existing participant. When adding a new participant, we first compute IDoriginal using the default hash function and check the list of existing IDs to see if IDoriginal is already used. If the ID does not exist, we have no collision, and the ID can be added and used as is. If the IDoriginal already exists, we need to calculate an alternative ID, termed IDalternative. This is done by applying a different hash function. We then need to check that the alternative ID is not occupied as well. It may therefore be necessary to try several different hash functions hash-typefree in order to find an unused ID. For this purpose, Find-free-slot searches through an array of hash functions (djb2, CRC-32, djb2-reverse, …) that results in a unique match. Once we have found a hash function hash-typefree that results in an unused ID, we associate IDoriginal with the hash function hash-typefree used to obtain the IDalternative. IDalternative is added to the list of IDs and used for linking the participant. In addition, another hash function hash-typefree + hash-typeoffset is used to calculate a validation code for the given name, and this validation code is also associated with IDoriginal. As multiple collisions may occur for an ID the pairs of hash functions and validation codes are connected. The hash function used to calculate the validation code must be different to the hash function used to calculate IDoriginal and IDalternative. The example in Fig 3 illustrates how collisions are handled using multiple hash functions.

Fig 3

Example of encoding and collision handling: (a) Encoding the first participant with hash function 0. (b) Encoding the second participant with hash function 0. (c) Further seven participants encoded without any collisions. (d) The eighth participant gives a collision. (e) A free slot is found by instead using hash function 1 (CRC-32). The hash function used (hash function 1) is attached to the collision entry together with a check code obtained using hash function 11 (salted). (f) A collision also occurs for participant nine. (g) A free slot is obtained using hash function 1 and attaching the hash function used and the validation code to the colliding item. (h) the tenth participant is encoded without collision.

ID lookup

To Lookup the ID of an existing participant an ID is computed using the default hash function. This ID is returned if there are no collisions associated with this ID. If there are collisions associated with this ID, the list of alternatives is assessed by first computing the alternative ID and validation codes for each entry in the list of hash functions. The alternative ID with a matching validation code is then returned as the participants ID. With this scheme additional information is only stored for colliding items which represent a small fraction of the IDs in most cases. The additional information maintains anonymity as an attacker only knows alternative hash functions that are used, and not which ID they map to. The validation code does indeed reduce the anonymity of the item, but a phonebook attack will fail if the truncations are sufficiently large. Also, collisions are unlikely to occur before a large portion of participants have been added making it harder for an adversary to reverse engineer the resulting ID from the set of IDs. Table 2 shows an example distribution of hash functions utilization when coding 100 participants with coding spaces of 1,000, 10,000, and 100,000, respectively. With smaller coding spaces more hash functions are needed than with larger coding spaces as 5 hash functions are needed to code 100 participants with a coding space of 1,000 while only 2 hash functions are needed to code 100 participants with a coding space of 100,000. In all cases, most of the coding is performed using the default hash function (hash-type 0). There is a theoretical chance that more hash functions are needed than what is provided. Given such a condition the algorithm detects that it is unable to find a free slot for that given participant. However, Table 2 suggests that such conditions are unlikely in practice as. It is also possible to extend the set of hash functions by adding shift and salt variations with the CRC32 function.

Table 2

Distribution of hash functions used by CANDIDATE for encoding 100 participants with coding spaces of 1000, 10,000 and 100,000.

		Frequency (%)
hash-type	Name	N = 1,000	N = 10,000	N = 100,000
0	djb2	95.0915%	99.4947%	99.9503%
1	CRC-32	4.5674%	0.5011%	0.0497%
2	djb2(reverse)	0.3164%	0.0041%
3	djb2(shift-1)	0.0228%	0.0001%
4	djb2(shift-2)	0.0018%
5	djb2(shift-3)	0.0001%

Example

An example of the CANDIDATE algorithm is provided in Fig 3 where 10 arbitrary names are encoded into a space of 50 codes. In this example the phonetic step is omitted for simplicity. First the name “Rodman, David M” is encoded with ID = 16. Clearly, there are no collisions as this is the first item. The same holds for the subsequent six items. However, the eight item “Wetterau, John R.” results in ID = 40 which results in a collision with the ID for “Mortensen, James K.”. An alternative hash function (hash type = 1, reversing the name) results in the unique ID = 26, and the hash type (1) and the validation code 17 is attached to item no. 7. A similar situation occurs when adding the ninth item “Couper, Mick P.” which results in ID = 18 which is already assigned to “Woodward, Mark”. Therefore “Couper, Mick P.” is coded with a different hash function (reversing the name) which gives the unique ID 30, and the coding type (1) and validation code of 47 is attached to the third item of “Woodward, Mark”. Clearly, all the 10 items were successfully coded with two recoverable collisions. It is straightforward to look up the IDs of participants without collisions. To find the ID of an item with collision, say “Woodward, Mark” we first compute the ID with the default hash function and find 18. Since, 18 is associated with a collision we need to check the validation codes. Here, “Woodward, Mark” yields a validation code of 45 which does not match the entry 47. We can therefore assume that the valid ID is 18. If we instead looked up the ID of “Couper, Mick P.” which also results in an ID of 18, we would find that its validation code matches the of 47 and we know the valid ID is obtained by applying hash-type 1, yielding the intended ID of 30. Fig 4 shows an implementation of the CANDIDATE tool that runs in the web-browser. The example shows a test user “Test A. User” is added into a space of 100 codes. The tool allocated ID = 11. A JSON object with the coding parameters for the sample is returned and needs to be stored. Subsequent lookups of the name with the given parameters will return ID = 11. The form input is also checked for invalid characters, and the researcher is warned about the anonymity limitation of a given study configuration. For example, if a study is to include 20 participants the population where these participants are recruited from need to comprise at least 500 individuals to achieve a minimum anonymity (k-anonymity = 5). This total sample population estimate is given by k-anonymity × L.

Fig 4

A screenshot of the CANDIDATE tool implementation.

Evaluation

The performance of the CANDIDATE tool was evaluated using simulations which allowed the tool to be exposed to many different scenarios. These simulations addressed two issues, namely the ability to successfully and uniquely link participants (integrity), and ability to preserve participants anonymity. To test the tool a list of 103,472 researcher’s names was taken from the dataset of a bibliometric study [50] adopted from [51] (see the GitHub repository). This list contains family names, in most cases first names, and in some cases initials. Each simulation was based on drawing a random sample of names from this master list and gradually adding these participants using the CANDIDATE tool. Two sets of sample sizes were used, namely 10 to 100 participants in steps of 10 representing small studies which are common in computer science [52] and 100 to 1,000 participants in steps of 100 representing medium to large studies which are more common within the health sciences. The simulations were repeated 10,000 times for each sample size and coding space configuration.

Integrity

Figs 5–8 lists the results. The results in Figs 5 and 7 are representative of small studies with 10 to 100 participants encoded onto coding spaces with 100, 1,000 and 10,000 entries. With a coding space of 100 the IDs are in the range from 00 to 99 (two digits), with a space of 1,000 IDs are in the range from 000 to 999 (three digits) and for 10,000 IDs are in the range from 0000 to 9999 (four digits).

Fig 5

Encoding success rates for small samples (N ≤ 100).

Fig 8

Collision rates for larger samples (100 ≤ N ≤ 1000).

Fig 7

Collision rates for small samples (N ≤ 100).

Encoding success rates for larger samples (100 ≤ N ≤ 1000).

Note that the y-axis starts at 99.6% to show the small variations. The results show that there was a non-zero probability of collisions for all the configurations ranging from 0.5% to 100%. However, the results show that the CANDIDATE tool successfully handled these collisions in most cases. With a coding space of 10,000 (four-digit IDs) all cases were handled successfully (up to 100 participants). With a coding space of 1,000 (three-digit IDs) there was less than 0.21% chance of the tool not being able to resolve the collisions, and with 10 and 20 participants all cases were handled successfully. The results show that coding up to 100 participants using a space of 100 entries (two-digit IDs) is more challenging. Coding 10 participants into a space of 100 entries gives a 0.1% chance of unrecoverable collisions, 20 participants can be coded with 0.91% chance of unresolvable collisions, while with 30 participants there is nearly a 3% chance of collision that cannot be automatically resolved. Figs 6 and 8 list simulation results for medium to large studies with 100 to 1,000 participants mapped onto a coding space of 10,000 (four-digit IDs) and 100,000 entries (five-digit IDs). With the 100,000 space there were no cases that resulted in unresolved collisions, while with 10,000 entries the up to 1,000 participants were coded with less than 0.26% chance of unresolved collisions. Up to 200 participants were coded without any unresolved collisions.

Fig 6

Encoding success rates for larger samples (100 ≤ N ≤ 1000).

Note that the y-axis starts at 99.6% to show the small variations.

Anonymity

To assess the anonymity of the CANDIDATE tool one successful encoding was selected from each of the conditions shown in Figs 5 and 7. Each encoding was subjected to a phonebook attack using the full list of 103,472 names. The number of hits per ID were recorded as well as the number of names that did not result in a valid hit and the amount of coding space without hits. It was assumed that an attacker had access to the list of valid IDs and the coding tables. The results are shown in Figs 9–11.

Fig 9

Log-log plot of mean anonymity with coding spaces of 100 (two-digit IDs), 1,000 (three-digit IDs), 10,000 slots (four-digit IDs), and 100,000 (five-digit IDs) with a phonebook of 103,472 names.

Error bars indicate the minimum and maximum anonymity.

Fig 11

Percentage of unused ID slots with large sample sizes.

This indicates the portion of phonebook entries that can be discarded as non-participants during an attack.

Log-log plot of mean anonymity with coding spaces of 100 (two-digit IDs), 1,000 (three-digit IDs), 10,000 slots (four-digit IDs), and 100,000 (five-digit IDs) with a phonebook of 103,472 names.

Error bars indicate the minimum and maximum anonymity.

Percentage of unused ID slots with small sample sizes.

This indicates the portion of phonebook entries that can be discarded as non-participants during an attack.

Percentage of unused ID slots with large sample sizes.

This indicates the portion of phonebook entries that can be discarded as non-participants during an attack. Fig 9 shows a log-log plot of the k-anonymity with coding spaces of 100, 1,000, 10,000, and 100,000 slots, respectively. With a coding space of 100 items (two-digit IDs) the minimum number of hits per slot was 818 (mean = 1,035). This means that no items can be uniquely identified through a phonebook attack. This k-anonymity is much higher than the recommended minimum of 5. With a coding space of 1,000 items (three-digit IDs) the minimum number of items per slot was 71 (mean = 103), which is also high. With a coding space of 10,000 (four-digit IDs) the smallest number of hits per ID is 1. This means that it was not possible to hide the identity of certain individuals, and that the anonymity of these individuals cannot be guaranteed. However, for most individuals this configuration provides sufficient anonymity as the mean number of hits per ID is 10.35 which is above the recommended limit of 5. When considering the ratio of phonebook items that can be confirmed as not being part of the study as they result in unused IDs, the results show that this ratio matches the ratio of participants to coding space size. That is, when 10 participants are coded with a coding space of 100, the 90% of the phonebook entries can be rejected. If 100 participants are coded with a coding space of 100 (two-digit IDs), none of the phonebook entries can be rejected. With coding spaces of 1,000 and 10,000 the rejection rates ranged from 90% to 99% and 99.0% to 99.9%, respectively. Although most of the phonebook entries could be rejected the items that were not rejected made up a larger set of candidates than the number of participants. The results show that the phonebook attack reached all the items in the coding space for all conditions. With 10,000 slots (four-digit IDs) the mean number of hits per ID is 10.34 which is above the limit of acceptable anonymity. However, the smallest number of items per ID was 1, which indicates that anonymity cannot be guaranteed for some individuals. With 100,000 slots (five-digit IDs) there is a mean of 1.03 hits per slot which cannot be considered anonymous. Clearly, the k-anonymity is related to the size of the coding space in relation to the total population and not the number of participants. Fig 11 shows that with a coding space of five digits only between 0.2% and 2% of the names in the phonebook resulted in valid IDs and could therefore be classified as not being part of the sample. The results also showed that with this configuration more than a third of the slots in the coding space remains unused. Note that the anonymity results depend on the size of the phonebook. With smaller phonebooks (small populations) the anonymity will be lower, and larger phonebooks (large populations) will obviously result in higher anonymity. The size of the phonebook used herein is comparable to a small country such as the U.S Virgin Islands (a population of 104,425 in 2020).

Discussion

The results confirm the trade-off between the ability to recover from collisions and the degree of anonymity of the resulting IDs. By increasing the ID-lengths the probability of unresolved collisions is reduced, while at the same time the anonymity is weakened. Short IDs yield high anonymity while there is a higher probability of encoding errors due to unresolved collisions. The results suggest that a reasonable compromise is achieved if the coding space is ten times the number of participants, that is, with L participants the coding space should be N = 10 × L. For example, coding 100 participants using three-digit IDs (a space of 1,000 slots) there is a 99.79% chance that there will be no unresolved collisions. In plain terms, this means that if one conducted 500 different research projects there would be one research project with an unresolved collision. In practical terms, the effective chance of experiencing unresolved collisions is low. What are the consequences of unresolved collisions? Unresolved collisions result in erroneous IDs. The corresponding observations of the erroneously identified participant would therefore be incorrectly linked. However, the researcher may detect the error as a duplicate data entry for an ID and may be able to resolve the issue manually. Imagine one ends up with one erroneous ID when coding 100 participants in a between-groups experiment comprising two distinct groups of participants. Furthermore, this participant is incorrectly mapped to the incorrect group. When analysing the data, the researcher will then notice that one participant is associated with a double set of observations and another and that the number of unique ID is less than the number of participants. One option is to discard this set of data. Two of the 100 observations will therefore be removed. Either one set of data for each of the two groups are discarded, or two sets of data for one group are discarded. However, statistical procedures should be sufficiently robust to handle any imbalance and without affecting the overall conclusions. Exceptions may of course occur if the results rest on the borderline of statistical significance. In fact, statistical procedures are likely to handle several incorrectly mapped items. In conclusion, we argue that the benefits of preserving participants’ privacy, perceptually simple and short ID codes, and simple administration of research studies outweigh the small risks associated with unresolved collisions. The anonymity provided by CANDIDATE would be weaker with large datasets, such as large biobank studies which can include data about several million individuals [53]. This is because more individuals would have a unique ID that would allow them to be uniquely identified. Other anonymity mechanisms should therefore be used with such large-scale studies. However, if a study involves a smaller subset of such a dataset CANDIDATE may be used. This is because the CANDIDATE coding would be specific to that subset, while a portion of other individuals in the superset will yield false positives. An adversary can therefore not be certain if a match is true or false. It must also be noted that the proposed procedure does not relieve the researchers from the responsibility of reflecting over the ethics, privacy, and implications of a research study. Even though an anonymous procedure is used it does not mean that the stored data are anonymous. For instance, pieces of demographic information that in isolation do not reveal the identity of a participant, may reveal the identity when these pieces of information are combined. Moreover, qualitative data may contain information that reveals someone’s identity. The content of the data stored also needs to be carefully considered during the design of studies and experiments. An important point is that the participants’ identities need to be known by the researcher at some point, when the participant is invited to participate, or when a participant is returning to a subsequent session. Even email correspondence with participants in the researcher’s email account could be considered a name list, and email accounts have been found to be vulnerable to security breaches. Clearly, if the list of participants is written down and stored the linking table problem persists. If an attacker gets hold of such a list of names, the attacker can with certainty know who participated in a study and subsequently find the IDs of the participants and know which data belong to which person. The researcher should therefore avoid keeping a list of participants, email correspondence and similar items. For small studies it is feasible for a researcher to memorize who the participants are, but this is not practical with larger studies. Another approach is to recruit participants in person (in some physical location such as a street, shopping centre, workplace, school, or hospital) and then make an (scheduled or unscheduled) appointment to turn up for one or more sessions. The researcher only needs to record the ID (and possibly time). The participant is responsible for turning up for a first, or subsequent, session. They then produce their name so that the ID can be found. Clearly, there is a probability that some participants will forget, but that may be a justified compromise to achieve anonymity. The CANDIDATE tool does not facilitate data from one study to be linked to data in other studies as the IDs are only valid for the set of participants within a given sample. This may be viewed as a limitation of the approach and an inconvenience to researchers. However, one may also argue that this in fact is a beneficial feature as CANDIDATE supports the ethical principle that data should only be collected for a specific and well-justified purpose and for that purpose only as the reuse of sensitive data in new contexts is associated with several problematic issues related to trust and privacy [54, 55]. Privacy of participants and guarantees that the powers of the consent given are adhered to trumps convenience for researchers. Although several mechanisms are employed to reduce the impact of input errors, there are several types of input errors that cannot be handled automatically. For instance, the tool will be unable to find a match if a participant’s middle name is inconsistently included or omitted on subsequent occasions. Researchers therefore need to be as accurate as possible when inputting participant information. Also, the input error tolerance mechanisms assume a Latin character representation. Other languages such as Chinese, Russian, or Arabic may require other error tolerance mechanisms. This study focused on coding participants by name and with this scheme the researcher needs to ensure that all the names are unique. With large datasets there is a probability that some participants share the same name. With larger sample sizes researchers may achieve uniqueness by concatenating additional information such as date of birth (day of the month, month, year, or combinations thereof) to the name. On the downside, incorporating additional information will increase the complexity of administering the study. Not explored in this study is the coding of non-name representations such as genetic or biometric information. To use such coding types, the representation of an individual must be consistent and identical on every instance. If the instances of one individual vary CANDIDATE will not be able to generate consistent codes. Biometric matching is often performed using a set of approximate parameters where matches are determined using some distance functions, such as degree of matching in fingerprint recognition [56]. Two biometric measurements of the same individual will rarely be identical and therefore cannot be coded using CANDIDATE.

Conclusions

The CANDIDATE tool for flexible and anonymous linking of participants was presented. Evaluations show that the tool with appropriate parameters can successfully assign unique and anonymous IDs to participants with a very low probability of obtaining unresolvable ID collisions. Experiments showed that if the space of IDs is about ten times the number of anticipated participants, one achieves a good balance of integrity and anonymity. In the very unlikely situation that incorrect IDs are generated due to collisions the robustness of statistical testing should ensure that the overall conclusions are not affected (false positives or false negatives) if one employs a hypothesis testing paradigm, or similar, CANDIDATE holds potential for simplifying the administration of multi-session studies. More researchers may be encouraged to follow participants over time to collect solid empirical data that allow researchers to draw reliable conclusions. An implementation of the CANDIDATE procedure that can be used locally in a web-browser (https://www.cs.oslomet.no/~frodes/CANDIDATE/) has been made available to the research communities as well as the simulation code of researchers who want to further develop the procedure (https://github.com/frode-sandnes/CANDIDATE/). (TXT) Click here for additional data file. 27 Jul 2021 PONE-D-21-20957 CANDIDATE: A tool for generating anonymous participant-linking IDs for multi-session studies PLOS ONE Dear Dr. Sandnes, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Sep 10 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Anandakumar Haldorai, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following financial disclosure: “This study was not funded. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” At this time, please address the following queries: a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” c) If any authors received a salary from any of your funders, please state which authors and which funders. d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments: Please carefully address the issues raised in the comments and, up front in your revised paper. Your revised paper will be sent to the same reviewers, as well as possibly new reviewers, for evaluation. Make sure the Abstract briefly describes the paper as it is used in abstracting and citation services. Keep the Abstract between 200 words. Do not use any references in the Abstract. Spell out each acronym the first time used in the body of the paper. Spell out acronyms in the Abstract only if used there. Include a list of six to ten key words after the Abstract. You may ignore any suggestion of including self-references by reviewers if not applicable. Include a paragraph at the end of the Introduction describing the organization of the paper. Make sure that the Conclusion briefly summarizes the results of the paper it should not repeat phrases from the Introduction. Keep the Conclusion to about 300 words. Do not use any references or acronyms in the Conclusion. Make sure all figures and tables are referred to in the body of the paper. Properly follow PLOS ONE reference style in both reference and citation sections. It is recommended to use a professional native English-speaking editor. Papers with less than excellent English will not be published even if technically perfect. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: See attachment . Reviewer #2: The manuscript by Sandes describes the CANDIDATE tool, a computational method for anonymizing participant identities in research studies. Due to real and perceived privacy concerns, as well as regulations such as the GDPR, there remains a need for simple, robust tools for maintaining the privacy of individuals in research studies, while enabling researchers to potentially link information from the same individual across longitudinal studies or multiple studies. Although I work in a different field and thus I am not familiar with most of the existing tools, I believe the manuscript did a good job of explaining the challenges and existing tools in the field. In particular, the computational modeling of different study sizes and scenarios is necessary to understand the interplay between study size, anonymity, collision rate (same ID assigned to multiple individuals) vulnerability to “phonebook attacks” to decipher identities, and the ID space (length of ID code.) The main strength of the manuscript is the clear description of the tool, and the easy-to-understand examples. In particular, I appreciated the recommendation that the ID pool be approximately 10x the size of the participant list (e.g., encoding 100 participants with a 3-digit number.) A weakness of the manuscript is that the body or discussion could have included some information about extensions of the tool (for example, whether a tool like this could be adapted to work with numerical, or genetic, identity information instead of names.) Also depending on space, some of the tables as have been easier to interpret as graphs of curves. Overall the CANDIDATE tool appears simple to use and useful, and I recommend publication of the manuscript. Specific issues: I think the manuscript could benefit from putting the work into a broader context. A graph or two in place of some tables would make it easier to quickly interpret the study. a. The discussion of the need for privacy is excellent, however, the current study is limited to encoding of individuals with roman alphabet names encoded by 26 characters. It is possible that a deeper discussion of the Soundex algorithm could rectify this, however, it is unclear whether the Soundex algorithm can deal with Chinese or Russian names, or whether these would need to be romanized. Alternatively, are there other tools that can sanitize name information or make them more robust, that could be used in place of Soundex tool? b. It would be interesting to learn if the tool could be enhanced or even used with participant IDs that were coded with a combination of numbers and letters, or a combination of names and genetic or biometric information, for example. Alternatively, can or should the CANDIDATE tool be adapted to use an alphanumeric ID space, instead of a strictly numeric ID space as in the examples. While this may be speculative and beyond the scope of the study, there is a clear future need for robust identification of individuals, whilst preserving their privacy. c. The modeling appeared robust for groups of 10-1000 individuals, but it is unclear whether this tool could be used on a larger scales, such as encoding information for 10’s of millions of individuals, or a large nation such as the UK. If this is not possible, it could be useful to provide context of whether CANDIDATE could be useful in subsets of those large biobank studies. d. I am not familiar with the details of different hash functions. The manuscript did a good job of explaining the goal of the hash, to provide one-way encoding, but it would be nice to have an explanation of different hash functions, and whether most are equivalent, and what are the tradeoffs of using multiple hashes VS single hashes. Finally, while the language is clear, the manuscript could benefit from copyediting for language and spelling. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Mario Lorenz Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Review.pdf Click here for additional data file. 14 Aug 2021 RESPONSE TO REVIEWERS’ COMMENTS MEMO: PONE-D-21-20957 Reviewer #1 comments: REVIEWER #1(pdf): 1. I tried the code from GitHub. Using the htm file. There seems to be bug when entering the same input multiple times. I can at least produce 2 different id’s. Further, If I add Symbols, Numbers letter to an existing string of solely latin-letters I receive the same ID, Only by adding another latin-character the ID changes. Although, this is no evaluation criteria for the paper it makes me wonder if these are just bugs or conceptual errors? [RESPONSE]: I am very grateful that the reviewer set aside time to try the tool. What the reviewer observed is indeed correct and it is intentional. To encode input with digits and other symbols the user needs to uncheck the “phonetic coding” checkbox in the form (checked as default). When using phonetic coding (for error tolerance) the input is sanitized. The tool is not capable of handling identical names. If adding the same name twice the tool will return two codes, but the user will only get one of these codes when looking up the same name. Since the names are not stored it is not possible for the tool to know if a name is repeated or not. A comment was added to the tool to make the user aware of both the name uniqueness requirement and the purpose of the phonetic coding. Also, a comment about the possibility of disabling the phonetic coding step was also added to the Encoding participants Section. REVIEWER #1 (pdf): 2. Non-latin letters are not treated as letters. A combination of Non-latin letters and latin letters leads to an reset of the application. I tried with Arabic letters. [RESPONSE]: Yes, this observation is correct. The current implementation of the tool does not support non-Latin characters. Names in other languages (such as Arabic) need to be transcribed using Latin characters. The section on CANDIDATE has been adjusted to explain this limitation and a paragraph was added to the Discussion section which describes possible support for other languages. Information about this limitation was also added to the tool. REVIEWER #1 (pdf): 3. Further from an UX POV I wouldn’t let the user chose the ration as it is an error source. Better automatically set it to a reasonable value based on the max participants input. [RESPONSE]: This is a valid point. I decided to follow the advice of the reviewer and have disabled the option to change the ratio (I set it to 10 to reflect the recommendations resulting from the experimentations documented in the manuscript). Advanced users who need a different ratio may manually alter this in the source. REVIEWER #1 (pdf): 4. “The privacy and safety of participants is of utmost importance in research that involves people. Privacy is also regulated by legislation such as the General Data Protection Regulations (GDPR) which applies in the European Union.” ��For research the more important and global codex for experiments is the Declaration of Helsinki. Pls refer to it. [RESPONSE]: This is an excellent suggestion. A reference to the Helsinki Declaration was added in the sentence before the mentioning of GDPR. REVIEWER #1 (pdf): 5. Pls explain what a salt is on its first occurrence in the introduction. [RESPONSE]: An explanation of salt was added as suggested. REVIEWER #1 (pdf): 6. The first 2 paragraphs in related work are motivational and should be integrated in the introduction. [RESPONSE]: The two paragraphs were moved from the Related work section and integrated in the Introduction section. REVIEWER #1 (pdf): 7. “Experimentation showed that a suitable compromise between integrity and anonymity is achieved when N is ten times the anticipated number of participants” �This is anticipation of the Evaluation results. Pls consider removing or explicitly referring to the evaluation section. [RESPONSE]: I agree with the reviewer. This sentence was removed. REVIEWER #1 (pdf): 8. It is misleading, that the author mostly speaks of the name as an input variable, although any String could be used. When read superficially one could be mislead that the tool is not suitable for larger studies where it is likely to have multiple participant with the same name. I would strongly advise to scatter the information that any unique string generated from easy accessible information can be used as an input. That’s said: Pls add this as possible limitation, as it is up to the experimenters ensure such unique input strings. [RESPONSE]: The reviewer is right. The following paragraph outlining this limitation was therefore added to the Discussion section: “This study focused on coding participants by name and with this scheme the experimenter needs to ensure that all the names are unique. With large datasets there is a probability that some participants share the same name. With larger sample sizes experimenters may achieve uniqueness by concatenating additional information such as date of birth (day of the month, month, year, or combinations thereof) to the name. On the downside, incorporating additional information will increase the complexity of administering the study REVIEWER #1 (pdf): 9. Line 191 ‘L’ and ‘ID-List’ are not previously defined [RESPONSE]: Both L and ID-List have now been defined in the pseudo code. REVIEWER #1 (pdf): 10. 191 hash-typefree = Find-free-slot(name, L, ID-List) ��Isn’t this bound to often return a value >9? Then switch in hash function hardly make any sense. This leads me to the point that I do not understand why there are 10 different digest generating function in the hash function. [RESPONSE]: The find-free-slot will usually return a value less than 9 (in most practical cases). But, it can return a value of 9 or more. The pseudo-code has been adjusted to make the intended meaning clearer. The last switch statement in the hash function will trigger if hash-code is equal to 9 or greater, that is, 10, 11, 12 … etc (the ≥ symbol now used). The returned hash is then the djb2-hash with a salt added. The salt is taken from an array of salts with index given by the hash-code (a declaration of the salt array was added). The different types of digests are therefore simply bound by the number of salts in the salt array. The reason why the hash function can return many different types of digests (10+) is to be able handle collisions. I.e., if two different inputs result in the same digest with one hash function they will (usually) result in two different (non-colliding) digests with two different hash-functions. The different digest thus facilitates the computation of alternative-IDs. Then, which hash function to use is stored with the ID-original item. A new figure was added to help illustrate the use of different hash functions to handle collisions. REVIEWER #1 (pdf): 11. I think Add-Function (Line 186) could be simplified by inverting the check in the if-statement. 186 Add(name, N) 187 IDoriginal = Encode(name, N, hash-typedefault) 188 IF IDoriginal in ID-list 189 hash-typefree = Find-free-slot(name, L, ID-List) 192 IDoriginal = Encode(name, N, hash-typefree) 193 validation-code = Encode(name, N, hash-typefree + hash-typeoffset) 194 ATTACH (hash-typefree, validation-code) TO IDoriginal 195 Add IDoriginal to ID-list [RESPONSE]: This is a good suggestion. The Add-function was simplified. REVIEWER #1 (pdf): 12. Line 194 souldn’t it be IDalternative [RESPONSE]: No, the statement is correct. It is because different names that result in colliding IDs (ID-original) will first “land on this slot”. The fact that there are already one or more validation-codes assigned to this ID-original means that a different ID (ID-alternative) needs to be created and assigned. During lookup we need to inspect the validation-codes to identify the matching one and the correct hash can be applied to find the ID-alternative. This is explained in the passage starting with “When adding a new participant, we first compute….” REVIEWER #1 (pdf): 13. In add and lookup function ‘=’ is used the other function use ‘:=’ pls unify [RESPONSE]: Thank you for spotting this inconsistency. All assignments have been unified to “:=”. REVIEWER #1 (pdf): 14. “199 FOR EACH (hash-type, validation-code) ATTACHED TO IDoriginal” parameters in FOR EACH not previously defined [RESPONSE]: The hash-type and validation-code in the FOR EACH are declared in the lines above (in the same IF block) REVIEWER #1 (pdf): 15. It is bad programming habit to have more than one return statement. Pls revise the code of Lookup function. I find it further strange, that this function returns a newly generated code in case the FOR EACH runs without running into the THAN condition of the IF statement. Pls explain. [RESPONSE]: The code has been revised with just one return statement. REVIEWER #1 (pdf): 16. I understand why the Sanitize method removes these characters, however maybe an Experimenter relies on these removed characters in order to create unique input strings. In the input text field I would check for the removed characters and make them invalid input. [RESPONSE]: This is an excellent suggestion. An input check was added to the tool that now only allows variations on the Latin alphabet, numbers (in case of phone numbers), @ and dot (in case of e-mails), and hyphen (as used in some connected names). The manuscript was also updated to reflect this change. REVIEWER #1 (pdf): 17. Code of Soundex function is missing [RESPONSE]: I did not include a detailed description of Soundex to save space as it is quite a well-known algorithm. To make this clearer to the reader I added the statement “Since Soundex is well-documented (see for instance [11, 30-35]), with many available implementations, it is not described in detail herein.”. However, if the reviewer insists, I would of course be happy to expand the text with a detailed description of Soundex. REVIEWER #1 (pdf): 18. “245 (without the four-character length restrictions)” �Out of nowhere statement. What is this 4 char restriction? [RESPONSE]: I agree that this sentence appears very cryptic when viewed out of context. It was therefore replaced with an example and an explanation, namely: “For example, “Christian” would be coded as C6235, i.e., the first letter (C), 6 for the r-sound, 2 for the c/g,/j,/k/q/s/x/z-sounds, 3 for the t/d-sounds and 5 for the m/n-sounds. Note that the full-length encoding is used, which differs from the original Soundex algorithm which only returns the first four characters (C623).“. REVIEWER #1 (pdf): 19. Pseudo code of CONFIDENCE procedure is missing [RESPONSE]: This is a typo. It was corrected to CANDIDATE. REVIEWER #1 (pdf): 20. References to the hash coding algorithms are missing (CRC32, etc.) [RESPONSE]: References to the detailed descriptions of the two algorithms were added to the manuscript. REVIEWER #1 (pdf): 21. “The CANDIDATE anonymisation tool“ section needs more structuring. Pls add meaningful subheading [RESPONSE]: Several subheadings were introduced to help guide the reader. REVIEWER #1 (pdf): 22. General Comment: Pls add flow charts or sufficient UML diagrams for all algorithms described in section “The CANDIDATE anonymisation tool“. It would allow for drastically shorten text, less repetition and far easier understanding. [RESPONSE]: A diagram illustrating the essence of the algorithm, namely the collision handling using multiple hash functions was added to the revised manuscript (as this is the core of the approach). A diagram illustrating the generation of the different hash functions was also added. An attempt was made to make it understandable for a wider readership as knowledge and experience is required to read and interpret UML diagrams. REVIEWER #1 (pdf): 23. A general comment: If the input string can be anything, then functions like Sanitize and Soundex are a bit meaningless, as they are intended to deal with names. They also seem to only work on latin-letter input. Also in the further description of CANDIDATE everything seems to be directed on handling names as input strings, although they are not unique and would be an unfavorable choice to take. [RESPONSE]: It is correct that Sanitize and Soundex only apply to name representations. From an information theoretic perspective there are better choices than just names as correctly pointed out by the reviewer, but from a practical experimenter’s perspective names are more acceptable as some participants are uncomfortable disclosing private information about themselves (including birth of dates). If the participation of a study is perceived as too “intrusive” participants may withdraw. Therefore, the emphasis is on names. Several statements have been revised and added in the revised manuscript to make this point clearer. REVIEWER #1 (pdf): 24. Line 274-290: In the pseudo code of the add function there is no loop. So I am not comprehending where “we search through an array of hash functions” should occur? [RESPONSE]: The “loop” is inside Find-free-slot(..) in the add function. To make this clearer “we search through …” was replaced with “Find-free-slot searches through …”. REVIEWER #1 (pdf): 25. 296-311 are mostly providing information already given. [RESPONSE]: The passages have been revised to avoid redundant information. REVIEWER #1 (pdf): 26. General Comment: Pls add the name list used to increase replicability. Should be as a supplement. [RESPONSE]: Done! The full list of names used for the experiments have been uploaded on the project GitHub page (link in the manuscript). REVIEWER #1 (pdf): 27. Table 4/5/6/7. Pls type out all numbers [RESPONSE]: As suggested by Reviewer #2 the tables were replaced by charts to simplify interpretation and comprehension. All the repeated numbers are included in the chart (i.e. the repetition marks in the tables are no longer an issue. REVIEWER #1 (pdf): 28. Table 8: why is ‘unused’ column not present for N=10,000 and in Table7? [RESPONSE]: Unused was not listed because the entire coding space was used (100%), hence it did not seem relevant to list it. To make this clear it is now explicitly stated in the text. Note that the content of Table 7 is now replaced by charts. REVIEWER #1 (pdf): 29. Table7/8: How can you explain, that min/mean values are the same in all conditions except for N=100? [RESPONSE]: This is because the min and mean k-anonymity depends on the size of the coding space in relation to the total population (phonebook) and not the number of participants. A statement was added to make this explicit in the text. REVIEWER #1 (pdf): 30. General Question regarding Evaluation: From what I understood from the explanation of the CANDIDATE algorithm, Anonymity and Encoding success greatly depends on the truncated hash value. In this case, this parameter is not given in the Evaluation section, and it is not varied to evaluate its influence. [RESPONSE]: Yes, the anonymity and encoding success depends on the level of truncation. The level of truncation (coding space) is denoted by N and this parameter is listed in the results. REVIEWER #1 (pdf): 31. “With 10,000 slots (four-digit IDs) the smallest number of items per ID was 1, which indicates no anonymity. However, the mean number of hits per ID is 10.34 which is above the limit of acceptable anonymity” ��This is a bit of an easy argumentation. In course of manuscript it is stressed multiple times, how important anonymity is and here the author is implying, that it is ok to not reach absolute anonymity. [RESPONSE]: I agree that this presentation was unfortunate. I have reordered the presentation of the results so that the mean (which is ok) is presented first, followed by the min (which is not ok) – leading to a clearer indication that overall, this configuration cannot ensure anonymity for all participants. REVIEWER #1 (pdf): 32. Further, after reading the Evaluation section, for me there seems to be an inherent conflict between ‘encoding success’ and ‘anonymity’. This is greatly impacting usability, as a user I shouldn’t have to know/understand the details of the CANDIDATE algorithm in order to be able to choose the just correct coding space. [RESPONSE]: Yes, it is indeed a trade-off between encoding success and anonymity, and the results shows that a suitable compromise if found with a coding space 10 times the number of participants. This ratio is also now fixed to 10 in the revised tool. The anonymity is also related to the total population of the group (country, region, institution, etc). A warning message with a simple anonymity estimate was added to the tool to make the user more aware of the anonymity for a given study. A comment on this was also added at the end of the section outlining the CANDIDATE algorithm. REVIEWER #1 (pdf): 33. The biggest weakness IMHO is not mentioned: The phonebook attacking scenario implies that the attacker got hold of the anonymous data and tries to identify the individuals. However, if an attacker was able to obtain the research data why shouldn’t s/he also be able to get hold of the study organization data where the participant’s name are listed? So the whole method relies on the security of the names list (i.e. the input data). The link-table approach, of course, harbors the same problem of keeping the link-table securer. However, even if an attacker should be able to obtain the participants name list and the anonymous data s/he could not link them. [RESPONSE]: The reviewer is indeed right and an important point. If an attacker somehow can be certain that an individual was part of the experiment (from other means) then the attacker can also find the ID of the participant. A paragraph was added to the Discussion section to explicitly elaborate on this point with some suggestions on how to manage this in practice. REVIEWER #1 (pdf): 34. A further limitation is that the whole evaluation was only conducted with names in Latin letters, whilst in Lines 166-175 it is explicitly stated that any kind of input string could be used. Unfortunately, therefore one can only consider the CANDIDATE tool validated for names provided in Latin-alphabet, of course considering the here found boundaries/limitations. [RESPONSE]: Yes, this is a very correct observation indeed. Although the names in the test suite were from all over the world (including Chinese and Arabic names), they were transcribed using Latin characters (as author names often are in international publications). Soundex only works with the Latin alphabet, and Soundex should not be used with other representations such as phone numbers. The text has therefore been revised to reflect this limitation (the Introduction to CANDIDATE pointed out here), the description of Soundex, as well as in the Discussion. The tool has also been updated to clearly indicate the character-coding limitation. REVIEWER #1 (pdf): 35. A further practical problem, though not a limitation of CANDIDATE itself, is, that the experimenter must pay great attention for possible errorness Input strings when generating the IDs. [RESPONSE]: I fully agree with the reviewer. There is always a risk that input errors caused by the user may lead to problems. The algorithm performs two steps to reduce the chance of error in the input strings. The Soundex algorithm ensures that the tool can handle several types of spelling mistakes as it performs a type of approximate string matching. Second, the name parts (first, middle, second) are sorted in alphabetical order so that it does not matter which order the names are input. Both mechanisms are already briefly explained in the manuscript. A cautionary note about the need for careful input was added to the last sentence of the Discussions section of the revised manuscript. REVIEWER #1 (pdf): 36. “Evaluations show that the tool successfully assigns unique and anonymous IDs to participants” ��Being picky, this is not true, as assigning unique and anonymous IDs depends on the correct set of parameters (as you write in following sentence). [RESPONSE]: Good point. The sentence was moderated with “the tool with appropriate parameters can successfully assign unique and anonymous IDs to participants”. Reviewer #2 comments: REVIEWER #2: Reviewer #2: The manuscript by Sandes describes the CANDIDATE tool, a computational method for anonymizing participant identities in research studies. Due to real and perceived privacy concerns, as well as regulations such as the GDPR, there remains a need for simple, robust tools for maintaining the privacy of individuals in research studies, while enabling researchers to potentially link information from the same individual across longitudinal studies or multiple studies. [RESPONSE]: Thank you. This is a very accurate summary of the manuscript. REVIEWER #2: Although I work in a different field and thus I am not familiar with most of the existing tools, I believe the manuscript did a good job of explaining the challenges and existing tools in the field. In particular, the computational modeling of different study sizes and scenarios is necessary to understand the interplay between study size, anonymity, collision rate (same ID assigned to multiple individuals) vulnerability to “phonebook attacks” to decipher identities, and the ID space (length of ID code.) The main strength of the manuscript is the clear description of the tool, and the easy-to-understand examples. In particular, I appreciated the recommendation that the ID pool be approximately 10x the size of the participant list (e.g., encoding 100 participants with a 3-digit number.) A weakness of the manuscript is that the body or discussion could have included some information about extensions of the tool (for example, whether a tool like this could be adapted to work with numerical, or genetic, identity information instead of names.) Also depending on space, some of the tables as have been easier to interpret as graphs of curves. Overall the CANDIDATE tool appears simple to use and useful, and I recommend publication of the manuscript. [RESPONSE]: Thank you very much for these encouraging comments. A description of the opportunities and limitations of using the tool with generic identity information was added to the discussion (see response to specific comment below). The results tables were replaced with charts for simplified interpretation. Specific issues: REVIEWER #2: I think the manuscript could benefit from putting the work into a broader context. A graph or two in place of some tables would make it easier to quickly interpret the study. [RESPONSE]: The discussion was extended to place the work in a broader context in terms of possibility to use genetic codes/biometric information instead of names. The results tables were replaced with charts for simplified interpretation. REVIEWER #2: a. The discussion of the need for privacy is excellent, however, the current study is limited to encoding of individuals with roman alphabet names encoded by 26 characters. It is possible that a deeper discussion of the Soundex algorithm could rectify this, however, it is unclear whether the Soundex algorithm can deal with Chinese or Russian names, or whether these would need to be romanized. Alternatively, are there other tools that can sanitize name information or make them more robust, that could be used in place of Soundex tool? [RESPONSE]: This point was also raised by the other reviewer. The text has been revised to clarify this point (in the section describing the CANDIDATE procedure). In short, the names need to be Romanised to use Soundex (which was designed for English). In principle CANDIDATE may be used with other scripts but then without the error tolerance, or language specific error tolerance mechanisms must be tailor made for the language. The current implementation does however not support non-Latin characters. The tool was also updated with information about this. REVIEWER #2: b. It would be interesting to learn if the tool could be enhanced or even used with participant IDs that were coded with a combination of numbers and letters, or a combination of names and genetic or biometric information, for example. Alternatively, can or should the CANDIDATE tool be adapted to use an alphanumeric ID space, instead of a strictly numeric ID space as in the examples. While this may be speculative and beyond the scope of the study, there is a clear future need for robust identification of individuals, whilst preserving their privacy. [RESPONSE]: This is a very interesting and relevant question. In short, if the representation is consistent on different occasions (no variations) it will work with CANDIDATE. If the information varies such as certain biometric information (e.g., fingerprint matching), it will not be possible to use CANDIDATE unless some step is used to reduce or “quantize” the information to a consistent representation. A paragraph addressing this point was added to the end of the Discussion section. REVIEWER #2: c. The modeling appeared robust for groups of 10-1000 individuals, but it is unclear whether this tool could be used on a larger scales, such as encoding information for 10’s of millions of individuals, or a large nation such as the UK. If this is not possible, it could be useful to provide context of whether CANDIDATE could be useful in subsets of those large biobank studies. [RESPONSE]: This is a highly relevant question. The anonymity of CANDIDATE comes from the ambiguities that arise from the participant being a subset of a larger population. If one codes an entire population (such as a large biobank dataset with 10’s of millions of individuals) the anonymity of the participants will be reduced. However, CANDIDATE is appropriate for studies involving a subset of such large lists as the coding space will be specific to the list and lead to false positives for other individuals in the superset. A paragraph has been added in the Discussion section to address this point. REVIEWER #2: d. I am not familiar with the details of different hash functions. The manuscript did a good job of explaining the goal of the hash, to provide one-way encoding, but it would be nice to have an explanation of different hash functions, and whether most are equivalent, and what are the tradeoffs of using multiple hashes VS single hashes. [RESPONSE]: It should not be necessary for the reader to be familiar with the details of the hashing algorithms. However, references to sources detailing each of the two basic hash algorithms (djb2 and CRC-32) were added as well as a brief explanation about how the two algorithms work. An example was also added in the revised manuscript (see the new Table 1) to illustrate how the different variations of the hash algorithms modify an input string at the different phases. REVIEWER #2: Finally, while the language is clear, the manuscript could benefit from copyediting for language and spelling. [RESPONSE]: The revised manuscript has been carefully reviewed for language issues. Submitted filename: response memo R1V07.docx Click here for additional data file. 3 Sep 2021 PONE-D-21-20957R1CANDIDATE: A tool for generating anonymous participant-linking IDs in multi-session studiesPLOS ONE Dear Dr. Sandnes, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct 18 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Anandakumar Haldorai, PhD Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: Recommended for minor revision. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Se attached File. Reviewer #2: The manuscript is greatly improved. The graphs help make the data more accessible, and the later sections of the paper do a really good job of explaining the limitations and appropriate use-cases for the tool. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Mario Lorenz Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Review.docx Click here for additional data file. 7 Oct 2021 Response to reviewers’ comments on revision 2 Reviewer #1 [Reviewer #1]: 1. Line 241: ≥ 9: digest := djb2(name + salt[8 - hash-type]) � Isn’t this bound to run in a run time error as hash-type will be at least 9 resulting in a negative index? However even if inverted “hash-type – 8” It could run into a runtime error as the value of hash-type is not constrained but the salt array has a pre-defined length. I know I am being picky here but as this is basically the core function of your whole approach the possibilities of runtime errors should be completely omitted. Therefore, a predefined salt-array with a fixed set of entries cannot be the solution. I would suggest to generate the salt generically during runtime. [RESPONSE]: The reviewer is right that there is a risk of a run-time error. The simulations shows that this is highly unlikely (see Table 2). With N = 1000 the hash-type value is never larger than 5, and with larger N’s it is lower. To generate a salt generically during run-time as suggested by the reviewer may also not be a feasible solution. From a theoretical perspective one may not be able to deterministically find a salt that results in a free slot that at the same time provides sufficient obfuscation. In any case the algorithm will detect if such a situation occurs and report to the experimenter that it was unable to encode a given participant. To make this situation more transparent to the reader I have added this as an exception in the pseudo-code, and also added a comment regarding this in the text (just before Table 2). 2. REVIEWER #1 (pdf): 14. “199 FOR EACH (hash-type, validation-code) ATTACHED TO IDoriginal” parameters in FOR EACH not previously defined [RESPONSE]: The hash-type and validation-code in the FOR EACH are declared in the lines above (in the same IF block) � As far as I see it the If statement where hash-type and validation-code are defined belong to the Add(…) function. The FOR EACH is part of the Lookup(…) so the variables defined in Add(…) should be unknown in Lookup(…). [RESPONSE]: Actually, the variables defined above were intended as global/state variables to the algorithm. I have adjusted the pseudo code with labels so that it is clearer which are internal state variables to the algorithm, constants, and what are variable parameters. 3. REVIEWER #1 (pdf): 16. I understand why the Sanitize method removes these characters, however maybe an Experimenter relies on these removed characters in order to create unique input strings. In the input text field I would check for the removed characters and make them invalid input. [RESPONSE]: This is an excellent suggestion. An input check was added to the tool that now only allows variations on the Latin alphabet, numbers (in case of phone numbers), @ and dot (in case of e-mails), and hyphen (as used in some connected names). The manuscript was also updated to reflect this change. � Unfortunately I was unable to detect this change is the manuscript. Could you please give me direction? [RESPONSE]: This is perhaps a misunderstanding on my part. The change was done to the browser implementation (updated on GitHub), not the manuscript. In the currently revised manuscriptI have removed sanitize from the algorithm and written that the algorithm assumes sanitized names, and that sanitized names can be ensured using input checks in the user interface. 4. When looking at the hash function I was wondering if there is a reason why crc32 was not used for the letter shifted input names but only djb2? [RESPONSE]: No, there was no reason for this. Testing showed that there were enough hash functions as is. If one needs more hash functions, all the pre-processing cases that are applied to djb2 (shifts and adding the salt) can also be applied to CRC32 with the desired effect. I have added a comment about this in the text (just before Table 2). 5. I appreciate the added Fig.1 and Fig2. However, I think Fig2 is currently not really self-containing and is not really understandable. I think it would be good to also use concrete example as in Fig1 for explaining it. [RESPONSE]: I have replaced Fig 2 with a new illustration (Fig. 3) that shows the steps involved in the encoding (Fig. 3 a, b, c, d, e, f, g and h). This figure builds on the example in Table 3 which now is obsolete. I have therefore removed Table 3 and 4. 6. REVIEWER #1 (pdf): 22. General Comment: Pls add flow charts or sufficient UML diagrams for all algorithms described in section “The CANDIDATE anonymisation tool“. It would allow for drastically shorten text, less repetition and far easier understanding. [RESPONSE]: A diagram illustrating the essence of the algorithm, namely the collision handling using multiple hash functions was added to the revised manuscript (as this is the core of the approach). A diagram illustrating the generation of the different hash functions was also added. An attempt was made to make it understandable for a wider readership as knowledge and experience is required to read and interpret UML diagrams. � Here I disagree. One who can read AND understand the given pseudocode is highly likely able to comprehend flow charts or sufficient UML diagrams. Further, it will be a big help for colleagues who would like to build on your work. It took me properly 5-time to really think through the algorithm with just the pseudocode and the textual description then it would have been with proper diagrams. By the same time I am also less confident to not have missed a glitch or an error. [RESPONSE]: I have added flow charts (see new Fig. 1 (a, b c and d). Reviewer #2 [Reviewer #2]: The manuscript is greatly improved. The graphs help make the data more accessible, and the later sections of the paper do a really good job of explaining the limitations and appropriate use-cases for the tool. [RESPONSE]: Thank you very much for these encouraging comments. General changes I have added a paragraph in the introduction where the term “anonymous” is briefly discussed and reflected upon which I think is relevant for this study. Submitted filename: response memo R3.docx Click here for additional data file. 13 Oct 2021 PONE-D-21-20957R2CANDIDATE: A tool for generating anonymous participant-linking IDs in multi-session studiesPLOS ONE Dear Dr. Sandnes, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Nov 27 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Anandakumar Haldorai, PhD Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments (if provided): The figures are needs to improve. The quality of figures maybe improved with proper image editor. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Thank you very much for the revision. Just one minor Editing issues: Fig 1a and 1c need minor polishing so that the text is not intersected by lines and all arrows are straight. Reviewer #2: The revised manuscript is relatively easy to follow, the explicit examples in Figures 1 and 3 are good and the added discussion of anonymity is helpful. I have only two minor comments. One cosmetic issue is that the flow chart would look better and be easier to follow if the text inside the different flow diagrams in Fig 1a and 1c was resized to make it more readable. Second, for the the online version of the algorithm, it would be helpful to add a hyperlink to the github page (especially if the code is expanded or modified in the future. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Mario Lorenz Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 28 Oct 2021 Response memo EDIROR: The figures are needs to improve. The quality of figures maybe improved with proper image editor. RESPONSE: The issues raised by the reviewers have been fixed (see below). Reviewer #1: Thank you very much for the revision. Just one minor Editing issues: Fig 1a and 1c need minor polishing so that the text is not intersected by lines and all arrows are straight. RESPONSE: Fig 1a and 1c have been adjusted to avoid collisions between text and lines. All arrows have been straightened. Reviewer #2: The revised manuscript is relatively easy to follow, the explicit examples in Figures 1 and 3 are good and the added discussion of anonymity is helpful. I have only two minor comments. One cosmetic issue is that the flow chart would look better and be easier to follow if the text inside the different flow diagrams in Fig 1a and 1c was resized to make it more readable. Second, for the the online version of the algorithm, it would be helpful to add a hyperlink to the github page (especially if the code is expanded or modified in the future. RESPONSE: The problems in Fig 1a and 1c were fixed by making the boxes larger so ensure a consistent text size. A link to the github page is added in the algorithm caption. Submitted filename: responsememoR4.docx Click here for additional data file. 15 Nov 2021 CANDIDATE: A tool for generating anonymous participant-linking IDs in multi-session studies PONE-D-21-20957R3 Dear Dr. Sandnes, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Anandakumar Haldorai, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Recommended. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Mario Lorenz Reviewer #2: No 2 Dec 2021 PONE-D-21-20957R3 CANDIDATE: A tool for generating anonymous participant-linking IDs in multi-session studies Dear Dr. Sandnes: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Anandakumar Haldorai Academic Editor PLOS ONE

31 in total

CANDIDATE: A tool for generating anonymous participant-linking IDs in multi-session studies.

Introduction

Related work

The CANDIDATE anonymization tool

Participant representations

Encoding participants

Handling collisions

Collision probability with djb2, CRC-32 and double hashes (half djb2/half CRC-32) for 100 randomly selected names with different coding space sizes.

ID lookup

Example

Evaluation

Integrity

Encoding success rates for larger samples (100 ≤ N ≤ 1000).

Anonymity

Log-log plot of mean anonymity with coding spaces of 100 (two-digit IDs), 1,000 (three-digit IDs), 10,000 slots (four-digit IDs), and 100,000 (five-digit IDs) with a phonebook of 103,472 names.

Percentage of unused ID slots with small sample sizes.

Percentage of unused ID slots with large sample sizes.

Discussion

Conclusions

1. Matching anonymous pre-posttests using subject-generated information.

2. Tolerating spelling errors during patient validation.

3. Self-generated identification codes for anonymous collection of longitudinal questionnaire data.

4. Protecting Student Anonymity in Research Using a Subject-Generated Identification Code.

5. Ensuring anonymity by use of subject-generated identification codes.

6. 'Soundex' codes of surnames provide confidentiality and accuracy in a national HIV database.

7. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets.

8. A glimpse into smartphone screen reader use among blind teenagers in rural Nepal.

9. Connected Health User Willingness to Share Personal Health Data: Questionnaire Study.

10. Privacy-preserving record linkage using Bloom filters.