OBJECTIVE: Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, ε-differential privacy provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure ε-differential privacy handle the problem of disclosing relational and set-valued data in a privacy-preserving manner separately. In this paper, we propose an algorithm that considers both relational and set-valued data in differentially private disclosure of healthcare data. METHODS: The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee ε-differential privacy. RESULTS: We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis. LIMITATION: The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases. CONCLUSIONS: Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and set-valued data in a differentially private manner, and can retain essential information for discriminative analysis.
OBJECTIVE: Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, ε-differential privacy provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure ε-differential privacy handle the problem of disclosing relational and set-valued data in a privacy-preserving manner separately. In this paper, we propose an algorithm that considers both relational and set-valued data in differentially private disclosure of healthcare data. METHODS: The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee ε-differential privacy. RESULTS: We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis. LIMITATION: The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases. CONCLUSIONS: Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and set-valued data in a differentially private manner, and can retain essential information for discriminative analysis.
Authors: Mohammed Saeed; Mauricio Villarroel; Andrew T Reisner; Gari Clifford; Li-Wei Lehman; George Moody; Thomas Heldt; Tin H Kyaw; Benjamin Moody; Roger G Mark Journal: Crit Care Med Date: 2011-05 Impact factor: 7.598
Authors: Lucila Ohno-Machado; Zia Agha; Douglas S Bell; Lisa Dahm; Michele E Day; Jason N Doctor; Davera Gabriel; Maninder K Kahlon; Katherine K Kim; Michael Hogarth; Michael E Matheny; Daniella Meeker; Jonathan R Nebeker Journal: J Am Med Inform Assoc Date: 2014-04-29 Impact factor: 4.497