| Literature DB >> 25079786 |
Zhanglong Ji, Xiaoqian Jiang, Shuang Wang, Li Xiong, Lucila Ohno-Machado.
Abstract
BACKGROUND: Privacy protecting is an important issue in medical informatics and differential privacy is a state-of-the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). However, differentially private methods sometimes introduce too much noise and make outputs less useful. Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced.Entities:
Mesh:
Year: 2014 PMID: 25079786 PMCID: PMC4101668 DOI: 10.1186/1755-8794-7-S1-S14
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Summary of data sets used in our experiments
| Data set | Data set description | # of | # of | Class distribution (negative/positive) |
|---|---|---|---|---|
| 1 | German breast cancer | 9 | 686 | 43.6% / 56.4% |
| 2 | Hospital discharge | 17 | 8,668 | 4.4% / 95.6% |
| 3 | SEER breast Cancer | 37 | 55,000 | 21.0% / 79.0% |
Attribute description for each data set, where numerical attributes are indicated with "∗", non-binary categorical attributes were converted into binary representations through dummy coding and classification labels are shown in the last row.
| Data set 1 | Data set 2 | Data set 3 |
|---|---|---|
| Age* | ||
| Pos: Alive, Neg: Died | Pos: Not a potential follow-up error, Neg: A potential follow-up error | Pos: Alive, Neg: Died |
Figure 1Effect of different parameters on model discrimination. Note that in 1(e), only our model is affected by iteration numbers.
Figure 2Comparison of three methods given different external settings: the number of private data sets, the fraction of public data, and the privacy budget. Red and yellow rectangles indicate our algorithm outperforms the other methods, while green and blue rectangles mean the opposite.
Figure 3Boxplot comparisons of models using three different datasets. We use default parameter values as stated in the begining of this section. For each method, the five lines from bottom to top are 2.5%, 25%, 50%, 75% and 97.5% quantiles of AUCs. The p-value of pairwise t-test on AUCs are also shown.