| Literature DB >> 35862447 |
Lida Kuang1, Samruda Pobbathi1, Yuri Mansury2, Matthew A Shapiro2, Vijay K Gurbani1.
Abstract
The systematic monitoring of private communications through the use of information technology pervades the digital age. One result of this is the potential availability of vast amount of data tracking the characteristics of mobile network users. Such data is becoming increasingly accessible for commercial use, while the accessibility of such data raises questions about the degree to which personal information can be protected. Existing regulations may require the removal of personally-identifiable information (PII) from datasets before they can be processed, but research now suggests that powerful machine learning classification methods are capable of targeting individuals for personalized marketing purposes, even in the absence of PII. This study aims to demonstrate how machine learning methods can be deployed to extract demographic characteristics. Specifically, we investigate whether key demographics-gender and age-of mobile users can be accurately identified by third parties using deep learning techniques based solely on observations of the user's interactions within the network. Using an anonymized dataset from a Latin American country, we show the relative ease by which PII in terms of the age and gender demographics can be inferred; specifically, our neural networks model generates an estimate for gender with an accuracy rate of 67%, outperforming decision tree, random forest, and gradient boosting models by a significant margin. Neural networks achieve an even higher accuracy rate of 78% in predicting the subscriber age. These results suggest the need for a more robust regulatory framework governing the collection of personal data to safeguard users from predatory practices motivated by fraudulent intentions, prejudices, or consumer manipulation. We discuss in particular how advances in machine learning have chiseled away a number of General Data Protection Regulation (GDPR) articles designed to protect consumers from the imminent threat of privacy violations.Entities:
Mesh:
Year: 2022 PMID: 35862447 PMCID: PMC9302812 DOI: 10.1371/journal.pone.0271714
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Summary of literature surveyed.
| Year | Author(s) | Demographics predicted | Artifacts used for modeling |
|---|---|---|---|
| 1991 | James Hartley [ | Gender | Handwriting |
| 2007 | Hu et al. [ | Age, gender | Browsing behaviour |
| 2012 | Ying et al. [ | Gender, and marital status | Behavioral- and environmental features |
| 2014 | Qin et al. [ | Age, gender | Applications usage |
| 2015 | Seneviratne et al. [ | Gender | Installed applications |
| 2015 | Wang et al. [ | Gender | Application response time |
| 2016 | Culotta et al. [ | Age, gender, ethnicity, education, income, parental status, political preference | Twitter feed |
| 2016 | Malmi et al. [ | Age, race, income | Installed applications |
| 2017 | Felbo et al. [ | Age, gender | CDR records, OSS, BSS |
| 2017 | Akter et al. [ | Age, gender | GPS location |
| 2017 | Avar Pentel [ | Age, gender | Keystroke analysis and mouse movement |
| 2018 | Qin et al. [ | Age, gender | Applications usage |
| 2018 | Sangaralingam et al. [ | Age, gender | Installed applications |
| 2018 | Wood-Doughty et al. [ | Gender, ethnicity | Twitter feed |
| 2019 | Rafique et al. [ | Age, gender | Images |
| 2019 | Fang et al. [ | Age, gender | Images |
| 2019 | Hamme et al. [ | Age, gender | Gait analysis |
| 2019 | Krismayer et al. [ | Age, gender, nationality | Music |
| 2019 | Al-Zubai et al. [ | Age, gender | CDR records, OSS, BSS |
| 2021 | Shafiloo et al. [ | Age, gender | Movie interests |
Fig 1Summary of our approach to modeling.
Response variable correlations.
(For gender, Male = 1, Female = 0).
| Corr. with age | Corr. with gender | |
|---|---|---|
| Payment method | -0.08 | 0.02 |
| Number of devices | -0.06 | 0.04 |
| Number of services | -0.23 | 0.00 |
| Device type | 0.03 | -0.02 |
| Device capability | -0.10 | -0.02 |
| Device age | -0.02 | -0.03 |
| Customer service | 0.05 | 0.03 |
| Need customer service | 0.04 | 0.02 |
| Recharge way | 0.01 | 0.02 |
| Main recharge way | 0.00 | 0.02 |
| Recharge value | 0.00 | 0.06 |
| Recharges | -0.06 | 0.01 |
| Recharges Average | 0.11 | 0.07 |
| Social network (MB) | -0.18 | 0.05 |
| Social network (number of sessions) | -0.19 | 0.03 |
| Instagram (MB) | -0.07 | 0.00 |
| Instagram (number of sessions) | -0.17 | 0.02 |
| Whatsapp (MB) | -0.15 | 0.04 |
| Whatsapp (number of sessions) | -0.17 | 0.03 |
| Bank (MB) | -0.01 | 0.03 |
| Bank (number of sessions) | -0.15 | 0.03 |
| Uber (MB) | -0.02 | 0.03 |
| Uber (number of sessions) | -0.15 | 0.03 |
| Internet (MB) | -0.13 | 0.07 |
| Internet (number of sessions) | -0.13 | 0.03 |
| Received calls | 0.06 | 0.07 |
| Rec’d. call duration | 0.05 | 0.00 |
| Made calls | -0.01 | 0.01 |
| Made calls duration | 0.00 | -0.15 |
Fig 2Correlation between response variable and atomic/derived dimensions.
Fig 3Age distribution in the dataset.
Correlation of surviving 749 features with the response variables.
| Positive correlation (Top 5) | Negative correlation (Top 5) | Avg. | Std. Dev | |
|---|---|---|---|---|
| Gender | 0.244, 0.242, 0.241, 0.238, 0.235 | -0.231, -0.229, -0.229, -0.228, -0.227 | 0.122 | 0.054 |
| Age | 0.253, 0.224, 0.222, 0.220, 0.213 | -0.301, -0.295, -0.283, -0.282, -0.282 | -0.066 | 0.101 |
Fig 4CDF for correlation of 749 attributes with response variables.
Correlation of age with features used for modeling.
| Dimension Type | Corr. Coefficient | |
|---|---|---|
| Recharge Value | Atomic | 0.06 |
| Min Recharge Value | Derived | 0.13 |
| Calls made during P2 | Derived | 0.06 |
| Percentage of days social network over-the-top applications used during P2 | Derived | -0.29 |
| Percentage of social network sessions used during P1 | Derived | -0.07 |
| Percent of received calls duration during P2 | Derived | 0.19 |
| Percent of received calls duration during P4 | Derived | -0.16 |
| Percent of days where over-the-top calls were placed during P2 and P3 | Derived | 0.18 |
| Percent of Internet megabytes used during the weekdays | Derived | 0.11 |
| Percentage of social network sessions used on weekends and holidays | Derived | -0.26 |
| Whatsapp traffic (in MB) generated on weekends and holidays | Derived | -0.15 |
Fig 5A feed-forward neural network.
Fig 6A simplified graph for the age prediction neural network.
Neural network results on age prediction (accuracy).
| 7 Hidden Layers | 5 Hidden Layers | 3 Hidden Layers | |
|---|---|---|---|
| ± 2 years | 0.66 | 0.69 | 0.69 |
| ± 3 years | 0.69 | 0.71 | 0.71 |
| ± 4 years | 0.73 | 0.75 | 0.74 |
| ± 5 years | 0.76 | 0.78 | 0.76 |
Fig 7A simplified graph for the gender prediction neural network.
Gender prediction confusion matrices (F: Female, M: Male).
| Predicted | |||
| Actual | F | M | |
| F | 628 | 422 | |
| M | 430 | 678 | |
| Predicted | |||
| Actual | F | M | |
| F | 693 | 364 | |
| M | 343 | 758 | |
| Predicted | |||
| Actual | F | M | |
| F | 655 | 395 | |
| M | 349 | 759 | |