| Literature DB >> 31220995 |
Luke Sloan1, Curtis Jessop2, Tarek Al Baghal3, Matthew Williams1.
Abstract
Linked survey and Twitter data present an unprecedented opportunity for social scientific analysis, but the ethical implications for such work are complex-requiring a deeper understanding of the nature and composition of Twitter data to fully appreciate the risks of disclosure and harm to participants. In this article, we draw on our experience of three recent linked data studies, briefly discussing the background research on data linkage and the complications around ensuring informed consent. Particular attention is paid to the vast array of data available from Twitter and in what manner it might be disclosive. In light of this, the issues of maintaining security, minimizing risk, archiving, and reuse are applied to linked Twitter and survey data. In conclusion, we reflect on how our ability to collect and work with Twitter data has outpaced our technical understandings of how the data are constituted and observe that understanding one's data is an essential prerequisite for ensuring best ethical practice.Entities:
Keywords: Twitter; archiving; consent; disclosive; ethics; linked data; reuse; surveys
Mesh:
Year: 2019 PMID: 31220995 PMCID: PMC7049949 DOI: 10.1177/1556264619853447
Source DB: PubMed Journal: J Empir Res Hum Res Ethics ISSN: 1556-2646 Impact factor: 1.742
Analysis of Disclosure Risk by Twitter JSON Attributes.
| Relating to | Attribute | Description | Nature of risk | Risk of identifying an individual |
|---|---|---|---|---|
| Tweet | text | The actual text of the tweet | If not a retweet, then unique content and directly identifiable | High |
| Tweet | retweet_count | The number of times a tweet has been retweeted | Changeable and dynamic, unlikely to be unique | Low* |
| Tweet | favorite_count | The approximate number of times a tweet has been liked by other users | Changeable and dynamic, unlikely to be unique | Low* |
| Tweet | favorited | Indicates whether a user has favorited the tweet | Binary categorical variable, common practice to “favorite” a tweet | Negligible |
| Tweet | truncated | Whether a tweet text has been truncated (greater than 140 characters) | Binary categorical variable, truncation common with new 280 character tweet limit | Negligible |
| Tweet | id_str | The numeric (string) version of the unique identifier for this tweet | Unique content, directly identifiable—often deposited to allow other researchers to “rehydrate” Twitter data sets | High |
| Tweet | in_reply_to_screen_name | If the tweet is a reply to another tweet, this is the name of the original tweet’s author | Evidence of Twitter correspondence with another unique user, may or may not represent someone in their network, often used for responding to public individuals (e.g., politicians) but could also be used to respond to users who are closely connected | Variable |
| Tweet | source | The utility used to post the tweet (e.g., Tweets posted from the Twitter website have a source of “web”) | Unlikely to pose a risk as alternative Twitter posting tools are in widespread use | Negligible |
| Tweet | retweeted | Indicates whether the tweet has been retweeted by the user | Binary categorical variable, common practice to retweet | Negligible |
| Tweet | created_at | Creation date and time of the tweet to the second (in UTC) (e.g., Tuesday November 23 12:46:54 +0000 2018) | On average there are 6,000 tweets created every second ( | Low |
| Tweet | in_reply_to_status_id_str | If the tweet is a reply to another tweet, this is the ID of the original tweet | Represents part of a conversation that the user is partaking in could be used to identify an individual if number of responses to original tweet are small | Variable |
| Tweet | in_reply_to_user_id_str | If the tweet is a reply to another tweet, this is the ID of the original tweet’s author | Evidence of Twitter correspondence with another unique user may or may not represent someone in their network, often used for responding to public individuals (e.g., politicians) but could also be used to respond to users who are closely connected | Variable |
| Tweet | lang | The language of the tweet text (machine-detected) | Machine detection will allocate to one language or mark as “undetected,” will only identify a single language, might well not be the same as language of interface, can change with every tweet (dynamic) | Negligible* |
| Tweet | expanded_url | Full (expanded) version of a URL included in the tweet | Depends where the URL points to, often to generic content (e.g., BBC News story) but could be to personal website or blog | Variable |
| Tweet | url | Wrapped URL corresponding to the value directly embedded into the raw tweet text | Depends where the URL points to, often to generic content (e.g., BBC News story) but could be to personal website or blog | Variable |
| User | listed_count | The number of public lists that the user is a member of | Unlikely to be unique | Low* |
| User | verified | Whether account has been verified (account of “public interest”) | Binary categorical variable, not unusual and could include actors, musicians, journalists, politicians, organizations, and so on. | Negligible |
| User | location | The location defined by the user | May or may not represent where the user lives or works, but potentially could place user in a low-level spatial unit | Variable |
| User | user_id_str | The numeric (string) version of the unique identifier for this user | Unique identifier, directly identifies the user | High |
| User | description | User-defined description of their account, often used as a “bio” | Regardless of what the user writes, this is likely to unique to the individual | High |
| User | geo_enabled | User has enabled the possibility of geotagging their tweets | Simply enables geotagging, does not enforce it. Binary categorical variable—research suggests that 41.6% of users have this setting enabled ( | Negligible |
| User | user_created_at | Creation date and time of the user account to the second (in UTC) (e.g., Tuesday November 23 12:46:54 +0000 2018) | Potentially unique to the individual due to high level of temporal granularity, note that offset (“+0000”) can be used to determine time zone (but see later comment on GDPR) | High |
| User | statuses_count | The number of tweets and retweets posted by the user | Changeable and dynamic, unlikely to be unique | Low* |
| User | followers_count | The number of followers the user account currently has | Changeable and dynamic, unlikely to be unique | Low* |
| User | favourites_count | The number of tweets the user has favorited since the account was created | Changeable and dynamic, unlikely to be unique | Low* |
| User | protected | Whether account is protected (tweets only visible to followers) | Binary categorical variable, not unusual practice | Negligible |
| User | user_url | A URL given by the user, normally a link to a personal/organizational website | Not necessarily unique, but will be in some cases, not unusual for users to direct to personal websites | High |
| User | name | The self-defined name of the user | Not necessarily the name of a person, but often is | High |
| User | time_zone | The time zone of the user | If present will place the user in a large-scale geography, but from May 23 has been returned as “null” (private field) due to EU privacy laws | NA |
| User | user_lang | The user’s choice of interface language | Twitter is available in 47 languages (at time of writing), may well not be the same as the language in which tweets are written, can change but most likely to be static | Negligible |
| User | utc_offset | The difference in hours and minutes between user time zone and UTC | If present will place the user in a large-scale geography, but from May 23 has been returned as “null” (private field) due to EU privacy laws | NA |
| User | friends_count | The number of accounts this user is following | Changeable and dynamic, unlikely to be unique | Negligible* |
| User | screen_name | The screen name (aka handle) of a user | Screen name can change (dynamic) but is always unique, an individual identifier | High |
| Geo | country_code | Two letter code of the country a tweet was issued from, or is about | May be derived from an exact point coordinate (lat/long), or from a place selected by a user such as a city. In the latter, this may be the country of the place from where the user is tweeting from, or a place that they are tweeting about. Either way, on its own this represents a high-level geography | Negligible |
| Geo | country | Name of the country a tweet was issued from or is about | May be derived from an exact point coordinate (lat/long), or form a place selected by a user such as a city. In the latter, this may be the country of the place from where the user is tweeting from, or a place that they are tweeting about. Either way, on its own this represents a high-level geography | Negligible |
| Geo | place_type | The nature of the location the tweet was issued in, or is about, such as a city or POI | Classification of place identified by user (either selected or derived from point coordinates) is generic and unlikely to be problematic | Negligible |
| Geo | full_name | Full name (string) of place, for example, “San Francisco, CA” | Could lead to low-level spatial data if point coordinates, or user selection, results in identifying a city or town | Variable |
| Geo | place_name | Short name (string) of place, for example, “San Francisco” | Could lead to low-level spatial data if point coordinates, or user selection, results in identifying a city or town | Variable |
| Geo | place_id | Unique ID (string) of place | Could lead to low-level spatial data if point coordinates, or user selection, results in identifying a city or town | Variable |
| Geo | place_lat | Center point of the location the tweet was issued in, or is about (latitude) | Gives a latitude value at the centroid of the location (e.g., center of Manchester), may or may not be where the user was when tweet was posted, unlikely to be of use without corresponding longitude value | Low |
| Geo | place_lon | Center point of the location the tweet was issued in, or is about (longitude) | Gives a longitude value at the centroid of the location (e.g., center of Manchester), may or may not be where the user was when tweet was posted, unlikely to be of use without corresponding latitude value | Low |
| Geo | Lat | Latitude of tweet location | Precise latitude of where user was when they tweeted, potentially could be at home or work, alternatively may be commuting. Either way has considerable potential to locate individuals in low-level geographies, but this is significantly reduced without longitude value | Medium* |
| Geo | Lon | Longitude of tweet location | Precise longitude of where user was when they tweeted, potentially could be at home or work, alternatively may be commuting. Either way has considerable potential to locate individuals in low-level geographies, but this is significantly reduced without latitude value | Medium* |
Note. JSON = JavaScript Object Notation; GDPR = General Data Protection Regulation; POI = point of interest; EU = European Union.
Principles for Maintaining Security (Linked Twitter and Survey Data).
| 1. Systematic processing | As much as possible, data should be managed in a systematic and considered manner. Based on the processes used for linking survey and administrative records ( |
| 2. Data reduction | To conduct analysis for any given research question, it is likely that not all of the available survey and Twitter data need to be linked together. As such, only the survey and Twitter data necessary for analysis should be made available for linkage. |
| 3. Controlled access | Throughout the data management process, access to identifiable data should be limited to those who need it to minimize the risks of disclosure. The linked data should be held securely, so that access is granted only to those who need it, and those people with access should be documented and have appropriate training for working with identifiable data. |
| 4. Data deletion | Data should only be held for as long as is necessary for analysis to be conducted. Once the project is complete, as with other forms of personal data, data should be securely deleted and archived if necessary. |
Figure 1.Data flow diagram for linking survey and Twitter data.
Summary of Considerations for the Ethical Linkage of Survey and Twitter Data.
| 1. Consent | Large-scale social media data collection disrupts “traditional” approaches to collecting informed consent, as the increased distance between researcher and research participant, and the scale of the number of participants, can make that interaction impractical, while the “public” nature of the data and agreement to platform “Terms of Service” calls into question the necessity of further consent. |
| 2. Disclosure | Due to its “searchable” nature, unlike traditional quantitative and or qualitative data, Twitter data often cannot be anonymized without a substantial loss in utility. As a result, anonymization should not, in most instances, be relied upon as a means of maintaining data security. This is potentially problematic for Twitter data collection in general, but is particularly challenging where it is linked to survey data which the participant has not chosen to put into the public domain. |
| 3. Security | Given the complexity of linking survey and Twitter data, the resulting difficulty in achieving truly “informed” consent, and the inability to consistently rely on anonymization of data to protect participants from risk of harm, increased emphasis should be placed on maintaining the security of data throughout the research process. |
| 4. Archiving | The archiving or sharing of linked survey and Twitter data for further analysis carry the same problems as the initial processing of the data. As such, this may be done ethically should informed consent have been obtained and the data are archived or shared in a systematic and controlled manner. |
Note. API = application programming interface.