| Literature DB >> 32067185 |
Louise Bezuidenhout1, Robert Quick2, Hugh Shanahan3.
Abstract
Data science skills are rapidly becoming a necessity in modern science. In response to this need, institutions and organizations around the world are developing research data science curricula to teach the programming and computational skills that are needed to build and maintain data infrastructures and maximize the use of available data. To date, however, few of these courses have included an explicit ethics component, and developing such components can be challenging. This paper describes a novel approach to teaching data ethics on short courses developed for the CODATA-RDA Schools for Research Data Science. The ethics content of these schools is centred on the concept of open and responsible (data) science citizenship that draws on virtue ethics to promote ethics of practice. Despite having little formal teaching time, this concept of citizenship is made central to the course by distributing ethics content across technical modules. Ethics instruction consists of a wide range of techniques, including stand-alone lectures, group discussions and mini-exercises linked to technical modules. This multi-level approach enables students to develop an understanding both of "responsible and open (data) science citizenship", and of how such responsibilities are implemented in daily research practices within their home environment. This approach successfully locates ethics within daily data science practice, and allows students to see how small actions build into larger ethical concerns. This emphasises that ethics are not something "removed from daily research" or the remit of data generators/end users, but rather are a vital concern for all data scientists.Entities:
Keywords: CODATA; Data ethics; Data science; Open Science; RDA
Mesh:
Year: 2020 PMID: 32067185 PMCID: PMC7417416 DOI: 10.1007/s11948-020-00197-2
Source DB: PubMed Journal: Sci Eng Ethics ISSN: 1353-3452 Impact factor: 3.525
Fig. 1Schematic diagram of SRDS curriculum, highlighting centrality of open and responsible science citizenship
Course breakdown for teaching open and responsible (data) science citizenship
| Subject | Topics covered | Number of hours |
|---|---|---|
| Research data management | Data management, data management plans, FAIR data, repositories | 5 h + 4 h practical |
| Open authorship | Reproducible reporting, DOIs, data licensing, ORCIDs | 4.5 h |
| Responsible conduct of research (RCR) and Open Science | Introduction to ethics, RCR, overview of open science, contextualizing openness and responsibility | 3.5 h |
| Technical data skills | Ethics exercises linked to technical content (see Fig. | Variable |
FAIR data refers to the movement to develop standards to make data Findable, Accessible, Interoperable and Reusable. DOI refers to Digital Object Identifier
Fig. 2Structure of the SRDS demonstrating distribution of ethics prompts
Schematic representation of distribution of content
| Element of science citizenship | Student engagement activities |
|---|---|
| Ethics and RCR | Lecture 1, Lecture 2 |
| Openness and Open Science | Author carpentry, RDM |
| Data provenance | RDM, SQL EP, |
| Data tools and infrastructure | Lecture 2, GitHub EP, Security EP, Recommender EP, Neural Networks EP, Computational Infrastructures EP |
| Data practices | Shell EP, R EP, Data Visualization EP |
| Research activity | Tools and instruments | Key ethical considerations | Key challenges in your research environment | How Can I Get Assistance? |
|---|---|---|---|---|
| Create | R, GitHub (versioning), data management plans, research data mamangement,,, FAIR data checklists, ethical approval, EOSC, FOSTER, OpenAIRE | Is my research ethical and responsible? Have I considered all types of data that I am producing? Have I thought about how to implement FAIR and openness? | ||
| Document | GitHub Research Data management FAIR data checklists OMERO | Is my metadata sufficient to allow for scrutiny and re-use? | ||
| Use | Analysis, GitHub, R, specific analysis tools | Am I using open analysis tools? Am I complying with the ethical requirements for secondary data use? | ||
| Store | Storage and backup options Dropbox | Are my data properly curated and annotated for re-use? What are the implications of 3rd party, commercial storage? | ||
| Share | Licensing—Creative Commons Re3data, DOIs, EOSC, FOSTER, OpenAIRE | Am I using open, sustainable and responsible pathways to sharing? Could my data be misused for negative purposes? How will I control for this? | ||
| Preserve | Archiving | Is my data guaranteed to have long-term preservation? |
| Research activity | Tools and Instruments | Key ethical considerations | Key challenges in your research environment | How can I get assistance? |
|---|---|---|---|---|
| Research misconduct | RCR guidelines, Open Data guidelines, Institutional policies | Fabrication, falsification and plagiarism Research causing harms Lack of attribution/respect of licensing | ||
| Conflicts of interest and commitment | RCR guidelines | Biases in research caused by undisclosed conflicts | ||
| Collaborative research | Journal guidelines, Memoranda of Understandings Licensing | Poor attribution of credit Scooping, theft, loss of control of data | ||
| Authorship and publications | AuthorAid ORCID | Attribution of credit Open Access | ||
| Peer review | Publons | Theft of ideas Uncollegial behaviour (bullying, unfair review, etc.) | ||
| Mentorship and trainee relationships | Appropriation of student’s research Failure in duty to teach Failure in duty to care Teaching in an appropriate fashion |
| Module | Question | Response |
|---|---|---|
| GitHub | Content on GitHub can only be made private with a subscription fee. Does the idea of having unpublished work freely open and accessible to anyone bother you? Yes/no | It’s ok to be concerned. Thinking carefully about where and how your share is responsible But it is important to recognize that all content online is “published” in terms of legal and ethical standards—“published” = making public GitHub and other sharing sites offer the ability to attach legal licenses that require attribution, i.e., Creative commons Some sharing sites offer the opportunity to add disclaimers for downstream use of data Registering outputs as DOIs provides a unique identifier for citing your work Not all data/projects are appropriate for GitHub, i.e., Data with ethical requirements. Be sure you’ve thought through how your work can be used downstream |
| R | R is an example of Free and Open Source Software. It is a community-originated product, and users do not have access to technical support in the same way they would have as license holders of proprietary software. Users thus rely on community forums and peer support for assistance when they run into problems. These community forums rely entirely on the time of volunteer users As a future R user, how important do you think it is to dedicate time to helping other users on community forums? 1. Not really, there are enough people helping 2. I’d like to, but I don’t have the time 3. I’d like to, but I don’t feel that I have the expertise 4. I will try from time to time 5. I will contribute regularly | Contributing to communities online: Access to Open Science resources is both a right and a responsibility Open Science movements are only as strong as their members Being open is a kind of gift economy—we receive gifts/opportunities but must be willing to pay back without expectation of reward Engaging with a community can lead to unexpected benefits—i.e., Learning, collaborations, visibility/prestige, friendships Following community activities—even if you’re not ready to contribute—can be very useful as you will: Get used to how the community operates Identify leaders to follow Learn from discussions Become part of a global community that links you across the globe |
| SQL | SQL enables communication with these databases, which makes it a powerful tool in research. Many of the databases/datasets that you will be using will be open. This means that they are available for re-use, but also means that you have a responsibility regarding how you re-use them. Please select all actions you should Nothing, if the data are open it is free to be re-used Nothing before I use the data, but I will give credit to the data producer after I use it I must check the metadata for any information about the ethical commitments made by the data producer I must contact the original data producer to tell them about my research I must alert my department so that they can register IP I must check whether the data are licensed by a Creative Commons license I must check how the methods by which the original data were produced to ensure that they were responsibly produced I must check that the data have been reused in other published papers I must email the database curator my data management plan | Always I must check the metadata for any information about the ethical commitments made by the data producer I must check whether the data are licensed by a Creative Commons license I must check how the methods by which the original data were produced to ensure that they were responsibly produced Good, but not necessary I must contact the original data producers to tell them about my research I must check that the data have been reused in other published papers I must email the database curator my data management plan Never Nothing, if the data are open it is free to be re-used Nothing before I use the data, but I will give credit to the data producer after I use it I must alert my department so that they can register IP Using Open Data/bases is a privilege and a responsibility You can show your respect for the data you use by being as open and transparent as possible However, before using any data—open or not—you must always Check for the ethical commitments attached to the data—check the metadata, but if in doubt email the original producer Check licensing—even Open Data may have restrictions on use. Check the CC website for descriptions on the different licenses Check the methods by which the data were produced—was it responsible research practice? Is it robust and reproducible research |
| Data visualisation | Exercise adapted from O’Brien, 2017 Full thesis available here: | Data visualizations are used to communicate information about important social issues to large audiences Ethical problems in data visualizations can be intentional or unintentional Visualization may use deceptive techniques that have the potential to alter the audience’s understanding of the information being presented Common deceptive data visualization techniques including message exaggeration/understatement and message reversal (i.e. Flipping or inverting axis of chart) Data visualizations carry the same ethical importance as other forms of communication Similar to journalists, technical communicators must follow a set code of ethics. According to the Society for Technical Communication (STC), “as technical communicators, we observe the following ethical principles in our professional activities” listing legality, honesty, confidentiality, quality, fairness, and professionalism as the main ethical categories for technical communicators |
| Information security | Q1: Is Open Software likely to be more or less secure than proprietary software? Q2: You need to encrypt sensitive data for an international research group. The government in the country where you live mandates a certain encryption technology and is widely suspected of leaving “backdoors” (i.e., ways to access the data without knowing the encryption key). How do you respond? | A1: Generally, more secure if there are active contributors reviewing the code and resolving security issues. However, it is also easier for attackers to understand the code and look for security holes A2: There are multiple aspects to it. In some cases you may have no choice but to follow the law, in others it may make sense for the data to be stored in a different country since there are multiple countries in the collaboration. Even if you trust your government, encryption backdoors can also be exploited by attackers |
| Recommender systems | Minneapolis, 2012 Target is a large retail firm in the USA that uses data analytics and recommender systems to tailor coupons to their customers. In Minneapolis in 2012 a customer approached the manager “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.” The selection of coupons that the daughter received were based on the data collected from her store loyalty card. Changes in purchasing behaviour were linked to certain likely outcomes, leading to the receipt of pregnancy-related coupons. Select all the statements you agree with: 1. I think that targeted marketing is unproblematic 2. If the daughter accepted a store loyalty card, she should have read the terms and conditions and accept the consequences 3. I think that there should have been some filters built into avoid targeting under-18s 4. I think that it is not right that Target employees are placed in situations where they cannot explain their company’s marketing decisions 5. I think that it is not right that Target has the ability to impose on the privacy of the daughter 6. I think that directed marketing can only be used for certain medical conditions (like pregnancy) but not others (like sexually transmitted diseases) 7. I think that directed marketing can cause harm as it should not be assumed that all women respond to their pregnancy in the same way | Recommender systems should not be thought of as neutral The choice of datasets to utilize, the links made and the responses prompted all reflect specific cultural values and assumptions While including these values are not necessarily unethical in their own right, it is important to recognize that they can have unintended consequences It is important for developers of recommender systems to be aware of the values that they introduce into the system, and their implications for the broader society |
| Artificial neural networks | Neural networks (and other machine learning techniques) can be trained to identify different types of”events” from data. One application that is currently very topical is to identify terrorists (from web browser history, travel, purchases, who they communicate with, etc.). The use of such applications by the state raises concerns about human rights and excessive surveillance. Other concerns, of course, relate to the mis-identification of individuals, as can be anticipated a small percentage of the time Suppose such a technique identifies an individual as having a high probability of committing a terrorist act in the near future. Maybe 95%—and that’s an honest number, not the prosecutor’s fallacy. But he/she has not broken any law (yet) and the neural net just says “this is typical terrorist behaviour”: it does not give any reason This, of course, leads to a number of different possibilities. For instance, some governments would curtail their liberty, even though they are innocent. Others might choose not to intervene, even though people may die. If you take two patterns which are identical except for the religion, or ethnicity, of the individual, your neural net will probably give very different answers. What answer best reflects your position: 1. Distinguishing individuals by religion or ethnicity perpetuates stereotypes and should not be used to distinguish individuals 2. Using all resources available is justified to save lives, thus the use of neural networks is justified 3. It is unacceptable to impose on individual freedom through pre-emptive interventions based on neural networks 4. The use of neural networks should only be used by the government on its own citizens. Being part of the online environment is not an invitation for foreign powers to use information about non-citizen individuals | The use of machine learning techniques for civil governance is very controversial and opinions are very divided as to whether they are just A problematic aspect of these techniques is that the general public has little likelihood of understanding how these systems are set up The users are also unlikely to share details for fear that the systems can be hacked, gamed or appropriated Many decisions made by users of these systems go unchallenged Data scientists are in a good position to monitor the development and deployment of such systems due to their ability to engage with the technical aspects of these systems |
| Research computational infrastructures | The Association for Computing Machinery has the following code of ethics: Exercise: read the code at Pay special attention to section 3.7 that deals with computational infrastructures | Recognize and take special care of systems that become integrated into the infrastructure of society Even the simplest computer systems have the potential to impact all aspects of society when integrated with everyday activities such as commerce, travel, government, healthcare, and education. When organizations and groups develop systems that become an important part of the infrastructure of society, their leaders have an added responsibility to be good stewards of these systems Part of that stewardship requires establishing policies for fair system access, including for those who may have been excluded. That stewardship also requires that computing professionals monitor the level of integration of their systems into the infrastructure of society. As the level of adoption changes, the ethical responsibilities of the organization or group are likely to change as well. Continual monitoring of how society is using a system will allow the organization or group to remain consistent with their ethical obligations outlined in the Code. When appropriate standards of care do not exist, computing professionals have a duty to ensure they are developed |