Literature DB >> 30828646

Generation of a cleaned dataset listing Avon Longitudinal Study of Parents And Children peer-reviewed publications to 2015.

Oliver Butters1,2, Amran Ismail2, Sue Thompson1, Rebecca Wilson1,2.   

Abstract

Birth cohort studies generate huge amounts of data, and as a consequence are a source of many peer reviewed publications. We have taken the list of publications from the Avon Longitudinal Study of Parents and Children UK birth cohort, filtered, de-duplicated and cleaned it to generate a bibliographic research data set. This dataset could be used for accurate reporting and monitoring of the impact of the study as well as bibliometric research.

Entities:  

Keywords:  ALSPAC; Bibliography; Birth cohort

Year:  2018        PMID: 30828646      PMCID: PMC6392145          DOI: 10.12688/wellcomeopenres.14986.1

Source DB:  PubMed          Journal:  Wellcome Open Res        ISSN: 2398-502X


Introduction

Birth cohort studies in the U.K. generate and distribute huge amounts of longitudinal data for medical, social and economic research. Data is generally applied for and given out to researchers once the relevant governance conditions have been met [1]. It is often the case that these studies keep track of the publications that have arisen from the data they have given to researchers for project monitoring purposes and to report back to the funder(s). The size of these lists of publications is sometimes used as a crude metric of the the research outputs or impact for the study. Most modern academic journals will assign a unique persistent identifier to new publications. This persistent identifier may be unique and resolvable by the journal, but may be meaningless outside of the journal’s ecosystem. The Digital Object Identifier (DOI) is the de facto persistent identifier which is used as an independent external reference to publications, posters, data, software etc. DOI resolving services exist to refer users (human and machine) to the journal web page for a given DOI, CrossRef holds over 100 million such records. These resolving services also host a wealth of metadata themselves. The DOI data model outlines the format of DOI data. In addition to CrossRef there exists other resolving and metadata services that are domain-specific. These may have more in depth metadata about their domain than the generality that the DOI data model can offer. In this work we also make use of the persistent identifiers that the National Center for Biotechnology Information (NCBI) PubMed generates (PubMed IDs, PMID), and the metadata their resolving service provides [2]. This offers extra metadata over and above that available from CrossRef, but only on medical focused publications, i.e. a subset of all available publications in birth cohort studies. In this paper we describe how we created a cleaned, de-duplicated list of peer-reviewed publications arising from the Avon Longitudinal Study of Parents and Children (ALSPAC). ALSPAC began in 1990 (see the cohort profiles for an overview [3, 4]), and has publications within the biomedical research domain. ALSPAC reports to have over 1800 publications as of August 2018 [5]. The study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool.

Data cleaning

The ALSPAC master list of publications at the time this project started (2014), consisted of a large table in a Microsoft Word document. This table was imported into a spreadsheet containing a reference to the publication, a DOI and a PMID. Given the amount of time that has passed since the original master list was parsed we have merged this list with the list of publications on the ALSPAC website as at 12/9/18. One pertinent point is that there exists a small number of publications in the original Microsoft Word document that are not present on the website; we include these here for completeness. Each publication was audited manually to ensure it was a peer reviewed publication i.e. that the journal had a defined peer-review process and/or that it appeared in Ulrichs Web Global Serials Directory with a "refereed" status. Non-peer-reviewed articles were removed from the publications list. Examples of non-peer-reviewed publications included theses, book chapters, published abstracts, opinion articles, comments on other articles, working papers and technical reports. The DOI and PMID for each entry were also audited to validate the identifier and ensure they corresponded to the correct article. A common error was the truncation of a PMID, which due to the numerical nature of PMIDs was itself a valid PMID albeit referring to the wrong publication. If a DOI or PMID was missing from a publication, wherever possible this was sourced from the journal or PubMed directly. The DOI and PMID fields from the publications spreadsheet were used to import the publications lists into a bibliographic library in Zotero. Zotero uses NCBI PubMed to resolve PMIDs and CrossRef to resolve DOIs. We then further cleaned the list of publications by deduplicating the list using Zotero’s native de-duplicate feature. Duplicates often arose in the bibliography when a publication was accepted in one year and then appeared online the next, or when it was listed with a DOI in one case and a PMID in another. Another common source of duplicates was having both the pre-print and the final published paper marked as separate items. In this case we disregarded the pre-print. Given that publications are not necessarily reported to ALSPAC on acceptance to a journal, and some journals have a long turn around in publication time, we chose to have a cut-off of the end of 2015 for this data set. Given the misclassification of years of some publications, we added all publications up to the end of 2016 (as defined by the list on the ALSPAC website), but disregarded any that had a publication date after the end of 2015. This criteria left us with 1300 peer reviewed publications claimed by ALSPAC to the end of 2015. Table 1 shows a summary of the data.
Table 1.

Data coverage. Percentages rounded down in each case.

Date range1989–2015
Publication count1300
DOIs (%)97
PMIDs (%)95
Publication title (%)100
Year published (%)100

Data description

To make this list of publications available to others in as useful way as possible we exported it from our Zotero library in two different formats: BibTeX format to be able to import into any reference manager and comma separated variable (CSV) to allow import into analysis tools to do bibliometric analysis with. Both of these formats are described in Table 2 and Table 3, respectively. Zotero v5.0.56 was used to export the data.
Table 2.

A data description of the BibTeX ALSPAC peer reviewed publications list to 2015.

VariableDescription
citation keyA unique identifier
titleArticle title
authorName(s) of author(s)
abstractArticle abstract
journalJournal title
volumeJournal volume
numberJournal issue
pagesArticle page numbers in the journal
yearYear published
monthMonth published
keywordsArticle keywords
issnInternational Standard Serial Number
doiDigital Object Identifier
pmidPubMed identifier
pmcidPubMed Central identifier
Table 3.

A data description of the CSV file of ALSPAC peer reviewed publications list to 2015.

VariableDescription
YearYear published
AuthorName(s) of author(s)
TitleArticle title
Publication titleJournal title
ISSNInternational Standard Serial Number
DOIDirect Object Identifier
Abstract NoteArticle abstract
DateDate article published
PagesArticle page numbers in the journal
IssueJournal issue
VolumeJournal volume
ExtraPubMed and/or PubMed Central ID;
Manual tagsArticle keywords

Data availability

The cleaned BibTeX and CSV data described here are available at Zenodo. DOI: https://doi.org/10.5281/zenodo.2276785 [6]. Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0). All of the metadata presented here is publicly available in its raw form—the list of publications is available from the ALSPAC website and the individual publications’ metadata from their respective publishers. PubMed and CrossRef have additional terms and conditions [1, 2] on their aggregated metadata, but these are permissive and allow fair use. Thank you for the opportunity of reviewing this data note. I think that this represents really exciting work and a huge effort in documenting studies using the data and cleaning these. There are some aspects that I think could be better described to help support other similar exercises in the future:  The suggestions made above are mainly for clarification to help understand the parameters of the data set. I would like to again emphasise that this data note does represent a huge task undertaken and has resulted in a very worthwhile output. I think the rationale around collating this information could be strengthened a little. In particular, I think the rationale should better make the case that understanding the scientific impact of these cohort studies is key in supporting the continuation of this study and funding future studies. The source data are drawn from records held by the ALSPAC team, which has been keeping track of publications. These formed a ‘master list’ of publications which was then extensively cleaned and refined to form the dataset. However, the processes used to keep track of publications need to be better described – how are studies identified and to what extent do the researchers feel that these records represent the full range of peer-reviewed studies published using ALSPAC data? The authors described that these are publications arising from ALSPAC data. Were any criteria imposed on what this ‘usage’ should look like? For example would a commentary that makes reference to the ALSPAC data (possibly alongside other studies) be included as a publication; would secondary analyses of studies using ALSPAC data be included (e.g. using an effect size from a study using ALSPAC data as part of a meta-analysis)? While the PMID and DOIs were cleaned, were the studies checked for their usage of these data? This seems important to clarify. While this data note describes a dataset of ALSPAC publications, I’m not clear if this is exclusively a dataset of primary studies using ALSPAC data in novel analyses, or if it also includes other publication types. If the dataset does include other publication types, does this have implications for the way in which the dataset should be used by future researchers? As a minor suggestion, it may be interesting to have an addition to Table 1 that includes a breakdown of publications by year. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The manuscript reports on creating a complete bibliography of publications associated with the ALSPAC longitudinal study. Collecting such data is not trivial given the duplication via preprints, PMIDs, and variable metadata associated with articles. This work is important for understanding the impacts of the study, as well as potential future meta-analyses. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
  1 in total

1.  PUblications Metadata Augmentation (PUMA) pipeline.

Authors:  Oliver W Butters; Rebecca C Wilson; Hugh Garner; Thomas W Y Burton
Journal:  F1000Res       Date:  2020-09-04
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.