| Literature DB >> 34026049 |
Oliver W Butters1,2,3, Rebecca C Wilson1,2,3, Hugh Garner2,3, Thomas W Y Burton3,4.
Abstract
Cohort studies collect, generate and distribute data over long periods of time - often over the lifecourse of their participants. It is common for these studies to host a list of publications (which can number many thousands) on their website to demonstrate the impact of the study and facilitate the search of existing research to which the study data has contributed. The ability to search and explore these publication lists varies greatly between studies. We believe a lack of rich search and exploration functionality of study publications is a barrier to entry for new or prospective users of a study's data, since it may be difficult to find and evaluate previous work in a given area. These lists of publications are also typically manually curated, resulting in a lack of rich metadata to analyse, making bibliometric analysis difficult. We present here a software pipeline that aggregates metadata from a variety of third-party providers to power a web based search and exploration tool for lists of publications. Alongside core publication metadata (i.e. author lists, keywords etc.), we include geocoding of first authors and citation counts in our pipeline. This allows a characterisation of a study as a whole based on common locations of authors, frequency of keywords, citation profile etc. This enriched publications metadata can be useful for generating study impact metrics and web-based graphics for public dissemination. In addition, the pipeline produces a research data set for bibliometric analysis or social studies of science. We use a previously published list of publications from a cohort study as an exemplar input data set to show the output and utility of the pipeline here. Copyright:Entities:
Keywords: ALSPAC; Bibliography; Bibliometrics; Longitudinal birth cohort
Mesh:
Year: 2020 PMID: 34026049 PMCID: PMC8108552 DOI: 10.12688/f1000research.25484.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Final metadata object.
Tabular representation of the python dictionary used to store the metadata, the secondary column items are nested under the primary items where present. Metadata source key: Zr=Zotero raw, S=Scopus, D=doi.org, P=PubMed, Ze=’Extra’ field from Zotero, W=Wikidata, De=Derived. The metadata sources are used in the order they are displayed in the table (left to right), once a value has been found the subsequent sources are not queried.
| Primary | Secondary | Source |
|---|---|---|
| IDs | DOI | Zr |
| PMID | Ze | |
| Scopus | S | |
| Hash | De | |
| Zotero | Zr | |
| Authors | Author list | P/D |
| First author | Author list/Ze | |
| Affiliation | D/P/S/Ze | |
| Location | Canonical institute | De |
| Town | W | |
| Country | W | |
| Longitude | W | |
| Latitude | W | |
| Date | P/D/S/Ze/Zr | |
| Title | P/D/S/Zr | |
| Abstract | P | |
| Citations | Scopus citation count | S |
| Keywords | MeSH | P |
| Other | P | |
| Journal | Name | D/P |
| Volume | D/P | |
| Issue | D/P |
Figure 1. Overview of the PUMA pipeline.
The left column shows the sources of data accessed via their APIs, the central column the stages the pipeline with the right column showing input and output of the pipeline.
Figure 2. Number of publications per year in ALSPAC.
Figure 3. Citation count profile of ALSPAC publications as of 15/1/2021.
The x-axis is truncated at 200 citations as there are a small number of publications disparately spread above this.
Figure 4. Choropleth map of first author countries in ALSPAC.
Source metadata coverage.
| Date range | 1989–2015 |
| Publication count | 1300 |
| DOIs | 1260 |
| PMIDs | 1240 |
| At least one of DOI or PMID | 1293 |
Counts of completeness of the augmented metadata fields.
The values are taken from the coverage report web page generated by the pipeline. A screenshot of this page is available in the output data (see data availability).
| Publication count | 1300 |
| First author name | 1288 |
| Raw first author institute | 1279 |
| Derived institute | 1271 |
| Derived geolocation | 1268 |
| Year published | 1300 |
| Publication title | 1300 |
| Abstract | 1207 |
| Scopus citations record | 1284 |
| Keywords (MeSH) | 1205 |
| Journal Name | 1293 |
Study level citation statistics from Scopus as of 15/1/2021.
The values are taken from the metrics web page generated by the pipeline. A screenshot of this page is available in the output data (see data availability).
| Number of publications | 1300 |
| Number with citation data | 1284 |
| Total citation count | 97,537 |
| h-index | 141 |
| c100-index | 226 |
| Mean citations per
| 76 |
| Median citation count | 39 |
Frequency of top ten lemmatized words used in keywords, titles and abstract text from the ALSPAC publications.
The full list of words as output by the pipeline is available in the output data (see data availability). The numbers in parentheses are the count.
| Keywords | Title | Abstract |
|---|---|---|
| study (1513) | study (357) | child (2517) |
| child (1284) | child (291) | age (2034) |
| human (1257) | cohort (259) | association (1905) |
| female (1050) | childhood (220) | associated (1696) |
| male (859) | association (220) | study (1675) |
| factor (720) | birth (146) | year (1553) |
| infant (568) | age (129) | risk (1142) |
| longitudinal (562) | risk (128) | maternal (1120) |
| pregnancy (470) | maternal (122) | ci (915) |
| adolescent (470) | associated (117) | cohort (904) |