| Literature DB >> 35022248 |
Yasunori Park1, Rachael A West1,2, Pranujan Pathmendra1, Bertrand Favier3, Thomas Stoeger4,5,6, Amanda Capes-Davis1,7, Guillaume Cabanac8, Cyril Labbé9, Jennifer A Byrne10,11.
Abstract
Nucleotide sequence reagents underpin molecular techniques that have been applied across hundreds of thousands of publications. We have previously reported wrongly identified nucleotide sequence reagents in human research publications and described a semi-automated screening tool Seek & Blastn to fact-check their claimed status. We applied Seek & Blastn to screen >11,700 publications across five literature corpora, including all original publications in Gene from 2007 to 2018 and all original open-access publications in Oncology Reports from 2014 to 2018. After manually checking Seek & Blastn outputs for >3,400 human research articles, we identified 712 articles across 78 journals that described at least one wrongly identified nucleotide sequence. Verifying the claimed identities of >13,700 sequences highlighted 1,535 wrongly identified sequences, most of which were claimed targeting reagents for the analysis of 365 human protein-coding genes and 120 non-coding RNAs. The 712 problematic articles have received >17,000 citations, including citations by human clinical trials. Given our estimate that approximately one-quarter of problematic articles may misinform the future development of human therapies, urgent measures are required to address unreliable gene research articles.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35022248 PMCID: PMC8807875 DOI: 10.26508/lsa.202101203
Source DB: PubMed Journal: Life Sci Alliance ISSN: 2575-1077
Figure 1.Diagram describing the five literature corpora screened by S&B.
For each corpus (top row), the diagram shows the numbers of articles that were (i) screened by S&B (white), (ii) flagged by S&B with sequences manually verified (grey), and (iii) found to be problematic by describing at least one wrongly identified nucleotide sequence (dark grey). Total numbers of problematic articles and wrongly identified sequences are indicated below the diagram, corrected for duplicate articles between the corpora.
Descriptions of the targeted corpora screened by Seek & Blastn with manual verification of nucleotide sequence reagent identities.
| Single gene knockdown (SGK) |
| Cisplatin + Gemcitabine (C + G) | ||||
|---|---|---|---|---|---|---|
| Corpus | Problematic | Corpus | Problematic | Corpus | Problematic | |
| Number of articles (% of corpus) | 174 (100%) | 75 (43%) | 50 (100%) | 31 (62%) | 100 (100%) | 51 (50%) |
| Number of journals | 83 | 42 | 35 | 25 | 48 | 31 |
| Publication year median (range) | 2015 (2006–2019) | 2015 (2010–2019) | 2017 (2009–2019) | 2017 (2009–2019) | 2017 (2008–2019) | 2017 (2009–2019) |
| Journal impact factor at publication year median (range) | 2.204 (0.098–8.459) | 1.778 (0.098–5.712) | 3.34 (0.700–9.050) | 3.23 (0.700–8.278) | 3.571 (1.099–10.391) | 3.041 (1.099–8.579) |
| Number of sequences/article median (range) | 6 (0–24) | 6 (1–24) | 11 (4–46) | 10 (4–46) | 11 (2–71) | 12 (2–70) |
| Number of incorrect sequences/article median (range) | ND | 1 (1–8) | ND | 1 (1–5) | ND | 2 (1–8) |
| Articles from China proportion (%) | 159/174 (91%) | 73/75 (97%) | 44/50 (88%) | 31/31 (100%) | 90/100 (90%) | 50/51 (98%) |
| Articles from China affiliated with hospitals proportion (%) | 147/159 (92%) | 68/73 (93%) | 40/44 (91%) | 28/31 (90%) | 82/90 (91%) | 48/50 (96%) |
| Articles from all other countries affiliated with hospitals proportion (%) | 6/15 (40%) | 0/2 (0%) | 0/6 (0%) | 0/3 (0%) | 1/10 (10%) | 0/1 (0%) |
| Articles with post-publication notices | 20/174 (12%) | 13/75 (17%) | 1/50 (2%) | 1/31 (3%) | 1/100 (1%) | 1/51 (2%) |
SGK articles were published until June 2019.
Post-publication notices include retractions, expressions of concern and corrections.
Cancer types studied in the Single Gene Knockdown (SGK) corpus, where each cancer type corresponds to a single article.
| Gene | Previously reported SGK articles | New SGK articles |
|---|---|---|
|
|
| Breast, Breast, Colorectal, Gastric, Liver, Lung, Pancreatic |
|
| N/A | |
|
| ||
|
| Brain, | |
|
|
| Breast, Gastric, Leukemia, Lung, |
|
|
| |
|
|
| |
|
| ||
|
| Brain, | Laryngeal, |
|
|
| |
|
| Bladder, | |
|
| Bladder, Lung | Brain, Brain, Breast, Breast, Liver, |
|
| Brain, | |
|
| Brain, Colorectal, Gastric, Thyroid | |
|
| Brain | |
|
| ||
|
| Brain, Brain, Gallbladder, Laryngeal, Leukemia, Lung, Lung, Oral, Oral, |
Cancer types shown in bold correspond to problematic articles with wrongly identified nucleotide sequence(s).
Underlined cancer types correspond to articles that have been retracted or assigned an expression of concern.
Wrongly identified nucleotide sequences summarized according to experimental technique and identity error type.
| Corpus | Technique | “Non-targeting” yet targeting proportion (%) | “Targeting” yet non-targeting proportion (%) | Targeting wrong gene/sequence proportion (%) | Total per corpus proportion (%) |
|---|---|---|---|---|---|
| SGK (n = 115 reagents in n = 75 articles) | PCR | 0/45 (0) |
|
| 52/115 (45) |
| Gene knockdown |
|
| 12/57 (21) |
| |
| Other | 0/45 (0) | 0/14 (0) | 0/57 (0) | 0/115 (0) | |
| Total (Error type) | 44/44 (100) | 14/14 (100) | 57/57 (100) | 115/115 (100) | |
| PCR | 0/2 (0) |
|
|
| |
| Gene knockdown |
| 1/9 (11) | 5/38 (13) | 8/49 (16) | |
| Other | 0/2 (0) | 0/9 (0) | 0/38 (0) | 0/49 (0) | |
| Total (Error type) | 2/2 (100) | 9/9 (100) | 38/38 (100) | 49/49 (100) | |
| C + G (n = 109 reagents in n = 51 articles) | PCR | 0/6 (0) |
|
|
|
| Gene knockdown |
| 1/24 (4) | 5/79 (6) | 10/109 (9) | |
| Other | 2/6 (33) | 0/24 (0) | 1/79 (1) | 3/109 (3) | |
| Total (Error type) | 6/6 (100) | 24/24 (100) | 79/79 (100) | 109/109 (100) | |
| PCR | 0/9 (0) |
|
|
| |
| Gene knockdown |
| 7/42 (17) | 15/233 (6) | 31/284 (11) | |
| Other | 0/9 (0) | 0/42 (0) | 0/233 (0) | 0/284 (0) | |
| Total (Error type) | 9/9 (100) | 42/42 (100) | 233/233 (100) | 284/284 (100) | |
| PCR | 0/36 (0) |
|
|
| |
| Gene knockdown |
| 37/335 (11) | 54/630 (8) | 121/995 (12) | |
| Other | 0/36 (0) | 2/335 (1) | 3/630 (1) | 5/995 (1) | |
| Total (Error type) | 30/30 (100) | 335/335 (100) | 630/630 (100) | 995/995 (100) | |
| Total (n = 1,535 reagents in n = 712 articles) | PCR | 0/89 (0) |
|
|
|
| Gene knockdown |
| 50/416 (12) | 89/1,030 (9) | 226/1,535 (15) | |
| Other | 2/89 (2) | 2/416 (1) | 4/1,030 (1) | 8/1,535 (1) | |
| Total (Error type) | 89/89 (100) | 416/416 (100) | 1,030/1,030 (100) | 1,535/1,535 (100) |
PCR = Human gene or genomic targeting primers for PCR, RT–PCR or methylation-specific PCR.
Gene knockdown = siRNA or shRNA.
Other = Claimed Ribozyme, TALEN, mimic sequences, and other oligonucleotide sequences.
Bold text indicates the most frequent error types per corpus.
Figure 2.Percentages of sequence identity error types in each corpus.
Percentages of wrongly identified nucleotide sequence reagents that correspond to the three identity error types (y-axis) in each corpus (x-axis). Percentages corresponding to each error type are indicated, rounded to the nearest single digit. The numbers of incorrect sequences in each corpus are shown below the x-axis.
Figure 3.Percentages of wrongly identified nucleotide sequences that were either unique or repeated within each corpus.
Percentages of wrongly identified sequences that were identified at least twice in any single corpus (black) are shown above each image, rounded to the nearest single digit. All other wrongly identified sequences were unique in the indicated corpus (grey). Numbers of wrongly identified sequences identified in each corpus are shown below each image.
Summary of features of Gene and Oncology Reports journals and problematic articles.
| Feature |
|
|
|---|---|---|
| Publication years screened by Seek & Blastn | 2007–2018 | 2014–2018 |
| Journal impact factor (range during years screened) | 2.082–2.871 | 2.301–3.041 |
| Flagged/screened articles proportion (%) | 742/7,399 (10%) | 1,709/3,778 (45%) |
| Problematic/flagged articles proportion (%) | 128/742 (17%) | 436/1,709 (26%) |
| Incorrect sequences/problematic article median (range) | 2 (1–36) | 2 (1–15) |
| Problematic articles from China proportion (%) | 69/128 (54%) | 393/436 (90%) |
| Problematic articles from all other countries proportion (%) | 59/128 (46%) | 43/436 (10%) |
| Problematic articles from China affiliated with hospitals proportion (%) | 54/69 (78%) | 342/393 (87%) |
| Problematic articles from all other countries affiliated with hospitals proportion (%) | 5/59 (9%) | 5/43 (12%) |
| Retracted or corrected problematic articles proportion (%) | 2/128 (2%) | 2/436 (0.5%) |
Figure S1.Problematic Gene articles per year, according to country of origin and institutional affiliation type.
Total numbers of problematic articles for each country/group of countries are shown in the upper left corner of each panel. (A) Numbers of problematic Gene articles (y-axes) per publication year (x-axes) according to country of origin, shown above each graph. Countries are shown in alphabetical order, from top left. (B) Numbers of problematic Gene articles (y-axis) per publication year (x-axis) from China (right panel) or all other countries (left panel). Articles affiliated with hospitals or other institution types are shown in blue or grey, respectively. Numbers of problematic articles per year are shown below each stacked bar graph.
Figure 4.Percentages of problematic Gene and Oncology Reports articles according to hospital affiliation status and country of origin.
Percentages of problematic Gene and Oncology Reports articles according to hospital affiliation status (y-axis) from either China or all other countries (x-axis). The journal and relevant date ranges of problematic articles are shown above each panel. Problematic articles that were (not) affiliated with hospitals are shown in blue (grey), respectively. Percentages shown have been rounded to the nearest single digit. Numbers of problematic articles from China or all other countries are indicated below the x-axis. For the comparisons shown in each panel, significantly higher proportions of problematic articles from China were affiliated with hospitals versus problematic articles from other countries (Fisher’s Exact test, P < 0.001).
Figure S2.Problematic Oncology Reports articles per year, according to country of origin and institutional affiliation type.
Total numbers of problematic articles for each country/group of countries are shown in the upper left corner of each panel. (A) Numbers of problematic Oncology Reports articles (y-axes) per publication year (x-axes) according to country of origin, shown above each graph. Countries are shown in alphabetical order, from top left. (B) Numbers of problematic Oncology Reports articles (y-axis) per publication year (x-axis) from China (right panel) or all other countries (left panel). Articles affiliated with hospitals or other institution types are shown in blue or grey, respectively. Numbers of problematic articles per year are shown below each stacked bar graph.
Figure 5.Percentages and numbers of problematic Gene and Oncology Reports articles per year.
Percentages of all Gene or Oncology Reports articles that were found to be problematic (y-axis) per publication year (x-axis). The journal and relevant publication year ranges are shown above each panel. Problematic articles from China or all other countries are shown in orange or grey, respectively. Percentages shown are rounded to one decimal place. Total numbers of problematic articles per year are shown below each graph.
Figure 6.Summary of (RT-)PCR primer pairings that involved at least one wrongly identified primer.
For n = 851 primer pairs that were claimed to target particular genes/sequences (gene X) (left panel), one or both primers were predicted to be incorrect (right panel), either by targeting unrelated genes or sequences (gene Y or gene Z), or by having no predicted human target (no target). Numbers of primer pairs and affected articles are indicated below each incorrect primer pair category. Some problematic articles described more than one (category of) incorrect primer pairing. Left- or right-hand primers are not intended to indicate forward or reverse primer orientations.
Figure 7.Numbers of past research articles that have studied human protein-coding genes in problematic articles.
(A, B) Numbers (log base 10) of problematic articles (y-axis) versus past research articles (x-axis) for (A) primary protein-coding genes in problematic articles and (B) claimed protein-coding gene targets of wrongly identified reagents. Vertical dashed lines indicate the median number of research articles for protein-coding genes, with the associated interquartile range shown in grey. Subsets of protein-coding genes are highlighted in each panel.
Figure S3.Human protein-coding genes in problematic articles appear frequently in PubMed.
(A, B) Numbers (log base 10) of PubMed articles (y-axis) for primary protein-coding genes in problematic articles (A) (green) or claimed protein-coding gene targets of wrongly identified reagents (B) (green), versus all other human protein-coding genes (A, B) (grey). Horizontal lines within box plots indicate median values, with box plots showing percentiles according to letter proportions, that is, ±25% percentile, ±(25 + 25/2)% percentile, ±(25 + 25/2 + 25/4)% percentile.
Figure 8.Clinical trial citations and approximate potential to translate (APT) for problematic articles.
(A) Percentages of problematic articles that are cited at least once according to the NIH Open Citation Collection (y-axis), according to publication corpus (x-axis). Error bars indicate 95% confidence intervals of bootstrapped estimates of percentages. Numbers of problematic articles with at least one clinical citation are shown below the x-axis for each corpus. (B) Average APT for problematic articles (y-axis) according to publication corpus (x-axis). Error bars indicate bootstrapped 95% confidence intervals. Numbers of problematic articles for which the APT computed by iCite (88) are shown below the x-axis for each corpus.