| Literature DB >> 34329294 |
Mandhri Abeysooriya1, Megan Soria1, Mary Sravya Kasu1, Mark Ziemann1.
Abstract
Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.Entities:
Mesh:
Year: 2021 PMID: 34329294 PMCID: PMC8357140 DOI: 10.1371/journal.pcbi.1008984
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.779
Gene name conversion behaviour of spreadsheet software.
| Gene name conversion | ||||
|---|---|---|---|---|
| Software | Microsoft Excel | Google Sheets | LibreOffice | Gnumeric |
| Text file open | Yes | Yes | No | No |
| Pasting data | Yes | Yes | No | No |
| Typing | Yes | Yes | No | No |
Gene names vulnerable to date conversion across Eukarya.
| Taxa | Genes | Genes affected | Taxa affected | |
|---|---|---|---|---|
| Vertebrates | 310 | 5,263,175 | 1,325 | 76 |
| Metazoa | 59 | 525,867 | 17 | 3 |
| Plants | 60 | 244,101 | 35 | 4 |
| Fungi | 59 | 788,221 | 140 | 12 |
| Protists | 39 | 163,026 | 27 | 9 |
Results of a screen for gene name errors in PMC.
| 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Total | |
|---|---|---|---|---|---|---|---|---|
| Publications screened | 19976 | 21204 | 22261 | 23976 | 24986 | 26046 | 27690 | 166139 |
| Excel files screened | 2948 | 4318 | 4472 | 4355 | 4824 | 5481 | 6443 | 32841 |
| Excel files with gene lists | 2286 | 3037 | 3331 | 3021 | 3566 | 3342 | 4496 | 23670 |
| Publications with Excel gene lists | 936 | 1491 | 1579 | 1412 | 1653 | 1823 | 2223 | 11117 |
| Publications with suspected gene name errors | 284 | 490 | 477 | 443 | 475 | 594 | 707 | 3470 |
| False positive Excel files | 8 | 0 | 7 | 5 | 15 | 4 | 11 | 50 |
| False positive publications | 2 | 0 | 6 | 3 | 11 | 3 | 9 | 34 |
| Affected Excel files | 429 | 701 | 653 | 648 | 703 | 914 | 1038 | 5086 |
| Affected publications | 282 | 490 | 471 | 440 | 464 | 591 | 698 | 3436 |
| Proportion of publications affected (%) | 30.1% | 32.9% | 29.8% | 31.2% | 28.1% | 32.4% | 31.4% | 30.9% |
Fig 1Prevalence of gene name errors in the period 2014–2020.
(A) Publications with supplementary Excel gene lists. (B) Publications affected by gene name errors. (C) Proportion of affected publications.
Gene name errors stratified by organism under study.
| Species | Publications with Excel gene lists | Affected publications | Proportion of publications affected |
|---|---|---|---|
| 1577 | 609 | 38.6% | |
| 7936 | 2419 | 30.5% | |
| 124 | 31 | 25.0% | |
| 607 | 142 | 23.4% | |
| 443 | 93 | 21.0% | |
| 327 | 68 | 20.8% | |
| 251 | 48 | 19.1% | |
| 511 | 76 | 14.9% | |
| 1827 | 172 | 9.4% | |
| 10 | 0 | 0.0% |
Prevalence of gene name errors across journals.
Only journals with ≥50 articles with supplementary Excel gene lists are shown.
| Journal name as it appears in PMC | Number of articles with Excel gene lists | Number of affected articles | Proportion of articles affected (%) |
|---|---|---|---|
|
| 920 | 345 | 37.5% |
|
| 946 | 244 | 25.8% |
|
| 767 | 227 | 29.6% |
|
| 660 | 166 | 25.2% |
|
| 448 | 134 | 29.9% |
|
| 326 | 107 | 32.8% |
|
| 313 | 94 | 30.0% |
|
| 243 | 89 | 36.6% |
|
| 155 | 73 | 47.1% |
|
| 158 | 71 | 44.9% |
|
| 193 | 66 | 34.2% |
|
| 118 | 52 | 44.1% |
|
| 140 | 48 | 34.3% |
|
| 137 | 44 | 32.1% |
|
| 137 | 39 | 28.5% |
|
| 74 | 39 | 52.7% |
|
| 109 | 38 | 34.9% |
|
| 120 | 36 | 30.0% |
|
| 117 | 31 | 26.5% |
|
| 85 | 31 | 36.5% |
|
| 73 | 29 | 39.7% |
|
| 105 | 28 | 26.7% |
|
| 80 | 27 | 33.8% |
|
| 74 | 27 | 36.5% |
|
| 66 | 26 | 39.4% |
|
| 56 | 26 | 46.4% |
|
| 51 | 26 | 51.0% |
|
| 64 | 25 | 39.1% |
|
| 97 | 24 | 24.7% |
|
| 53 | 22 | 41.5% |
|
| 58 | 20 | 34.5% |
|
| 56 | 20 | 35.7% |
|
| 77 | 19 | 24.7% |
|
| 74 | 15 | 20.3% |
|
| 53 | 15 | 28.3% |
|
| 52 | 6 | 11.5% |
|
| 75 | 5 | 6.7% |
Fig 2A scatterplot of JIF and proportion of articles with supplementary Excel gene lists affected by gene name errors.
Fig 3Gene name errors in supplementary files for three dominant journals in the period 2014–2020.