| Literature DB >> 35622395 |
Romy Sauvayre1,2.
Abstract
Google Scholar (GS) is a free tool that may be used by researchers to analyze citations; find appropriate literature; or evaluate the quality of an author or a contender for tenure, promotion, a faculty position, funding, or research grants. GS has become a major bibliographic and citation database. For assessing the literature, databases, such as PubMed, PsycINFO, Scopus, and Web of Science, can be used in place of GS because they are more reliable. The aim of this study was to examine the accuracy of citation data collected from GS and provide a comprehensive description of the errors and miscounts identified. For this purpose, 281 documents that cited 2 specific works were retrieved via Publish or Perish software (PoP) and were examined. This work studied the false-positive issue inherent in the analysis of neuroimaging data. The results revealed an unprecedented error rate, with 279 of 281 (99.3%) examined references containing at least one error. Nonacademic documents tended to contain more errors than academic publications (U=5117.0; P<.001). This viewpoint article, based on a case study examining GS data accuracy, shows that GS data not only fail to be accurate but also potentially expose researchers, who would use these data without verification, to substantial biases in their analyses and results. Further work must be conducted to assess the consequences of using GS data extracted by PoP. ©Romy Sauvayre. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 27.05.2022.Entities:
Keywords: academic publication; citation analysis; database reliability; false positives; reference accuracy; research evaluation; scientometrics
Mesh:
Year: 2022 PMID: 35622395 PMCID: PMC9187964 DOI: 10.2196/28354
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 7.076
Figure 1Pareto diagram. Sum of errors detected (N=755) as a function of document type. Light blue indicates academic publications, gray indicates nonacademic documents, and dark blue indicates cumulative sum curve of errors detected.
Typology of Google Scholar errors. Typology and proportion of errors identified as a function of the number of valid references examined and as a function of the total number of errors detected.
| Error type | Errors identified, n (%) | |
|
| As a function of the number of valid | As a function of the total number of errors |
| Data collection | 42 (5.6) | 33 (11.7) |
| Academic publication collection | 77 (10.2) | 77 (27.5) |
| Citation | 81 (10.7) | 81 (29.9) |
| Author | 61 (8.1) | 53 (19.4) |
| Title | 60 (7.9) | 57 (20.8) |
| Publication year | 31 (4.1) | 31 (11.3) |
| Publication | 155 (20.5) | 133 (47.5) |
| Publisher | 248 (32.8) | 244 (86.8) |
Accurate and inaccurate content identified in the “Publication” column retrieved from Google Scholar via Publish or Perish (N=280).
| Type of error | Inaccurate publication, n (%) | Accurate publication, n (%) | Utilitya |
| Journal name (n=108) | 56 (51.9) | 52 (48.1) | (+) |
| Magazine name (n=2) | 1 (50.0) | 1 (50.0) | (−) |
| Book title (n=13) | 13 (100.0) | 0 (0.0) | (−) |
| Edited book title (n=29) | 21 (72.4) | 8 (27.6) | (+) |
| Conference proceeding title (n=5) | 5 (100.0) | 0 (0.0) | (+) |
| Thesis title (n=2) | 2 (100.0) | 0 (0.0) | (−) |
| Publisher name (n=2) | 2 (100.0) | 0 (0.0) | (−) |
| Domain name (n=20) | 18 (90.0) | 2 (10.0) | (−) |
| Preprint database name (n=4) | 1 (25.0) | 3 (75.0) | (−) |
| Other (n=5) | 5 (100.0) | 0 (0.0) | (−) |
| Missing value (not provided) (n=90) | 9 (100.0) | 81 (90.0) | (−) |
| Total (n=280) | 133 (47.5) | 147 (52.5) | N/Ab |
aThe usable publication content for studies using academic publications is denoted by “+.” The errors were not easy to categorize because of nonacademic documents. For instance, when the document type is a blog post or an unpublished draft, a journal name is not expected in the “Publication” column and thus is counted as an inaccuracy. Nevertheless, this type of document had already been counted as a data collection error. Therefore, each document type was specifically analyzed to avoid falsely increasing the error count. However, the categorization was easier for other references, such as when the journal editor name was provided instead of the journal name. In addition, an examination of spelling and typographical errors, including capitalization errors, was conducted.
bN/A: not applicable.
Accurate and inaccurate content in the “Publisher” column retrieved from Google Scholar via Publish or Perish (N=281).
| Type of error | Inaccurate publication, n (%) | Accurate publication, n (%) | Utilitya |
| Book and conference proceeding editor (n=35) | 13 (37.1) | 22 (62.9) | (+) |
| Journal editor (n=51) | 51 (100.0) | 0 (0.0) | (−) |
| Journal name (n=1) | 1 (100.0) | 0 (0.0) | (−) |
| Digital library name (n=2) | 2 (100.0) | 0 (0.0) | (−) |
| Domain name (n=167) | 167 (100.0) | 0 (0.0) | (−) |
| Not provided (n=25) | 10 (40.0) | 15 (60.0) | (−) |
| Total (n=281) | 244 (86.8) | 37 (13.2) | N/Ab |
aThe usable publication content for studies using academic publication data is denoted by “+.”
bN/A: not applicable.