Literature DB >> 29467794

Detection of Potential Problematic Cytb Gene Sequences of Fishes in GenBank.

Xiaobing Li1, Xuejuan Shen1, Xiao Chen2, Dan Xiang3, Robert W Murphy4, Yongyi Shen1,3,5.   

Abstract

Fishes are, by far, the most diverse group of vertebrates. Their classification relies heavily on morphology. In practice, the correct morphological identification of species often depends on personal experience because many species vary in their body shape, color and other external characters. Thus, the identification of a species may be prone to errors. Due to the rapid development of molecular biology, the number of sequences of fishes deposited in GenBank has grown explosively. These published data likely contain errors owing to invalid or incorrectly identified species. The erroneous data can lead to downstream problems. Thus, it is critical that such errors get identified and corrected. A strategy based on DNA barcoding can detect potentially erroneous data, especially when intraspecific K2P variation exceeds interspecific K2P divergence. Analyses of the most used DNA marker for fishes (mitochondrial Cytb) discovers that intraspecific differences of fishes are generally less than 1%, while interspecific differences are generally higher than 10%. Based on this ruler, our analyses identify 1,303 potential problematic Cytb sequences of fishes in GenBank and point to taxonomic problems, errors in identification, genetic introgression and other concerns. Care must be taken to avoid the perpetuation of errors when using these available data.

Entities:  

Keywords:  Cytb; DNA barcoding; GenBank; fish; sequence error

Year:  2018        PMID: 29467794      PMCID: PMC5808227          DOI: 10.3389/fgene.2018.00030

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


Introduction

The identification of fishes generally relies on morphology and distribution. However, in practice, problems exist due to the great diversity of fishes, small body sizes of many species, poor preservation of individual specimens and other issues. Further, accuracy in the morphological identification of species depends on personal experience. For many species, abiotic factors such as environmental pertubations can affect body shape, skin color and other external characters (Wilkens and Strecker, 2003). These factors inevitably lead to controversy and misidentification. DNA barcoding uses a short gene segment to identify species (Hebert et al., 2003a, 2004). Generally, mitochondrial COI gene is the marker of choice because differences in sequences between species have been well characterized (Hebert et al., 2003b). This method has been applied to the classification of fishes to facilitate the rapid and accurate identification of species and the discovery of the cryptic species (Fields et al., 2015; Bhattacharya et al., 2016). In DNA barcoding, a short standardized sequence can distinguish individuals of a species because genetic variation between species usually exceeds that within species (Hebert et al., 2003a; Hajibabaei et al., 2007). In such cases, any gene segment can serve to identify species. Potential errors and taxonomic conundrums can be identified when interspecific genetic variation does not exceed that within species. Because of advances in sequencing technologies, the number of DNA sequences of fishes has increased explosively in GenBank. For example, fishes now have more than 60,000 sequences of mitochondrial cytochrome b (Cytb) alone in the database, and this representation is ever increasing. Many sequences have been submitted by labs void of taxonomic expertise. Further, sampling error, contamination, hybridization, introgression, and nuclear pseudogenes can also lead to problems and errors. Consequently, any large database likely contains errors and the perpetuation of erroneous data can lead to downstream problems. Thus, it is critical to identify and correct such errors. The large gap between Cytb intra- and interspecies differences is stable. Consequently, the gene has been used widely in systematics and molecular ecology including the identifications of species of chickens, praomyin rodents and gadid fishes, among many others (Kartavtsev, 2011; Nicolas et al., 2012; Yacoub et al., 2015; Fernandes et al., 2017). Many studies on fishes have used Cytb sequences for molecular phylogenetics and population analyses. Therefore, we use Cytb to test if DNA barcoding can identify potential erroneous sequences of fishes. This approach has the potential to be used universally to improve the quality of publically available data.

Materials and methods

To obtain the maximum number of sequences, we downloaded all 65,326 Cytb records for fishes from NCBI. These sequences, which were uploaded by many labs, many of them were incomplete Cytb genes, had different lengths and covered different parts of the gene. Therefore, we employed the following trimming steps to standardize these sequences before calculating sequence divergences: (1) flanking regions of Cytb were deleted; (2) sequences were aligned using MAFFT (Katoh and Toh, 2010); (3) to obtain the maximum number of homologous sequences, we balanced the maximum length alignment vs. taxonomic coverage to attain the final trimmed dataset for downstream analyses. The trimmed dataset consisted of 35,130 fragments of 918 bp. When we set the complete Cytb for Carassius auratus GU135519.1 as the standard, the available fragments ranged from 75 to 998 bp. DAMBE (Xia and Xie, 2001) was employed to detect for nucleotide substitution saturation. Iss < ss.c was statistically significant (P = 0), indicating that the nucleotide substitution was not saturated (Xia et al., 2003). Pairwise divergences (Kimura 2-parameter, K2P) of these sequences were calculated using MEGA 6 (Tamura et al., 2013). Then, intraspecific distances greater than 1% and interspecific distances less than 10% were identified as being potentially problematic. Neighbor-joining trees with 1,000 bootstrap replications were constructed using MEGA 6 (Tamura et al., 2013) to visualize similarity and sequence divergence. Sequences with intraspecific K2P divergences greater than interspecific differences were retained for further evaluation.

Results and discussion

The compiled a dataset of Cytb sequences of fishes from GenBank exhibited a great diversity of lengths. A clear tradeoff existed between maximizing the length of the alignments and taxonomic coverage (Shen et al., 2013). Usable fragment lengths ranged from 55 to 972 bp. Our final dataset consisted of 35,130 fragments of 918 bp. We regarded GenBank accession number GU135519.1 for Cytb to be the standard for all comparisons. The index of substitution saturation (Iss) is significantly less than the critical Iss.c (P = 0) (Table 1). This result suggests that the nucleotide substitutions are not saturated. The distribution of genetic distances was shown to vary greatly (Johns and Avise, 1998). Notwithstanding, our intraspecific differences generally fall below 1%, while interspecific differences usually exceed 10% (Figure 1). The gap suggests that Cytb can efficiently distinguish different species of fishes. Some notable exceptions exist. For example, sequences with shallow interspecific divergence (<10%), deep intraspecific divergence (>1%), and interspecific differences that are much less than intraspecific differences constitute potential errors. Based on this ruler, we identify 1,303 potential problematic Cytb gene sequences (Table S1).
Table 1

Test of substitution saturation of Cytb sequences of fishes.

IssI ss.cSymTDFPIss.cSymTDFP
40.2980.81729.3829170.0000.78527.5839170.000
80.2960.78424.7059170.0000.67719.3029170.000
160.2940.76622.6149170.0000.56512.9889170.000
320.2990.74220.6969170.0000.4316.1909170.000
Figure 1

Intra- and interspecific pairwise divergence (Kimura 2-parameter) of Cytb in fishes. (A) Intraspecific divergence. (B) Interspecific divergence.

Test of substitution saturation of Cytb sequences of fishes. Intra- and interspecific pairwise divergence (Kimura 2-parameter) of Cytb in fishes. (A) Intraspecific divergence. (B) Interspecific divergence. Shallow interspecific divergence may owe to several possibilities. (1) Species of recent origin should have very shallow interspecific divergence. For example, the K2P divergence between Comephorusdy bowskii and C. baicalensis is only 0.4–1.0%, and between Etheostoma kanawhae and E. osburni a mere 0.4–0.7%. These species appear to have recent origins (Syu et al., 1994; Sun et al., 2007; Geiger et al., 2016). (2) MtDNA introgression can lead to shallow interspecific differences. For example, Melanotaenia misoolensis (KC133624.1) is very similar to M. flavipinnis (0.2–0.3%), and M. boesemani (KC133618.1) shows shallow interspecific divergence with M. ajamaruensis (0.3–0.4%). Gene introgression via hybridization occurs in rainbowfishes (Unmack et al., 2013). The low genetic divergence between Chasmistes brevirostris and Deltistes luxatusis (0.8–1%) is also due to introgressive hybridization (Dowling et al., 2016). Introgressive hybridizations were also found in suckers, darters, barbs and so on (Near et al., 2011; Unmack et al., 2014; Bernal et al., 2017; Schmidt et al., 2017). This reason leads to the unexpected shallow interspecific divergence in many fishes. Nuclear sequences would be helpful to classify recent origin or mtDNA introgression. (3) Errors in species identification and conspecificity of the species can also lead to low values of divergence. For example, Etheostoma spectabile (FJ381067.1, FJ381066.1, FJ381061.1, and FJ381057.1), E. bison (KF377137.1), E. burri (FJ381080.1 and AY374262.1), and E. lawrencei (KF377157.1 and KF377156.1) show shallow interspecific divergence with E. caeruleum (0.3–1.1%). This result suggests conspecificity of the species, or species misidentifications. K2P distances between Etheostoma sitikuense, E. percnurum, E. marmorpinnum range from 0.2 to 1.1%. The low levels of interspecific divergence indicate either recent divergence or perhaps a taxon-specific slowing of the molecular clock. Although no specific level of divergence can identify species, low interspecific divergence point to a need for further investigation. Larger than expected intraspecific differences also exist. For example, two sequences of Paramisgurnus dabryanu (KM186183.1, KF771003.1) differ from conspecifics by 18.1–19.6%, one Paracobitis malapterura (LC167412.1) differs by 22.0–22.4%, two Etheostoma coosae (HQ128114.1, AY374266.1) by 10.9–12.2%, two Rhodeus ocellatus (KT004415.1, AF051876.1) by 20.0–20.6%, and two Schizothorax waltoni (KT833090.1, KT833089.1) by 19.2–20.7%. These cases indicate that at least half of the sequences were either incorrectly identified to species, contamination of DNA occurred in the laboratory, or an errneous sequence was submitted to GenBank. Species having wide ranges of intraspecific differences are most likely composites of multiple cryptic species. For example, Etheostoma nigripinne has complex relationships, and its intraspecific divergences range from 0.0 to 14.5%. Similarly, intraspecific divergences of E. rufilineatum range from 0.1 to 12.6%. Many currently recognized species contain a few cryptic species (Köhler et al., 2005; Palandacic et al., 2017; Phuong et al., 2017). Further taxonomic study is necessary for those species with wide ranges of intraspecific differences. Cases where interspecific differences are much less than intraspecific differences likely owe to problems such as species misidentifications, database errors when submitting sequences to GenBank, laboratory mix-ups, laboratory contamination, and other issues. For example, one sequence of Etheostoma oophylaxe (JX547432.1) has shallow interspecific divergence with E. nigripinne (0.1–4.1%), but deep intraspecific divergence (13.8–14.5%) (Figure 2A). One sequence of E. artesiae (HQ128075.1) has relatively low interspecific divergences with E. swaini (5.4–7.6%), but deep intraspecific divergence (10.3–10.4%). Three sequences of E. crossopterum (JX547246.1; JX547256.1; and JX547253.1) have shallow interspecific divergence with E. nigripinne (0–0.3%) but exhibit deep intraspecific divergence (15.6–16.8%). Four sequences of Acheilognathus signifier (KF410810.1; KF410811.1; EF483930.1 and JQ714034.1) have low interspecific divergences with Tanakia koreensis (0.2–5.2%), yet deep intraspecific divergence (15.4–16.1%; Figure 2B). Further investigation into the discordance is desirable.
Figure 2

Two examples of potential errors for Cytb sequences in fishes.

Two examples of potential errors for Cytb sequences in fishes. Other reasons can lead to unexpected values of genetic divergence. (1) Great geographic distances can result in genetic divergence, especially in widely distributed species. (2) Recent origins of species can result in high levels of genetic similarity. (3) Taxonomic change can result in errors. For example, the names Rutilus lemmingii and Chondrostoma lemmingii differ, but they are the same species, as do Epinephelus lanceolatus and Promicrops anceolatus. Therefore, we suggest that GenBank (NCBI) provide a mechanism for updating changes in taxonomic classification. (4) Morphologically different species may have essentially identical genes. For example, many species of darters (Etheostoma) differ morphologically, but genetically differ slightly. Similarly, Glossolepis incisus, G. pseudoincisus, and G. dorityi are all essentially identical genetically (Unmack et al., 2013). It has to be mentioned that without standard sequences for each species, when two sequences have atypical genetic divergence values, we cannot classify which sequence is correct and which is wrong. Further investigations into species with atypical genetic divergence values (Table S1) can improve the accuracy of the fish mitochondrial database and foster interesting study. DNA barcoding can complement morphological classifications and provide an alternative approach to assessing species diversity. Now, the approach is widely used to identify species of fishes (Ward et al., 2005; Smith et al., 2008; Ardura et al., 2010; Filonzi et al., 2010). Classifications form the basis of evolutionary research and incorrect taxonomies can negatively affect all other biological investigations. Fishes comprise nearly half of all vertebrate species, and, thus, an accurate classification is essential. Species identification errors in GenBank can mislead subsequent research. We detect potentially problematic data for one gene only, Cytb, for sequences from fishes. The approach will be useful for other mitochondrial genes and other taxa. DNA barcoding can identify species of fishes, species complexes, sister-species, and discover potentially problematic errors.

Author contributions

XL carried out the data analysis and drafted the manuscript; XS, XC, and DX carried out data analysis; YS designed and coordinated the study, and helped draft the manuscript; RM revised the manuscript. All authors gave final approval for publication.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
  27 in total

1.  Biological identifications through DNA barcodes.

Authors:  Paul D N Hebert; Alina Cywinska; Shelley L Ball; Jeremy R deWaard
Journal:  Proc Biol Sci       Date:  2003-02-07       Impact factor: 5.349

2.  DNA barcoding to fishes: current status and future directions.

Authors:  Manojit Bhattacharya; Ashish Ranjan Sharma; Bidhan Chandra Patra; Garima Sharma; Eun-Min Seo; Ju-Suk Nam; Chiranjib Chakraborty; Sang-Soo Lee
Journal:  Mitochondrial DNA A DNA Mapp Seq Anal       Date:  2015-06-09       Impact factor: 1.514

3.  Phylogeny and biogeography of rainbowfishes (Melanotaeniidae) from Australia and New Guinea.

Authors:  Peter J Unmack; Gerald R Allen; Jerald B Johnson
Journal:  Mol Phylogenet Evol       Date:  2013-01-08       Impact factor: 4.286

4.  Introgression and selection shaped the evolutionary history of sympatric sister-species of coral reef fishes (genus: Haemulon).

Authors:  Moisés A Bernal; Michelle R Gaither; W Brian Simison; Luiz A Rocha
Journal:  Mol Ecol       Date:  2016-12-25       Impact factor: 6.185

5.  DNA barcoding Australia's fish species.

Authors:  Robert D Ward; Tyler S Zemlak; Bronwyn H Innes; Peter R Last; Paul D N Hebert
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2005-10-29       Impact factor: 6.237

6.  Analysis of tandem DNA repeats of cottoid fish in Lake Baikal by direct consensus sequencing.

Authors:  M E Pavlova; S I Belikov
Journal:  Mol Mar Biol Biotechnol       Date:  1994-12

7.  A comparative summary of genetic distances in the vertebrates from the mitochondrial cytochrome b gene.

Authors:  G C Johns; J C Avise
Journal:  Mol Biol Evol       Date:  1998-11       Impact factor: 16.240

8.  Influence of introgression and geological processes on phylogenetic relationships of Western North American mountain suckers (Pantosteus, Catostomidae).

Authors:  Peter J Unmack; Thomas E Dowling; Nina J Laitinen; Carol L Secor; Richard L Mayden; Dennis K Shiozawa; Gerald R Smith
Journal:  PLoS One       Date:  2014-03-11       Impact factor: 3.240

9.  Contrasting morphology with molecular data: an approach to revision of species complexes based on the example of European Phoxinus (Cyprinidae).

Authors:  Anja Palandačić; Alexander Naseka; David Ramler; Harald Ahnelt
Journal:  BMC Evol Biol       Date:  2017-08-09       Impact factor: 3.260

10.  Introgressive Hybridization and the Evolution of Lake-Adapted Catostomid Fishes.

Authors:  Thomas E Dowling; Douglas F Markle; Greg J Tranah; Evan W Carson; David W Wagman; Bernard P May
Journal:  PLoS One       Date:  2016-03-09       Impact factor: 3.240

View more
  6 in total

1.  Sequencing and Characterization of Mitochondrial Protein-Coding Genes for Schizothorax niger (Cypriniformes: Cyprinidae) with Phylogenetic Consideration.

Authors:  Tasleem Akhtar; Ghazanfar Ali; Nuzhat Shafi; Wasim Akhtar; Abdul Hameed Khan; Zahid Latif; Abdul Wali; Syeda Ain-Ul-Batool; Abdul Rehman Khan; Sadia Mumtaz; Syed Iftikhar Altaf; Sundus Khawaja; Madiha Khalid; Fazal Ur Rehman; Qudir Javid
Journal:  Biomed Res Int       Date:  2020-12-07       Impact factor: 3.411

Review 2.  Modernizing the Toolkit for Arthropod Bloodmeal Identification.

Authors:  Erin M Borland; Rebekah C Kading
Journal:  Insects       Date:  2021-01-06       Impact factor: 2.769

3.  An integrative re-evaluation of Typhlatya shrimp within the karst aquifer of the Yucatán Peninsula, Mexico.

Authors:  Lauren Ballou; David Brankovits; Efraín M Chávez-Solís; José M Chávez Díaz; Brett C Gonzalez; Shari Rohret; Alexa Salinas; Arielle Liu; Nuno Simões; Fernando Álvarez; Maria Pia Miglietta; Thomas M Iliffe; Elizabeth Borda
Journal:  Sci Rep       Date:  2022-03-29       Impact factor: 4.379

4.  A detailed workflow to develop QIIME2-formatted reference databases for taxonomic analysis of DNA metabarcoding data.

Authors:  Benjamin Dubois; Frédéric Debode; Louis Hautier; Julie Hulin; Gilles San Martin; Alain Delvaux; Eric Janssen; Dominique Mingeot
Journal:  BMC Genom Data       Date:  2022-07-08

5.  Dentex dentex Frauds: Establishment of a New DNA Barcoding Marker.

Authors:  Marina Ceruso; Celestina Mascolo; Pasquale De Luca; Iolanda Venuti; Elio Biffali; Rosa Luisa Ambrosio; Giorgio Smaldone; Paolo Sordino; Tiziana Pepe
Journal:  Foods       Date:  2021-03-10

6.  Sharp Increase of Problematic Mitogenomes of Birds: Causes, Consequences, and Remedies.

Authors:  George Sangster; Jolanda A Luksenburg
Journal:  Genome Biol Evol       Date:  2021-09-01       Impact factor: 3.416

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.