| Literature DB >> 35134858 |
Bachir Balech, Anna Sandioniggi, Marinella Marzano, Graziano Pesole, Monica Santamaria.
Abstract
Nucleotide sequences reference collections or databases are fundamental components in DNA barcoding and metabarcoding data analyses pipelines. In such analyses, the accurate taxonomic assignment is a crucial aspect, relying directly on the availability of comprehensive and curated reference sequence collection and its taxonomy information. The currently wide use of the mitochondrial cytochrome oxidase subunit-I (COXI) as a standard DNA barcode marker in metazoan biodiversity studies highlights the need to shed light on the availability of the related relevant information from different data sources and their eventual integration. To adequately address data integration process, many aspects should be markedly considered starting from DNA sequence curation followed by taxonomy alignment with solid reference backbone and metadata harmonization according to universal standards. Here, we present MetaCOXI, an integrated collection of curated metazoan COXI DNA sequences with their associated harmonized taxonomy and metadata. This collection was built on the two most extensive available data resources, namely the European Nucleotide Archive (ENA) and the Barcode of Life Data System (BOLD). The current release contains more than 5.6 million entries (39.1% unique to BOLD, 3.6% unique to ENA, and 57.2% shared between both), their related taxonomic classification based on NCBI reference taxonomy, and their available main metadata relevant to environmental DNA studies, such as geographical coordinates, sampling country and host species. MetaCOXI is available in standard universal formats ('fasta' for sequences & 'tsv' for taxonomy and metadata), which can be easily incorporated in standard or specific DNA barcoding and/or metabarcoding data analysis pipelines. Database URL: https://github.com/bachob5/MetaCOXI.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35134858 PMCID: PMC9216479 DOI: 10.1093/database/baab084
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.Schematic representation of the bioinformatics workflow used to build MetaCOXI. The whole process was implemented and executed in Linux environment using bash shell commands in combination with custom python scripts to parse the intermediate results. ENA raw sequences were filtered according to their lengths. hmmsearch application was used to search the translated DNA sequences against COXI HMM PFAM reference profile. All matches satisfying the TC threshold were selected. The sequences were further validated using blastn to exclude bacterial, plants and Archaea sequences. Taxonomical classification of all entries was aligned to NCBI reference taxonomy. The final COXI sequences and their associated harmonized taxonomy paths and metadata are provided in ‘fasta’ and ‘tsv’ formats, respectively. Yellow color indicates filtering parameters. Green color denotes processing steps. Dashed red arrows illustrate the returning results from reference databases.
Number of entries in MetaCOXI related to their source DB: BOLD, ENA or both
| MetaCOXI | BOLD unique | ENA unique | BOLD and ENA |
|---|---|---|---|
| 5 608 848 | 2 195 176 (39.13%) | 201 719 (3.59%) | 3 211 953 (57.26%) |
Figure 2.Lengths distribution of MetaCOXI sequences ranging from 100 to 3020 bp. The most frequent sequence length is 658 bp represented by 1 573 982 sequences.
Taxonomic composition of MetaCOXI at the main six taxonomy levels belonging to Metazoa kingdom with their corresponding sequence number. The percentages are relative to the totals present in the whole collection. Phylum with more than three classes are highlighted in red
| Phylum | Seq N° | Class N° | Class % | Order N° | Order % | Family N° | Family % | Genus N° | Genus % | Species N° | Species % |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Arthropoda | 4 782 392 | 19 | 19 | 130 | 21.07 | 2306 | 45.013 | 36 559 | 69.969 | 658 938 | 88.601 |
| Chordata | 482 868 | 18 | 18 | 171 | 27.71 | 1051 | 20.515 | 7895 | 15.110 | 39 910 | 5.366 |
| Mollusca | 170 563 | 8 | 8 | 80 | 12.97 | 587 | 11.458 | 3471 | 6.643 | 19 049 | 2.561 |
| Platyhelminthes | 28 693 | 6 | 6 | 39 | 6.32 | 191 | 3.728 | 620 | 1.187 | 3228 | 0.434 |
| Cnidaria | 17 787 | 6 | 6 | 24 | 3.89 | 229 | 4.470 | 749 | 1.433 | 3919 | 0.527 |
| Echinodermata | 27 995 | 5 | 5 | 42 | 6.81 | 143 | 2.791 | 616 | 1.179 | 2330 | 0.313 |
| Porifera | 4150 | 4 | 4 | 29 | 4.70 | 100 | 1.952 | 330 | 0.632 | 1465 | 0.197 |
| Acanthocephala | 1944 | 4 | 4 | 7 | 1.13 | 18 | 0.351 | 54 | 0.103 | 162 | 0.022 |
| Brachiopoda | 167 | 3 | 3 | 5 | 0.81 | 15 | 0.293 | 26 | 0.050 | 77 | 0.010 |
| Bryozoa | 2224 | 3 | 3 | 4 | 0.65 | 61 | 1.191 | 110 | 0.211 | 282 | 0.038 |
| Rotifera | 8971 | 3 | 3 | 6 | 0.97 | 24 | 0.468 | 67 | 0.128 | 995 | 0.134 |
| Nemertea | 3849 | 3 | 3 | 4 | 0.65 | 32 | 0.625 | 88 | 0.168 | 703 | 0.095 |
| Annelida | 49 906 | 3 | 3 | 30 | 4.86 | 134 | 2.616 | 909 | 1.740 | 8224 | 1.106 |
| Hemichordata | 131 | 2 | 2 | 2 | 0.32 | 5 | 0.098 | 8 | 0.015 | 30 | 0.004 |
| Ctenophora | 200 | 2 | 2 | 5 | 0.81 | 5 | 0.098 | 6 | 0.011 | 26 | 0.003 |
| Nematoda | 20 979 | 2 | 2 | 19 | 3.08 | 148 | 2.889 | 499 | 0.955 | 3247 | 0.437 |
| Tardigrada | 1705 | 2 | 2 | 5 | 0.81 | 8 | 0.156 | 42 | 0.080 | 319 | 0.043 |
| Kinorhyncha | 390 | 2 | 2 | 0 | 0.00 | 8 | 0.156 | 8 | 0.015 | 36 | 0.005 |
| Priapulida | 17 | 1 | 1 | 1 | 0.16 | 2 | 0.039 | 4 | 0.008 | 4 | 0.001 |
| Onychophora | 1235 | 1 | 1 | 1 | 0.16 | 2 | 0.039 | 33 | 0.063 | 199 | 0.027 |
| Chaetognatha | 1098 | 1 | 1 | 3 | 0.49 | 6 | 0.117 | 15 | 0.029 | 97 | 0.013 |
| Nematomorpha | 328 | 1 | 1 | 2 | 0.32 | 3 | 0.059 | 8 | 0.015 | 87 | 0.012 |
| Entoprocta | 33 | 0 | 0 | 0 | 0.00 | 3 | 0.059 | 7 | 0.013 | 23 | 0.003 |
| Gastrotricha | 451 | 0 | 0 | 2 | 0.32 | 11 | 0.215 | 33 | 0.063 | 124 | 0.017 |
| Sipuncula | 4 | 0 | 0 | 0 | 0.00 | 0 | 0.000 | 0 | 0.000 | 0 | 0.000 |
| Placozoa | 11 | 0 | 0 | 0 | 0.00 | 1 | 0.020 | 1 | 0.002 | 6 | 0.001 |
| Dicyemida | 102 | 0 | 0 | 0 | 0.00 | 1 | 0.020 | 2 | 0.004 | 16 | 0.002 |
| Gnathostomulida | 14 | 0 | 0 | 2 | 0.32 | 6 | 0.117 | 8 | 0.015 | 10 | 0.001 |
| Xenacoelomorpha | 261 | 0 | 0 | 2 | 0.32 | 20 | 0.390 | 48 | 0.092 | 175 | 0.024 |
| Cycliophora | 267 | 0 | 0 | 0 | 0.00 | 0 | 0.000 | 1 | 0.002 | 5 | 0.001 |
| Phoronida | 106 | 0 | 0 | 0 | 0.00 | 0 | 0.000 | 2 | 0.004 | 23 | 0.003 |
| NA | 7 | 1 | 1 | 1 | 0.16 | 1 | 0.020 | 1 | 0.002 | 7 | 0.001 |