| Literature DB >> 24586551 |
Haitham Marakeby1, Eman Badr1, Hanaa Torkey1, Yuhyun Song2, Scotland Leman2, Caroline L Monteil3, Lenwood S Heath4, Boris A Vinatzer5.
Abstract
A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today's speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research.Entities:
Mesh:
Year: 2014 PMID: 24586551 PMCID: PMC3931686 DOI: 10.1371/journal.pone.0089142
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of genome similarity-based code assignment.
(A) The genome of one organism is chosen as first genome (G1), added to the genome database, and “0” is assigned to all positions in the code (only five positions are shown here for simplicity while codes with 20 positions were used in the examples in Tables 2 to 5). A second genome (G2) is then added to the database and compared to G1. A code is assigned to the organism with genome G2 based on the genome similarity to G1 measured as percentage of average nucleotide identity (ANI). (B) The genome of a third organism (G3) is compared to G1 and G2. Since G3 is more similar to G1 than G2, G3 is assigned its code based on its ANI with G1. (C) Every new genome that is added to the database will be compared to all genomes already in the database and codes will always be assigned based on the ANI with the most similar genome. (D) Since every organism in the database was assigned a code based on genome similarity with the most similar organism already in the database at the time of its addition, all codes reflect the similarity of organisms with each other (as long as their genomes aligned) and thus are an approximation of their phylogenetic relationships (represented by the tree in the figure).
Provisional codes assigned to a selection of γ proteobacteria and a small number of non-γ proteobacteria.
| Order or family | Species and strain name | Code |
| Non-gamma |
| 0A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 0A0B0C0D0E0F0G0H0K0L1M0P0Q0R | |
|
|
| 1A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 1A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
| Pasteurellales |
| 2A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 2A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 2A0B2C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 2A0B3C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 2A0B4C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 2A0B5C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 2A0B6C0D0E0F0G0H0K0L0M0P0Q0R | |
|
|
| 6A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 6A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 6A1B0C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B0C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B1C0D0E0F0G0H0K0L0M0P1Q0R | |
|
| 12A0B1C1D0E0F0G0H0I0J0K0L0M0N | |
|
| 12A0B1C1D0E0F0G1H0K0L0M0P0Q0R | |
|
| 12A0B2C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B3C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B4C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B5C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B6C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 12A0B6C0D0E0F0G0H0K0L0M0P0Q1R | |
|
|
| 13A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
| Vibrionales |
| 20A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 20A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 20A1B0C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 20A1B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 20A1B2C0D0E0F0G0H0K0L0M0P0Q0R | |
|
|
| 22A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 22A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 22A0B1C1D0E0F0G0H0K0L0M0P0Q0R | |
|
| 22A0B2C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 22A0B2C1D0E0F0G0H0K0L0M0P0Q0R | |
|
| 22A0B3C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 22A0B4C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 22A0B5C0D0E0F0G0H0K0L0M0P0Q0R | |
|
|
| 29A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 29A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 29A0B1C1D0E0F0G0H0K0L0M0P0Q0R | |
|
| 29A0B2C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 29A0B3C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 29A0B4C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 29A0B5C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 29A0B6C0D0E0F0G0H0K0L0M0P0Q0R | |
| Xanthomonadales |
| 31A0B0C0D0E0F0G0H0K0L0M0P0Q0R |
|
| 31A0B1C0D0E0F0G0H0K0L0M0P0Q0R | |
|
| 31A0B1C1D0E0F0G0H0K0L0M0P0Q0R | |
|
| 31A0B2C0D0E0F0G0H0K0L0M0P0Q0R |
Code positions from A (60% ANI) to R (99.95% ANI) are shown. See Table S1 in File S1 for codes that were assigned to additional taxa, for ANIb values, and for the percentage of fragments that aligned with the genomes used for code assignment.
Examples of provisional mitochondrial codes assigned to Foot and Mouth Disease Viruses.
| Country of isolation | |||
| Accession # | Code | ||
| UK | |||
| DQ404158 | 0C 0E 0F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| DQ404159 | 0C 0E 0F 0G 0H 0I 0J 0K 1L 0M 0R 0X | ||
| DQ404160 | 0C 0E 0F 0G 0H 0I 0J 0K 1L 1M 0R 0X | ||
| DQ404161 | 0C 0E 0F 0G 0H 0I 0J 1K 0L 0M 0R 0X | ||
| DQ404162 | 0C 0E 0F 0G 0H 1I 0J 0K 0L 0M 0R 0X | ||
| DQ404163 | 0C 0E 0F 0G 0H 2I 0J 0K 0L 0M 0R 0X | ||
| DQ404164 | 0C 0E 0F 0G 0H 3I 0J 0K 0L 0M 0R 0X | ||
| DQ404165 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 0M 0R 0X | ||
| DQ404166 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 0M 0R 1X | ||
| DQ404167 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 0M 1R 0X | ||
| DQ404168 | 0C 0E 0F 0G 0H 3I 1J 0K 2L 0M 0R 0X | ||
| DQ404169 | 0C 0E 0F 0G 0H 3I 1J 0K 3L 0M 0R 0X | ||
| DQ404170 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 1M 0R 0X | ||
| DQ404171 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 2M 0R 0X | ||
| DQ404172 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 0R 0X | ||
| DQ404173 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 0R 1X | ||
| DQ404174 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 1R 0X | ||
| DQ404175 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 0R 2X | ||
| DQ404176 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 2R 0X | ||
| DQ404177 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 2R 1X | ||
| DQ404178 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 2R 2X | ||
| DQ404179 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 2R 3X | ||
| DQ404180 | 0C 0E 0F 0G 0H 3I 1J 0K 0L 3M 3R 0X | ||
| India | |||
| HQ832576 | 0C 1E 0F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832577 | 0C 1E 1F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832578 | 0C 1E 2F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832579 | 0C 1E 2F 0G 1H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832580 | 0C 1E 2F 0G 2H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832581 | 0C 1E 2F 0G 3H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832582 | 0C 1E 2F 1G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832583 | 0C 1E 2F 0G 4H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832584 | 0C 1E 3F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832585 | 0C 1E 4F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832586 | 0C 1E 5F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832587 | 0C 1E 6F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832588 | 0C 1E 7F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832589 | 0C 1E 8F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832590 | 0C 1E 9F 0G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832591 | 0C 1E 9F 1G 0H 0I 0J 0K 0L 0M 0R 0X | ||
| HQ832592 | 0C 1E 9F 2G 0H 0I 0J 0K 0L 0M 0R 0X | ||
Code positions ranging from C (80% ANI) to X (99.9999% ANI) are shown. See Table S5 in File S1 for codes, ANIb values, and percentage of fragments that aligned with the genomes used for code assignment.
Thresholds of Average Nucleotide Identity (ANI) used for assignment of provisional codes in Tables 2 through 5.
| Position label | ||||||||||||||||||||||||
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | |
| ANI % | ||||||||||||||||||||||||
| 60 | 70 | 80 | 85 | 90 | 95 | 98 | 99 | 99.5 | 99.6 | 99.7 | 99.8 | 99.9 | 99.91 | 99.92 | 99.93 | 99.94 | 99.95 | 99.96 | 99.97 | 99.98 | 99.99 | 99.999 | 99.9999 | |
ANI value that approximately corresponds to 70% DDH [15].
Provisional codes assigned to Bacillus anthracis strains.
|
| Code |
| A0174 | 0V0W0X |
| A0193 | 0V1W0X |
| Western North America USA6153 | 0V2W0X |
| Tsiankovskii I | 0V3W0X |
| A0389 | 1V0W0X |
| Ames | 1V1W0X |
| Ames Ancestor | 1V1W1X |
| A0248 | 1V1W1X |
| Australia 94 | 1V2W0X |
| Sterne | 1V3W0X |
| A0442 | 2V0W0X |
| Kruger B | 2V1W0X |
| A0465 | 3V0W0X |
| CNEVA 9066 | 3V1W0X |
| A0488 | 4V0W0X |
| CDC 684 | 4V1W0X |
| Vollum | 4V2W0X |
| A1055 | 5V0W0X |
| A2012 | 6V0W0X |
| H9401 | 7V0W0X |
Code positions from V (99.99% ANI) to X (99.9999% ANI) are shown. See Table S2 in File S1 for ANIb values and for the percentage of fragments that aligned with the genomes used for code assignment.
Examples of provisional mitochondrial codes assigned to members of the phylum chordata.
| Class/order/family, Species | Common name | Code | |
| Amphibia/Anura/Ranidae | |||
|
| Dark-spotted frog | 1A1B76C0D0E0F0G0H | |
| Mammalia/Rodentia/Muridae | |||
|
| House mouse | 1A0B28C0D0E0F0G0H | |
|
| Brown rat | 1A0B28C1D0E0F0G0H | |
| Mammalia/Primates/Hominidae | |||
|
| Gorilla | 1A0B18C0D0E0F0G0H | |
|
| Human | 1A0B18C0D1E0F0G0H | |
|
| Bonobo | 1A0B18C0D1E1F0G0H | |
|
| Common Chimpanzee | 1A0B18C0D1E1F1G0H | |
|
| Sumatran Orangutan | 1A0B18C0D2E0F0G0H | |
|
| Bornean orangutan | 1A0B18C0D2E1F0G0H | |
| Mammalia/Primates/Hylobatidae | |||
|
| Lar gibbon | 1A0B18C1D0E0F0G0H | |
Code positions from A (60% ANI) to H (99% ANI) are shown. See Table S3 in File S1 for codes, ANIb values, and percentage of fragments that aligned with the genomes used for code assignment for 466 mitochondria.
Figure 2Applications of genome similarity-based codes in Science and Society.
Each user who wanted to obtain a code for an organism would submit a genome sequence to a platform associated with a specific application. Each application platform could submit genomes to a central code database for unique code assignment. Codes would then be returned to the application platform, in which codes could be stored instead of entire genome sequences. Each platform would also store application-specific metadata associated with each code while the central code database would mainly store genomes and associated codes. Genome submissions are symbolized by blue arrows; code assignments are symbolized by red arrows.