Literature DB >> 26034773

Similarity-based codes sequentially assigned to ebolavirus genomes are informative of species membership, associated outbreaks, and transmission chains.

Alexandra J Weisberg1, Haitham A Elmarakeby2, Lenwood S Heath2, Boris A Vinatzer1.   

Abstract

Background.  Developing a universal standardized microbial typing and nomenclature system that provides phylogenetic and epidemiological information in real time has never been as urgent in public health as it is today. We previously proposed to use genome similarity as the basis for immediate and precise typing and naming of individual organisms or viruses. In this study, we tested the validity of the proposed system and applied it to the epidemiology of infectious diseases using Ebola virus disease (EVD) outbreaks as the example. Methods.  One hundred twenty-eight publicly available ebolavirus genomes were compared with each other, and average nucleotide identity (ANI) was calculated. The ANI was then used to assign unique codes, hereafter referred to as Life Identification Numbers (LINs), to every viral isolate, whereby each LIN consisted of a series of positions reflecting increasing genome similarity. Congruence of LINs with phylogenetic and epidemiological relationships was then determined. Results.  Assigned LINs correlate with phylogeny at the species and infraspecies level and can even identify some individual transmission chains during the 2014-2015 EVD epidemic in West Africa. Conclusions.  Life Identification Numbers can provide a fast, automated, standardized, and scalable approach to precisely identify and name viral isolates upon genome sequence submission, facilitating unambiguous communication during disease epidemics among clinicians, epidemiologists, and governments.

Entities:  

Keywords:  average nucleotide identity; classification; ebolavirus; epidemiology; phylogeny

Year:  2015        PMID: 26034773      PMCID: PMC4438903          DOI: 10.1093/ofid/ofv024

Source DB:  PubMed          Journal:  Open Forum Infect Dis        ISSN: 2328-8957            Impact factor:   3.835


Although naming of viral species is regulated by well established nomenclature rules described in the International Code of Virus Classification and Nomenclature [1], there are no general rules for classification below the species level, leaving it up to specialist groups to develop family-specific rules to name strains, variants, and isolates. In the case of the Filoviridae, which includes the genus Ebolavirus, a taxonomic revision was published in 2010 [2]. Based on this revision, the genus Ebolavirus contains 5 species: Bundibugyo ebolavirus, Reston ebolavirus, Sudan ebolavirus, Taï Forest ebolavirus, and Zaire ebolavirus, whereby a viral isolate is considered to be a member of a species if it is less than 30% different (based on its full-length genomic sequence) from the virus type of the species and more than 30% different from the type virus of the type species (ie, Z ebolavirus). Each species has 1 member virus, which is also the type of virus: Bundibugyo virus (BDBV), Reston virus (RESTV), Sudan virus (SUDV), Taï Forest virus (TAFV), and Ebola virus (EBOV), respectively. Each type virus has a type variant represented by a specific isolate [3, 4]. The isolate name is composed of the following: virus name, isolation host-suffix, country of sampling, year of sampling, genetic variant designation, and isolate designation [4]. Because of such complex nomenclature rules that are specific to every viral family and because of the frequency of taxonomic revisions, it can be very challenging for nonexperts to correctly classify and name a newly discovered virus that causes an emerging disease outbreak. This situation can lead to confusion about the identity of the pathogen, which can in turn delay an effective international response to contain such an outbreak before it turns into an epidemic. Fortunately, because next-generation sequencing [5] is so affordable today, the opportunity exists to develop new classification and naming systems that can overcome the above limitations and that can provide approaches to specifically type and name any individual isolate, strain, or organism as soon as its genome sequence becomes available. We previously described such an approach and demonstrated its suitability for providing codes that largely correlate with phylogenetic and epidemiological relationships using the genomes of various bacteria, animals, humans, and foot and mouth disease virus (FMDV) [6]. In short, genome similarity-based codes (ie, Life Identification Numbers [LINs]) that we propose to assign to every genome-sequenced organism and virus consist in a series of positions, each of which reflects a different threshold of percentage of DNA identity, expressed as average nucleotide identity (ANI) [7]. The more similar the genomes of 2 organisms (or viral isolates) are, the more similar their LINs will be. Organisms with very different genomes will have LINs that are different at their left-most position, organisms with intermediate similarity will have identical symbols (any number or letter can be used) up to an intermediate position in their LINs, and almost identical organisms will have LINs that are identical almost to the right-most position. Only isolates with 100% DNA identity will have exactly the same LIN (Figure 1).
Figure 1.

Basic principal of Life Identification Number (LIN) assignment. Each LIN is composed of a series of positions corresponding to average nucleotide identity values increasing from left to right of the LIN. The actual symbol at each position can be any number or letter and does not reflect the degree of similarity between genomes. The information content in a LIN is its similarity to other LINs. For example, the first genome added to the database is assigned “0” in all positions (Example 1). A genome with relatively low genome similarity compared with Example 1 (74% for Example 2) will have a LIN that is the same up to the LIN position B corresponding to the 70% threshold (because 74 is higher than 70) but different at position C corresponding to the 80% threshold (because 74 is lower than 80). A genome more similar to Example 1 (99.4% for Example 3) has a LIN that is identical to Example 1 up to a position further to the right (position L), and almost identical genomes have LINs identical to each other nearly up to the right-most position (Examples 1 and 5). Identical genomes have identical LINs (Examples 3 and 4).

Basic principal of Life Identification Number (LIN) assignment. Each LIN is composed of a series of positions corresponding to average nucleotide identity values increasing from left to right of the LIN. The actual symbol at each position can be any number or letter and does not reflect the degree of similarity between genomes. The information content in a LIN is its similarity to other LINs. For example, the first genome added to the database is assigned “0” in all positions (Example 1). A genome with relatively low genome similarity compared with Example 1 (74% for Example 2) will have a LIN that is the same up to the LIN position B corresponding to the 70% threshold (because 74 is higher than 70) but different at position C corresponding to the 80% threshold (because 74 is lower than 80). A genome more similar to Example 1 (99.4% for Example 3) has a LIN that is identical to Example 1 up to a position further to the right (position L), and almost identical genomes have LINs identical to each other nearly up to the right-most position (Examples 1 and 5). Identical genomes have identical LINs (Examples 3 and 4). We propose to assign LINs sequentially as genomes become available, whereby the LIN of the organism with the most similar genome that already has an assigned LIN will be used as the basis for assigning the new LIN [6]. The only time a LIN of an isolate or organism should be adjusted is when a higher quality genome sequence becomes available. This approach will provide stability of LINs and minimize confusion inevitably associated with taxonomic revisions. One inherent limitation of identifiers assigned based on overall genome similarity is as follows: in the case of organisms or viruses with very recent common ancestors—in which a single mutation may be the only indication of a new transmission chain during an outbreak—LINs’ ability to reflect phylogenetic and epidemiological relationships is limited [6]. However, we still need to establish exactly what that limit is. In this study, we determined to what depth LINs can be informative of phylogenetic relationships among members of the genus Ebolavirus. Forty-seven publicly available ebolavirus genomes from previous outbreaks and 81 genomes of ebolavirus isolates from the 2014–2015 Ebola virus disease (EVD) epidemic [8, 9] were compared, provisional LINs were assigned, and these assigned LINs were compared with whole genome phylogenies. The results reveal that LINs reflect evolutionary relationships from the species level all the way down to single transmission chains identified by phylogenetic reconstruction. However, they do not reflect every node revealed by phylogeny, which needs to be considered when interpreting LINs in an epidemiological context. We finally propose that LINs should be assigned to every genome in GenBank (and other public databases) upon genome sequence submission, to provide the healthcare and research community with a system for fast and precise identification of, and clear communication about, viruses and other pathogens.

METHODS

One hundred twenty-eight ebolavirus genomes were downloaded from the National Center for Biotechnology Information (NCBI) nucleotide database on September 3, 2014, and pairwise ANI was calculated using our previously described custom pipeline [6] based on jSpecies [10]. Calculations of ANI were performed in the order of the dates associated with each ebolavirus genome sequence in GenBank (which is either the date the genome sequence was submitted or the date it was last updated). In this process, every genome was only compared to those genomes that would have been available on those dates (Table 1). When 2 or more genomes had the same associated date, accession numbers were used to establish the order of comparison. Life Identification Numbers were then assigned to each genome sequentially based on the ANI with the most similar genome to which a LIN had already been assigned (ie, all genomes in rows above the genome in question in Table 1) using ANI thresholds shown in Figure 1.
Table 1.

ANI Values and LINs Assigned to Each EBOV Genomea

Shortened GenBank DefinitionDateOrderMost Similar GenomeANILIN
AF522874.1 Reston_strain_Pennsylvania_1989-199009/04/021AF522874.11000.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
AY354458.1 Zaire_strain_Zaire_199502/06/042AF522874.169.447780.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
AY729654.1 Sudan_strain_Gulu_200010/07/053AY354458.170.227140.1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
EU338380.1 Sudan_EBOV-S-200401/23/084AY729654.194.954440.1.1.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
AB050936.1 Reston_ebolavirus06/25/085AF522874.198.747220.0.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
FJ217161.1 Bundibugyo_ebolavirus11/21/086AY354458.172.391820.1.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
FJ217162.1 Tai_Forest_Cote_d'Ivoire_199411/21/087FJ217161.175.354620.1.3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
FJ968794.1 Sudan_UGA_strain_Boniface_200705/29/098EU338380.199.634440.1.1.0.0.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0
FJ621583.1 Reston_Reston_strain_Reston_2008-A07/14/099AF522874.197.691670.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
FJ621584.1 Reston_Reston_strain_Reston_2008-C07/14/0910AF522874.196.981670.0.0.0.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
FJ621585.1 Reston_strain_Reston_2008-E07/14/0911AB050936.198.682220.0.0.0.0.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
HQ613402.1 Zaire_034-KS_200811/07/1112AY354458.197.848330.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
HQ613403.1 Zaire_DRC_2007_M-M11/07/1113HQ613402.199.938890.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.0.0.0.0.0
JN638998.1 Sudan_2011_Nakisamata08/06/1214AY729654.199.395560.1.1.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
JQ352763.1 Zaire_1995_strain_Kikwit01/23/1315AY354458.199.983330.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0
KC242784.1 Zaire_EBOV_COD-2007-9_Luebo03/08/1316HQ613403.199.983330.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.0.0.1.0.0
KC242785.1 Zaire_EBOV_COD-2007-0_Luebo03/08/1317HQ613403.199.972220.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.0.1.0.0.0
KC242786.1 Zaire_EBOV_COD-2007-1_Luebo03/08/1318HQ613403.199.988890.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.0.0.2.0.0
KC242787.1 Zaire_EBOV_COD-2007-23_Luebo03/08/1319HQ613403.199.988890.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.0.0.3.0.0
KC242788.1 Zaire_EBOV_COD-2007-43_Luebo03/08/1320KC242787.199.956110.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.1.0.0.0.0.0
KC242789.1 Zaire_EBOV_COD-2007-4_Luebo03/08/1321KC242786.199.988890.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.0.0.4.0.0
KC242790.1 Zaire_EBOV_COD-2007-5_Luebo03/08/1322HQ613403.199.966670.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.1.0.0.1.0.0.0.0
KC242791.1 Zaire_EBOV_COD-1977-Bonduni03/08/1323AY354458.198.866110.1.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC242792.1 Zaire_EBOV_GAB-1994-Gabon03/08/1324AY354458.199.0650.1.0.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC242796.1 Zaire_EBOV_COD-1995-13625_Kikwit03/08/1325AY354458.199.9450.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0.0.0.0.0
KC242797.1 Zaire_EBOV_GAB-1996-1Oba03/08/1326KC242792.199.818330.1.0.0.0.0.0.0.1.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0
KC242798.1 Zaire_EBOV_GAB-1996-1Ikot03/08/1327KC242792.199.657220.1.0.0.0.0.0.0.1.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0
KC242799.1 Zaire_EBOV_COD-1995-13709_Kikwit03/08/1328KC242796.199.994440.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0.0.0.1.0
KC242800.1 Zaire_EBOV_GAB-1996-Ilembe03/08/1329KC242786.197.880.1.0.0.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC242801.1 Zaire_EBOV_COD-1976-deRoover03/08/1330KC242791.199.988890.1.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0
JX477165.1 Reston_RESTV-PHL-2009-09A_Farm_A03/08/1331FJ621583.199.916670.0.0.0.0.0.1.0.0.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0
JX477166.1 Reston_RESTV-PHL-USA-1996-Ferlite,Philippines-Alice,TX03/08/1332AB050936.199.883890.0.0.0.0.0.0.1.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0
KC242783.2 Sudan_SUDV_SSD-1979-Maleo03/14/1333FJ968794.199.775560.1.1.0.0.1.0.0.0.0.1.1.0.0.0.0.0.0.0.0.0.0.0.0
KC589025.1 Sudan_EboSud-639_201206/24/1334AY729654.199.198890.1.1.0.0.0.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC545389.1 Sudan_EboSud-602_201206/24/1335AY729654.199.292780.1.1.0.0.0.0.0.3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC545390.1 Sudan_EboSud-603_201206/24/1336KC545389.199.994440.1.1.0.0.0.0.0.3.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0
KC545391.1 Sudan_EboSud-609_201206/24/1337KC545389.199.988890.1.1.0.0.0.0.0.3.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0
KC545392.1 Sudan_EboSud-682_201206/24/1338KC545389.199.994440.1.1.0.0.0.0.0.3.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0
KC545393.1 Bundibugyo_EboBund-112_201206/24/1339FJ217161.198.713330.1.2.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC545394.1 Bundibugyo_EboBund-120_201206/24/1340KC545393.199.972220.1.2.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0.0
KC545395.1 Bundibugyo_EboBund-122_201206/24/1341KC545393.199.977780.1.2.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.0
KC545396.1 Bundibugyo_EboBund-14_201206/24/1342KC545393.199.961110.1.2.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.1.0.0.0.0
KM034550.1 Zaire_SLE-2014-Makona-EM09508/08/1443KC242786.197.119440.1.0.0.0.0.3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KM034551.1 Zaire_SLE-2014-Makona-EM09608/08/1444KM034550.199.852220.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0
KM034552.1 Zaire_SLE-2014-Makona-EM09808/08/1445KM034551.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM034553.1 Zaire_SLE-2014-Makona-G3670.108/08/1446KM034552.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.2.0.0
KM034554.1 Zaire_SLE-2014-Makona-G3676.108/08/1447KM034552.199.977220.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.1.0.0.0
KM034556.1 Zaire_SLE-2014-Makona-G3677.108/08/1448KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM034558.1 Zaire_SLE-2014-Makona-G3679.108/08/1449KM034551.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0
KM034559.1 Zaire_SLE-2014-Makona-G3680.108/08/1450KM034554.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.1.0.1.0
KM034560.1 Zaire_SLE-2014-Makona-G3682.108/08/1451KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM034561.1 Zaire_SLE-2014-Makona-G3683.108/08/1452KM034554.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.1.0.2.0
KM034562.1 Zaire_SLE-2014-Makona-G3686.108/08/1453KM034554.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.1.1.0.0
KM034563.1 Zaire_SLE-2014-Makona-G3687.108/08/1454KM034550.199.112220.1.0.0.0.0.3.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KM233035.1 Zaire_SLE-2014-Makona-EM10408/08/1455KM034552.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.3.0.0
KM233036.1 Zaire_SLE-2014-Makona-EM10608/08/1456KM034552.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233037.1 Zaire_SLE-2014-Makona-EM11008/08/1457KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233038.1 Zaire_SLE-2014-Makona-EM11108/08/1458KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233039.1 Zaire_SLE-2014-Makona-EM11208/08/1459KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233040.1 Zaire_SLE-2014-Makona-EM11308/08/1460KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233041.1 Zaire_SLE-2014-Makona-EM11508/08/1461KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233042.1 Zaire_SLE-2014-Makona-EM11908/08/1462KM233036.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.4.0.0
KM233043.1 Zaire_SLE-2014-Makona-EM12008/08/1463KM034552.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.2.0
KM233044.1 Zaire_SLE-2014-Makona-EM12108/08/1464KM034552.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.3.0
KM233045.1 Zaire_SLE-2014-Makona-EM124.108/08/1465KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233049.1 Zaire_SLE-2014-Makona-G370708/08/1466KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233050.1 Zaire_SLE-2014-Makona-G3713.208/08/1467KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.4.0
KM233053.1 Zaire_SLE-2014-Makona-G372408/08/1468KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.5.0
KM233054.1 Zaire_SLE-2014-Makona-G372908/08/1469KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233055.1 Zaire_SLE-2014-Makona-G3734.108/08/1470KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233056.1 Zaire_SLE-2014-Makona-G3735.108/08/1471KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233058.1 Zaire_SLE-2014-Makona-G3750.108/08/1472KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.6.0
KM233061.1 Zaire_SLE-2014-Makona-G375208/08/1473KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.7.0
KM233062.1 Zaire_SLE-2014-Makona-G375808/08/1474KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233063.1 Zaire_SLE-2014-Makona-G376408/08/1475KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233064.1 Zaire_SLE-2014-Makona-G3765.208/08/1476KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233065.1 Zaire_SLE-2014-Makona-G3769.108/08/1477KM034552.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.5.0.0
KM233069.1 Zaire_SLE-2014-Makona-G3770.108/08/1478KM233042.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.4.0.0
KM233071.1 Zaire_SLE-2014-Makona-G377108/08/1479KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233072.1 Zaire_SLE-2014-Makona-G378208/08/1480KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233073.1 Zaire_SLE-2014-Makona-G378608/08/1481KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233074.1 Zaire_SLE-2014-Makona-G378708/08/1482KM034552.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.6.0.0
KM233075.1 Zaire_SLE-2014-Makona-G378808/08/1483KM034556.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.8.0
KM233076.1 Zaire_SLE-2014-Makona-G3789.108/08/1484KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233077.1 Zaire_SLE-2014-Makona-G379508/08/1485KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233078.1 Zaire_SLE-2014-Makona-G379608/08/1486KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233079.1 Zaire_SLE-2014-Makona-G379808/08/1487KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233080.1 Zaire_SLE-2014-Makona-G379908/08/1488KM034552.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.9.0
KM233081.1 Zaire_SLE-2014-Makona-G380008/08/1489KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233082.1 Zaire_SLE-2014-Makona-G3805.108/08/1490KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233084.1 Zaire_SLE-2014-Makona-G380708/08/1491KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233085.1 Zaire_SLE-2014-Makona-G380808/08/1492KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233086.1 Zaire_SLE-2014-Makona-G380908/08/1493KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.10.0
KM233087.1 Zaire_SLE-2014-Makona-G3810.108/08/1494KM034552.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.7.0.0
KM233089.1 Zaire_SLE-2014-Makona-G381408/08/1495KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233090.1 Zaire_SLE-2014-Makona-G381608/08/1496KM233086.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.10.0
KM233091.1 Zaire_SLE-2014-Makona-G381708/08/1497KM233036.199.983330.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.8.0.0
KM233092.1 Zaire_SLE-2014-Makona-G381808/08/1498KM233036.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.9.0.0
KM233093.1 Zaire_SLE-2014-Makona-G381908/08/1499KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.11.0
KM233094.1 Zaire_SLE-2014-Makona-G382008/08/14100KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233095.1 Zaire_SLE-2014-Makona-G382108/08/14101KM233086.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.10.0
KM233096.1 Zaire_SLE-2014-Makona-G382208/08/14102KM233086.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.10.0
KM233097.1 Zaire_SLE-2014-Makona-G382308/08/14103KM034552.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.12.0
KM233098.1 Zaire_SLE-2014-Makona-G3825.108/08/14104KM233061.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.7.0
KM233100.1 Zaire_SLE-2014-Makona-G382608/08/14105KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.13.0
KM233101.1 Zaire_SLE-2014-Makona-G382708/08/14106KM233100.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.13.0
KM233102.1 Zaire_SLE-2014-Makona-G382908/08/14107KM233061.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.14.0
KM233103.1 Zaire_SLE-2014-Makona-G383108/08/14108KM233074.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.6.0.0
KM233104.1 Zaire_SLE-2014-Makona-G383408/08/14109KM233061.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.7.0
KM233105.1 Zaire_SLE-2014-Makona-G383808/08/14110KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233106.1 Zaire_SLE-2014-Makona-G384008/08/14111KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233107.1 Zaire_SLE-2014-Makona-G384108/08/14112KM034552.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.0.0
KM233108.1 Zaire_SLE-2014-Makona-G384508/08/14113KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233109.1 Zaire_SLE-2014-Makona-G384608/08/14114KM233036.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.10.0.0
KM233110.1 Zaire_SLE-2014-Makona-G384808/08/14115KM233061.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.7.0
KM233111.1 Zaire_SLE-2014-Makona-G385008/08/14116KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KM233112.1 Zaire_SLE-2014-Makona-G385108/08/14117KM233061.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.7.0
KM233113.1 Zaire_SLE-2014-Makona-G3856.108/08/14118KM233036.199.994440.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.15.0
KM233115.1 Zaire_SLE-2014-Makona-G385708/08/14119KM233061.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.7.0
KM233116.1 Zaire_SLE-2014-Makona-NM042.108/08/14120KM233036.11000.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.0.1.1.0
KJ660346.2 Zaire_GIN-2014-Kissidougou-C1508/26/14121KM034554.199.988890.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.1.2.0.0
KJ660347.2 Zaire_GIN-2014-Gueckedou-C0708/26/14122KJ660346.299.983330.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.1.3.0.0
KJ660348.2 Zaire_GIN-2014-Gueckedou-C0508/26/14123KJ660347.299.977780.1.0.0.0.0.3.0.0.0.0.0.1.0.0.0.0.0.0.0.2.0.0.0
NC_002549.1 Zaire_COD-1976_Yambuku-Mayinga08/27/14124KC242791.199.988890.1.0.0.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0
NC_004161.1 Reston_USA-1989-Philippines89-Pennsylvania08/27/14125AF522874.11000.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0
KC242793.1 Zaire_EBOV_GAB-1996-1Eko09/12/14126KC242797.11000.1.0.0.0.0.0.0.1.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0
KC242794.1 Zaire_EBOV_GAB-1996-2Nza09/12/14127KC242798.11000.1.0.0.0.0.0.0.1.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0
KC242795.1 Zaire_EBOV_GAB-1996-1Mbie09/12/14128KC242797.11000.1.0.0.0.0.0.0.1.0.0.0.1.0.0.0.0.0.0.0.0.0.0.0

Abbreviations: ANI, average nucleotide identity; EBOV, Ebola virus; LINs, Life Identification Numbers; NCBI, National Center of Biotechnology Information.

a Each genome was analyzed in the order it was submitted to NCBI GenBank (or the date it was most recently revised). For each genome the following is shown in the table: date submitted to NCBI (or date the sequence was last revised), the order in which the genome was analyzed, the most similar genome out of all the genomes submitted to GenBank on an earlier date, the corresponding ANI, and the assigned LIN using the ANI cutoffs described in Figure 1 (instead of labeling each position with a subscript LIN, positions are simply separated by a period).

ANI Values and LINs Assigned to Each EBOV Genomea Abbreviations: ANI, average nucleotide identity; EBOV, Ebola virus; LINs, Life Identification Numbers; NCBI, National Center of Biotechnology Information. a Each genome was analyzed in the order it was submitted to NCBI GenBank (or the date it was most recently revised). For each genome the following is shown in the table: date submitted to NCBI (or date the sequence was last revised), the order in which the genome was analyzed, the most similar genome out of all the genomes submitted to GenBank on an earlier date, the corresponding ANI, and the assigned LIN using the ANI cutoffs described in Figure 1 (instead of labeling each position with a subscript LIN, positions are simply separated by a period). For phylogenetic reconstruction, the genomes were aligned using MAFFT [11, 12]. A maximum-likelihood (ML) tree was constructed using RAxML [13] with 20 ML search replicates and 1200 nonparametric bootstrap replicates under the GTRGAMMA model. Nonparametric bootstrap branch support values were mapped onto the best log-likelihood ML tree, and clades with less than 50% bootstrap support were collapsed into polytomies using TreeCollapseCL4 [14].

RESULTS

Sequential Calculation of Average Nucleotide Identity and Identification of Most Similar Genomes

All results of sequential ANI calculation are listed in Table 1. However, for the purpose of clarity, examples of individual results are described here in the order in which they were obtained: the ebolavirus genome sequence in our dataset with the earliest associated date (September 2002) had accession number AF522874.1. It was not compared with any other genome, because all other ebolavirus genome sequences were either submitted to GenBank or updated in GenBank after September 2002. The genome with accession number AY35458.1 was the second genome in our dataset (because the associated date, February 2004, was the second genome submitted to GenBank, or updated, in temporal order). Thus, this genome was compared only with genome AF522874.1, and the ANI between the 2 genomes was determined to be 69.44778%. The third genome submitted to GenBank, ie, the genome with accession number AY729654.1 (submitted in October 2005), was then compared to the first 2 genomes. Genome AY729654.1 was found to be more similar to genome AY35458.1 than to genome AF522874.1. Therefore, Table 1 lists genome AY729654.1 as most similar to genome AY35458.1 with the corresponding ANI of 70.22714%. This type of comparison was repeated 125 times, with the 128th ebolavirus genome, genome KC242795.1, being compared with the first 127 ebolavirus genomes, whereby genome KC242797.1 was identified to be the most similar genome to genome KC242795.1 with an ANI value of 100%, indicating identical sequences.

Assignment of Life Identification Numbers and Their Correlation With Membership in Ebolavirus Species and Their Association With Separate Ebola Virus Disease Outbreaks

Life Identification Numbers were assigned based on the sequentially calculated ANI values and the sequentially identified most similar genomes reported in Table 1. The same thresholds as those in reference [6] were used at each LIN position. The LIN assigned to the first genome has zeroes at all positions, but any other number or letter could have been used. Zero is simply the default symbol we use to start the LIN assignment, but the information provided by LINs is not represented by the symbol themselves but by the presence of identical symbols at the same position in different isolates. For example, the LIN assigned to the fourth isolate in Table 1 is identical to the third isolate up to position E, because the 2 isolates have genomes that are 94.95444% identical to each other, which is higher than 90% (the threshold corresponding to position E; also see Figure 1). However, they are different at position F because 94.95444% is lower then 95% (threshold corresponding to position F; also see Figure 1). Table 1 and Figure 2 show that all ebolavirus isolates share the symbol 0 in position A (0A) because their genomes are all over 60% identical to each other. Isolates of RESTV are the only isolates identified by the symbol 0 in position B. Thus, Life Identification Number 0A0B is sufficient to distinguish RESTV isolates from all other ebolavirus isolates. Members of the other 4 species of Ebolavirus are instead uniquely identified by a specific symbol at LIN position C: all EBOV isolates can be uniquely identified as 0A1B0C; all SUDV isolates can be uniquely identified as 0A1B1C; all BDBV isolates can be uniquely identified as 0A1B2C; and all TAFV isolates can be uniquely identified as 0A1B3C. Therefore, the first 3 LIN positions are sufficient to identify ebolavirus isolates as members of each species of Ebolavirus (as classified by Kuhn et al [2]).
Figure 2.

Life Identification Numbers (LINs) assigned to ebolavirus isolates are informative of species membership and the outbreak during which they were isolated. A maximum likelihood tree was constructed and midpoint rooting was applied. Nonparametric bootstrap support values as a percentage of 1200 bootstrap replicates are shown above branches. Clades are labeled using virus abbreviations described in reference [2]. For Ebola virus (EBOV), the geographic location of outbreaks is also listed. Clades corresponding to individual species and to individual outbreaks are displayed as collapsed branches. Only those LIN positions that distinguish species and outbreaks from each other are shown. Abbreviations: BDBV, Bundibugyo virus; RESTV, Reston virus; SUDV, Sudan virus.

Life Identification Numbers (LINs) assigned to ebolavirus isolates are informative of species membership and the outbreak during which they were isolated. A maximum likelihood tree was constructed and midpoint rooting was applied. Nonparametric bootstrap support values as a percentage of 1200 bootstrap replicates are shown above branches. Clades are labeled using virus abbreviations described in reference [2]. For Ebola virus (EBOV), the geographic location of outbreaks is also listed. Clades corresponding to individual species and to individual outbreaks are displayed as collapsed branches. Only those LIN positions that distinguish species and outbreaks from each other are shown. Abbreviations: BDBV, Bundibugyo virus; RESTV, Reston virus; SUDV, Sudan virus. Although different outbreaks are caused by different variants of EBOV, Figure 2 shows that the isolates from these outbreaks share the same LIN up to position F: 0A1B0C0D0E0F. Isolates from the first outbreak of EBOV in Zaire in 1976 are then uniquely identified by LIN positions G and H (0G1H), and the isolates from the Luebo outbreak in 2007 by position G (1G). The isolates of the Gabon outbreak in 1996 share the same LIN with the isolates from the Kikwit outbreak up to position H (0A1B0C0D0E0F0G0H). This is consistent with a recent common ancestor of the viruses that caused these 2 outbreaks, which is confirmed by phylogenetic reconstruction (see the tree shown in Figure 2 and previous publications [8, 9]). Consequently, the isolates of these 2 latter outbreaks are only differentiated at the position with the next higher similarity threshold, ie, position I, at which the isolates from the Kikwit outbreak have a 0 (0I) and the isolates from the Gabon outbreak have a 1 (1I). A single isolate from Gabon from an outbreak in 1996 was previously shown to represent a separate genetic lineage compared with the other EBOV isolates from the earlier Gabon outbreak in 1994 [8], and this is clearly identified with a 2 in position G (2G). Finally, isolates from the 2014 EBOV epidemic are different from all other EBOV isolates at position G. These isolates share a 3 at that position (3G) and then zeroes at all following positions up to position T. Only isolates G3687.1 and EM095 are exceptions, having a different symbol at positions I and M, respectively, possibly because of sequencing errors due to low median genome coverage: 20× for G3687.1 and 16x for EM0957. Therefore, the provisional LINs assigned here immediately reveal that the 2014 EVD epidemic in West Africa is caused by a variant that is distinct from all other EBOV variants that have caused outbreaks of EVD in the Democratic Republic of Congo, Uganda, or Gabon in the past (because of the distinctive LIN at position G: 3G). In addition, the LINs immediately reveal that all isolates from the current EVD outbreak are extremely similar to each other (because of the conserved LIN at position L: 0L). It is important to note that the LINs assigned here are provisional and are based only on comparisons among ebolavirus genomes. We do not propose to actually use these provisional LINs. Instead, we propose that permanent LINs should be assigned in the future by NCBI or an independent LIN database.

Correlation Between Life Identification Numbers and Phylogenetic and Epidemiological Relationships of Isolates From the Ebola Virus Disease Epidemic in West Africa

The 2014–2015 EVD epidemic probably started with a single zoonotic event in Guinea and spread to Sierra Leone and Liberia [9]. Gire et al [9] concluded from genome sequences of 78 isolates from Sierra Leone that the epidemic in Sierra Leone started with the introduction of 2 separate viral lineages from Guinea. In fact, among the first patients in Sierra Leone, 2 divergent groups of EBOV, called SL1 and SL2, could be distinguished, and the most recent ancestor of these 2 groups was inferred to have existed before the Sierra Leone outbreak started. Life Identification Number position U correlates with these 2 groups: all but 2 SL1 isolates are identified by LIN 1U, and all SL2 isolates are identified by LIN 0U (Table 1, Figure 3, and Supplementary Figure 1). The 2 SL1 isolates that do not have LIN 1U are G3687.1 and EM095, mentioned previously, because they have a much lower ANI compared with all other EBOV isolates from the 2014 epidemic and consequently different symbols at LIN positions I and M, respectively (possibly because of sequencing errors due to low coverage). Two of the 3 Guinea isolates share LIN 1U with the SL1 group, suggesting that these 2 isolates and SL1 have a recent common ancestor. The third Guinea isolate is the only sequenced isolate with LIN 2U.
Figure 3.

Maximum likelihood phylogeny of ebolavirus genomes from the 2014 Ebola virus disease outbreak in West Africa with identifying Life Identification Number (LIN) positions mapped to taxa. Branch labels are nonparametric bootstrap support values as a percentage of 1200 bootstrap replicates. Branches with less than 50% bootstrap support were collapsed into polytomies. Branches are not to scale. All 2014 ebolavirus genomes shared all LIN positions up to position T (indicated in gray below the tree and by gray brackets) except for isolates G3687.1 and EM095, which were different at positions I and M, respectively. Group SL1 isolates are green, SL2 isolates are red, SL3 isolates are purple, and Guinean isolates are blue. All isolates are of variant “Makona” [15] and labels are based on isolate names in reference [9]. The labels “Kissidougou” and “Gueckedou” are the location of isolation in Guinea based on reference [8].

Maximum likelihood phylogeny of ebolavirus genomes from the 2014 Ebola virus disease outbreak in West Africa with identifying Life Identification Number (LIN) positions mapped to taxa. Branch labels are nonparametric bootstrap support values as a percentage of 1200 bootstrap replicates. Branches with less than 50% bootstrap support were collapsed into polytomies. Branches are not to scale. All 2014 ebolavirus genomes shared all LIN positions up to position T (indicated in gray below the tree and by gray brackets) except for isolates G3687.1 and EM095, which were different at positions I and M, respectively. Group SL1 isolates are green, SL2 isolates are red, SL3 isolates are purple, and Guinean isolates are blue. All isolates are of variant “Makona” [15] and labels are based on isolate names in reference [9]. The labels “Kissidougou” and “Gueckedou” are the location of isolation in Guinea based on reference [8]. Gire et al [9] further identified a third viral group in Sierra Leone named SL3, which is a subgroup of SL2, with the only difference between SL3 and SL2 isolates being a single mutation in position 10,218. SL3 corresponds to the large clade in the upper portion of the phylogenetic tree in Figure 3. Because only a single mutation distinguishes SL2 from SL3, it is not surprising that SL3 and SL2 do not correlate with a conserved difference at any LIN position. Because LINs are assigned based on percentage of overall DNA identity between genomes, the mutations that distinguish isolates within SL2 and SL3 have a larger effect on LINs than the single mutation distinguishing SL2 from SL3. In addition, within SL2 and SL3, there are 7 small clades that have bootstrap values higher than 60 and that probably correspond to small individual transmission chains (clades labeled I–VII in Figure 3). Five of these clades (II, III, IV, V, and VII) contain isolates that have exactly the same LIN in all positions. In clade I, isolate G3829 has a LIN ending in 14W0X, whereas the other 6 isolates in the same clade end with LIN 7W0X. In this case, isolate G3829 mutated to a point that it became too different to be grouped with the other isolates at LIN position W, and the phylogenetic signal that groups isolate G3829 with the rest of the clade was lost. Finally, clade VI is the only clade in which isolates have the same identical LIN (ending in 1V0W0X) as isolates outside of the clade. More importantly, this result is simply due to the fact that draft genomes of slightly different length were used in tree construction, which influenced the location of these isolates in the tree. However, these isolates are actually 100% identical to each other in the part of the draft genome sequence that they share and which was considered in LIN assignment.

DISCUSSION

In today's interconnected world, emerging infectious diseases can spread globally within weeks, requiring international coordination in disease control. Therefore, it is important that communication about pathogens is not hindered by confusion about their identity because different names for the same pathogen, or the same name for different pathogens, are used by different researchers in different countries. In this study, we have shown that unique identifiers, such as LINs, can be assigned to individual ebolavirus isolates based on a simple measure of ANI to address this problem. The assigned LINs are not only informative of the species and the outbreak for which they are associated, but they even reflect some individual transmission chains. Average nucleotide identity was originally developed to replace the experimental measurement of DNA-DNA hybridization (DDH) to determine whether 2 bacteria belong to the same species [7, 16, 17]. A 95% ANI was determined to correspond to approximately 70% DDH [18], which had been chosen years earlier as the minimum similarity between 2 bacteria to belong to the same species [19]. In viral taxonomy, a percentage of pairwise sequence identity has been used to demarcate taxa within viral families using computational approaches such as PASC [20] or DEmARC [21], both of which have been applied to the family of Filoviridae [22, 23]. Going 1 step further, we previously proposed that ANI could be used beyond assigning viruses to traditional taxa, such as species or genera [6]. We showed that ANI could be used to assign unique genome similarity-based codes to individual viral isolates, whereby we found that codes (which we now call LINs) assigned to individual isolates of FMDV collected during the 2001 United Kingdom FMDV outbreak largely reflected epidemiological relationships. This result was obtained by comparing assigned LINs with the results of an earlier molecular epidemiological investigation of that same outbreak [24]. In this study, we showed that ANI can be used to provide LINs for individual ebolavirus isolates and that (1) assigned LINs are informative of phylogenetic relationships at the species level, (2) LINs clearly correlate with separate EVD outbreaks, and (3) in many cases, LINs even reflect phylogenetic clades that correspond to likely individual transmission chains during the epidemic in West Africa in 2014. Thus, LINs are exceptionally informative of deep phylogenetic relationships among ebolavirus isolates. However, there are challenges when comparing LINs with very deep phylogenetic relationships when isolates are associated with recent epidemics. First, neither LINs nor phylogeny may represent true evolutionary and epidemiological relationships, because not enough mutations were accumulated to create strongly supported phylogenetic clades or clearly distinguishable LIN classes. Second, some clades in the Maximum Likelihood and Bayesian trees provided in reference [9] for the current EVD epidemic have low statistical support and, not surprisingly, do not correspond to conserved LIN positions. However, even a single-nucleotide polymorphism (SNP) can potentially distinguish transmission chains from each other. In fact, a single SNP at position 10,218 was used to divide Sierra Leone viruses into groups SL2 and SL3 [9]. As a result, one question remains: how important would it be for isolate identifiers to reflect the distinction of 2 groups based on a single SNP? In the case of SL2 and SL3, viruses were sometimes isolated from the same patients [9]. Therefore, they do not represent separate transmission chains. Finally, we can assume that during additional transmission events, these viruses will accumulate several more mutations so that viruses after additional transmission events will become different enough at the whole genome level to be identified by different LINs. However, thresholds used at the different LIN positions could be further refined to improve correlation with phylogenetic relationships. For example, the isolate with accession number KM233102 was assigned LIN 14W, whereas the other isolates belonging to the same clade were assigned LIN 7W. KM233102 was assigned this LIN because the calculated ANI compared with the other isolates was calculated to be 99.9944%, which is less than the 99.999% threshold corresponding to LIN position W. Therefore, if an additional LIN position were to be introduced to the left of position W with a 99.994% threshold, KM233102 would still be in a group with the other isolates at position W and separate from them at the next position with threshold 99.999%. However, we believe that only after assigning LINs to many more viruses for many more disease outbreaks and comparing LINs with phylogeny each time should LIN position thresholds be optimized to best reflect phylogeny for assignment of permanent LINs. Finally, we emphasize that we do not propose to replace current viral taxonomy with LINs. Current viral taxonomy is very useful and highly informative in many aspects, and it is tailored to the evolutionary mechanisms at the base of viral diversity found in different viral families. Therefore, LINs should complement but not replace current viral taxonomy.

CONCLUSIONS

In summary, we have shown here that, in the case of ebolavirus, LINs in their current implementation are highly informative of similarity and relationships from the species level all the way to some individual transmission chains. Taken together with our previous results [6], these new results suggest that after further optimization, LINs have the potential to provide unique and stable identifiers for individual viral and bacterial isolates and eukaryotic organisms that reflect deep phylogenetic relationships. As such, we propose that LINs should be assigned in the future to every newly sequenced genome either by NCBI (or a database specifically developed for LIN assignment) to provide the healthcare and research community with a typing scheme of unprecedented precision and with identifiers for unambiguous communication about pathogen isolates and other organisms as soon as genomes sequences become available.

Supplementary Material

Supplementary material is available online at Open Forum Infectious Diseases (http://OpenForumInfectiousDiseases.oxfordjournals.org/).
  20 in total

1.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors:  Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal:  Nucleic Acids Res       Date:  2002-07-15       Impact factor: 16.971

2.  Partitioning the genetic diversity of a virus family: approach and evaluation through a case study of picornaviruses.

Authors:  Chris Lauber; Alexander E Gorbalenya
Journal:  J Virol       Date:  2012-01-25       Impact factor: 5.103

3.  Towards a genome-based taxonomy for prokaryotes.

Authors:  Konstantinos T Konstantinidis; James M Tiedje
Journal:  J Bacteriol       Date:  2005-09       Impact factor: 3.490

4.  DNA-DNA hybridization values and their relationship to whole-genome sequence similarities.

Authors:  Johan Goris; Konstantinos T Konstantinidis; Joel A Klappenbach; Tom Coenye; Peter Vandamme; James M Tiedje
Journal:  Int J Syst Evol Microbiol       Date:  2007-01       Impact factor: 2.747

5.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors:  Kazutaka Katoh; Daron M Standley
Journal:  Mol Biol Evol       Date:  2013-01-16       Impact factor: 16.240

6.  Virus nomenclature below the species level: a standardized nomenclature for natural variants of viruses assigned to the family Filoviridae.

Authors:  Jens H Kuhn; Yiming Bao; Sina Bavari; Stephan Becker; Steven Bradfute; J Rodney Brister; Alexander A Bukreyev; Kartik Chandran; Robert A Davey; Olga Dolnik; John M Dye; Sven Enterlein; Lisa E Hensley; Anna N Honko; Peter B Jahrling; Karl M Johnson; Gary Kobinger; Eric M Leroy; Mark S Lever; Elke Mühlberger; Sergey V Netesov; Gene G Olinger; Gustavo Palacios; Jean L Patterson; Janusz T Paweska; Louise Pitt; Sheli R Radoshitzky; Erica Ollmann Saphire; Sophie J Smither; Robert Swanepoel; Jonathan S Towner; Guido van der Groen; Viktor E Volchkov; Victoria Wahl-Jensen; Travis K Warren; Manfred Weidmann; Stuart T Nichol
Journal:  Arch Virol       Date:  2012-09-23       Impact factor: 2.574

7.  Nomenclature- and database-compatible names for the two Ebola virus variants that emerged in Guinea and the Democratic Republic of the Congo in 2014.

Authors:  Jens H Kuhn; Kristian G Andersen; Sylvain Baize; Yīmíng Bào; Sina Bavari; Nicolas Berthet; Olga Blinkova; J Rodney Brister; Anna N Clawson; Joseph Fair; Martin Gabriel; Robert F Garry; Stephen K Gire; Augustine Goba; Jean-Paul Gonzalez; Stephan Günther; Christian T Happi; Peter B Jahrling; Jimmy Kapetshi; Gary Kobinger; Jeffrey R Kugelman; Eric M Leroy; Gael Darren Maganga; Placide K Mbala; Lina M Moses; Jean-Jacques Muyembe-Tamfum; Magassouba N'Faly; Stuart T Nichol; Sunday A Omilabu; Gustavo Palacios; Daniel J Park; Janusz T Paweska; Sheli R Radoshitzky; Cynthia A Rossi; Pardis C Sabeti; John S Schieffelin; Randal J Schoepp; Rachel Sealfon; Robert Swanepoel; Jonathan S Towner; Jiro Wada; Nadia Wauquier; Nathan L Yozwiak; Pierre Formenty
Journal:  Viruses       Date:  2014-11-24       Impact factor: 5.048

8.  Filovirus RefSeq entries: evaluation and selection of filovirus type variants, type sequences, and names.

Authors:  Jens H Kuhn; Kristian G Andersen; Yīmíng Bào; Sina Bavari; Stephan Becker; Richard S Bennett; Nicholas H Bergman; Olga Blinkova; Steven Bradfute; J Rodney Brister; Alexander Bukreyev; Kartik Chandran; Alexander A Chepurnov; Robert A Davey; Ralf G Dietzgen; Norman A Doggett; Olga Dolnik; John M Dye; Sven Enterlein; Paul W Fenimore; Pierre Formenty; Alexander N Freiberg; Robert F Garry; Nicole L Garza; Stephen K Gire; Jean-Paul Gonzalez; Anthony Griffiths; Christian T Happi; Lisa E Hensley; Andrew S Herbert; Michael C Hevey; Thomas Hoenen; Anna N Honko; Georgy M Ignatyev; Peter B Jahrling; Joshua C Johnson; Karl M Johnson; Jason Kindrachuk; Hans-Dieter Klenk; Gary Kobinger; Tadeusz J Kochel; Matthew G Lackemeyer; Daniel F Lackner; Eric M Leroy; Mark S Lever; Elke Mühlberger; Sergey V Netesov; Gene G Olinger; Sunday A Omilabu; Gustavo Palacios; Rekha G Panchal; Daniel J Park; Jean L Patterson; Janusz T Paweska; Clarence J Peters; James Pettitt; Louise Pitt; Sheli R Radoshitzky; Elena I Ryabchikova; Erica Ollmann Saphire; Pardis C Sabeti; Rachel Sealfon; Aleksandr M Shestopalov; Sophie J Smither; Nancy J Sullivan; Robert Swanepoel; Ayato Takada; Jonathan S Towner; Guido van der Groen; Viktor E Volchkov; Valentina A Volchkova; Victoria Wahl-Jensen; Travis K Warren; Kelly L Warfield; Manfred Weidmann; Stuart T Nichol
Journal:  Viruses       Date:  2014-09-26       Impact factor: 5.048

9.  Genetics-based classification of filoviruses calls for expanded sampling of genomic sequences.

Authors:  Chris Lauber; Alexander E Gorbalenya
Journal:  Viruses       Date:  2012-08-31       Impact factor: 5.048

10.  A system to automatically classify and name any individual genome-sequenced organism independently of current biological classification and nomenclature.

Authors:  Haitham Marakeby; Eman Badr; Hanaa Torkey; Yuhyun Song; Scotland Leman; Caroline L Monteil; Lenwood S Heath; Boris A Vinatzer
Journal:  PLoS One       Date:  2014-02-21       Impact factor: 3.240

View more
  4 in total

1.  LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes.

Authors:  Long Tian; Reza Mazloom; Lenwood S Heath; Boris A Vinatzer
Journal:  PeerJ       Date:  2021-03-24       Impact factor: 2.984

2.  Draft Genome Sequences of Four Streptomycin-Sensitive Erwinia amylovora Strains Isolated from Commercial Apple Orchards in Ohio.

Authors:  A M Jimenez Madrid; T Klass; V Roman-Reyna; J Jacobs; M L Lewis Ivey
Journal:  Microbiol Resour Announc       Date:  2021-12-16

3.  A Dual Barcoding Approach to Bacterial Strain Nomenclature: Genomic Taxonomy of Klebsiella pneumoniae Strains.

Authors:  Melanie Hennart; Julien Guglielmini; Sébastien Bridel; Martin C J Maiden; Keith A Jolley; Alexis Criscuolo; Sylvain Brisse
Journal:  Mol Biol Evol       Date:  2022-07-02       Impact factor: 8.800

4.  Phylogenetic Analyses of Shigella and Enteroinvasive Escherichia coli for the Identification of Molecular Epidemiological Markers: Whole-Genome Comparative Analysis Does Not Support Distinct Genera Designation.

Authors:  Emily A Pettengill; James B Pettengill; Rachel Binet
Journal:  Front Microbiol       Date:  2016-01-19       Impact factor: 5.640

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.