Literature DB >> 19468314

Genome signatures, self-organizing maps and higher order phylogenies: a parametric analysis.

Derek Gatherer1.   

Abstract

Genome signatures are data vectors derived from the compositional statistics of DNA. The self-organizing map (SOM) is a neural network method for the conceptualisation of relationships within complex data, such as genome signatures. The various parameters of the SOM training phase are investigated for their effect on the accuracy of the resulting output map. It is concluded that larger SOMs, as well as taking longer to train, are less sensitive in phylogenetic classification of unknown DNA sequences. However, where a classification can be made, a larger SOM is more accurate. Increasing the number of iterations in the training phase of the SOM only slightly increases accuracy, without improving sensitivity. The optimal length of the DNA sequence k-mer from which the genome signature should be derived is 4 or 5, but shorter values are almost as effective. In general, these results indicate that small, rapidly trained SOMs are generally as good as larger, longer trained ones for the analysis of genome signatures. These results may also be more generally applicable to the use of SOMs for other complex data sets, such as microarray data.

Entities:  

Keywords:  Genome Signature; Herpesvirus; Jack-Knife Method; Metagenomics; Microarray; Phylogeny; Self-Organizing Map; Viruses

Year:  2007        PMID: 19468314      PMCID: PMC2684143     

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Introduction

Molecular evolutionary methodology revolves around the production of sequence alignments and trees. However, as evolutionary distance increases between two homologous molecules, their similarity may decay to the point where they are no longer alignable. Construction of a phylogenetic tree under such circumstances becomes impossible. One method that has been suggested for the study of distant evolutionary relationships is that of genomic signatures or genome signatures† (Karlin and Ladunga, 1994; Karlin and Burge, 1995; Karlin and Mrazek, 1996). At least one reviewer has come to the conclusion that it is the preferred method in cases where evolutionary distance, recombination, horizontal transmission or variable mutation rates may confound traditional alignment-based techniques (Brocchieri, 2001). The first derivation of genome signatures predates the invention of DNA sequencing. Biochemical studies revealed that the frequencies of nearest-neighbour dinucleotide pairs in DNA were generally consistent within genomes, and often different between genomes. These characteristic nearest neighbour patterns were termed general schemes (Russell et al. 1976; Russell and Subak-Sharpe, 1977), and constitute, in modern terminology, a subset of genome signatures, those of length k = 2. As long DNA sequences began to be isolated and computers entered the biological laboratory, it became a simple matter to produce nearest-neighbour frequency tables. Indeed, for any DNA sequence of length N, it is theoretically possible to derive frequency tables for all k-mers ranging from 1 to N, within that sequence. The frequency table at k = 1 corresponds to the raw nucleotide content on one strand. On the assumption that DNA is double stranded under most circumstances in most species, the complementary bases are also scored. This reduces the raw count of the four bases to a single value, between zero and one, representing the GC content of that DNA sequence. Correspondingly, at k = 2, the raw count of 16 dinucleotide frequencies, can be reduced to a vector containing 10 values if the count for each dimer on the top strand is added to the count for its complement on the other strand. There are 10 values, not 8, in this vector since GC, CG, AT and TA are self-complementary. This process is called symmetrization (Karlin and Ladunga, 1994). The symmetrized values in the vector are then usually corrected for the frequencies of their component monomers, as follows: where fXY is the symmetrized frequency of dinucleotide XY, and fX and fY are the symmetrized frequencies of bases X and Y, respectively. The whole vector is referred to as the genome signature at k = 2 or, particularly in the extensive literature of the Karlin group, simply as ρ* XY. For all values of k, the nomenclature GS-k is here adopted. The vector thus becomes an array of the ratios of observed frequencies of k-mers to their expected frequencies given an underlying zero-order Markov chain model of a DNA sequence. Even though symmetrization will reduce the size of the vector for large values of k, it is apparent that it will still grow in size at the order of 4 for an alphabet of length 4. In practice, most investigators have confined themselves to the study of genome signatures of k = 2, in other words to ρ *XY, symmetrized dinucleotide frequencies corresponding to general schemes, although in recent years the availability of faster computers has undoubtedly contributed to the increasing use of genome signatures up to k = 10 (Deschavanne et al. 1999; Edwards et al. 2002; Abe et al. 2003a; Sandberg et al. 2003; Campanaro et al. 2005; Dufraigne et al. 2005; Wang et al. 2005; Paz et al. 2006). The length of DNA required to generate a genome signature has conventionally been taken to be around 50 kb, and for this value it has been observed that the Hamming or Euclidean distances between signatures derived from contigs within species are generally considerably smaller than the corresponding average values between species (Karlin and Ladunga, 1994; Karlin and Burge, 1995; Karlin et al. 1997; Abe et al. 2002; Teeling et al. 2004), even when the same-species contigs are on different chromosomes (Gentles and Karlin, 2001). However, recent work has established that genome signatures within species may be stable over lengths as short as 10 kb (Deschavanne et al. 1999; Karlin, 2001; Abe et al. 2002) or less (Sandberg et al. 2001; Jernigan and Baran, 2002; Abe et al. 2003a; Sandberg et al. 2003; McHardy et al. 2007). This has led to their practical application in the detection of pathenogenicity islands (pIs) in pathogenic bacteria. These are sequences originating in horizontal transmission from one bacterium to another, converting a previously innocuous strain into a pathogenic one. Their foreign origin is often reflected in a genome signature closer to their species of origin than their current host genome (Karlin, 1998; Karlin, 2001; Dufraigne et al. 2005). Phylogenetic conclusions drawn from comparison of genome signatures have sometimes been controversial. For instance, Karlin et al. (1997) found that cyanobacteria do not form a coherent evolutionary group, and that Methanococcus jannaschii is closer to eukaryotes than to other proteobacteria, and Campbell et al. (1999) suggested that archaea do not form a coherent clade. Karlin (1998) posited a wide variety of further revisions of the prokaryotic phylogeny based on genome signature results, as well as a novel origin for mitochondria (Karlin et al. 1999). Edwards et al. (2002) used genome signatures as part of a revision of the phylogeny of birds. Nevertheless, few authors have felt confident enough to draw phylogenetic trees based on genome signature comparisons. Coenye and Vandamme (2004) have shown that dinucleotide content is only a reliable indicator of relatedness for closely related organisms. To visualize genome signature relationships between species, a variety of other representational schemes have been used including histograms (Karlin and Mrázek, 1997), partial ordering graphs (Karlin et al. 1997), chaos games (Deschavanne et al. 1999; Edwards et al. 2002; Wang et al. 2005), and self-organizing maps (Abe et al. 2003b). This paper uses self-organizing maps (SOMs) as a tool to explore genome signature variability at phylogenetic levels from superkingdom down to genus. The SOM is a neural network method which spreads multi-dimensional data onto a two-dimensional surface (Kohonen, 1997). Its endpoint is therefore similar to multi-dimensional scaling or principal components analysis, and like these other techniques has been extensively used in biology, principally for the analysis of micro-array data but also to a lesser extent for sequence analysis (Arrigo et al. 1991; Giuliano et al. 1993; Andrade et al. 1997; Tamayo et al. 1999; Kanaya et al. 2001; Wang et al. 2001; Abe et al. 2002; Covell et al. 2003; Ressom et al. 2003; Xiao et al. 2003; Mahony et al. 2004; Oja et al. 2005; Abe et al. 2006; Samsonova et al. 2006). The resulting “flat” representation may be a strong aid to intuitive understanding of the structure of complex multidimensional datasets. The SOM is not a clustering technique per se, but the surface may be divided up into zones that are then treated as clusters. Alternatively, cluster boundaries on the surface may be defined more objectively using additional algorithms (Ultsch, 1993). The SOM is also not hierarchical (unlike UPGMA but like K-means clustering, two other commonly used techniques for the analysis of microarrays). This absence of hierarchy means that it is particularly suited to situations where the natural hierarchy of species relationships, reflecting evolutionary descent, may have been violated, e.g. by horizontal gene transfer. In this paper, the main parameters of the SOM: its size and the number of iterations used in its construction, are investigated for their effects on its classificatory accuracy. These parameters must be chosen at the beginning of each run of SOM building, and there is little guidance in the SOM literature as to their optimal values. As well as the parameters of the SOM, the value of k used in the genome signature is similarly examined. High k genome signatures are extremely long vectors that may present considerable memory problems even on modern computers. Likewise, lengthy iterations in training the SOM, especially if it is a large one, may consume considerable time.

Methods

Genome sequences

Complete genome sequences were downloaded from NCBI Taxonomy Browser (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/). A Perl script was written to divide complete genome sequences into consecutive strings of 10 or 100 kb, as required. Trailing ends, and genomes shorter than the required string length, were discarded. The resulting FASTA-formatted datasets were then processed to calculate their genome signatures. Table 1 lists the genomes used as the main data set for the paper, that of viruses of the family Herpesviridae. The analyses shown in Figures 3 to 7 use this set. A larger set of genomes with the widest possible phylogenetic range, including all three superkingdoms of cellular life as well as viruses, is given in Table 2. These are used for the “all-life” and superkingdom-level SOMs in Figure 1. Table 3 lists those viral genomes used for the SOM across a wide set of viral genomes, displayed in Figure 2.
Table 1.

Herpesvirus genome sequences used for the analyses shown in Figures 3 to 7. The nomenclature follows the International Committee on Taxonomy of Viruses (Fauquet et al. 2005).

NameAccessionSub-familyGenus
Psittacid herpesvirus 1NC_005264AlphaIltovirus
Gallid herpesvirus 2NC_002229AlphaMardivirus
Gallid herpesvirus 3NC_002577AlphaMardivirus
Meleagrid herpesvirus 1NC_002641AlphaMardivirus
Cercopithecine herpesvirus 1NC_004812AlphaSimplexvirus
Human herpesvirus 1NC_001806AlphaSimplexvirus
Human herpesvirus 2NC_001798AlphaSimplexvirus
Bovine herpesvirus 1NC_001847AlphaVaricellovirus
Bovine herpesvirus 5NC_005261AlphaVaricellovirus
Cercopithecine herpesvirus 7NC_002686AlphaVaricellovirus
Equid herpesvirus 1NC_001491AlphaVaricellovirus
Equid herpesvirus 4NC_001844AlphaVaricellovirus
Human herpesvirus 3NC_001348AlphaVaricellovirus
Suid herpesvirus 1NC_006151AlphaVaricellovirus
Cercopithecine herpesvirus 8NC_006150BetaCytomegalovirus
Chimpanzee cytomegalovirusNC_003521BetaCytomegalovirus
Human herpesvirus 5 (AD169)NC_001347BetaCytomegalovirus
Human herpesvirus 5 (Merlin)NC_006273BetaCytomegalovirus
Murid herpesvirus 1NC_004065BetaMuromegalovirus
Murid herpesvirus 2NC_002512BetaMuromegalovirus
Human herpesvirus 6NC_001664BetaRoseolovirus
Human herpesvirus 6BNC_000898BetaRoseolovirus
Human herpesvirus 7NC_001716BetaRoseolovirus
Tupaia herpesvirusNC_002794BetaTupaiavirus
Callitrichine herpesvirus 3NC_004367GammaLymphocryptovirus
Cercopithecine herpesvirus 15NC_006146GammaLymphocryptovirus
Human herpesvirus 4NC_001345GammaLymphocryptovirus
Alcelaphine herpesvirus 1NC_002531GammaRhadinovirus
Ateline herpesvirus 3NC_001987GammaRhadinovirus
Bovine herpesvirus 4NC_002665GammaRhadinovirus
Cercopithecine herpesvirus 17NC_003401GammaRhadinovirus
Equid herpesvirus 2NC_001650GammaRhadinovirus
Human herpesvirus 8NC_003409GammaRhadinovirus
Murid herpesvirus 4NC_001826GammaRhadinovirus
Saimiriine herpesvirus 2NC_001350GammaRhadinovirus
Ictalurid herpesvirus 1NC_001493unassignedIctalurivirus
Ostreid herpesvirus 1NC_005881unassignedunassigned
Figure 3.

Dominance maps for GS-2 of 10kb fragments of herpesviruses applied to a 50 ×50 SOM over 500 iterations. The SOM is colored first according to genus membership and then according to family membership (reduced scale inset).

Figure 7.

Jack-knifing experiments to determine effects of SOM size (top left), GS number (top right) and number of iterations (lower left). The “undecided” column indicates the percentage of sequences in the test set that could not be assigned to a sub-family or genus. The “correct” column indicates the percentage of assignable sequences that were correctly assigned. Optimal values are highlighted in yellow.

Table 2.

Genomes used for the analysis shown in Figure 1. In total there are 79 eukaryotic, 156 eubacterial, 30 archaeal and 122 viral genomes with more than 100kb of sequence.

NameSuperkingdomAccession
Aeropyrum pernix K1archaeaNC_000854
Archaeoglobus fulgidus DSM 4304archaeaNC_000917
cf. Archaea SAR-1archaeaNS_000019
Ferroplasma acidarmanus Type IarchaeaNS_000030
Ferroplasma sp. Type IIarchaeaNS_000029
Haloarcula marismortui ATCC43049 chromosome IarchaeaNC_006396
Haloarcula marismortui ATCC43049 chromosome IIarchaeaNC_006397
Halobacterium sp. NRC-1archaeaNC_002607
Halobacterium sp. NRC-1 plasmid pNRC100archaeaNC_001869
Methanocaldococcus jannaschii DSM2661archaeaNC_000909
Methanococcus maripaludis S2archaeaNC_005791
Methanopyrus kandleri AV19archaeaNC_003551
Methanosarcina acetivorans C2AarchaeaNC_003552
Methanosarcina barkeri str. fusaro chromosome 1archaeaNC_007355
Methanosarcina mazei Go1archaeaNC_003901
Methanothermobacter thermautotrophicus str. DeltaHarchaeaNC_000916
Nanoarchaeum equitans Kin4-MarchaeaNC_005213
Natronomonas pharaonis DSM2160archaeaNC_007426
Picrophilus torridus DSM9790archaeaNC_005877
Pyrobaculum aerophilum str. IM2archaeaNC_003364
Pyrococcus abyssi GE5archaeaNC_000868
Pyrococcus furiosus DSM3638archaeaNC_003413
Pyrococcus horikoshii OT3archaeaNC_000961
Sulfolobus acidocaldarius DSM639archaeaNC_007181
Sulfolobus solfataricus P2archaeaNC_002754
Sulfolobus tokodaii str. 7archaeaNC_003106
Thermococcus kodakaraensis KOD1archaeaNC_006624
Thermoplasma acidophilum DSM1728archaeaNC_002578
Thermoplasma volcanium GSS1archaeaNC_002689
Thermoplasmatales archaeon GplarchaeaNS_000033
Agrobacterium tumefaciens str. C58eubacteriaNC_003062
Anabaena variabilis ATCC 29413eubacteriaNC_007413
Aquifex aeolicus VF5eubacteriaNC_000918
Azoarcus sp. EbN1eubacteriaNC_006513
Bacillus cereus ATCC 10987eubacteriaNC_003909
Bacillus cereus E33LeubacteriaNC_006274
Bacillus subtilis sub sp. subtilis str. 168eubacteriaNC_000964
Bacteroides fragilis NCTC9343eubacteriaNC_003228
Bacteroides fragilis YCH46eubacteriaNC_006347
Bartonella henselae str. Houston-1eubacteriaNC_005956
Bartonella quintana str. ToulouseeubacteriaNC_005955
BBUR Borrelia burgdorferi B31eubacteriaNC_001318
Bifidobacterium longum NCC2705eubacteriaNC_004307
Bordetella parapertussis 12822eubacteriaNC_002928
Bordetella pertussis TohamaIeubacteriaNC_002929
Bradyrhizobium japonicum USDA110eubacteriaNC_004463
Brucella abortus biovar 1 str. 9-941 chromosome IeubacteriaNC_006932
Brucella abortus biovar 1 str. 9-941 chromosome IIeubacteriaNC_006933
Brucella suis 1330 chromosome IeubacteriaNC_004310
Buchnera aphidicola str. APS (Acyrthosiphonpisum)eubacteriaNC_002528
Buchnera aphidicola str. Sg (Schizaphisgraminum)eubacteriaNC_004061
Burkholderia mallei ATCC23344 chromosome 1eubacteriaNC_006348
Burkholderia mallei ATCC23344 chromosome 2eubacteriaNC_006349
Burkholderia pseudomallei 1710b chromosome IeubacteriaNC_007434
Burkholderia pseudomallei 1710b chromosome IIeubacteriaNC_007435
Burkholderia pseudomallei K96243 chromosome 1eubacteriaNC_006350
Burkholderia sp. 383 chromosome 1eubacteriaNC_007510
Burkholderia sp. 383 chromosome 2eubacteriaNC_007511
Burkholderia sp. 383 chromosome 3eubacteriaNC_007509
Candidatus Blochmannia pennsylvanicus str. BPENeubacteriaNC_007292
Candidatus Pelagibacter ubique HTCC1062eubacteriaNC_007205
Carboxydothermus hydrogenoformans Z-2901eubacteriaNC_007503
Caulobacter crescentus CB15eubacteriaNC_002696
Chlamydia trachomatis A/HAR-13eubacteriaNC_007429
Chlamydia trachomatis D/UW-3/CXeubacteriaNC_000117
Chlamydophila caviae GPICeubacteriaNC_003361
Chlamydophila pneumoniae AR39eubacteriaNC_002179
Chlamydophila pneumoniae CWL029eubacteriaNC_000922
Chlamydophila pneumoniae J138eubacteriaNC_002491
Chlorobium chlorochromatii CaDeubacteriaNC_007514
Clostridium acetobutylicum ATCC824eubacteriaNC_003030
Clostridium tetani E88eubacteriaNC_004557
Colwellia psychrerythraea 34HeubacteriaNC_003910
Corynebacterium glutamicum ATCC13032eubacteriaNC_003450
Corynebacterium jeikeium K411eubacteriaNC_007164
Coxiella burnetii RSA493eubacteriaNC_002971
Dechloromonas aromatica RCBeubacteriaNC_007298
Dehalococcoides sp. CBDB1eubacteriaNC_007356
Deinococcus radiodurans R1 chromosome 1eubacteriaNC_001263
Deinococcus radiodurans R1 chromosome 2eubacteriaNC_001264
Desulfovibrio vulgaris sub sp. vulgaris str. HildenborougheubacteriaNC_002937
Desulfovibriode sulfuricans G20eubacteriaNC_007519
Ehrlichia canis str. JakeeubacteriaNC_007354
Erwinia carotovora sub sp. atrosepticaSCRI1043eubacteriaNC_004547
Escherichia coli CFT073eubacteriaNC_004431
Escherichia coli K12eubacteriaNC_000913
Escherichia coli O157:H7EDL933eubacteriaNC_002655
Francisella tularensis sub sp. tularensis Schu4eubacteriaNC_006570
Geobacter metallireducens GS-15eubacteriaNC_007517
Haemophilus ducreyi 35000HPeubacteriaNC_002940
Haemophilus influenzae 86-028NPeubacteriaNC_007146
Haemophilus influenzae RdKW20eubacteriaNC_000907
Helicobacter pylori 26695eubacteriaNC_000915
Helicobacter pylori J99eubacteriaNC_000921
Leifsoniaxyli sub sp. xyli str. CTCB07eubacteriaNC_006087
Leptospira interrogans serovar Copenhageni chromosome IeubacteriaNC_005823
Leptospira interrogans serovar Copenhageni chromosome IIeubacteriaNC_005824
Leptospira interrogans serovar Lai str. 56601 chromosome IeubacteriaNC_004342
Mannheimia succiniciproducens MBEL55EeubacteriaNC_006300
Mesoplasma florum L1eubacteriaNC_006055
Mesorhizobium loti MAFF303099eubacteriaNC_002678
Methylococcus capsulatus str. BatheubacteriaNC_002977
Mycobacterium avium sub sp. paratuberculosis K-10eubacteriaNC_002944
Mycobacterium bovis AF2122/97eubacteriaNC_002945
Mycobacterium leprae TNeubacteriaNC_002677
Mycobacterium tuberculosis H37RveubacteriaNC_000962
Mycoplasma genitalium G-37eubacteriaNC_000908
Mycoplasma hyopneumoniae 7448eubacteriaNC_007332
Mycoplasma hyopneumoniae JeubacteriaNC_007295
Mycoplasma synoviae 53eubacteriaNC_007294
Neisseria gonorrhoeae FA1090eubacteriaNC_002946
Neisseria meningitidis MC58eubacteriaNC_003112
Neisseria meningitidis Z2491eubacteriaNC_003116
Nitrobacter winogradskyi Nb-255eubacteriaNC_007406
Nitrosococcus oceani ATCC 19707eubacteriaNC_007484
Nitrosomonas europaea ATCC 19718eubacteriaNC_004757
Nocardia farcinicaI FM10152eubacteriaNC_006361
Oceanobacillus iheyensis HTE831eubacteriaNC_004193
Parachlamydia sp. UWE25eubacteriaNC_005861
Pasteurella multocida sub sp. multocida str. Pm70eubacteriaNC_002663
Pelobacter carbinolicus DSM2380eubacteriaNC_007498
Pelodictyon luteolum DSM273eubacteriaNC_007512
Photobacterium profundum SS9 chromosome 1eubacteriaNC_006370
Photobacterium profundum SS9 chromosome 2eubacteriaNC_006371
Photorhabdus luminescens sub sp. laumondii TTO1eubacteriaNC_005126
Prochlorococcus marinus str. NATL2AeubacteriaNC_007335
Prochlorococcus marinus sub sp. pastoris str. CCMP1986eubacteriaNC_005072
Propionibacterium acnes KPA171202eubacteriaNC_006085
Pseudoalteromonas haloplanktis TAC125 chromosome IeubacteriaNC_007481
Pseudoalteromonas haloplanktis TAC125 chromosome IIeubacteriaNC_007482
Psuedomonas fluorescens Pf-5eubacteriaNC_004129
Psuedomonas fluorescens PfO-1eubacteriaNC_007492
Psuedomonas putida KT2440eubacteriaNC_002947
Psuedomonas syringae pv. phaseolicola 1448AeubacteriaNC_005773
Psuedomonas syringae pv. syringae B728aeubacteriaNC_007005
Psuedomonas syringae pv. tomato str. DC3000eubacteriaNC_004578
Psychrobacter arcticus 273-4eubacteriaNC_007204
Ralstonia eutropha JMP134 chromosome 1eubacteriaNC_007347
Ralstonia eutropha JMP134 chromosome 2eubacteriaNC_007348
Ralstonia solanacearum GMI1000eubacteriaNC_003295
Rhodobacter sphaeroides 2.1 chromosome 1eubacteriaNC_007493
Rhodobacter sphaeroides 2.1 chromosome 2eubacteriaNC_007494
Rickettsia conorii str. Malish 7eubacteriaNC_003103
Rickettsia felis URRWXCal2eubacteriaNC_007109
Rickettsia prowazekii str. MadridEeubacteriaNC_000963
Rickettsia typhi str. WilmingtoneubacteriaNC_006142
Salmonella enterica serovar Choleraesuis str. SC-B67eubacteriaNC_006905
Salmonella enterica serovar Typhi str. CT18eubacteriaNC_003198
Shewanella oneidensis MR-1eubacteriaNC_004347
Shigella flexneri 2a str. 2457TeubacteriaNC_004741
Shigella flexneri 2a str. 301eubacteriaNC_004337
Shigella sonnei Ss046eubacteriaNC_007384
Sinorhizobium meliloti 1021eubacteriaNC_003047
Staphylococcus aureus sub sp. Aureus Mu50eubacteriaNC_002758
Staphylococcus haemolyticus JCSC143eubacteriaNC_007168
Staphylococcus saprophyticus sub sp. saprophyticuseubacteriaNC_007350
Streptococcus agalactiae A909eubacteriaNC_007432
Streptococcus pyogenes MGAS10394eubacteriaNC_006086
Streptococcus pyogenes MGAS315eubacteriaNC_004070
Streptococcus pyogenes MGAS500eubacteriaNC_007297
Streptococcus pyogenes MGAS6180eubacteriaNC_007296
Streptococcus pyogenes SSI-1eubacteriaNC_004606
Streptococcus thermophilus CNRZ1066eubacteriaNC_006449
Streptococcus thermophilus LMG18311eubacteriaNC_006448
Streptomyces avermitilis MA-4680eubacteriaNC_003155
Streptomyces coelicolor A3(2)eubacteriaNC_003888
Synechococcus sp. CC9605eubacteriaNC_007516
Synechococcus sp. CC9902eubacteriaNC_007513
Thermobifida fusca YXeubacteriaNC_007333
Thermus thermophilus HB8eubacteriaNC_006461
Thiobacillus denitrificans ATCC2525eubacteriaNC_007404
Thiomicrospira crunogena XCL-2eubacteriaNC_007520
Tropheryma whipplei str. TwisteubacteriaNC_004572
Vibrio cholerae O1 biovar eltor str. N16961 chromosome IeubacteriaNC_002505
Vibrio vulnificus CMCP6 chromosome IeubacteriaNC_004459
Vibrio vulnificus CMCP6 chromosome IIeubacteriaNC_004460
Wolbachia endosymbiont strain TRS of BrugiamalayieubacteriaNC_006833
Wolinella succinogenes DSM1740eubacteriaNC_005090
Xanthomonas axonopodis pv. citri str. 306eubacteriaNC_003919
Xanthomonas campestris pv. campestris str. 8004eubacteriaNC_007086
Xanthomonas campestris pv. campestris str. ATCC33913eubacteriaNC_003902
Xanthomonas campestris pv. vesicatoria str. 85-10eubacteriaNC_007508
Xanthomonas oryzae pv. oryzae KACC10331eubacteriaNC_006834
Xylella fastidiosa 9a5ceubacteriaNC_002488
Xylella fastidiosa Temecula 1eubacteriaNC_004556
Yersinia pseudotuberculosis IP32953eubacteriaNC_006155
Bos taurus genome 12eukaryoteNC_007310
Bos taurus genome 13eukaryoteNC_007311
Bos taurus genome 14eukaryoteNC_007312
Bos taurus genome 15eukaryoteNC_007313
Bos taurus genome 16eukaryoteNC_007314
Bos taurus genome 17eukaryoteNC_007315
Bos taurus genome 18eukaryoteNC_007316
Bos taurus genome 19eukaryoteNC_007317
Bos taurus genome 20eukaryoteNC_007318
Bos taurus genome 21eukaryoteNC_007319
Bos taurus genome 22eukaryoteNC_007320
Bos taurus genome 23eukaryoteNC_007324
Bos taurus genome 24eukaryoteNC_007325
Bos taurus genome 25eukaryoteNC_007326
Bos taurus genome 26eukaryoteNC_007327
Bos taurus genome 27eukaryoteNC_007328
Bos taurus genome 28eukaryoteNC_007329
Bos taurus genome 29eukaryoteNC_007330
Bos taurus genome XeukaryoteNC_007331
Candida albicans genomic DNA, genome 7eukaryoteNC_007436
Cryptococcus neoformans genome 1eukaryoteNC_006670
Cryptococcus neoformans genome 10eukaryoteNC_006679
Cryptococcus neoformans genome 11eukaryoteNC_006680
Cryptococcus neoformans genome 12eukaryoteNC_006681
Cryptococcus neoformans genome 13eukaryoteNC_006682
Cryptococcus neoformans genome 14eukaryoteNC_006683
Cryptococcus neoformans genome 2eukaryoteNC_006684
Cryptococcus neoformans genome 3eukaryoteNC_006685
Cryptococcus neoformans genome 4eukaryoteNC_006686
Cryptococcus neoformans genome 5eukaryoteNC_006687
Cryptococcus neoformans genome 6eukaryoteNC_006691
Cryptococcus neoformans genome 7eukaryoteNC_006692
Cryptococcus neoformans genome 8eukaryoteNC_006693
Cryptococcus neoformans genome 9eukaryoteNC_006694
Cryptosporidium parvum genome 1eukaryoteNC_006980
Cryptosporidium parvum genome 2eukaryoteNC_006981
Cryptosporidium parvum genome 3eukaryoteNC_006982
Cryptosporidium parvum genome 4eukaryoteNC_006983
Cryptosporidium parvum genome 5eukaryoteNC_006984
Cryptosporidium parvum genome 6eukaryoteNC_006985
Cryptosporidium parvum genome 7eukaryoteNC_006986
Cryptosporidium parvum genome 8eukaryoteNC_006987
Drosophila melanogaster genome 2LeukaryoteNT_033779
Drosophila melanogaster genome 2ReukaryoteNT_033778
Drosophila melanogaster genome 3LeukaryoteNT_037436
Drosophila melanogaster genome 3ReukaryoteNT_033777
Drosophila melanogaster genome 4eukaryoteNC_004353
Drosophila melanogaster genome XeukaryoteNC_004354
Leishmania major strain Friedlin genome 27eukaryoteNC_007268
Leishmania major strain Friedlin genome 29eukaryoteNC_007270
Leishmania major strain Friedlin genome 4eukaryoteNC_007245
Saccharomyces cerevisiae genome IeukaryoteNC_001133
Saccharomyces cerevisiae genome IIeukaryoteNC_001134
Saccharomyces cerevisiae genome IIIeukaryoteNC_001135
Saccharomyces cerevisiae genome IVeukaryoteNC_001136
Saccharomyces cerevisiae genome IXeukaryoteNC_001141
Saccharomyces cerevisiae genome VeukaryoteNC_001137
Saccharomyces cerevisiae genome VIeukaryoteNC_001138
Saccharomyces cerevisiae genome VIIeukaryoteNC_001139
Saccharomyces cerevisiae genome VIIIeukaryoteNC_001140
Saccharomyces cerevisiae genome XeukaryoteNC_001142
Saccharomyces cerevisiae genome XIeukaryoteNC_001143
Saccharomyces cerevisiae genome XIIeukaryoteNC_001144
Saccharomyces cerevisiae genome XIIIeukaryoteNC_001145
Saccharomyces cerevisiae genome XIVeukaryoteNC_001146
Saccharomyces cerevisiae genome XVeukaryoteNC_001147
Saccharomyces cerevisiae genome XVIeukaryoteNC_001148
Trypanosoma brucei TREU927 genome 1eukaryoteNC_007334
Trypanosoma brucei TREU927 genome 10eukaryoteNC_007283
Trypanosoma brucei TREU927 genome 11 scaffold 1eukaryoteNT_165288
Trypanosoma brucei TREU927 genome 2eukaryoteNC_005063
Trypanosoma brucei TREU927 genome 3eukaryoteNC_007276
Trypanosoma brucei TREU927 genome 4eukaryoteNC_007277
Trypanosoma brucei TREU927 genome 5eukaryoteNC_007278
Trypanosoma brucei TREU927 genome 6eukaryoteNC_007279
Trypanosoma brucei TREU927 genome 7eukaryoteNC_007280
Trypanosoma brucei TREU927 genome 8eukaryoteNC_007281
Trypanosoma brucei TREU927 genome 9eukaryoteNC_007282
Trypansomabrucei TREU927 genome 11 scaffold 2eukaryoteNT_165287
Acanthamoeba polyphaga mimivirusvirusNC_006450
Adoxophyes honmai nucleopolyhedrovirusvirusNC_004690
Aeromonas phage 31virusNC_007022
African swine fever virusvirusNC_001659
Agrotis segetum granulovirusvirusNC_005839
Alcelaphine herpesvirus 1virusNC_002531
Ambystoma tigrinum virusvirusNC_005832
Amsacta moorei entomopoxvirusvirusNC_002520
Ateline herpesvirus 3virusNC_001987
Autographa californica nucleopolyhedrovirusvirusNC_001623
bacteriophage 44 RR2.8tvirusNC_005135
bacteriophage Aeh1virusNC_005260
bacteriophage G1virusNC_007066
bacteriophage KVP40virusNC_005083
bacteriophage RM378virusNC_004735
bacteriophage SPBc2virusNC_001884
bacteriophage S-PM2 virionvirusNC_006820
bacteriophage T5 virionvirusNC_005859
Bombyx mori nucleopolyhedrovirusvirusNC_001962
Bovine herpesvirus 1virusNC_001847
Bovine herpesvirus 4virusNC_002665
Bovine herpesvirus 5virusNC_005261
Bovine papular stomatitis virusvirusNC_005337
Callitrichine herpesvirus 3virusNC_004367
CamelpoxvirusvirusNC_003391
CanarypoxvirusvirusNC_005309
Cercopithecine herpesvirus 1virusNC_004812
Cercopithecine herpesvirus 15virusNC_006146
Cercopithecine herpesvirus 17virusNC_003401
Cercopithecine herpesvirus 2virusNC_006560
Cercopithecine herpesvirus 7virusNC_002686
Cercopithecine herpesvirus 8virusNC_006150
Chimpanzee cytomegalovirusvirusNC_003521
Choristoneura fumiferana defective nucleopolyhedrovirusvirusNC_005137
Choristoneura fumiferana MNPVvirusNC_004778
Chrysodeixis chalcites nucleopolyhedrovirusvirusNC_007151
Cowpox virusvirusNC_003663
Cryptophlebia leucotreta granulovirusvirusNC_005068
Culex nigripalpus baculovirusvirusNC_003084
Cyanophage P-SSM2virusNC_006883
Cyanophage P-SSM4virusNC_006884
Cydia pomonella granulovirusvirusNC_002816
Ectocarpus siliculosus virusvirusNC_002687
Ectromelia virusvirusNC_004105
Emiliania huxleyi virus 86virusNC_007346
Enterobacteria phage RB43virusNC_007023
Enterobacteria phage RB49virusNC_005066
Enterobacteria phage RB69virusNC_004928
Enterobacteria phage T4virusNC_000866
Epiphyas postvittana nucleopolyhedrovirusvirusNC_003083
Equid herpesvirus 1virusNC_001491
Equid herpesvirus 2virusNC_001650
Equid herpesvirus 4virusNC_001844
Fowlpox virusvirusNC_002188
Frogvirus 3virusNC_005946
Gallid herpesvirus 1virusNC_006623
Gallid herpesvirus 2virusNC_002229
Gallid herpesvirus 3virusNC_002577
Goatpox virusvirusNC_004003
Helicoverpa armigera nuclearpolyhedrosisvirusvirusNC_003094
Helicoverpa zea single nucleocapsid nucleopolyhedrovirusvirusNC_003349
Heliocoverpa armigera nucleopolyhedrovirus G4virusNC_002654
Heliothis zea virus 1virusNC_004156
Human herpesvirus 1virusNC_001806
Human herpesvirus 2virusNC_001798
Human herpesvirus 3 (strain Dumas)virusNC_001348
Human herpesvirus 4virusNC_001345
Human herpesvirus 5 (laboratory strain AD169)virusNC_001347
Human herpesvirus 5(wildtype strain Merlin)virusNC_006273
Human herpesvirus 6virusNC_001664
Human herpesvirus 6BvirusNC_000898
Human herpesvirus 7virusNC_001716
Human herpesvirus 8, genomevirusNC_003409
Ictalurid herpesvirus 1virusNC_001493
Infectious spleen and kidney necrosis virusvirusNC_003494
Invertebrate iridescent virus 6virusNC_003038
Lactobacillus plantarum bacteriophage LP65virionvirusNC_006565
Lumpy skin disease virusvirusNC_003027
Lymantria dispar nucleopolyhedrovirusvirusNC_001973
Lymphocystis disease virus 1virusNC_001824
Lymphocystis disease virus-isolate ChinavirusNC_005902
Macaca fuscata rhadinovirusvirusNC_007016
Mamestra configurata NPV-AvirusNC_003529
Mamestra configurata nucleopolyhedrovirus BvirusNC_004117
Melanoplus sanguinipes entomopoxvirusvirusNC_001993
Meleagrid herpesvirus 1virusNC_002641
Molluscum contagiosum virusvirusNC_001731
Monkeypox virusvirusNC_003310
Muledeerpox virusvirusNC_006966
Murid herpesvirus 1virusNC_004065
Murid herpesvirus 2virusNC_002512
Murid herpesvirus 4virusNC_001826
Mycobacteriophage Bxz1 virionvirusNC_004687
Mycobacteriophage Omega virionvirusNC_004688
Myxoma virusvirusNC_001132
Orf virusvirusNC_005336
Orgyia pseudotsugata multicapsid nucleopolyhedrovirusvirusNC_001875
Ostreid herpesvirus 1virusNC_005881
Paramecium bursaria Chlorellavirus 1virusNC_000852
Phthorimaea operculella granulovirusvirusNC_004062
Plutella xylostella granulovirusvirusNC_002593
Psittacid herpesvirus 1virusNC_005264
Psuedomonas phage phiKZvirusNC_004629
Rabbit fibroma virusvirusNC_001266
Rachiplusia ou multiple nucleopolyhedrovirusvirusNC_004323
Saimiriine herpesvirus 2virusNC_001350
Sheeppox virusvirusNC_004002
Shrimp whitespot syndrome virusvirusNC_003225
Singapore grouper iridovirusvirusNC_006549
Spodoptera exigua nucleopolyhedrovirusvirusNC_002169
Spodoptera litura nucleopolyhedrovirusvirusNC_003102
Staphylococcus phage K virionvirusNC_005880
Staphylococcus phage TwortvirusNC_007021
Suid herpesvirus 1virusNC_006151
Swinepox virusvirusNC_003389
Trichoplusia ni SNPV virusvirusNC_007383
Tupaia herpesvirusvirusNC_002794
Vaccinia virusvirusNC_001559
Variola virusvirusNC_001611
Xestiac-nigrum granulovirusvirusNC_002331
Yaba monkey tumorvirusvirusNC_005179
Yaba-like disease virusvirusNC_002642
Figure 1.

Dominance maps for GS-2 applied to a 50 ×50 SOM over 100 iterations. The eubacterial and viral SOMs are shown at a larger scale owing to their greater detail. Dominance areas are color coded.

Table 3.

Viral genomes used for the SOM covering a wide range of viruses, shown in Figure 2. 579 viral genomes have at least 10kb of sequence. This is approximately 35% of all fully sequenced viral genomes available at the time of the analysis.

NameAccession
Bovine adenovirus 2AC_000001
Bovine adenovirus 3AC_000002
Canine adenovirus type 1AC_000003
Duck adenovirus 1AC_000004
Human adenovirus type 12AC_000005
Human adenovirus type 17AC_000006
Human adenovirus type 2AC_000007
Human adenovirus type 5AC_000008
Porcine adenovirus 5AC_000009
Simian adenovirus 21AC_000010
Simian adenovirus 25AC_000011
Murine adenovirus 1AC_000012
Fowl adenovirus 9AC_000013
Fowl adenovirus 1AC_000014
Human adenovirus type 11AC_000015
Turkey adenovirus 3AC_000016
Human adenovirus type 1AC_000017
Human adenovirus type 7AC_000018
Human adenovirus type 35AC_000019
Canine adenovirus type 2AC_000020
Paramecium bursaria Chlorella virus 1NC_000852
Viral hemorrhagic septicemia virusNC_000855
Enterobacteria phage T4NC_000866
Alteromonas phage PM2NC_000867
Streptococcus thermophilus bacteriophage Sfi19NC_000871
Streptococcus thermophilus bacteriophage Sfi21NC_000872
Lactobacillus bacteriophage phi adhNC_000896
Human herpesvirus 6BNC_000898
Fowl adenovirus DNC_000899
Bacteriophage VT2-SaNC_000902
Snakehead rhabdovirusNC_000903
Bacteriophage 933WNC_000924
Enterobacteria phage MuNC_000929
Acyrthosiphon pisum bacteriophage APSE-1NC_000935
Murine adenovirus ANC_000942
Murray Valley encephalitis virusNC_000943
MyxomavirusNC_001132
Rabbit fibromavirusNC_001266
Bacteriophage phi YeO3-12NC_001271
Enterobacteria phage 186NC_001317
Mycobacterium phage L5NC_001335
Sulfolobus spindle-shaped virus 1NC_001338
Human herpesvirus 4NC_001345
Human herpesvirus 5 (laboratory strain AD169)NC_001347
Human herpesvirus 3 (strain Dumas)NC_001348
Saimiriine herpesvirus 2NC_001350
Simian foamy virusNC_001364
Human adenovirus CNC_001405
Bacteriophage lambdaNC_001416
Enterobacteria phage PRD1NC_001421
Bacillus phage PZANC_001423
Japanese encephalitis virusNC_001437
Achole plasmaphage L2NC_001447
Venezuelan equine encephalitis virusNC_001449
Avian infectious bronchitis virusNC_001451
Human adenovirus FNC_001454
Human adenovirus ANC_001460
Bovine viral diarrheavirus 1NC_001461
Dengue virus type 2NC_001474
Dengue virus type 3NC_001475
Dengue virus type 1NC_001477
Equid herpesvirus 1NC_001491
Cryphonectria hypovirus 1NC_001492
Ictalurid herpesvirus 1NC_001493
Measles virusNC_001498
O’nyong-nyong virusNC_001512
Rabies virusNC_001542
Ross River virusNC_001544
Sindbis virusNC_001547
Sendai virusNC_001552
Vaccinia virusNC_001559
Vesicular stomatitis Indiana virusNC_001560
West Nile virusNC_001563
Cell fusing agent virusNC_001564
Beet yellows virusNC_001598
Enterobacteria phage T7NC_001604
Lake Victoria marburg virusNC_001608
Bacteriophage P4NC_001609
Variola virusNC_001611
Sonchus yellow net virusNC_001615
Autographa californica nucleopolyhedrovirusNC_001623
Rice tungro spherical virusNC_001632
Equid herpesvirus 2NC_001650
Infectious hematopoietic necrosis virusNC_001652
African swine fever virusNC_001659
Citrus tristeza virusNC_001661
Human herpesvirus 6NC_001664
Tick-borne encephalitis virusNC_001672
Haemophilus phage HP1NC_001697
Lactococcus phage c2NC_001706
Human herpesvirus 7NC_001716
Fowl adenovirus ANC_001720
Human immunodeficiency virus 2NC_001722
Snakehead retrovirusNC_001724
Molluscum contagiosum virusNC_001731
Canine adenovirusNC_001734
Human foamy virusNC_001736
Human respiratory syncytial virusNC_001781
Papaya ringspot virusNC_001785
Barmah Forest virusNC_001786
Human spuma retrovirusNC_001795
Human parainfluenza virus 3NC_001796
Human herpesvirus 2NC_001798
Respiratory syncytial virusNC_001803
Human herpesvirus 1NC_001806
Louping ill virusNC_001809
Duck adenovirus ANC_001813
Lymphocystis disease virus 1NC_001824
Streptococcus phage Cp-1NC_001825
Murid herpesvirus 4NC_001826
Bovine foamy virusNC_001831
Bacteriophage sk1NC_001835
Little cherry virus 1NC_001836
Sweet potato feathery mottle virusNC_001841
Equid herpesvirus 4NC_001844
Murine hepatitis virus strain A59NC_001846
Bovine herpesvirus 1NC_001847
Walleye dermal sarcoma virusNC_001867
Simian-Human immunodeficiency virusNC_001870
Feline foamy virusNC_001871
Rhopalosiphum padi virusNC_001874
Orgyia pseudotsugata nucleopolyhedrovirusNC_001875
Bovine adenovirus BNC_001876
Bacteriophage SPBc2NC_001884
Enterobacteria phage P2NC_001895
Mycobacteriophage D29NC_001900
Bacteriophage N15NC_001901
Methanobacterium phage psiM2NC_001902
Hendra virusNC_001906
Bacteriophage bIL170NC_001909
Canine distemper virusNC_001921
Igbo Ora virusNC_001924
Mycoplasma arthritidis bacteriophage MAV1NC_001942
Hemorrhagic enteritis virusNC_001958
Porcine reproductive and respiratory syndrome virusNC_001961
Bombyx mori nucleopolyhedrovirusNC_001962
Lymantria dispar nucleopolyhedrovirusNC_001973
Bacteriophage phi-C31NC_001978
Ateline herpesvirus 3NC_001987
Bovine respiratory syncytial virusNC_001989
Melanoplus sanguinipes entomopox virusNC_001993
Yellow fever virusNC_002031
Bovine viral diarrhea virus genotype 2NC_002032
Human adenovirus DNC_002067
Streptococcus thermophilus bacteriophage DT1NC_002072
Bovine parainfluenza virus3NC_002161
Enterobacteria phage HK022NC_002166
Bacteriophage HK97NC_002167
Spodoptera exigua nucleopolyhedrovirusNC_002169
Streptococcus thermophilus bacteriophage 7201NC_002185
Fowlpox virusNC_002188
Tupaia paramyxovirusNC_002199
Mumps virusNC_002200
Equine foamy virusNC_002201
Streptococcus thermophilus bacteriophage Sfi11NC_002214
Gallid herpesvirus 2NC_002229
Northern cereal mosaic virusNC_002251
Transmissible gastroenteritis virusNC_002306
Staphylococcus aureus bacteriophage PVLNC_002321
Xestiac-nigrum granulovirusNC_002331
Enterobacteria phage P22NC_002371
Pseudomonas phage D3NC_002484
Staphylococcus aureus prophage phiPV83NC_002486
Frog adenovirusNC_002501
Murid herpesvirus 2NC_002512
Ovine adenovirus ANC_002513
Mycoplasma virus P1NC_002515
Roseophage SIO1NC_002519
Amsacta moorei entomopox virusNC_002520
Bovine ephemeral fever virusNC_002526
Alcelaphine herpesvirus 1NC_002531
Equine arteritis virusNC_002532
Lactate dehydrogenase-elevating virusNC_002534
Zaire ebola virusNC_002549
Gallid herpesvirus 3NC_002577
Plutella xylostella granulovirusNC_002593
Newcastle disease virusNC_002617
Methanothermobacter wolfeii prophage psiM100NC_002628
Dengue virus type 4NC_002640
Meleagrid herpesvirus 1NC_002641
Yaba-like disease virusNC_002642
Human coronavirus 229ENC_002645
Bacillus phage GA-1NC_002649
Heliocoverpa armigera nucleopolyhedrovirus G4NC_002654
Mycobacteriophage Bxb1NC_002656
Classical swine fever virusNC_002657
Staphylococcus aureus temperate phage phi SLTNC_002661
Bovine herpesvirus 4NC_002665
Bacteriophage bIL285NC_002666
Bacteriophage bIL286NC_002667
Bacteriophage bIL309NC_002668
Bacteriophage bIL310NC_002669
Bacteriophage bIL311NC_002670
Bacteriophage bIL312NC_002671
Bovine adenovirus DNC_002685
Cercopithecine herpesvirus 7NC_002686
Ectocarpus siliculosus virusNC_002687
Porcine adenovirus CNC_002702
Bacteriophage Tuc2009NC_002703
Nipah virusNC_002728
Bacteriophage HK620NC_002730
Lactococcus lactis bacteriophage TP901-1NC_002747
Tupaia herpesvirusNC_002794
Lactococcus phage BK5-TNC_002796
Spring viremia of carp virusNC_002803
Cydia pomonella granulovirusNC_002816
Taura syndrome virusNC_003005
Lumpy skin disease virusNC_003027
Invertebrate iridescent virus 6NC_003038
Avian paramyxovirus 6NC_003043
Bovine coronavirusNC_003045
Streptococcus pneumoniae bacteriophage MM1NC_003050
Epiphyas postvittana nucleopolyhedrovirusNC_003083
Culex nigripalpus baculovirusNC_003084
Bacteriophage Mx8NC_003085
Simian hemorrhagic fever virusNC_003092
Helicoverpa armigera nuclearpolyhedrosis virusNC_003094
Spodopteralitura nucleopolyhedrovirusNC_003102
Temperate phage PhiNIH1.1NC_003157
Sulfolobus islandicus filamentous virusNC_003214
Semliki forest virusNC_003215
Bacteriophage A118NC_003216
Shrimp white spot syndrome virusNC_003225
Australian bat lyssa virusNC_003243
Human adenovirus ENC_003266
Bacteriophage phiCTXNC_003278
Bacteriophage phiETANC_003288
Bacteriophage PSANC_003291
Bacteriophage T3NC_003298
Bacteriophage phiE125NC_003309
Monkeypox virusNC_003310
Bacteriophage K139NC_003313
Haemophilus phage HP2NC_003315
Sinorhizobium meliloti phage PBC5NC_003324
Halovirus HF2NC_003345
Helicoverpa zea nucleopolyhedrovirusNC_003349
Bacteriophage P27NC_003356
Mycobacteriophage TM4NC_003387
Swinepox virusNC_003389
Cyanophage P60NC_003390
Camelpox virusNC_003391
Cercopithecine herpesvirus 17NC_003401
Human herpesvirus 8NC_003409
Mayaro virusNC_003417
Sleeping disease virusNC_003433
Porcine epidemic diarrhea virusNC_003436
Human parainfluenza virus 2NC_003443
Shigella flexneri bacteriophage VNC_003444
Human parainfluenza virus 1 strain Washington/1964NC_003461
Infectious spleen and kidney necrosis virusNC_003494
Chimpanzee cytomegalovirusNC_003521
Bacteriophage phi3626NC_003524
Stx2 converting bacteriophage INC_003525
Mamestra configurata NPV-ANC_003529
Cryphonectria hypovirusNC_003534
Dasheen mosaic virusNC_003537
Lettuce mosaic virusNC_003605
Maize chlorotic dwarf virusNC_003626
Modoc virusNC_003635
Cowpox virusNC_003663
Rio Bravo virusNC_003675
Apoi virusNC_003676
Pestivirus Reindeer-1NC_003677
Pestivirus Giraffe-1NC_003678
Border disease virus 1NC_003679
Powassan virusNC_003687
Langat virusNC_003690
Rice yellow stunt virusNC_003746
Acyrthosiphon pisum virusNC_003780
Sweet potato mild mottle virusNC_003797
Eastern equine encephalitis virusNC_003899
Aura virusNC_003900
Vibriophage VpV262NC_003907
Western equine encephalomyelitis virusNC_003908
Salmon pancreas disease virusNC_003930
Tamana bat virusNC_003996
Human adenovirus BNC_004001
Sheeppox virusNC_004002
Goatpox virusNC_004003
Leek yellow stripe virusNC_004011
Ovine adenovirus 7NC_004037
Phthorimaea operculella granulovirusNC_004062
Murid herpesvirus 1NC_004065
Lactococcus lactisbacteriophageul36NC_004066
TiomanvirusNC_004074
VirusPhiCh1NC_004084
Sulfolobus islandicus rod-shaped virus 2NC_004086
Sulfolobus islandicus rod-shaped virus 1NC_004087
Ectromelia virusNC_004105
Lactobacillus casei bacteriophage A2NC_004112
Mamestra configurata nucleopolyhedrovirus BNC_004117
Montana myotis leukoencephalitis virusNC_004119
Human metapneumovirusNC_004148
Heliothis zea virus 1NC_004156
Dugbe virus segment LNC_004159
Reston Ebola virusNC_004161
Chikungunya virusNC_004162
Bacteriophage B103NC_004165
Bacteriophage SPP1NC_004166
Bacteriophage phi-105NC_004167
Bacteriophage r1tNC_004302
Streptococcus thermophilus bacteriophage O1205NC_004303
Bacteriophage phig1eNC_004305
Salmonella typhimurium phage ST64BNC_004313
Rachiplusia ou multiple nucleopolyhedrovirusNC_004323
Burkholderia cepacia phage Bcep781NC_004333
Salmonella typhimurium bacteriophage ST64TNC_004348
Alkhurma virusNC_004355
Callitrichine herpesvirus 3NC_004367
Treeshrew adenovirusNC_004453
Vibrio harveyi bacteriophage VHMLNC_004456
Bacteriophage IN93NC_004462
Pseudomonas aeruginosa phage PaP3NC_004466
Streptococcus pyogenes phage 315.1NC_004584
Streptococcus pyogenes phage 315.2NC_004585
Streptococcus pyogenes phage 315.3NC_004586
Streptococcus pyogenes phage 315.4NC_004587
Streptococcus pyogenes phage 315.5NC_004588
Streptococcus pyogenes phage 315.6NC_004589
Staphylococcus aureus phage phi11NC_004615
Staphylococcus aureus phage phi12NC_004616
Staphylococcus aureus phage phi13NC_004617
Pseudomonas phage phiKZNC_004629
Bacteriophage phi-BT1NC_004664
Pseudomonas phage gh-1NC_004665
Grapevine leaf roll-associated virus 3NC_004667
Staphylococcus phage 44AHJDNC_004678
Staphylococcus aureus phage phiP68NC_004679
Mycobacteriophage Che8NC_004680
Mycobacteriophage CJW1NC_004681
Mycobacteriophage Bxz2NC_004682
Mycobacteriophage Che9cNC_004683
Mycobacteriophage RosebushNC_004684
Mycobacteriophage CorndogNC_004685
Mycobacteriophage Che9dNC_004686
Mycobacteriophage Bxz1NC_004687
Mycobacteriophage OmegaNC_004688
Mycobacteriophage BarnyardNC_004689
Adoxophyes honmai nucleopolyhedrovirusNC_004690
SARS coronavirusNC_004718
Grapevine rootstock stem lesion associated virusNC_004724
Bacteriophage RM378NC_004735
Staphylococcus phage phiN315NC_004740
Bacteriophage L-413CNC_004745
Lactococcus phage P335NC_004746
Enterobacteria phage epsilon15NC_004775
Yersinia pestis phage phiA1122NC_004777
Choristoneura fumiferana MNPVNC_004778
Cercopithecine herpesvirus 1NC_004812
Phage phi4795NC_004813
Streptococcus phage C1NC_004814
Bacteriophage phBC6A51NC_004820
Bacteriophage phBC6A52NC_004821
Bacteriophage Aaphi23NC_004827
Deformed wing virusNC_004830
Enterobacteria phage SP6NC_004831
Xanthomonas oryzae bacteriophage Xp10NC_004902
Stx1 converting bacteriophageNC_004913
Stx2 converting bacteriophage IINC_004914
Halovirus HF1NC_004927
Enterobacteria phage RB69NC_004928
Streptococcus mitis phage SM1NC_004996
Papaya leaf-distortion mosaic potyvirusNC_005028
Onion yellow dwarf virusNC_005029
Goose paramyxovirus SF02NC_005036
Adoxophyes orana granulovirusNC_005038
Yokose virusNC_005039
Bacteriophage phiKMVNC_005045
Bacteriophage WPhiNC_005056
Omsk hemorrhagic fever virusNC_005062
Kamiti River virusNC_005064
Little cherry virus 2NC_005065
Enterobacteria phage RB49NC_005066
Cryptophlebia leucotreta granulovirusNC_005068
Bacteriophage PY54NC_005069
Bacteriophage KVP40NC_005083
Fer-de-lance virusNC_005084
Burkholderia cepacia phage BcepNazgulNC_005091
Hirame rhabdovirusNC_005093
Bacteriophage 44RR2.8tNC_005135
Choristoneura fumiferana nucleopolyhedrovirusNC_005137
Human coronavirus OC43NC_005147
Bacteriophage D3112NC_005178
Yaba monkey tumor virusNC_005179
Bacillus thuringiensis bacteriophage Bam35cNC_005258
Mycobacteriophage PG1NC_005259
Bacteriophage Aeh1NC_005260
Bovine herpesvirus 5NC_005261
Burkholderia cepacia phage Bcep22NC_005262
Burkholderia cenocepacia phage Bcep1NC_005263
Psittacid herpesvirus 1NC_005264
Sulfolobus spindle-shaped virus 2NC_005265
Bacteriophage Felix01NC_005282
Dolphin morbillivirusNC_005283
Bacteriophage phi1026bNC_005284
Bacteriophage EJ-1NC_005294
Crimean-Congo hemorrhagic fever virus segment LNC_005301
Canarypox virusNC_005309
OrfvirusNC_005336
Bovine papularstomatitis virusNC_005337
Mossman virusNC_005339
Bacteriophage PSP3NC_005340
Burkholderia cepacia phage Bcep43NC_005342
Enterobacteria phage Sf6NC_005344
Bacteriophage VWBNC_005345
Lactobacillus johnsonii prophage Lj928NC_005354
Lactobacillus johnsonii prophage Lj965NC_005355
Bacteriophage 77NC_005356
Bordetella phage BPP-1NC_005357
Sulfolobus spindle-shaped virus Ragged HillsNC_005360
Sulfolobus spindle-shaped virus Kamchatka-1NC_005361
Bordetella phage BMP-1NC_005808
Bordetella phage BIP-1NC_005809
Bacteriophage phiLC3NC_005822
Acidianus filamentus virus 1NC_005830
Human coronavirus NL63NC_005831
Ambystoma tigrinum virusNC_005832
Enterobacteria phage T1NC_005833
Agrotis segetum granulovirusNC_005839
Salmonella typhimurium bacteriophage ST104NC_005841
Enterobacteria phage P1NC_005856
Bacteriophage phiKO2NC_005857
Bacteriophage T5NC_005859
Porcine adenovirus ANC_005869
Pyrobaculum spherical virusNC_005872
Kakugo virusNC_005876
Vibriophage VP2NC_005879
Staphylococcus phage KNC_005880
Ostreid herpesvirus 1NC_005881
Burkholderia cenocepacia phage BcepMuNC_005882
Pseudomonas aeruginosa bacteriophage PaP2NC_005884
Actinoplanes phage phiAsp2NC_005885
Burkholderia cenocepacia phage BcepB1ANC_005886
Burkholderia cepacia complex phage BcepC6BNC_005887
Vibriophage VP5NC_005891
Sulfolobus turreted icosahedral virusNC_005892
Bacteriophage phiAT3NC_005893
Lymphocystis disease virus-isolate ChinaNC_005902
Neodiprion sertifer nucleopolyhedrovirusNC_005905
Neodiprion lecontei NPVNC_005906
Frog virus 3NC_005946
Bacteriophage phiMFV1NC_005964
Maize fine streak virusNC_005974
Maize mosaic virusNC_005975
Simian adenovirus ANC_006144
Cercopithecine herpesvirus 15NC_006146
Cercopithecine herpesvirus 8NC_006150
Suid herpesvirus 1NC_006151
Watermelon mosaic virusNC_006262
Sulfolobus tengchongensis spindle-shaped virus STSV1NC_006268
Human herpesvirus 5 (wildtype strain Merlin)NC_006273
Rinderpest virusNC_006296
Bovine adenovirus ANC_006324
Bacteriophage 11bNC_006356
Peste-des-petits-ruminants virusNC_006383
Simian parainfluenza virus 41NC_006428
Mokola virusNC_006429
Simian parainfluenza virus5NC_006430
Sudan ebola virusNC_006432
Acanthamoeba polyphaga mimivirusNC_006450
Varroa destructor virus 1NC_006494
Bacteriophage B3NC_006548
Singapore grouper iridovirusNC_006549
Usutu virusNC_006551
Pseudomonas aeruginosa phage F116NC_006552
Thermoproteus tenax spherical virus 1NC_006556
Bacillus clarkii bacteriophage BCJA1cNC_006557
Getah virusNC_006558
Cercopithecine herpesvirus 2NC_006560
Lactobacillus plantarum bacteriophage LP65NC_006565
Human coronavirus HKU1NC_006577
Pneumonia virus of mice J3666NC_006579
Gallid herpesvirus 1NC_006623
Cotesia congregata virus segment Circle 1NC_006633
Cotesia congregata virus segment Circle 2NC_006634
Cotesia congregata virus segment Circle 3NC_006635
Cotesia congregata virus segment Circle 4NC_006636
Cotesia congregata virus segment Circle 5NC_006637
Cotesia congregata virus segment Circle 6NC_006638
Cotesia congregata virus segment Circle 7NC_006639
Cotesia congregata virus segment Circle 9NC_006641
Cotesia congregata virus segment Circle 10NC_006642
Cotesia congregata virus segment Circle 11NC_006643
Cotesia congregata virus segment Circle 12NC_006644
Cotesia congregata virus segment Circle 13NC_006645
Cotesia congregata virus segment Circle 14NC_006646
Cotesia congregata virus segment Circle 17NC_006648
Cotesia congregata virus segment Circle 18NC_006649
Cotesia congregata virus segment Circle 19NC_006650
Cotesia congregata virus segment Circle 20NC_006651
Cotesia congregata virus segment Circle 22NC_006653
Cotesia congregata virus segment Circle 23NC_006654
Cotesia congregata virus segment Circle 25NC_006655
Cotesia congregata virus segment Circle 26NC_006656
Cotesia congregata virus segment Circle 30NC_006657
Cotesia congregata virus segment Circle 31NC_006658
Cotesia congregata virus segment Circle 32NC_006659
Cotesia congregata virus segment Circle 33NC_006660
Cotesia congregata virus segment Circle 35NC_006661
Cotesia congregata virus segment Circle 36NC_006662
Bacteriophage S-PM2NC_006820
Murine hepatitis virus strain JHMNC_006852
Simian adenovirus 1NC_006879
Cyanophage P-SSP7NC_006882
Cyanophage P-SSM2NC_006883
Cyanophage P-SSM4NC_006884
Lactobacillus plantarum bacteriophage phiJL-1NC_006936
Bacteriophage phiJL001NC_006938
Bacteriophage KS7NC_006940
Taro vein chlorosis virusNC_006942
Mint virus 1NC_006944
Bacillus thuringiensis phage GIL16cNC_006945
Karshi virusNC_006947
Salmonella typhimurium bacteriophage ES18NC_006949
Listonella pelagia phage phiHSICNC_006953
Muledeerpox virusNC_006966
Vaccinia virusNC_006998
Macaca fuscata rhadinovirusNC_007016
Streptococcus thermophilus bacteriophage 2972NC_007019
Tupaia rhabdovirusNC_007020
Staphylococcus phage TwortNC_007021
Aeromonas phage 31NC_007022
Enterobacteria phage RB43NC_007023
Xanthomonas campestris pv. pelargonii phage Xp15NC_007024
Feline coronavirusNC_007025
Microplitis demolitor bracovirus segment GNC_007034
Microplitis demolitor bracovirus segment HNC_007035
Microplitis demolitor bracovirus segment JNC_007036
Microplitis demolitor bracovirus segment KNC_007037
Microplitis demolitor bracovirus segment MNC_007038
Microplitis demolitor bracovirus segment NNC_007039
Microplitis demolitor bracovirus segment LNC_007040
Microplitis demolitor bracovirus segment INC_007041
Microplitis demolitor bracovirus segment ONC_007044
Bacteriophage PT1028NC_007045
Bacteriophage 66NC_007046
Bacteriophage 187NC_007047
Bacteriophage 69NC_007048
Bacteriophage 53NC_007049
Bacteriophage 85NC_007050
Bacteriophage 2638ANC_007051
Bacteriophage 42eNC_007052
Bacteriophage 3ANC_007053
Bacteriophage 47NC_007054
Bacteriophage 37NC_007055
Bacteriophage EWNC_007056
Bacteriophage 96NC_007057
Bacteriophage ROSANC_007058
Bacteriophage 71NC_007059
Bacteriophage 55NC_007060
Bacteriophage 29NC_007061
Bacteriophage 52ANC_007062
Bacteriophage 88NC_007063
Bacteriophage 92NC_007064
Bacteriophage X2NC_007065
Bacteriophage G1NC_007066
Phytophthora endorna virus 1NC_007069
Burkholderia pseudomallei phage phi52237NC_007145
Vibriophage VP4NC_007149
Chrysodeixis chalcites nucleopolyhedrovirusNC_007151
Bacteriophage SH1NC_007217
Bacteriophage JK06NC_007291
Emiliania huxleyi virus 86NC_007346
Trichoplusia ni SNPV virusNC_007383
Acidianus two-tailed virusNC_007409
Shallot yellow stripe virusNC_007433
Breda virusNC_007447
Grapevine leaf roll-associated virus 2NC_007448
Enterobacteria phage L17NC_007449
Enterobacteria phage PR3NC_007450
Enterobacteria phage PR4NC_007451
Enterobacteria phage PR5NC_007452
Enterobacteria phage PR772NC_007453
J-virusNC_007454
Coliphage K1FNC_007456
Bacillus anthracis phage CherryNC_007457
Bacillus anthracis phage GammaNC_007458
Burkholderia cepacia phage Bcep176NC_007497
Bacteriophage Lc-NuNC_007501
Figure 2.

Dominance map for GS-2 of 10kb fragments of viruses applied to a 50 ×50 SOM over 1000 iterations. The category “Bacteriophages” refers to unclassified phages. Most phages are members of the family Caudovirales. The text added to the dominance map shows the general divisions of the Poxviridae and Caudovirales which form more than one well defined dominance area.

Calculation of genome signatures

A Perl script was written to derive raw k-mer counts on FASTA-formatted databases of input sequences, using the SeqWords.pm module from BioPerl (http://www.bioperl.org/Pdoc-mirror/bioperl-live/Bio/Tools/SeqWords.html). The raw k-mer frequencies were then symmetrized, as follows: where fν and fν-comp are the raw frequencies of a k-mer ν and its complement ν-comp. Symmetrization means that a sequence and its complement will generate the same answer. The symmetrized frequencies are then corrected for the 1-mer content. For instance for a 2-mer XY, where X and Y can each represent any nucleotide base {A, C, T, G}: where fsXY is the symmetrized frequency for dimer XY and fsX and fsY are the symmetrized frequencies of its component 1-mers. For a 3-mer XYZ, the correction would be for the 1-mers, X, Y and Z and so on. The genome signature vector for length k, is thus composed of a series of ratios of observed to expected values of its component k-mers, where the expected values are determined by a zero-order Markov chain (Bernouilli series) model. Genome signatures are therefore not distorted by gross base compositional differences between genomes, which would otherwise be the dominant factor.

Self-organizing map

Self-organizing maps (SOMs) were run following Tamayo et al. (1999), using a Perl script. Input consisted of an array of the genome signatures generated as described above. The dimensions of the SOM and the number of iterations in training were variables entered by the user. Euclidean distances were used when comparing vectors. Once the dimensions of the SOM were set, x columns by y rows, weight vectors initializing each of the xy cells of the SOM were selected at random from the entire set of genome signature data vectors. The SOM is thus initially simply filled with a random subset of the data. Training then commences, for nominated t iterations. At each iteration m, each data vector in turn was compared to each weight vector, and the closest weight vector for each data vector designated the winning weight vector of that data vector in that iteration. Each time a winning weight vector is identified, the winning weight vector, and the weight vectors of cells within a spatial range ℜ on the SOM, were then trained by the data vector as follows. Each value c in the winning weight vector w is altered, so that its value at iteration, m, becomes at the next iteration m+1: where wm – v represents the difference between the winning weight vector and the data vector for each value c along the vectors. In other words, one simply aligns the data vector and the winning weight vector and subtracts them. Each value of the winning weight vector is then altered to bring it closer to the data vector by a factor of τ, the training effect, which is derived as follows: τ changes at each iteration of the process, and is the ratio of two other values α and γ. α is calculated for each iteration m as follows: where m is the number of the current iteration, and t the number of total iterations requested. There-fore, the number of iterations of the SOM, a parameter chosen at the start of the process, determines the gradient at which α will decrease as the iterations progress. Whereas α is the same for all cells in the SOM and changes according to the iteration number only, γ is the Euclidean distance on the SOM from the weight vector being trained within range ℛ of the winning weight vector. τ can therefore be seen to decrease as the SOM progresses, since α decreases, and also to decrease the further one goes away from the winning weight vector, since γ increases. The range within which weight vectors are trained at each iteration is calculated: where S is the length or breadth of the SOM, whichever is the smaller. The area of the SOM being trained therefore also shrinks as α decreases with increasing iterations. Once each data vector has found its winning weight vector and trained it, also training the weight vectors within range ℜ of the winning weight vector, then one iteration is completed. New values of α,τ and ℜ are then calculated, and the second iteration can commence. It can be intuitively grasped that there is a great deal of “churn” in initial iterations of the SOM. When α is close to 1, data vectors will effectively change their winning weight vector to copies of themselves. Only at the limits of the trained area R will the effect be subtler. However, as the number of iterations mounts, α will decrease and each data vector will have a relatively weaker effect on its winning weight vector and even less on those weight vectors in its vicinity. Observation (data not shown) of distribution of a simple data set over a SOM through the iterative process shows that a relatively chaotic process dominates until approximately halfway through the nominated number of iterations, at which point structure rapidly builds in the SOM. The final 10% or so of iterations consist mostly of fine-tuning of the final weight vector values. Training SOMs can also be time consuming, especially for large data sets of high dimensionality vectors trained over large numbers of iterations. The longest run presented here (that in Fig. 2) took in excess of 3 weeks on a single 2.8 GHz Intel processor under a Linux operating system. One of the major motivations of this paper was to define ways to reduce SOM training time without losing accuracy or sensitivity. After the final iteration, each data vector is again compared to each weight vector and assigned to the closest. This results in partition of each data vector to one cell in the SOM, thus spreading the multi-dimensional data across the two-dimensional surface of the SOM. Conversely, each final weight vector in the SOM is assigned to its closest data vector, the centroid nearest neighbour (cnn). If the data vectors belong to several categories, each cell in the SOM can be colored according to the origin of its cnn, which is then said to dominate that cell in the SOM. This allows the production of color-coded dominance maps indicating the general spread of the data vector set over the SOM. NCBI taxonomic categories were used throughout, except for herpesviruses where the International Committee on the Taxonomy of Viruses (ICTV) usage is followed (Davison 2002; Davison et al. 2005; Fauquet et al. 2005).

Availability of scripts

All Perl scripts, for processing genomes, calculating genome signatures, and running SOMs are available on request from the author ( d.gatherer@mrcvu.gla.ac.uk).

Results

SOMs on large sequence datasets

The ability of SOMs to distinguish the origin of fragments of DNA based on their genome signatures, was initially tested using GS-2 (see Methods, section 2, above) measured over fragments of 100 kb. At the time of analysis there were 79 eukaryotic, 156 eubacterial, 30 archaeal and 122 viral genomes with more than 100 kb of sequence each (Table 2). The dimension of the SOM was 50 × 50 and 100 iterations were used. At the end of the iterations, dominance areas (see Methods, section 3, above), were used to color the SOM. For the entire data set, “all life” in Figure 1, the superkingdoms of archaea, eubacteria and eukaryota were chosen, along with the unranked category of viruses. Within each of the SOMs applied to the superkingdoms and the viruses, the next level down was used for coloring dominance maps. This is the phylum level in the archaea and eubacteria, and the family level in the viruses. In the eukaryota, the relative scarcity of completely sequenced genomes required a more ad hoc classification. When all input sets are pooled, GS-2 produces a SOM in which eubacterial sequences cluster together (Fig. 1; “All life”, green). Archaeal sequences are split into several groups that are situated along the boundary between the eubacteria and the eukaryotes. Likewise, viral sequences are split into one group in the top left corner and other clusters along the eubacterial-eukaryotic border. It is evident that this “all life” SOM does not contribute to the issue of the phylogeny of the three superkingdoms, except to underline that archaea are not derivatives of either eukaryotes or eubacteria. When the SOM is confined to archaeal sequences (Fig. 1; “Archaea”), those genomes designated “unclassified” by NCBI, are located well within the territory of the Euryarchaeota, strongly suggesting that they belong to this phylum. In general the archaeal inter-phylum boundaries are clear, although the Crenarchaeaota are split into two clusters. The predominance of Euryarchaeota in terms of area is a reflection of the larger number of complete genomes in that phylum. Likewise, in the eukaryotes (Fig. 1; “Eukaryota”), the large size of the human genome contributes to a large area dominated by the Vertebrata. It should be remembered that the classification in the eukaryotes is ad hoc owing to the relatively small number of complete genomes. However, it is interesting that the boundaries between the dominance areas are as distinct as those in the archaea. The situation is considerably more complicated within the eubacteria (Fig. 1; “Eubacteria”), being the superkingdom with the greatest number of completely sequenced genomes. Some eubacterial phyla are rather fragmented in their dominance areas. For instance, the phylum Firmicutes occupies several partly adjacent areas. The phylum Deinococcus has two small and rather distant dominance areas, and the Bacteroidetes and Spirochaetes both have small outlying fragments. The Proteobacteria dominate the right side of the SOM and penetrate between the various groups on the left side. The overall impression is of less clear-cut differences in GS-2 between phyla in eubacteria than in eukaryotes or archaea. A similar situation is observed in the SOM on viral sequences (Fig. 1; “Viruses”). A few viral families, such as the Baculoviridae, the family Mimivirus and the Nimaviridae do manage coherent dominance areas, but all others are extensively mixed. The Baculoviridae are the only family of any size than maintain a distinctive dominance area. This basic illustration of the SOM in action demonstrates that for a single parameter set, namely 50 × 50 SOM and 100 iterations, different phylogenetic groups exhibit variable degrees of partition across the SOM.

Increased resolution SOM on viruses

To increase the resolution of the SOM against viral sequences, GS-2 was reapplied to viral sequences only using 10 kb fragments. This enables a larger number of viral genomes to be analysed, up from 122 to 579, as genomes of 10 kb or more can be included (Table 3). The number of iterations was increased to 1000. The resulting dominance map is shown in Figure 2. When viral sequences alone are considered at higher resolution, the SOM becomes very complex. The family level classification is maintained for the dominance map but there are now more families, since viruses as small as 10 kb are eligible. Perhaps the most salient feature is that Poxviridae are divisible into sheep/goat pox viruses and others (Fig. 2: “sheep/goat” and “other pox”). Additionally phages, within the family Caudovirales, tend to be differentially located on the SOM in four major areas, one of which, mycophages, accounts for two of these areas (Fig. 2: “myco-ϕ”, “entero-ϕ” and “cocco-ϕ”). Again the Baculoviridae form a noticeably large and coherent cluster. Herpesviridae, by contrast, are spread across the entire map. Herpesviridae (Table 1) are next considered alone under the same conditions as in Figure 2. Dominance maps for this narrower selection are shown in Figure 3. Figure 3 shows that when family-level taxonomy is considered within herpesviruses, GS-2 distinguishes the ostreid herpesviruses and the ictaluriviruses as two fairly homogenous blocs distinct from the Alloherpesviridae (Davison, 2002), comprising the alpha, beta and gamma families. At the genus-level, Muromegalovirus alone forms a nearly contiguous bloc although Mardivirus nearly does so. The remaining genera, like the families, are considerably mixed across the SOM. Like the wide spread of herpesvirus signatures across the viral SOM, this is a reflection of the degree of sequence heterogeneity with the Herpesviridae. The three figures presented above demonstrate that the SOM is an intriguing tool for the conceptualisation of relationships between genome signatures. However, the evident complexity of some of the topographical arrangements raises serious questions concerning its utility as a diagnostic tool for phylogeny. Therefore, some experiments are described which address this issue in a quantitative way.

Effect of length of k-mer used to generate genome signature

In order to investigate if genome signatures of longer k give better resolution than k = 2, 10 kb herpesvirus sequences were processed into genome signature of GS-2 to GS-6 and the SOM was trained for 100 iterations (Fig. 4). On first inspection, it does not appear that a higher genome signature provides any better resolution than a lower one. The GS-3 SOM was also run on a 20 × 20 map, but again this produces no major change to the overall pattern. In all cases, ostreid herpesvirus and ictalurivirus have coherent dominance areas on the SOM. At GS-5, alpha herpes-viruses also have a coherent dominance area, but this disappears again at GS-6. In order to further investigate this apparent lack of improvement at higher values of k, the density of sequences of each family was plotted onto the SOM (Fig. 5). Instead of the dominance map approach, in which each cell is colored according to the affiliation of its cnn (Fig. 1–4 are all of this type), cells in which more than 95% of allocated sequences are of a single type are colored red, and those with fewer than 5% of that type are white. Cells between these two extremes are colored yellow. A ratio is then produced of red-to-yellow in each SOM. A perfectly partitioned SOM will therefore have a ratio of infinity, indicating no mixed cells, or more accurately no cells with greater than 5% mixture of the “wrong” family.
Figure 4.

Dominance maps herpesvirus families, illustrating the effect of varying GS values using 10kb herpesvirus sequences, on a 10 ×10 SOM (except for GS-3 at 20 ×20) over 100 iterations.

Figure 5.

The density of herpesviral sequences, classed by family, on a 10 ×10 SOM after 100 iterations. >95% density: red; 5%–95% density: yellow; <5% density: white. The figure in each box is the ratio of sequences in red to yellow areas of the SOM.

Figure 5 demonstrates that family level taxonomy is better determined at higher GS in all five families of herpesviruses. The ratio of high alpha-density (>95%, red) to medium alpha-density (5% to 95%, yellow) increases from 0.88 to 2.83 as the GS increases from 2 to 4. The corresponding increases for the beta and gamma families are from 0.52 to 2.33 and from 1.91 to 2.11 respectively. For the ostreid herpesviral sequences, perfect partition is reached at GS-4 and for the ictalurid viruses at GS-3. This is probably a reflection of the presence of a single virus in each of these categories with a correspondingly lower number of sequences analysed.

Effect of length of training phase of SOM

It is therefore apparent that genome signature of longer values of k produce some improvement in the accuracy of the final partition on the SOM. However, longer k results in longer data vectors, increasing at order 4 and therefore much slower training of the SOM. One way to speed training of the SOM is simply to reduce the number of training cycles. The effect of the number of iterations on density of each family is displayed in Figure 6.
Figure 6.

The density of herpesviral sequences, classed by family, on a 10 ×10 SOM of GS-2, run over a varying number of iterations, i. >95% density: red; 5%–95% density: yellow; <5% density: white. The figure in each box is the ratio of sequences in red to yellow areas of the SOM.

Figure 6 shows that increasing the number of iterations has a mixed effect on the density of family sequences. The alpha herpesviral sequences increase in density from 0.92 to 1.35 as the number of iterations increases from 10 to 1000, and the beta herpesviruses from 0.52 to 0.83. The ostreid herpesviral sequences are also perfectly clustered at 100 iterations. However, the gamma and ictalurid sequences are more poorly partitioned at higher numbers of iterations.

Jack-knifing analysis

Figures 1–6 provide a largely qualitative impression of the effectiveness of SOMs in correctly assigning the origins of DNA sequences based on their genome signature. To provide a further more quantitative assessment of the parameters of the process, a jack-knifing analysis was carried out. All herpesviral sequences were divided randomly into two groups. Genome signatures and SOMs were constructed as appropriate using one half. Then the remaining half was applied to the SOM to predict their origin at the family and genus level. To make a prediction concerning the origin of a data vector, the Euclidean distances between that vector and all of the weight vectors of the preconstructed SOM, are calculated. The origin of the nearest weight vector is taken to be the classification of the data vector being tested. Where a data vector falls into a cell on the SOM containing none of the original data vectors used to construct the SOM, its origin is deemed to be “undecided” (Fig. 7). When SOM size is varied for GS-2 at 100 iterations (Fig. 7, top left table), SOMs of greater than 10 × 10 introduce considerably uncertainty into the assignment. However, for those sequences that can be assigned, 95% accuracy at the subfamily level is achieved in a 50 × 50 SOM. Likewise, a 30 × 30 SOM gives 94% accuracy at the genus level. When SOM size is held at 10 × 10 and the signature length at GS-2 and the number of iterations is varied (Fig. 7, lower left table), there is little effect on the sensitivity. At the subfamily level, there are never more than 4.4% of sequences that cannot be assigned, and never more than 7.2% at the genus level. Where sequences can be assigned, optimal accuracy is achieved at 1000 or 5000 iterations, but the variation in accuracy is low. Increasing the iterations from 10 to 5000 only gives a 4% increase in accuracy of assignment at the sub-family level. When 100 iterations are used and the SOM size is held at 10 × 10 (Fig. 7, top right table), GS-4 or GS-5 appear to be optimal.

Discussion

Genome signatures provide a summary of the k-mer content of a genome, corrected for compositional bias. Various studies in a wide range of species have revealed that genome signatures are generally constant within genomes and similar in related genomes (Karlin and Ladunga, 1994; Karlin et al. 1998; Gentles and Karlin, 2001). The extent to which this is a phenomenon of neutral drift or one of active conservation is unknown. It is intuitively obvious that two identical genomes will have identical genome signatures, and that as they diverge the genome signatures will also diverge. Indeed this is the basis of a least one bioinformatical tool that assesses sequence relatedness (Li et al. 2001; Li et al. 2002). However, various suggestions have been made for conservative selection pressures which would act to maintain genome signature similarity in related organisms, including dinucleotide stacking energies, curvature, methylation, superhelicity, context-dependent mutation biases and effects deriving from related replication machinery (Karlin and Burge, 1995; Blaisdell et al. 1996). If these factors are similar within a clade, they might act as a brake on genome signature divergence. The conservation of genome signatures within genomes (which is what originally gave rise to the term “signature” in this context) would tend to suggest that signatures do not drift neutrally, at least within genomes. Figure 1 demonstrates that at the phylum level within the three superkingdoms of cellular life, satisfactory partition of GS-2 can be obtained by the SOM. However, this is less true for eubacteria than it is for eukaryotes and archaea. At the family level in viruses the picture is considerably more confused, with only the Baculoviridae demonstrating anything like territorial coherence on the SOM at GS-2 (Fig. 1 and 2). This may well be a reflection of speed of substitution in viral genomes. However at the species level, the same coherence within genomes as found in cellular organisms may well be the norm. For instance, when the ostreid and ictalurid herpes-virus families are included in a SOM with the Alloherpesviridae, these two families, both represented by a single viral genome, have strongly discrete areas on the SOM (Fig. 3 and 4). This does not mean that genome signatures are not diagnostic tools for phylogenetic assignment at the family and sub-family level in herpesviruses, merely that the results should be interpreted with caution. The use of higher values of k appears to have a marginal effect on improving the discrete distribution of family-level herpesviral signatures on the SOM (Fig. 5) but jack-knifing indicates that this does not improve above k = 5 (Fig. 7). The effects of larger dimension SOMs and increased iterations are ambiguous at best. Optimal values appear to be around GS-4 or GS-5 with 500 to 1000 iterations of the SOM. The size of the SOM might be varied, with an initial run at high dimension (e.g. 50 × 50) followed by a lower dimension run (e.g. 10 × 10) for sequences unassigned by the first run (Fig. 7). The use of genome signatures in the identification of pathogenicity islands is by now well established (Karlin, 1998; Karlin, 2001; Dufraigne et al. 2005). They are valuable in this context in that they indicate regions within genomes that have characteristics different to the rest of the genome. However, it is apparent from the present work that it is difficult on the basis of genome signatures to accurately identify the origin of the exogenous DNA. A BLAST search is more likely to generate informative hits in this context. Nevertheless for sequences that cannot be precisely identified on the basis of alignment-based methods such as BLAST, genome signatures with SOMs holds out the prospect of identification of origin to a reasonable level. The optimization of SOM parameters reported here may also extend to other applications of SOMs. Of particular interest in bioinformatics is their use for the analysis of microarray data. The experimental design would be the same, with a standard microarray data set (e.g. the breast cancer data provided by Reid et al. 2005) substituting for the genome signature arrays. Dominance mapping would be done by clinical outcome, and jack-knife analysis could test the accuracy and sensitivity of assignment of that outcome.
  46 in total

Review 1.  Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes.

Authors:  S Karlin
Journal:  Trends Microbiol       Date:  2001-07       Impact factor: 17.079

2.  Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome.

Authors:  S Kanaya; M Kinouchi; T Abe; Y Kudo; Y Yamada; T Nishi; H Mori; T Ikemura
Journal:  Gene       Date:  2001-10-03       Impact factor: 3.688

3.  A genomic schism in birds revealed by phylogenetic analysis of DNA strings.

Authors:  Scott V Edwards; Bernard Fertil; Alain Giron; Patrick J Deschavanne
Journal:  Syst Biol       Date:  2002-08       Impact factor: 15.683

4.  Limits of predictive models using microarray data for breast cancer clinical treatment outcome.

Authors:  James F Reid; Lara Lusa; Loris De Cecco; Danila Coradini; Silvia Veneroni; Maria Grazia Daidone; Manuela Gariboldi; Marco A Pierotti
Journal:  J Natl Cancer Inst       Date:  2005-06-15       Impact factor: 13.506

5.  Identification of a new motif on nucleic acid sequence data using Kohonen's self-organizing map.

Authors:  P Arrigo; F Giuliano; F Scalia; A Rapallo; G Damiani
Journal:  Comput Appl Biosci       Date:  1991-07

6.  What drives codon choices in human genes?

Authors:  S Karlin; J Mrázek
Journal:  J Mol Biol       Date:  1996-10-04       Impact factor: 5.469

7.  Similarity of the general designs of protochordates and invertebrates.

Authors:  G J Russell; J H Subak-Sharpe
Journal:  Nature       Date:  1977-04-07       Impact factor: 49.962

8.  Molecular classification of cancer: unsupervised self-organizing map analysis of gene expression microarray data.

Authors:  David G Covell; Anders Wallqvist; Alfred A Rabow; Narmada Thanki
Journal:  Mol Cancer Ther       Date:  2003-03       Impact factor: 6.261

9.  Laterally transferred elements and high pressure adaptation in Photobacterium profundum strains.

Authors:  Stefano Campanaro; Alessandro Vezzi; Nicola Vitulo; Federico M Lauro; Michela D'Angelo; Francesca Simonato; Alessandro Cestaro; Giorgio Malacrida; Giulio Bertoloni; Giorgio Valle; Douglas H Bartlett
Journal:  BMC Genomics       Date:  2005-09-14       Impact factor: 3.969

10.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature.

Authors:  Christine Dufraigne; Bernard Fertil; Sylvain Lespinats; Alain Giron; Patrick Deschavanne
Journal:  Nucleic Acids Res       Date:  2005-01-13       Impact factor: 16.971

View more
  3 in total

1.  Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method.

Authors:  Guohong Albert Wu; Se-Ran Jun; Gregory E Sims; Sung-Hou Kim
Journal:  Proc Natl Acad Sci U S A       Date:  2009-06-24       Impact factor: 11.205

2.  Resolving prokaryotic taxonomy without rRNA: longer oligonucleotide word lengths improve genome and metagenome taxonomic classification.

Authors:  Eric B Alsop; Jason Raymond
Journal:  PLoS One       Date:  2013-07-01       Impact factor: 3.240

3.  LAF: Logic Alignment Free and its application to bacterial genomes classification.

Authors:  Emanuel Weitschek; Fabio Cunial; Giovanni Felici
Journal:  BioData Min       Date:  2015-12-08       Impact factor: 2.522

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.