Literature DB >> 28293072

An Analysis of Adenovirus Genomes Using Whole Genome Software Tools.

Abstract

The evolution of sequencing technology has lead to an enormous increase in the number of genomes that have been sequenced. This is especially true in the field of virus genomics. In order to extract meaningful biological information from these genomes, whole genome data mining software tools must be utilized. Hundreds of tools have been developed to analyze biological sequence data. However, only some of these tools are user-friendly to biologists. Several of these tools that have been successfully used to analyze adenovirus genomes are described here. These include Artemis, EMBOSS, pDRAW, zPicture, CoreGenes, GeneOrder, and PipMaker. These tools provide functionalities such as visualization, restriction enzyme analysis, alignment, and proteome comparisons that are extremely useful in the bioinformatics analysis of adenovirus genomes.

Entities: CellLine Chemical Disease Gene Species

Keywords: Adenovirus; Bioinformatics; CoreGenes; GeneOrder; Genome; Genomics; Percent identity; Restriction enzyme; Software

Year: 2016 PMID： 28293072 PMCID： PMC5320926 DOI： 10.6026/97320630012301

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Human adenoviruses (HAdVs) were first discovered in human adenoid tissue in the 1950s [1]. Since then, many different HAdVs have been identified. HAdVs, like all adenoviruses, possess double stranded DNA genomes [2]. The size of the HAdV genome is approximately 35 kb. There are seven species of HAdVs (A through G) and each species consists of different HAdV types. HAdVs cause many diseases such as respiratory disease, conjunctivitis, and gastroenteritis. HAdVs are classified based on several criteria including serum neutralization assays, restriction enzyme analysis (REA), hemagglutination, phylogenetic analysis, and whole genome analysis [3]. The improvement of genome sequencing technology has revolutionized the field of genomics and this impact has certainly been felt on adenovirus genomics. The number of HAdV genomes that have been sequenced has increased at an incredible rate. Bioinformatics analysis can be applied to whole genomes in order to distinguish between HAdVs and gain insight into their evolution [4]. Indeed, whole genome sequence analysis has emerged as the gold standard for the classification of HAdVs [5]. In order to perform whole genome bioinformatics analysis on HAdVs, the appropriate whole genome software tools must be used. These myriad tools vary from standalone to web-based programs. Some of these tools include Artemis, EMBOSS, pDRAW, zPicture, CoreGenes, GeneOrder, and PipMaker (Table 1). The use of these whole genome tools in describing the relationships between HAdVs is presented here. In addition, the use of whole genome percent identities and the use of inverted terminal repeats (ITRs) as techniques to complement these tools and to describe the relatedness between HAdVs are also explored.

Table 1

The names and locations of the various bioinformatics tools used to analyze adenovirus genomes.

Tool	Location
Artemis	http://www.sanger.ac.uk/science/tools/artemis
% identity analysis (EMBOSS)	http://emboss.sourceforge.net/
pDRAW for virtual REA	http://www.acaclone.com
ClustalO for ITR analysis	http://www.ebi.ac.uk/Tools/msa/clustalo/
zPicture	http://zpicture.dcode.org
CoreGenes	http://binf.gmu.edu:8080/CoreGenes3.5
GeneOrder	http://binf.gmu.edu:8080/GeneOrder4.0
PipMaker	http://pipmaker.bx.psu.edu/pipmaker/

Methodology

Artemis

Artemis is a genome browser that can be used to annotate genomes [6]. It can be downloaded from http://www.sanger.ac. uk/resources/software/artemis/. In addition to annotation, Artemis can be used to view and compare annotated genomes. These genomes can be downloaded from GenBank (www.ncbi.nlm.nih.gov) and examined in Artemis. Of particular value is the ability to compare two HAdV genomes by opening up two instances of Artemis and laying the windows on top of each other. This technique allows the comparison of individual proteins and their corresponding nucleotide sequences. This makes it easy to spot mutations such as mis-sense and non-sense mutations. For example, the genome of the HAdV-B3 GB prototype strain is 98% identical to the HAdV-B3 NHRC 1276 field strain. Despite this high level of sequence identity, there are differences in some proteins that can be spotted using Artemis. For example, as described previously [7], a 20.6 kDa protein in the E2B region is found in HAdV-3 GB (Figure 1A), but using Artemis, this protein can be seen to be severely truncated by 1 stop codon in the HAdV-3 NHRC 1276 field strain (Figure 1B).

Figure 1

(A) Artemis view of a 20.6 kDa protein in HAdV-B3 GB; (B) Artemis view of the same protein in HAdV-B3 NHRC1276 truncated by a stop codon which is symbolized by the “+” sign.

Whole genome percent identity analysis

It is very useful to know how closely related adenoviruses are to each other. One way of determining this is to examine the whole genome nucleotide percent identity of an adenovirus genome and compare it to the percent identity of another adenovirus genome. This can be accomplished using the EMBOSS package [8] which contains programs that perform pairwise alignment of two sequences and output the percent identity between them. Specifically, the two programs are called needle and stretcher. Needle performs a classic Needleman-Wunsch global alignment of two sequences, while stretcher uses a modified version of the same algorithm to deal with longer sequences. An example of the utility of using percent identity in comparing HAdVs can be seen in the case of HAdV-G52 which is associated with gastroenteritis [9]. There was debate as to whether HAdV-G52 was a new type of HAdV or whether it belonged to the existing HAdV-F species which are also associated with gastroenteritis [10]. One piece of evidence that argues that HAdV-G52 is indeed a new type is whole genome nucleotide percent identity of this genome compared to the genomes of SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41. These percent identities are shown in Table 2. The HAdV-F40 and HAdV-F41 genomes have significantly lower percent identities when compared to SAdV-1 and SAdV-7. This suggests that HAdV-G52 is more closely related to the simian adenoviruses SAdV-1 and SAdV-7 than to HAdV-F40 and HAdV-F41.

Table 2

Percent identities of SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41 compared to HAdV-G52.

HAdV type (GenBank accession #)	% identity to HAdV-G52
HAdV-G52 (DQ923122)	100
SAdV-1 (NC_006879)	95.5
SAdV-7 (DQ792570)	82.9
HAdV-F40 (NC_001454)	69.1
HAdV-F41 (DQ315364)	69.2

In addition to downloading the EMBOSS package (http://emboss.sourceforge.net/), the needle and stretcher programs are also available online. Needle is available at http://www.ebi.ac.uk/Tools/psa/ emboss_needle/ and stretcher is available at http://www.ebi.ac.uk/ Tools/psa/ emboss_ stretcher/.

Virtual restriction enzyme analysis

REA has been used for a long time as an inexpensive and quick way to distinguish between HADV strains belonging to a certain type. For example, twelve restriction enzymes were used to distinguish between numerous strains of HAdV-3 obtained from Africa, Asia, Australia, Europe, North America, and South America [11]. REA has also been useful in distinguishing between HAdV types associated with outbreaks of lower respiratory tract infections in children [12]. With the increasing number of HAdV genomes that are available, it can be very useful to perform a virtual REA using these genomes. Since the whole genome is available, it is unnecessary to extract DNA and perform REA in the lab. The program pDRAW (www.acaclone.com) can perform REA on HAdV genomes using a wide variety of restriction enzymes. A virtual gel plot is then produced so that the results can be viewed and analyzed. Virtual REA can be used to determine differences between HAdVs. For example, HAdV-B3 GB is 98% identical to HAdV-B3 NHRC 1276. The whole genome percent identity alone may suggest only a few differences between these HAdVs. However, when a virtual REA is done on these two genomes, it can be seen that they are distinct from each other. Figure 2 shows a virtual REA gel plot produced by pDRAW using the BclI enzyme for these two genomes. Lane 2 corresponds to HAdV-B3 and lane 3 corresponds to HAdV-B3 NHRC 1276. The restriction patterns are quite different between the two strains, despite their percent identity being very high.

Figure 2

Virtual restriction enzyme analysis of HAdV-3 strains using BclI. The standards lane is labeled “1.” Lane 2 is HAdV-B3 GB and lane 3 is HAdV-B3 NHRC 1276.

Inverted terminal repeats

The ITRs of adenoviruses are located at both ends of the linear double-stranded DNA genome. The ITRs are essential for viral DNA replication because they contain sequence motifs that serve as binding sites for cellular and viral proteins [13]. One sequence motif is the “core” origin of replication (ATAATATACC), which is highly conserved in mastadenoviruses. This site binds the preterminal protein-DNA polymerase heterodimer [14]. The analysis of ITRs can be used to distinguish between HAdV types as will be seen in the case of HAdV-G52, SAdV-1, SAdV-7, HAdVF40, and HAdV-F41. Figure 3 shows an alignment of the ITRs from these adenoviruses using ClustalW [15] (Please note that the ClustalW server has been replaced by ClustalO at EBI). The core origin is perfectly conserved in the ITRs (nucleotides 9-18), as shown in the boxed region. The ITRs also contain transcription factor binding sites to which cellular factors bind, which may reflect cell tropism. One of these is the NFI site (TGGAAACGTGCCAA), which is highly conserved between HAdV-G52, SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41. The NFI site is identical between HAdV-52 and the simian adenoviruses. Similarly, the site is exactly the same between the two HAdV-F species adenoviruses. The NFIII site (TATGATAAT) is identical between the five adenoviruses. The host provided NF1 and NFIII transcription factors serve to enhance adenovirus replication [16].

Figure 3

Alignment of ITRs from HAdV-G52, SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41. The boxed region consists of a motif that is highly conserved in mastadenoviruses. The uppercase bold sequences correspond to NFI binding sites, the underlined sequences correspond to NFIII sites, the bold italic sequences correspond to SP1 sites, and the lowercase bold sequences correspond to ATF sites.

Two putative SP1 sites (denoted by bold italics in Figure 3) are also present in the SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41 ITRs. One of these SP1 sites is not found in the HAdV-G52 ITR because it is significantly shorter (84 nucleotides) than the other ITRs. However, when the HAdV-G52 ITR is extended to 210 nucleotides, this SP1 site is present (Figure 4). The ATF site (TGACGT) is present in all of the five analyzed ITRs. Interestingly, there is an extra ATF site present in the HAdV-F40 and HAdV-F41 ITRs. Even in the extended ITRs (Figure 4 and Figure 5), this ATF site does not appear to be present in HAdV-G52, SAdV- 1, and SAdV-7. Figure 5 shows all the ITRs extended to 420 nucleotides, and includes the TATA box of the E1A gene towards the end of the alignment. In this extended alignment, putative ATF sites appear in HAdV-52, SAdV-1, and SAdV-7. However, these ATF sites differ from the ATF sites found in HAdV-F40 and HAdV-F41. The difference is that the last 2 nucleotides in the ATF sites are switched (TGACGT vs. TGACTG). In summary, this in depth sequence analysis shows that the ITR of HAdV-G52 is more similar to the ITRs of SAdV-1 and SAdV-7 than to the ITRs of HAdV-F40 and HAdV-F41. This suggests that HAdVG52 is more closely related to the simian adenoviruses than to the species F adenoviruses. This provides more evidence that HAdV-G52 is a new type.

Figure 4

The HAdV-G52, SAdV-1, and SAdV-7 ITRs have been extended to 210 nucleotides. The second ATF binding site (lowercase bold) still appears to be only present in HAdV-F40 and HAdV-F41. In contrast, the SP1 site missing in HAdV-52 in the original alignment appears to be present in this extended alignment.

Figure 5

The HAdV-G52, SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41 ITRs have been extended to 420 bp. Putative ATF sites appear in HAdV-G52, SAdV-1, and SAdV-7. The conserved TATA box of the E1A gene is also shown (TATTTA).

zPicture

Percent identity gives a broad overview of the differences between HAdV genomes. However, in order to determine where in the genome these differences are located, a whole genome visualization tool such as zPicture must be used. zPicture uses BLASTZ [17] to align two genomes and produces a plot of percent identity between the two genomes [18]. By looking at the plot, regions of high percent identity and regions of low percent identity can be easily identified. This is especially useful in the comparison of HAdV-G52 with SAdV-1, SAdV-7, HAdV-F40, and HAdV-F41. Figure 6A shows a zPicture plot of HAdV-G52 vs. SAdV-1 and it can be seen that the percent identity is very high in almost all regions between these two genomes. In contrast, Figure 6B shows a zPicture plot of HAdV-G52 vs. SAdV-7 that indicates there is zero percent identity between the two at the E1A and E3 regions. A possible explanation for this is that parts of these regions may have been artificially deleted by researchers using SAdV-7 as a viral vector. Indeed, the genome size of SAdV-7 is smaller at 31,045 bp than HAdV-G52 whose genome size is 34,250 bp, supporting this hypothesis. Figure 6C and Figure 6D show plots of HAdV-G52 vs. HAdV-F40 and HAdV-F41. These figures show lower percent identity at the E3 and E4 regions and relatively high percent identity for the rest of the genome.

Figure 6

zPicture plots of HAdV-G52 vs. A) SAdV-1, B) SAdV-7, C)HAdV-F40, D) HAdV-F41. The red regions are evolutionarily conserved regions (ECRs) of at least 100 bp in length and at least 70% identity.

CoreGenes

CoreGenes is a tool that is used to determine the “core” or common set of proteins in a set of genomes. It has previously been used in the classification of bacteriophages [19,20] CoreGenes is implemented in the Java programming language and uses a combination of servlets and HTML to provide the required functionality [21,22]. The CoreGenes algorithm takes GenBank accession numbers as input via a web interface. These genome files are then retrieved and the protein sequences are parsed and extracted from the files. Protein similarity analysis is performed for each protein from the query genome against the reference genome protein database using BLASTP from the WU-BLAST package. If the sequence alignments are equal to or greater than a user specified threshold BLASTP score, then that pair of proteins is stored and a consensus genome of related genes is created. These scores can be “custom-specified” by the user by entry into text fields in the CoreGenes web interface. It is available at http://binf.gmu.edu:8080/CoreGenes3.5. If more than two accession numbers are entered, the CoreGenes3.0 algorithm proceeds in an iterative manner. The consensus genome created from the analysis of the first query genome and the reference genome is analyzed against the second query genome. A second consensus genome is created and stored, which is then analyzed against the third query genome. This process is repeated and the fourth query genome is treated in the same way. The final output is a table of related genes across all five genomes. CoreGenes also outputs unique genes between two genomes. From the whole genome percent identity analysis, ITR alignments, and zPicture plots, there is strong evidence that HAdVG52 is closely related to SAdV-1 and SAdV-7. In order to further investigate the relationship between HAdV-G52, SAdV-1, and SAdV-7, a whole proteome approach is undertaken here. The CoreGenes whole proteome analysis reveals that HAdV-G52 and SAdV-1 share a total of 35 proteins at a BLASTP threshold score of “75”. Figure 7 shows a partial table of shared proteins between HAdV-G52 and SAdV-1 that is produced by CoreGenes. The total number of proteins in HAdV-G52 is 36, while the total number in SAdV-1 is 35. Interestingly, a protein that is not annotated in SAdV-1, but which is found in HAdV-G52 is the U protein. This is likely an essential protein. For example, this protein may be involved in adenovirus DNA replication and RNA transcription [23]. Additional analysis using the annotation and genome visualization tool Artemis reveals that the U protein is in fact present in SAdV-1. These results suggest that HAdV-G52 and SAdV-1 are very closely related since they share all the same proteins with each other. This is consistent with whole genome percent identity analysis, ITR alignments, and zPicture plots. CoreGenes analysis reveals that HAdV-G52 shares fewer proteins with SAdV-7 than with SAdV-1 with a total of 26 shared proteins at a BLASTP threshold score of “75.” The total number of proteins in SAdV-7 is 27. There appears to be several proteins unique to HAdV-G52 that are absent in SAdV-7. These are the E3 CR1-alpha1, E3 CR1beta1, E3 RIDalpha, E3 RIDbeta, E3 14.7 kDa, and U proteins. Further analysis using TBLASTN (BLAST a protein query against a translated nucleotide database) confirms that these proteins are missing in SAdV-7. In contrast, these proteins are all present in SAdV-1 and this indicates that HAdV-G52 is more closely related to SAdV-1 than to SAdV-7. This is sup ported by the genome identity between HAdV-G52 and SAdV-1 which is 95.5%. The genome identity between HAdV-G52 and SAdV-7 is significantly lower at 82.9%. As mentioned earlier in the zPicture analysis, a possible explanation for these missing proteins in SAdV-7 is artificial deletion of segments of the genome for use as a viral vector.

GeneOrder

GeneOrder4.0 is a versatile user-friendly web-based tool developed for the analysis of gene order and synteny [24]. This software tool has been updated to analyze larger sized bacterial genomes of around 4-5 megabases (Mb). It performs “on-the-fly” analysis of two genomes and produces a dot plot of gene pairs. GeneOrder4.0 uses the BLAST-like Alignment Tool (BLAT) [25] to perform efficient and fast “all-against-all” protein comparisons. GeneOrder4.0 also provides for the analysis of custom or proprietary data, that is, data not submitted to GenBank for one reason or another. Since GeneOrder4.0 is web-based, users do not have to download or install any software packages. Webbased access is especially useful for non-computationally based scientists such as bench-based biologists. Other user-friendly features of GeneOrder4.0 include zooming, printing, and customizing the final graphical plot. In addition, clicking on the data points on the plot leads to the popping up of new browser windows, leading to the GenBank record of the gene pairs on the plot. In order to visualize the relatedness of the HAdV-G52, SAdV-1, and SAdV-7 genomes, they are analyzed as pairs using GeneOrder4.0. The plot between HAdV-G52 and SAdV-1 shows several related proteins and confirms that these two genomes are related to each other (Figure 8A). The plot between HAdV-G52 and SAdV-7 shows several related proteins (Figure 8B), but the number is less than that of the plot between HAdV-G52 and SAdV-1. These related proteins are indicated by red dots (BLASTP score ≥ 200) and blue “x” symbols (100 ≤ BLASTP score < 200) on the plots. In addition, there is a noticeable gap between one segment of related proteins and the other (Figure 8B). As explained previously, it appears that a part of the genome of SAdV-7 has been deleted. This illustrates the utility of GeneOrder4.0 in visualizing abnormalities in a genome when compared to a reference genome (HAdV-G52).

Figure 8

GeneOrder plots of HAdV-G52 vs. A) SAdV-1 and B) SAdV-7. The red circles indicate BLASTP scores ≥ 200, while the blue “x” symbols indicate BLASTP scores of ≥ 100 but < than 200.

PipMaker

PipMaker compares two sequences using the BLASTZ algorithm and produces a dot plot that shows the segments that are conserved between the sequences [26]. The PipMaker web server accepts sequences in FASTA format and also produces a percent identity plot (pip). A textual form of the sequence alignments can also be created. When PipMaker is used to produce a dotplot of HAdV-G52 vs. SAdV-7 (Figure 9), it can be seen that there are gaps in the plot showing the regions of artificial deletion in SAdV-7, particularly the E1A and E3 regions. Thus, PipMaker allows for the visualization of differences between whole adenovirus genomes.

Figure 9

PipMaker dot plot of HAdV-G52 vs. SAdV-7. The gaps in the plot reflect differences between the HADV-G52 genome and the SAdV-7 genome. The differences indicate gaps present in the SAdV-7 genome which correspond to artificial deletions in that genome with respect to HAdV-G52.

Discussion

The evolution of sequencing technology from second generation to third generation sequencing promises to deliver higher throughput at a cheaper cost and faster rate [27]. This will lead to even more genomes being sequenced. In order to deal with this data deluge, the development of whole genome software tools must continue. The utility of whole genome tools such as Artemis, EMBOSS, pDRAW, zPicture, CoreGenes, and GeneOrder in the analysis of adenovirus genomes has been demonstrated here. Whole genome percent identity analysis using the program in EMBOSS provides a broad overview of the similarity between adenovirus genomes, while zPicture enables the visualization of regions of high percent identity in these genomes. These two tools are useful in determining that HAdV-G52 is more related to SAdV-1 and SAdV-7 than to HAdV-F40 and HAdV-F41. The ITR analysis also agrees with the whole genome percent identity and zPicture results. REA analysis using pDRAW allows the differentiation of two HAdV types that may initially look the same, but are in fact distinct. HAdV-3 GB and the field strain HAdV-3 NHRC 1276 share a very high percent identity, but pDRAW analysis shows distinct restriction patterns that distinguish these two genomes. The whole genome visualization tool Artemis allows the viewing and inspection of these two HAdV-3 genomes. Upon closer inspection with Artemis, a 20.6 kDA protein was found to be truncated in the HAdV-3 NHRC 1276 field strain. This illustrates the use of Artemis in discovering minor differences between these two HAdV-3 genomes. CoreGenes finds the common genes between a set of up to five genomes. CoreGenes analysis reveals that HAdV-G52 shares more proteins with SAdV-1 than with SAdV-7. The missing proteins in SAdV-7 likely due to an artificial deletion are also found using the CoreGenes analysis. GeneOrder analysis also visualizes these missing proteins as a gap in the synteny plot that it produces. Similarly, PipMaker also shows gaps in the dot plot between HAdV-G52 and SAdV-7, reflecting the differences between HAdV-G52 and SAdV-7. In summary, all of these whole genome tools are invaluable in analyzing adenovirus genomes. Therefore, their development and the development of new tools must be encouraged and supported.

25 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

Review 2. Genetic content and evolution of adenoviruses.

Authors: Andrew J Davison; Mária Benkő; Balázs Harrach
Journal: J Gen Virol Date: 2003-11 Impact factor: 3.891

3. Isolation of a cytopathogenic agent from human adenoids undergoing spontaneous degeneration in tissue culture.

Authors: W P ROWE; R J HUEBNER; L K GILMORE; R H PARROTT; T G WARD
Journal: Proc Soc Exp Biol Med Date: 1953-12

4. Transcription factors NFI and NFIII/oct-1 function independently, employing different mechanisms to enhance adenovirus DNA replication.

Authors: Y M Mul; C P Verrijzer; P C van der Vliet
Journal: J Virol Date: 1990-11 Impact factor: 5.103

5. Taxonomic parsing of bacteriophages using core genes and in silico proteome-based CGUG and applications to small bacterial genomes.

Authors: Padmanabhan Mahadevan; Donald Seto
Journal: Adv Exp Med Biol Date: 2010 Impact factor: 2.622

6. Overreliance on the hexon gene, leading to misclassification of human adenoviruses.

Authors: Gurdeep Singh; Christopher M Robinson; Shoaleh Dehghan; Timothy Schmidt; Donald Seto; Morris S Jones; David W Dyer; James Chodosh
Journal: J Virol Date: 2012-02-01 Impact factor: 5.103

7. Human adenovirus type 52: a type 41 in disguise?

Authors: J C de Jong; A D M E Osterhaus; Morris S Jones; Balázs Harrach
Journal: J Virol Date: 2008-04 Impact factor: 5.103

8. Natural variants of human adenovirus type 3 provide evidence for relative genome stability across time and geographic space.

Authors: Padmanabhan Mahadevan; Jason Seto; Clark Tibbetts; Donald Seto
Journal: Virology Date: 2009-11-24 Impact factor: 3.616