Literature DB >> 31050742

Human Cytomegalovirus Genomes Sequenced Directly From Clinical Material: Variation, Multiple-Strain Infection, Recombination, and Gene Loss.

Nicolás M Suárez1, Gavin S Wilkie1, Elias Hage2,3, Salvatore Camiolo1, Marylouisa Holton1, Joseph Hughes1, Maha Maabar1, Sreenu B Vattipally1, Akshay Dhingra2, Ursula A Gompels4, Gavin W G Wilkinson5, Fausto Baldanti6,7, Milena Furione6, Daniele Lilleri8, Alessia Arossa9, Tina Ganzenmueller2,3,10, Giuseppe Gerna8, Petr Hubáček11, Thomas F Schulz2,3, Dana Wolf12, Maurizio Zavattoni6, Andrew J Davison1.   

Abstract

The genomic characteristics of human cytomegalovirus (HCMV) strains sequenced directly from clinical pathology samples were investigated, focusing on variation, multiple-strain infection, recombination, and gene loss. A total of 207 datasets generated in this and previous studies using target enrichment and high-throughput sequencing were analyzed, in the process enabling the determination of genome sequences for 91 strains. Key findings were that (i) it is important to monitor the quality of sequencing libraries in investigating variation; (ii) many recombinant strains have been transmitted during HCMV evolution, and some have apparently survived for thousands of years without further recombination; (iii) mutants with nonfunctional genes (pseudogenes) have been circulating and recombining for long periods and can cause congenital infection and resulting clinical sequelae; and (iv) intrahost variation in single-strain infections is much less than that in multiple-strain infections. Future population-based studies are likely to continue illuminating the evolution, epidemiology, and pathogenesis of HCMV.
© The Author(s) 2019. Published by Oxford University Press for the Infectious Diseases Society of America.

Entities:  

Keywords:  gene loss; genome sequence; genotype; human cytomegalovirus; multiple-strain infection; mutation; recombination; target enrichment; variation

Mesh:

Substances:

Year:  2019        PMID: 31050742      PMCID: PMC6667795          DOI: 10.1093/infdis/jiz208

Source DB:  PubMed          Journal:  J Infect Dis        ISSN: 0022-1899            Impact factor:   5.226


(See the major Article by Suárez et al, on pages Human cytomegalovirus (HCMV) poses a risk, particularly to people with immature or compromised immune systems, and can have serious outcomes in congenitally infected children, transplant recipients, and people with human immunodeficiency virus/AIDS. Prior to the advent of high-throughput technologies, studies of HCMV genomes in natural infections were limited to Sanger sequencing of polymerase chain reaction (PCR) amplicons, often focusing on a small number of polymorphic (hypervariable) genes [1]. This left out most of the genome and also restricted the characterization of multiple-strain infections, which may have more serious outcomes. The first complete HCMV genome sequence to be determined was that of the high-passage strain AD169 [2], from a plasmid library. Over a decade later, additional genomes were sequenced from bacterial artificial chromosomes [3-5], virion DNA [6] and overlapping PCR amplicons [7, 8]. These sequences were also determined using Sanger technology, and were complemented subsequently by many others, increasingly using high-throughput methods [7, 9–13]. With only 3 exceptions [7, 11], all were derived from laboratory strains isolated in cell culture. Mounting evidence of the existence of multiple-strain infections and the propensity of HCMV to mutate during cell culture [6–8, 14, 15] added impetus to sequencing genomes directly from clinical material to define natural populations. One strategy for this involves sequencing overlapping PCR amplicons [7, 16]. Another utilizes an oligonucleotide bait library representing known HCMV diversity to select target sequences from random DNA fragments. This target enrichment technology originated in commercial kits for cellular exome sequencing, and was subsequently applied to various pathogens [17, 18], including HCMV [19-21]. We have applied it to HCMV since 2012 and have systematically released via GenBank many genome sequences that have proved pivotal in other studies [11, 12, 19–21]. The HCMV genome exhibits several evolutionary phenomena, including variation, multiple-strain infection, recombination, and gene loss, all of which were discovered prior to high-throughput sequencing and have since been illuminated by this technology (early references are [22-26]). We explore these and other key genomic features of HCMV, with an emphasis on the strains present in clinical material.

METHODS

Samples

For convenience, samples were analyzed as collections 1–3, which are summarized in Table 1 and described in Supplementary Tables 1–3, respectively. Collection 3 represents samples sequenced by others in previous studies using target enrichment with a different oligonucleotide bait library. The features of the samples are shown in Supplementary Tables 1–3 (rows 3–6), and the clinical outcomes of congenital infection are in Supplementary Table 1 (row 205).
Table 1.

Selected Characteristics on Sample Collections 1–3

CharacteristicCollection 1Collection 2Collection 3
Patients, No.a482925
Patient conditionCongenital infectionMostly transplant recipientsVarious
Samples, No.538957
Sample source, city (prefix)Pavia (PAV), Jerusalem (JER), Prague (PRA)Hannover (Child, RTR, SCTR), Pavia (PAV)Rotterdam (Rot), London (Lon, Pat_)
Datasets, No.5397b57c
Duplicated libraries, No.070
HCMV load, IU/µLd26–559 9685–194 840104–18 377
Genome copies for library, No.e225–8 399 520280–3 896 800Unknown
Reads in Merlin alignment, %2–910–850–90
Coverage ratio in Merlin alignment, % unique/total reads0.40–83.120.00–76.090.00–90.21
Genome sequences determined, No.f422524

Details are provided in Supplementary Tables 1–3.

Abbreviation: HCMV, human cytomegalovirus.

aArchived diagnostic samples were used, and clinical data were retrieved, with the approval of the institutional review boards of Policlinico San Matteo, Pavia (reference numbers 35853/2010 and 35854/2010), Hadassah University Hospital, Jerusalem (reference number HMO-063911), Motol University Hospital, Prague (reference number EK-701a/16) and Hannover Medical School, Hannover (reference number 2527-2014).

bWe reported 68 of the Hannover datasets previously [21].

cThese datasets were reported previously by others, and were either provided by the authors [19] or downloaded from the European Nucleotide Archive (study PRJEB12814) [20].

dViral load in most extracted samples was quantified in the laboratory of origin or the sequencing laboratory. In some instances, the entire sample was used blind to generate a sequencing library.

eAssumes that 1 IU is equivalent to 1 genome copy.

fThe trimmed paired-read data were aligned to the UCSC hg19 human reference genome (http://genome.ucsc.edu/) using Bowtie2. Nonmatching reads were assembled de novo into contigs using SPAdes version 3.5.0 [27]. The contigs were ordered using Scaffold_builder version 2.2 [28] by reference to a version of the strain Merlin sequence lacking all but 100 nt of the terminal repeat regions (TRL at the left end and TRS at the right end; Figure 1), and merged into a draft genome sequence. Residual gaps were filled by identifying relevant reads anchored in flanking regions and assembling them manually in a reiterative fashion. TRL and TRS were reinstated, and the complete genome sequence was verified by aligning it against the read data using Bowtie2 and inspecting the alignment in Tablet. An annotated genome sequence was produced using Sequin (https://www.ncbi.nlm.nih.gov/Sequin/).

Selected Characteristics on Sample Collections 1–3 Details are provided in Supplementary Tables 1–3. Abbreviation: HCMV, human cytomegalovirus. aArchived diagnostic samples were used, and clinical data were retrieved, with the approval of the institutional review boards of Policlinico San Matteo, Pavia (reference numbers 35853/2010 and 35854/2010), Hadassah University Hospital, Jerusalem (reference number HMO-063911), Motol University Hospital, Prague (reference number EK-701a/16) and Hannover Medical School, Hannover (reference number 2527-2014). bWe reported 68 of the Hannover datasets previously [21]. cThese datasets were reported previously by others, and were either provided by the authors [19] or downloaded from the European Nucleotide Archive (study PRJEB12814) [20]. dViral load in most extracted samples was quantified in the laboratory of origin or the sequencing laboratory. In some instances, the entire sample was used blind to generate a sequencing library. eAssumes that 1 IU is equivalent to 1 genome copy. fThe trimmed paired-read data were aligned to the UCSC hg19 human reference genome (http://genome.ucsc.edu/) using Bowtie2. Nonmatching reads were assembled de novo into contigs using SPAdes version 3.5.0 [27]. The contigs were ordered using Scaffold_builder version 2.2 [28] by reference to a version of the strain Merlin sequence lacking all but 100 nt of the terminal repeat regions (TRL at the left end and TRS at the right end; Figure 1), and merged into a draft genome sequence. Residual gaps were filled by identifying relevant reads anchored in flanking regions and assembling them manually in a reiterative fashion. TRL and TRS were reinstated, and the complete genome sequence was verified by aligning it against the read data using Bowtie2 and inspecting the alignment in Tablet. An annotated genome sequence was produced using Sequin (https://www.ncbi.nlm.nih.gov/Sequin/).
Figure 1.

Locations in the human cytomegalovirus strain Merlin genome of genes used for genotyping. The genome consists of 2 unique regions, UL (1325–194 343 bp) and US (197 627–233 108 bp), the former flanked by inverted repeats TRL (1–1324 bp) and IRL (194 344–195 667 bp), and the latter flanked by inverted repeats IRS (195 090–197 626 bp) and TRS (233 109–235 646 bp). Protein-coding regions are indicated by shaded arrows, and noncoding RNAs as narrower, white arrows, with gene nomenclature below. Introns are shown as narrow white bars. The 12 genes (RL5A, RL6, RL12, RL13, UL1, UL9, UL11, UL73, UL74, UL120, UL146, and UL139) used for motif read-matching are in dark gray (red in online version). Two of these genes (RL13 and UL146) were also used for genotype read-matching. The additional 5 genes (UL20, UL33, UL37, UL55, and US9) used to genotype sequences by alignment are medium gray (orange in online version). All other genes are shown in white (pink in online version).

DNA Sequencing

Target enrichment and sequencing library preparation were performed using the SureSelect XT version 1.7 system for Illumina paired-end libraries with biotinylated RNA bait libraries (Agilent) [21]. Bait libraries representing known HCMV diversity were designed in February 2012 and April 2014 from 31 and 64 complete genome sequences, respectively. Information on and access to the latter library (55 210 baits of 120 nucleotides [nt] with overrepresentation of G + C–rich regions) are available from the corresponding author. Data on viral loads and library construction are shown in Supplementary Tables 1–3 (rows 9–12). Datasets of 300 or 150 nt paired-end reads were generated using a MiSeq (Illumina). Their names are shown in Supplementary Tables 1–3 (row 7). They were prepared for analysis using Trim Galore version 0.4.0 (program available at http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/; length = 21, quality = 10, and stringency = 3). The numbers of trimmed reads are in Supplementary Tables 1–3 (row 15).

Library Diversity

Estimating the number of reads in a dataset derived from unique HCMV fragments initially involved using Bowtie2 version 2.2.6 [29] to align the reads against the strain Merlin sequence (GenBank accession number AY446894.2), and, where it could be determined, the consensus genome sequence derived from the dataset. The relevant data are in Supplementary Tables 1–3 (rows 17–19 and 23–26). Reads containing insertions or deletions were removed to preserve coordinate numbering, as were duplicate read pairs sharing both end coordinates and duplicate unpaired reads sharing one end coordinate, thereby producing an alignment file for unique reads derived from unique HCMV fragments (program available at https://centre-for-virus-research.github.io/VATK/AssemblyPostProcessing). This file was viewed using Tablet version 1.14.11.7 [30]. The coverage depth values for total and unique fragment reads are in Supplementary Tables 1–3 (rows 20–21 and 27–28).

Strain Enumeration

The number of strains represented in a dataset was estimated by 2 strategies: genotype read-matching and motif read-matching (program available at https://centre-for-virus-research.github.io/VATK/HCMV_pipeline). Both strategies utilized datasets concatenated from the paired-end datasets. The genotype designations used were either based on reported phylogenies [6, 12, 25, 31, 32], amended or extended as appropriate, or constructed afresh using Clustal Omega version 1.2.4 [33] and MEGA version 6.0.6 [34] with data for the genomes listed in Supplementary Table 4 and individual genes for which additional sequences were available in GenBank. Alignments and phylogenetic reconstructions are in Supplementary Figures 1 and 2, respectively. For genotype read-matching, Bowtie2 was used to align the reads to sequences representing the genotypes of 2 hypervariable genes, UL146 and RL13 [6, 12, 35]. The sequences from the entire coding region of UL146 and the central coding region of RL13 are in Supplementary Tables 1–3 (rows 34–58). In contrast to the UL146 genotypes, the RL13 genotypes cross-matched within 4 groups (G1, G2, G3; G4A, G4B; G6, G10; and G7, G8). In these instances, the genotype within the group with most matching reads was scored. The number of reads aligned to each genotype is in Supplementary Tables 1–3 (rows 34–58). A genotype was scored if the number of reads was >10 and represented >2% of the total number detected for all genotypes of that gene. For 14 samples in collection 1 that had been sequenced prior to the availability of ultrapure (TruGrade) oligonucleotides, these values were >25 and >5%, respectively. The number of strains in a sample was scored as the greater of the numbers of genotypes detected for the 2 target genes, and is in Supplementary Tables 1–3 (row 13). For motif read-matching, conserved genotype-specific motifs (20–31 nt) were identified by visual inspection of alignments (Supplementary Figure 1) for 12 hypervariable genes [6, 12, 19, 35]. Additional motifs for identifying common intergenotypic recombinants were included. The motif sequences and number of reads containing perfect matches to a sequence or its reverse complement are in Supplementary Tables 1–3 (rows 60–170). Genotypes were scored as described above. The number of strains in a sample was estimated as the maximum number of genotypes detected for at least 2 genes, and is in Supplementary Tables 1–3 (row 14).

Pseudogene Analysis

The genomes of some HCMV strains exhibit gene loss apparent as pseudogenes resulting from mutations causing premature translational termination [7, 11, 12, 26]. These mutations are substitutions that introduce in-frame stop codons or ablate splice sites, or insertions or deletions that cause frameshifting or loss of protein-coding regions. Motif read-matching was used to assess the presence of common mutations and also to determine the prevalence of mutations identified in collection 1. These data are in Supplementary Tables 1–3 (rows 171–178) and Supplementary Table 1 (rows 180–203), respectively.

Intrahost Variation

Minor genome populations were analyzed by enumerating single-nucleotide polymorphisms (SNPs) in datasets for which consensus genome sequences had been determined. Thus, the term mutant applies hereafter to a strain that has a mutation in the consensus sequence resulting in a pseudogene, and the term SNP applies to a minor variation from the consensus within a population. To enumerate SNPs, original datasets were prepared for analysis using Trim Galore (length = 100, quality = 30, and stringency = 1), and trimmed reads were mapped using Bowtie2. Alignment files in SAM format were converted into BAM format, sorted using SAMtools version 1.3 [36], and analyzed using LoFreq version 2.1.2 [37] and V-Phaser 2 [38].

Data Deposition

Original datasets were purged of human reads and deposited in the European Nucleotide Archive (ENA; project number PRJEB29585), and consensus genome sequences were deposited in GenBank. The accession numbers are in Supplementary Tables 1–3 (rows 8 and 29, respectively). Updated genome sequence determinations in collection 3 were deposited by the original submitters in GenBank [19] or by us as third-party annotations in ENA (project number PRJEB29374) [20]. Sequence features are in Supplementary Tables 1–3 (rows 30–32).

RESULTS

Operational Limitations

A total of 207 datasets from 199 samples and 102 individuals were analyzed (Table 1 and Supplementary Tables 1–3). Library quality was represented in the percentage of HCMV reads and the coverage depth by unique fragment reads. These values were related to sample type, being higher for urine than blood presumably because of a higher proportion of viral to host DNA. They also depended on the number of viral genome copies used to make the library, with >1000 copies generally being needed to determine a complete genome sequence. However, despite high library diversity, it was not possible to assemble complete genome sequences from most datasets in collection 3 because of gaps in RL12 and some G + C–rich regions, perhaps as a result of limitations in the bait library. The use of excessive PCR cycles with some samples in collections 1 and 2 led to high coverage depth by total fragment reads but low coverage depth by unique fragment reads, and thus to highly clonal libraries (eg, PAV2 in collection 1). Genotypes present at subthreshold levels may represent multiple-strain infections or cross-contamination during the complex sample processing pathway (eg, PRA4 reads in PRA6A in collection 1).

Genome Sequences

A total of 91 complete or almost complete HCMV genome sequences were determined (Table 1). We reported 5 previously [21], and 16 are improvements on published sequences [19]. Most originated from single-strain infections or multiple-strain infections in which one strain was predominant, and some originated from different strains that predominated in a patient at different times. Defining a strain as a viral genome present in an individual, these 91 sequences, plus an additional 49 deposited by our group and 104 by others, brought the number of strains sequenced to 244 (Supplementary Table 4). Of these, 91 were sequenced directly from clinical material, and all but one were determined in this and our previous study [21]. The average size of the HCMV genome, based on the 78 complete sequences in this set, is 235 465 bp (range 234 316–237 120 bp).

Multiple-Strain Infections

Genotypic differences in hypervariable genes (Figure 1 and Supplementary Figures 1 and 2) were exploited to distinguish single-strain from multiple-strain infections by genotype read-matching and motif read-matching with threshold values. To our knowledge, these methods, employed in the present work and the companion study [39], have not been used previously for categorizing HCMV infections. Single strains were common in congenitally infected patients (n = 43/50 in collections 1 and 2), but significantly less so in transplant recipients (n = 11/25 in collections 2 and 3; χ2 = 14.583, P < .05). Intrahost variation is discussed below. Locations in the human cytomegalovirus strain Merlin genome of genes used for genotyping. The genome consists of 2 unique regions, UL (1325–194 343 bp) and US (197 627–233 108 bp), the former flanked by inverted repeats TRL (1–1324 bp) and IRL (194 344–195 667 bp), and the latter flanked by inverted repeats IRS (195 090–197 626 bp) and TRS (233 109–235 646 bp). Protein-coding regions are indicated by shaded arrows, and noncoding RNAs as narrower, white arrows, with gene nomenclature below. Introns are shown as narrow white bars. The 12 genes (RL5A, RL6, RL12, RL13, UL1, UL9, UL11, UL73, UL74, UL120, UL146, and UL139) used for motif read-matching are in dark gray (red in online version). Two of these genes (RL13 and UL146) were also used for genotype read-matching. The additional 5 genes (UL20, UL33, UL37, UL55, and US9) used to genotype sequences by alignment are medium gray (orange in online version). All other genes are shown in white (pink in online version).

Recombination

The 244 genome sequences were genotyped in the 12 hypervariable genes used for motif read-matching and then in 5 additional genes (Figure 1 and Supplementary Table 4). Hypervariation in UL55, which encodes glycoprotein B (gB), is located in 2 regions (UL55N near the N terminus, and UL55X encompassing the proteolytic cleavage site) [23, 40]. Five genotypes (G1–G5) have been assigned to each region [23, 40–42], which are separated by 927 bp that are 80% identical in all strains. All genomes had a recognized UL55X genotype (Supplementary Table 5). As reported previously [40], UL55N G2 and G3 could not be distinguished reliably from each other, and 2 additional genotypes (G6–G7) were detected that may have arisen from ancient recombination events within UL55N (Supplementary Tables 4 and 5 and Supplementary Figure 1). There was evidence for recombination in the region between UL55N and UL55X in only 8 genomes. This low proportion of recombination (3.3%) contrasts with the higher levels proposed in UL55 from PCR-based studies [40, 43], which may have been affected by artefactual recombination. UL73 and UL74, which encode glycoproteins N and O (gN and gO), respectively, are adjacent hypervariable genes that exist as 8 genotypes each [25, 32, 44]. There was evidence for recombination between them in only 7 genomes (2.9%), in accordance with the low levels (2.2%) detected previously in PCR-based studies [25, 32, 45]. In the region containing adjacent hypervariable genes RL12, RL13, and UL1, recombinants were also rare (1.2%) within RL12 and absent from RL13 and UL1. In contrast, hypervariable genes UL146 and UL139, which encode a CXC chemokine and a membrane glycoprotein, respectively, are separated by a well-conserved region of over 5 kbp. The number (66) of the 126 possible genotype combinations represented in the 244 genomes is too large to allow any underlying genotypic linkage to be discerned, consistent with previous conclusions from PCR-based studies [31]. No recombinants were noted within UL146. In principle, strains in multiple-strain infections have the opportunity to recombine. In our previous analysis of RTR1 in collection 2, we noted that one strain (RTR1A) predominated at earlier times and another (RTR1B) at later times [21]. From the low frequency of SNPs across a large part of the genome, we concluded that the second strain had arisen either by recombination involving the first strain or by reinfection with, or reactivation of, a second strain fortuitously similar to the first. In the present study, recombination was strongly supported by a comparison of the 2 genome sequences, which showed that approximately two-thirds of the genome is almost identical (differing by 3 substitutions in noncoding regions), whereas the remaining third is highly dissimilar. To investigate whether strains have been transmitted without recombination occurring, identical genotypic constellations were identified among the 244 genomes (Table 2). This revealed the existence of 12 haplotype groups within which multiple strains lack signs of having recombined since diverging from their last common ancestor; these are henceforth termed nonrecombinant strains. As an incidental outcome, the 2 strains in group 1 (PRA8 and CZ/3/2012), which were characterized in different studies, were confirmed as having originated from the same patient, reducing the set of sequenced strains to 243. The results from the other 11 groups suggest that nonrecombinant strains have been circulating, some for periods sufficient to allow the accumulation of >100 substitutions. Among the highly divergent groups, group 9 (3 strains) exhibited 135 differences, with the 50 that would affect protein coding distributed among 38 genes, and group 10 (2 strains) exhibited 138 differences, with the 38 that would affect protein coding distributed among 27 genes. No obvious bias was observed toward greater diversity in any particular gene or group of genes, including those in the hypervariable category.
Table 2.

Groups of Nonrecombinant Strains

Genotypesa
GroupStrainRL5ARL6RL12RL13UL1UL9UL11UL20UL33UL37UL55NUL73UL74UL120UL146UL139US9Mutated GenesDifferencesbShared Mutations
1PRA811666625152/34C1C2B141UL1450These strains share a UL145 mutation, were characterized in different studies, and were confirmed as having been derived from the same patient
CZ/3/201211666625152/34C1C2B141UL145
2BE/3/2011241B11416222/34A31A821None1None
BE/21/2011241B11416222/34A31A821None
3UK/Lon6/Urine/2011517771162143A1B3B131A1None23None
2CEN15517771162143A1B3B131A1None
BE/5/2012517771162143A1B3B131A1None
4BE/14/2012354A4A46654543A1B2A951RL6 UL9 UL40 US726These strains share a UL9 mutation and also RL6 and UL40 mutations that are present in other strains
BE/36/2011354A4A46654543A1B2A951RL6 UL9 UL40
5BE/10/2012631A111124543A1B2A371None35None
BE/26/2011631A111124543A1B2A371None
6BE/1/2011114A4A4667462/34A32A141UL1 UL965These strains bear a UL9 mutation that is present in other strains, and 2 strains share a UL1 mutation
BE/8/2010114A4A4667462/34A32A141UL9
BE/9/2012114A4A4667462/34A32A141UL1 UL9
7NAN1LA35555732262/34D53B752RL6 US973These strains share RL6 and US9 mutations that are present in other strains
BE/6/201235555732262/34D53B752RL6 US9 US27
8BE/7/2012241A11417532/34A32B1351RL5A RL13 UL150125These strains share a UL150 mutation that is present in other strains
BE/11/2012241A11417532/34A32B1351UL150
BE/16/2012241A11417532/34A32B1351UL150
BE/26/2010241A11417532/34A32B1351UL150
BE/30/2011241A11417532/34A32B1351UL150
9JER8511133321346122B3B741UL1 UL9 UL111A135These strains share a UL111A mutation that is present in another strain
JER40411133321346122B3B741UL111A
BE/25/20101133321346122B3B741UL111A
10JER569511777116212/33B2A4B1321UL9 UL111A138These strains share a UL111A mutation that is present in other strains, and have different UL9 mutations
BE/15/201011777116212/33B2A4B1321RL1 UL9 UL111A
11PRA7114B4B49655664D54B1021RL5A UL111A143These strains share RL5A and UL111A mutations that are present in other strains
JP114B4B49655664D54B1021RL5A UL111A
BE/4/2010114B4B49655664D54B1021RL5A UL111A
12BE/6/2011514B4B4961532/33A1B1B951UL9155Two strains share a UL9 mutation that is present in other strains
BE/18/2011514B4B4961532/33A1B1B951None
BE/27/2011514B4B4961532/33A1B1B951UL9

aSee Supplementary Figures 1 and 2 for genotype definitions. G prefix omitted.

bTotal number of differences among all strains in the group, not including size variations in tandem repeats. To exclude repeat regions, sequences were aligned from the TATA box of RL1 to the end of US, omitting the region from the AATAAA polyadenylation signal of UL150A to the beginning of TRS.

Groups of Nonrecombinant Strains aSee Supplementary Figures 1 and 2 for genotype definitions. G prefix omitted. bTotal number of differences among all strains in the group, not including size variations in tandem repeats. To exclude repeat regions, sequences were aligned from the TATA box of RL1 to the end of US, omitting the region from the AATAAA polyadenylation signal of UL150A to the beginning of TRS.

Pseudogenes

Among the strains sequenced from clinical material, 77% are mutated in at least one gene (compared with 79% among all sequenced strains), and one is mutated in as many as 6 genes (Pat_D in collection 3) (Supplementary Table 4). The most frequently mutated genes are UL9, RL5A, UL1 and RL6 (members of the RL11 family), US7 and US9 (members of the US6 gene family), and UL111A (encoding viral interleukin 10) (Table 3). In addition, there was evidence from the PAV6 datasets (collection 1) for maternal transmission of a US7 mutant (Supplementary Table 1), and from PCR data (not shown) for maternal transmission of a UL111A mutant to PAV16 (collection 1). Focusing on the most common mutations, strains in which UL9, RL5A, UL1, US9, US7, and UL111A were affected (singly or in combination) were, like strains that were not mutated in any gene, transmitted in congenital infections and, in some cases, linked to defects in neurological development (Supplementary Table 1).
Table 3.

Mutated Genes in Order of Decreasing Frequency

GeneFeature(s)Strains Mutated, No.aStrains Mutated, %a
PassagedbClinicalcAlldPassagedbClinicalcAlld
UL9RL11 family; type 1 membrane protein50318132.8934.0733.33
RL5ARL11 family31275820.3929.6723.87
UL1RL11 family; type 1 membrane protein20183813.1619.7815.64
RL6RL11 family23143715.1315.3815.23
US9US6 family; type 1 membrane protein26113717.1112.0915.23
UL111AViral interleukin-101672310.537.699.47
UL150Unknown113147.243.305.76
US7US6 family; type 1 membrane protein77144.617.695.76
UL40Type 1 membrane protein82105.262.204.12
UL30UL30 family2351.323.302.06
UL142MHC family; type 1 membrane protein2351.323.302.06
RL12RL11 family; type 1 membrane protein3141.971.101.65
RL1RL1 family1230.662.201.23
UL136Potential transmembrane domain3031.970.001.23
US13US12 family; type 3 membrane protein3031.970.001.23
UL133Potential transmembrane domain2021.320.000.82
US6US6 family; type 1 membrane protein1120.661.100.82
US8US6 family; type 1 membrane protein0220.002.200.82
US27GPCR family; type 3 membrane protein2021.320.000.82
UL11RL11 family; type 1 membrane protein1010.660.000.41
UL13Unknown0110.001.100.41
UL14UL14 family; type 1 membrane protein0110.001.100.41
UL15APotential transmembrane domain0110.001.100.41
UL20Type 1 membrane protein1010.660.000.41
UL43US22 family0110.001.100.41
UL99Envelope-associated protein1010.660.000.41
UL148Type 1 membrane protein1010.660.000.41
UL147CXCL family1010.660.000.41
UL145Unknown0110.001.100.41
UL150AUnknown1010.660.000.41
IRS1US22 family1010.660.000.41
US1US1 family1010.660.000.41
US12US12 family; type 3 membrane protein1010.660.000.41
US19US12 family; type 3 membrane protein0110.001.100.41

Abbreviations: CXCL, chemokine (CXC motif) ligand; GPCR, G protein–coupled receptor; MHC, major histocompatibility complex.

aOmitting mutations that occurred in RL13, UL128, UL130, and UL131A probably during passage, or that were engineered during bacterial artificial chromosome construction.

bStrains sequenced from strains passaged in cell culture, not taking into account the minority of mutations confirmed from the clinical samples (n = 152, excludes CZ/3/2012, which is the same strain as PRA8).

cStrains sequenced directly from clinical material (n = 91).

dStrains sequenced directly from clinical material or passaged virus (n = 243).

Mutated Genes in Order of Decreasing Frequency Abbreviations: CXCL, chemokine (CXC motif) ligand; GPCR, G protein–coupled receptor; MHC, major histocompatibility complex. aOmitting mutations that occurred in RL13, UL128, UL130, and UL131A probably during passage, or that were engineered during bacterial artificial chromosome construction. bStrains sequenced from strains passaged in cell culture, not taking into account the minority of mutations confirmed from the clinical samples (n = 152, excludes CZ/3/2012, which is the same strain as PRA8). cStrains sequenced directly from clinical material (n = 91). dStrains sequenced directly from clinical material or passaged virus (n = 243).

Intrahost Diversity

LoFreq and V-Phaser analyses showed that single-strain infections contained markedly fewer SNPs (median values of 60 and 140, respectively) than multiple-strain infections (median values of 2444 and 2955, respectively; Figure 2). The differences between the values for single- and multiple-strain infections were significant (Kruskal–Wallis rank-sum test; LoFreq: χ2 = 67.918, P < 2.2 × 10-16; V-Phaser: χ2 = 63.536, P = 1.6 × 10-15).
Figure 2.

Box-and-whisker graphs created using ggplot2 (https://ggplot2.tidyverse.org) showing the total number of single-nucleotide polymorphisms (SNPs) detected at a frequency of >2% in single-strain and multiple-strain infections using LoFreq (A) and V-Phaser (B). Single-strain (n = 134 and 131, respectively) and multiple-strain datasets (n = 29 and 29, respectively) for which consensus genome sequences had been derived were identified by motif read-matching, and the total number of SNPs in each dataset was enumerated (insertions, deletions, and length polymorphisms were not considered). LoFreq employed a minimal coverage depth of 10 reads (minimal SNP quality [phred] 64) and strand-bias significance with a false discovery rate correction of P < .001. V-Phaser employed phasing with a window size of 500 nucleotides and quality score (phred) 20 for calibrating the significance of strand-bias at P < .05. Each box (light gray for single strains and dark gray for multiple strains) encompasses the first to third quartiles (Q1–Q3) and shows the median as a thick line. For each box, the horizontal line at the end of the upper dashed whisker marks the upper extreme (defined as the smaller of Q3 + 1.5 [Q3–Q1] and the highest single value), and the horizontal line at the end of the lower dashed whisker marks indicates the lower extreme (the greater of Q1 – 1.5 [Q3–Q1] and the lowest single value).

Box-and-whisker graphs created using ggplot2 (https://ggplot2.tidyverse.org) showing the total number of single-nucleotide polymorphisms (SNPs) detected at a frequency of >2% in single-strain and multiple-strain infections using LoFreq (A) and V-Phaser (B). Single-strain (n = 134 and 131, respectively) and multiple-strain datasets (n = 29 and 29, respectively) for which consensus genome sequences had been derived were identified by motif read-matching, and the total number of SNPs in each dataset was enumerated (insertions, deletions, and length polymorphisms were not considered). LoFreq employed a minimal coverage depth of 10 reads (minimal SNP quality [phred] 64) and strand-bias significance with a false discovery rate correction of P < .001. V-Phaser employed phasing with a window size of 500 nucleotides and quality score (phred) 20 for calibrating the significance of strand-bias at P < .05. Each box (light gray for single strains and dark gray for multiple strains) encompasses the first to third quartiles (Q1–Q3) and shows the median as a thick line. For each box, the horizontal line at the end of the upper dashed whisker marks the upper extreme (defined as the smaller of Q3 + 1.5 [Q3–Q1] and the highest single value), and the horizontal line at the end of the lower dashed whisker marks indicates the lower extreme (the greater of Q1 – 1.5 [Q3–Q1] and the lowest single value).

DISCUSSION

Advances in high-throughput sequencing technology have made it possible to generate a wealth of viral genome information directly from clinical material. However, operational limitations should be registered. These include sample characteristics (source, viral content and presence of multiple strains), confounding factors (technical limitations, logistical errors and cross-contamination), design of the bait library (ability to enrich all strains and acquire data across the genome), and quality and extent of the sequencing data (library diversity and coverage depth). Since perceived levels of intrahost variation are particularly sensitive to these factors, we proceeded cautiously with this aspect. However, as indicated in our previous study [21], it is clear that the number of SNPs in single-strain infections was markedly less than that in multiple-strain infections. It was also far less than that reported by others in samples from congenital infections [16]. The factors listed above may have been responsible for the outliers observed in single-strain infections; for example, the PAV6 (collection 1) library was made using non-TruGrade oligonucleotides, RTR6B (collection 2) had a low coverage depth and also came from a patient from whom other samples contained multiple strains, and CMV-35 (collection 3) may have contained subthreshold levels of additional strains or cross-contaminants. In our view, accurate estimates of the levels of intrahost variation in single-strain infections are not available from the present and previous studies, and will require sequencing and bioinformatic approaches that are demonstrably reliable, robust, and reproducible [46, 47]. Whole-genome analyses have confirmed the significant role of recombination during HCMV evolution reported in numerous earlier studies [12, 19]. Recombination has occurred over a very long period but nonetheless remains limited in extent, with surviving events being more numerous in long regions, less numerous in short regions, and rare or absent in hypervariable regions, consistent with the role of homologous recombination. Recombination frequency may be restricted in some circumstances by functional interdependence within the same protein (eg, gB) or possibly between separate proteins (eg, gN and gO [25, 32, 44]). However, it is not known whether differential recombination due to sequence relatedness is of general biological significance for the virus. Also, strains have circulated that seem not to have recombined for long periods. Application of an evolutionary rate estimated for herpesviruses (3.5 × 10−8 substitutions/nt/year) [48] implies that these periods may have extended to many thousands of years. Moreover, as suggested by the lack of diversity within genotypes in comparison with the marked diversity among them, the distribution of substitutions in nonrecombinant strains fits with the view that intense diversification of the hypervariable genes occurred early in human or pre–human history [25, 31] and has long since ceased. Assessing the extent to which recombinants arise and survive in individuals with multiple-strain infections is problematic. Except where populations fluctuate significantly and are sampled serially (eg, RTR1 in collection 2), it is difficult to approach this using short-read data, as they are based on PCR methodologies prone to generating recombinational artefacts. Long- or single-read sequencing technologies and demonstrably reliable bioinformatic approaches are needed. Also, conclusions drawn from transplant recipients, who are immunosuppressed and in whom HCMV populations may be diversified by transplantation from HCMV-positive donors or selected with antiviral drugs, are unlikely to represent other situations, such maternal transmission via breast milk [39]. Evidence for pseudogenes was largely derived previously from strains isolated in cell culture, and it was unclear to what extent pseudogenes presented in natural populations. For example, in a study reporting that 75% of strains carry pseudogenes [12], 157 mutations were identified in 101 strains, with all but one of these strains having been passaged in cell culture, although 35 mutations were confirmed by PCR of the clinical material. Nonetheless, we found that the distribution of pseudogenes among the 91 strains sequenced in the present study directly from clinical material is similar to that among strains isolated in cell culture, thus generally validating the earlier suppositions. The likelihood that many of these mutants are ancient is supported by the finding that all were detected at levels very close to 100% in collection 1, and by previous observations identifying the same mutation in different strains [7, 12]. Moreover, 9 of the groups of nonrecombinant strains contained pseudogenes, and some of the mutations were common to group members and even to additional strains among the 243, indicating that they have been transferred by recombination. The implication that some mutants have a selective advantage in certain individuals may be extended to their presence in pathogenic congenital infections, probably in combination with host factors. The genes from which pseudogenes have arisen are involved, or are suspected to be involved, in immune modulation. They include UL111A, which encodes viral interleukin 10 [49]; UL40, which is involved in protecting infected cells against natural killer cell lysis [50] via its cleaved signal peptide, in which mutations occur; and UL9, which bears a potential immunoglobulin-binding domain [2]. These findings also suggest, but do not prove, that maternal HCMV genotyping might be useful in developing strategies for preventing congenital CMV. Modern approaches offer a powerful means for analyzing HCMV genomes directly from clinical material, with the important proviso that the data should be quality assessed and interpreted in the context of the known evolutionary and biological characteristics of the virus. Extensive high-throughput sequence data are likely to illuminate further the epidemiology, pathogenesis, and evolution of HCMV in clinical and natural settings, thus facilitating the identification of virulence determinants and the development of new interventions.

Supplementary Data

Supplementary materials are available at The Journal of Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.
  49 in total

1.  Simultaneous infection of healthy people with multiple human cytomegalovirus strains.

Authors:  U Meyer-König; K Ebert; B Schrage; S Pollak; F T Hufert
Journal:  Lancet       Date:  1998-10-17       Impact factor: 79.321

2.  High-throughput analysis of human cytomegalovirus genome diversity highlights the widespread occurrence of gene-disrupting mutations and pervasive recombination.

Authors:  Steven Sijmons; Kim Thys; Mirabeau Mbong Ngwese; Ellen Van Damme; Jan Dvorak; Marnix Van Loock; Guangdi Li; Ruth Tachezy; Laurent Busson; Jeroen Aerssens; Marc Van Ranst; Piet Maes
Journal:  J Virol       Date:  2015-05-13       Impact factor: 5.103

3.  Identification and BAC construction of Han, the first characterized HCMV clinical strain in China.

Authors:  Fei Zhao; Zhang-Zhou Shen; Zhong-Yang Liu; Wen-Bo Zeng; Shuang Cheng; Yan-Ping Ma; Simon Rayner; Bo Yang; Guan-Hua Qiao; Hai-Fei Jiang; Shuang Gao; Hua Zhu; Fu-Qiang Xu; Qiang Ruan; Min-Hua Luo
Journal:  J Med Virol       Date:  2015-10-12       Impact factor: 2.327

4.  Cloning and sequencing of a highly productive, endotheliotropic virus strain derived from human cytomegalovirus TB40/E.

Authors:  Christian Sinzger; Gabriele Hahn; Margarete Digel; Ruth Katona; Kerstin Laib Sampaio; Martin Messerle; Hartmut Hengel; Ulrich Koszinowski; Wolfram Brune; Barbara Adler
Journal:  J Gen Virol       Date:  2008-02       Impact factor: 3.891

5.  Functional profiling of a human cytomegalovirus genome.

Authors:  Walter Dunn; Cassie Chou; Hong Li; Rong Hai; David Patterson; Viktor Stolc; Hua Zhu; Fenyong Liu
Journal:  Proc Natl Acad Sci U S A       Date:  2003-11-17       Impact factor: 11.205

6.  Specific capture and whole-genome sequencing of viruses from clinical samples.

Authors:  Daniel P Depledge; Anne L Palser; Simon J Watson; Imogen Yi-Chun Lai; Eleanor R Gray; Paul Grant; Ravinder K Kanda; Emily Leproust; Paul Kellam; Judith Breuer
Journal:  PLoS One       Date:  2011-11-18       Impact factor: 3.240

7.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

8.  A method enabling high-throughput sequencing of human cytomegalovirus complete genomes from clinical isolates.

Authors:  Steven Sijmons; Kim Thys; Michaël Corthout; Ellen Van Damme; Marnix Van Loock; Stefanie Bollen; Sylvie Baguet; Jeroen Aerssens; Marc Van Ranst; Piet Maes
Journal:  PLoS One       Date:  2014-04-22       Impact factor: 3.240

9.  On the effective depth of viral sequence data.

Authors:  Christopher J R Illingworth; Sunando Roy; Mathew A Beale; Helena Tutill; Rachel Williams; Judith Breuer
Journal:  Virus Evol       Date:  2017-11-14

Review 10.  Human cytomegalovirus encoded homologs of cytokines, chemokines and their receptors: roles in immunomodulation.

Authors:  Brian P McSharry; Selmir Avdic; Barry Slobedman
Journal:  Viruses       Date:  2012-10-25       Impact factor: 5.048

View more
  42 in total

1.  Specialization for Cell-Free or Cell-to-Cell Spread of BAC-Cloned Human Cytomegalovirus Strains Is Determined by Factors beyond the UL128-131 and RL13 Loci.

Authors:  Eric P Schultz; Jean-Marc Lanchy; Le Zhang Day; Qin Yu; Christopher Peterson; Jessica Preece; Brent J Ryckman
Journal:  J Virol       Date:  2020-06-16       Impact factor: 5.103

2.  Influence of Human Cytomegalovirus Glycoprotein O Polymorphism on the Inhibitory Effect of Soluble Forms of Trimer- and Pentamer-Specific Entry Receptors.

Authors:  Nadja Brait; Tanja Stögerer; Julia Kalser; Barbara Adler; Ines Kunz; Max Benesch; Barbara Kropff; Michael Mach; Elisabeth Puchhammer-Stöckl; Irene Görzer
Journal:  J Virol       Date:  2020-07-01       Impact factor: 5.103

3.  Polymorphisms in Human Cytomegalovirus Glycoprotein O (gO) Exert Epistatic Influences on Cell-Free and Cell-to-Cell Spread and Antibody Neutralization on gH Epitopes.

Authors:  Le Zhang Day; Cora Stegmann; Eric P Schultz; Jean-Marc Lanchy; Qin Yu; Brent J Ryckman
Journal:  J Virol       Date:  2020-03-31       Impact factor: 5.103

4.  Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses.

Authors:  Zhi-Luo Deng; Akshay Dhingra; Adrian Fritz; Jasper Götting; Philipp C Münch; Lars Steinbrück; Thomas F Schulz; Tina Ganzenmüller; Alice C McHardy
Journal:  Brief Bioinform       Date:  2021-05-20       Impact factor: 11.622

5.  LoReTTA, a user-friendly tool for assembling viral genomes from PacBio sequence data.

Authors:  Ahmed Al Qaffas; Jenna Nichols; Andrew J Davison; Amine Ourahmane; Laura Hertel; Michael A McVoy; Salvatore Camiolo
Journal:  Virus Evol       Date:  2021-04-23

6.  Olfactory Entry Promotes Herpesvirus Recombination.

Authors:  Wanxiaojie Xie; Kimberley Bruce; Helen E Farrell; Philip G Stevenson
Journal:  J Virol       Date:  2021-09-15       Impact factor: 5.103

7.  Cytomegalovirus Strain TB40/E Restrictions and Adaptations to Growth in ARPE-19 Epithelial Cells.

Authors:  Mai Vo; Alexis Aguiar; Michael A McVoy; Laura Hertel
Journal:  Microorganisms       Date:  2020-04-24

8.  Mutagenesis of Human Cytomegalovirus Glycoprotein L Disproportionately Disrupts gH/gL/gO over gH/gL/pUL128-131.

Authors:  Eric P Schultz; Qin Yu; Cora Stegmann; Le Zhang Day; Jean-Marc Lanchy; Brent J Ryckman
Journal:  J Virol       Date:  2021-08-10       Impact factor: 5.103

Review 9.  Common Polymorphisms in the Glycoproteins of Human Cytomegalovirus and Associated Strain-Specific Immunity.

Authors:  Hsuan-Yuan Wang; Sarah M Valencia; Susanne P Pfeifer; Jeffrey D Jensen; Timothy F Kowalik; Sallie R Permar
Journal:  Viruses       Date:  2021-06-09       Impact factor: 5.818

Review 10.  Pathogenesis of human cytomegalovirus in the immunocompromised host.

Authors:  Paul Griffiths; Matthew Reeves
Journal:  Nat Rev Microbiol       Date:  2021-06-24       Impact factor: 60.633

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.