Literature DB >> 21375749

Breaking the 1000-gene barrier for Mimivirus using ultra-deep genome and transcriptome sequencing.

Matthieu Legendre1, Sébastien Santini, Alain Rico, Chantal Abergel, Jean-Michel Claverie.   

Abstract

BACKGROUND: Mimivirus, a giant dsDNA virus infecting Acanthamoeba, is the prototype of the mimiviridae family, the latest addition to the family of the nucleocytoplasmic large DNA viruses (NCLDVs). Its 1.2 Mb-genome was initially predicted to encode 917 genes. A subsequent RNA-Seq analysis precisely mapped many transcript boundaries and identified 75 new genes.
FINDINGS: We now report a much deeper analysis using the SOLiD™ technology combining RNA-Seq of the Mimivirus transcriptome during the infectious cycle (202.4 Million reads), and a complete genome re-sequencing (45.3 Million reads). This study corrected the genome sequence and identified several single nucleotide polymorphisms. Our results also provided clear evidence of previously overlooked transcription units, including an important RNA polymerase subunit distantly related to Euryarchea homologues. The total Mimivirus gene count is now 1018, 11% greater than the original annotation.
CONCLUSIONS: This study highlights the huge progress brought about by ultra-deep sequencing for the comprehensive annotation of virus genomes, opening the door to a complete one-nucleotide resolution level description of their transcriptional activity, and to the realistic modeling of the viral genome expression at the ultimate molecular level. This work also illustrates the need to go beyond bioinformatics-only approaches for the annotation of short protein and non-coding genes in viral genomes.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21375749      PMCID: PMC3058096          DOI: 10.1186/1743-422X-8-99

Source DB:  PubMed          Journal:  Virol J        ISSN: 1743-422X            Impact factor:   4.099


Findings

Mimivirus, a nucleocytoplasmic large double stranded DNA virus infecting Acanthamoeba species, is the largest virus identified to date. Its icosahedral fibrillated capsid has a diameter of 750 nm. Besides its outstanding particle size, the genome of Mimivirus is also exceptional both in size and complexity. The initial sequencing revealed a linear genome of 1,181,404 nt (roughly the size of the spirochaete bacterium Treponema pallidum genome) harboring 911 protein coding genes and 6 tRNAs [1]. Some of these genes were observed for the first time in a virus, the most salient being those involved in protein translation and DNA repair. These unique features reawaked conceptual discussions on the nature of viruses and the frontier between viruses and cellular organisms [2-4]. We recently reported the first RNA-Seq study of a large DNA virus using the 454-Flex technology [5]. The transcriptome analysis of Mimivirus during its infection cycle modified the initial gene map in various aspects. First the exact mapping of polyadenylated transcripts allowed the precise location of untranslated regions (UTRs) and intron-exon boundaries. Comparison of the RNA-Seq reads to the reference genome also corrected some phase-shifting sequencing errors causing a few ORFs to be merged. In the meantime 75 new genes were revealed by their transcripts, among which 26 non-coding RNA genes that could not be identified by ORF-based gene-finding approaches. Such transcriptome analyses using massively parallel pyrosequencing nicely complemented ab initio bioinformatic annotations. However, one limitation inherent to the RNA-seq approach is that sequence reads are unevenly distributed along the genome. Genomic positions located in weakly expressed genes and intergenic regions exhibit a lower coverage and are thus less likely to be corrected. To circumvent these limitations, while keeping the power of RNA-Seq for gene discovery, we performed a comprehensive re-sequencing and thorough re-annotation of the Mimivirus genome using two larger and complementary data sets: an ultra-deep sequencing of genomic DNA and total RNA, both from the SOLiD™ platform. The total number of generated 50-bp reads was about 50 million for the genomic DNA dataset and 200 million for the total RNA dataset. This huge amount of new data allowed us to i) further improve the quality of the Mimivirus genome sequence, ii) identify polymorphic genomic positions (SNPs), and iii) discover previously overlooked genes, one of which encodes an RNA polymerase II subunit, increasing the Mimivirus gene count to 1018.

A new Mimivirus reference genome sequence

The Mimivirus genomic DNA library was constructed using 4.7 μg of input DNA with the SOLiD™ Fragment Library Construction kit (standard protocol). After emulsion PCR the monoclonal beads were loaded on one fourth of a slide of a SOLiD™ 3 Plus System and sequenced (50-base pair reads) with the SOLiD™ Opti Fragment Library Sequencing chemistry. This raw sequence dataset (45,275,001 genomic reads), was used to build iteratively improved versions of the Mimivirus genome sequence, using the following bioinformatic pipeline (Figure 1): Starting from the original genome sequence (RefSeq ID NC_006450) as template, we first mapped the reads onto it using the Bfast program [6] in the color space with default parameters for match, localalign and postprocess subroutines. To avoid overweighting of some genomic positions caused by inhomogeneous PCR amplifications, we removed duplicated reads with the MarkDuplicate subroutine (Picard program suite: http://picard.sourceforge.net). To improve the base-resolution consensus, a micro re-alignment was performed on each read with the SRMA program [7]. With this stringent selection we only used the best representatives (4 to 5%) of the initial dataset. The mapped dataset was then searched for variants (substitutions or indels) using the Samtools [8] and VarScan programs [9]. A substitution was called a change from the (current) reference genome when represented in more than 70% of the aligned reads. Indels were also validated when represented in more than 60% of the aligned reads. The validated variations were then incorporated into a new version of the genome sequence that became the new reference for the next round of corrections. The procedure was iteratively applied to convergence, i.e. until no more indels or substitutions were validated, for a total of 14 cycles. The final 1,181,549 nucleotides-long genome sequence resulting from the above corrections is now the reference Mimivirus genome sequence (RefSeq ID NC_014649). It differs by 196 substitutions, 29 deletions and 174 insertions from the original genome sequence (RefSeq ID NC_006450).
Figure 1

Flow chart of the Mimivirus genome correction pipeline. The upper panel illustrates the correction procedure and the lower panel the annotation method. Colors are used for clarity: datasets are in purple, genomes are in green, sequence manipulations (mapping, duplicate removal, or modifications) are in yellow, computation steps are in blue and genes in red. The upper left graph represents the decrease in substitutions (in red) and indels (in black) identified during the iterative genome correction process, together with the increase in the total number of reads (in green) mapped to genome.

Flow chart of the Mimivirus genome correction pipeline. The upper panel illustrates the correction procedure and the lower panel the annotation method. Colors are used for clarity: datasets are in purple, genomes are in green, sequence manipulations (mapping, duplicate removal, or modifications) are in yellow, computation steps are in blue and genes in red. The upper left graph represents the decrease in substitutions (in red) and indels (in black) identified during the iterative genome correction process, together with the increase in the total number of reads (in green) mapped to genome.

Identification of single nucleotide polymorphisms

Next-generation sequencing platforms are now providing deep enough data to readily identify single nucleotide polymorphisms (SNPs). While using SOLiD™ reads in the course of the above correction procedure, we observed a number of polymorphic positions that could not be interpreted as sequencing errors given their high frequency of occurrence. SNPs in the Mimivirus genome were then systematically pinpointed as follows: we recorded all the positions with a nucleotide differing from the reference genome sequence in more than 10% of the aligned reads and seen at least once on both strands. In addition, we excluded all the variant positions less than 25 nt apart as they could correspond to mapping errors. The same procedure was independently applied to extract the polymorphic positions showing in 10% or more of the reads within the SOLiD™ RNA-seq dataset described hereafter. We then took the intersection of these two independent analyses to confidently identify 27 SNPs in the Mimivirus genome (see Table 1).
Table 1
Genomic positionGeneGene annotationCodon (SNP position in bold)Reference alleleReference allele coverage (%)Second alleleSecond allele coverage (%)Reference encoded AASecond allele encoded AA
2746L1cUncharacterized probable non-coding RNA gene-C86.6T13.4--
5402L3Uncharacterized proteinGAAG78.0A22.0EK
9911L6Uncharacterized proteinGTAA74.2G25.8VV
22248R13Uncharacterized proteinTATT83.7G16.3Y*
28580L18Putative sel1-like repeat-containing proteinATTA76.9T23.1IF
47300L37Putative KilA-N domain-containing proteinATCA86.3G13.7IV
54207L42Putative ankyrin repeat proteinTTGA63.8G36.2LV
97232L77bUncharacterized protein-C88.3T11.7AV
166952R135Putative GMC-type oxidoreductaseGATT87.0C13.0DD
322426L254Heat shock protein 70 homologATTT88.9A11.1II
328586R260DnaJ-like proteinTTCT81.3G18.8FV
329434R261Uncharacterized proteinCAAA85.7C14.3QH
399891R313Ribonucleoside-diphosphate reductase large subunitATTA83.9C16.1IL
440978R343Probable ribonuclease 3TGGT89.9A10.1WR
483113R367Uncharacterized proteinAAAA86.2T13.8KI
504876---T88.0G12.0--
601715L454Uncharacterized proteinATCT87.3C12.7IT
649432L485Uncharacterized proteinGAAA88.9C11.1ED
655506L490Uncharacterized proteinACCG85.1T14.9TI
734179R547Uncharacterized proteinAACA84.7C15.3NH
736530R549bUncharacterized probable non-coding RNA gene-T88.0C12.0--
787617L594Uncharacterized proteinAAAA73.5C26.5KT
918583R699Uncharacterized proteinAAAA87.5C12.5KN
939044R714Uncharacterized proteinTTTT69.0G31.0FC
962204R735Uncharacterized proteinCAAA82.3C17.7QH
1069573R822Uncharacterized proteinATTA89.2G10.8IV
1170156R903Putative ankyrin repeat proteinTTTT89.1G10.9FC
The number of synonymous substitutions (3 out of 24 coding SNPs) is surprisingly low compared to non-synonymous substitutions. Although paradoxical at first glance such a high proportion of non-synonymous substitutions was already noticed when comparing closely related bacterial strains exhibiting a small number of mutations [10]. This is usually explained by the fact that those mutations are not deleterious enough to be rapidly eliminated from the population, i.e. the observed variations are not yet fixed. Accordingly, the observed distribution of non-synonymous vs. synonymous variations is not significantly different from what is expected by chance from the relative frequency of the non-synonymous (79%) vs. synonymous substitutions (21%) computed from the Mimivirus genome codon composition (Fisher exact test p[3,21; 5, 19]> 0.7) [11]. To our knowledge this is the first genome-wide SNPs analysis of a large DNA virus. It remains to be determined whether the observed polymorphisms are representative of the true Mimivirus population diversity.

Mimivirus genome harbors 1018 genes

In addition to correcting the genome sequence we sought to thoroughly revise the Mimivirus gene annotation (Figure 1). We first identified the open reading frames (ORFs) using the "self-training" option of the Genemark™ program suite [12]. Beyond ORF annotation we delineated the exact boundaries of transcripts using two large transcriptome data sets: one from a previously published study of Mimivirus polyadenylated RNAs [5], the other from a SOLiD™ sequencing of total RNA. The latter was generated from nine barcoded transcriptome libraries constructed at various time during the entire Mimivirus infection cycle using 1 μg of total RNA from Acanthamoeba castellanii cells, each with the SOLiD™ Whole Transcriptome Analysis kit, and pooled at equimolar concentrations. After emulsion PCR the monoclonal beads were loaded on one slide of a SOLiD™ 3 Plus System and sequenced (50 base pairs) with the SOLiD™ Opti Fragment Library Sequencing chemistry. A total of 202,436,309 reads were generated and subsequently aligned to the Mimivirus genome using Bfast [6]. The two combined RNA-seq datasets allowed the unambiguous identification of the 5' end of 555 Mimivirus transcripts as well as the 3' end of 601 transcripts at single base-pair resolution We completed the genome annotation by mapping previously identified transcription regulation signals (i.e. the palindromic transcription termination signal [13], the early expression promoter element [14] and the late expression promoter element [5]) using the previously described protocols [5]. The combination of the deep transcriptome data mentioned above with the location of the predicted regulatory elements led to a substantial update of the Mimivirus gene map. Appendix lists the new genes identified from previously overlooked transcripts, as well as the new genes resulting from the correction of phase-shifting sequencing errors. The Mimivirus gene number is now of 1018, among which 979 putatively encode proteins, 6 encode tRNAs and 33 correspond to non-coding RNA genes. All these annotations are now included in the new reference Mimivirus entry (RefSeq NC_014649).

One more Mimivirus-encoded component of the transcription apparatus

Mimivirus was already known to encode a large number (if not all) of the components of its transcription apparatus: the two largest RNA Polymerase II subunits (R501 and L244), and four smaller subunits: Rpb3/Rpb11 (R470), Rpb5 (L235), Rpb6 (R209), Rpb7/E (L376). Mimivirus also possesses its own poly(A) polymerase (R341), and a series of transcription factors (L250, R339, R350, R429, R450, R559). Such a virally-encoded transcription system is required by the fact that Mimivirus genes are transcribed within well-defined cytoplasmic virion factories, with little or no participation of the host transcription apparatus localized in the cell nucleus. In order to bootstrap the infectious cycle, the above Mimivirus genes follow a late expression pattern allowing their protein products to be incorporated in the mature virions [15]. It turned out that the inventory of Mimivirus transcription-associated gene was not yet complete. Deeper sequencing of the Mimivirus-infected cells total RNA revealed a transcriptional activity (classified as "late") in between genes L357 and R358 (Figure 2A). This location corresponds to a short ORF (now denoted R357b) spanning 73 residues that exhibited no significant databases similarity at the time of our original annotation [1]. However, analyzing this predicted amino-acid sequence now suggests that it is a divergent homologue of the subunit N of RNA polymerase II. Interestingly, the closest relative (30% identity) of this new Mimivirus protein is found (Figure 2B) within the recently published 730 kb-genome of a giant virus infecting the marine microflagellate Cafeteria roenbergensis [16]. These findings strongly suggest that R357b encodes a real protein, thus adding one more component to the already complex transcriptional machinery of Mimivirus. We hope that the accurate genome sequence and comprehensive transcript map now available for Mimivirus will make it a reference micro-organism for future experimental and computational studies aiming at elucidating the physiology of giant DNA viruses.
Figure 2

Discovery of a component of the Mimivirus transcription apparatus. A) Mimivirus genome browser (URL: http://www.igs.cnrs-mrs.fr/mimivirus/) screenshot showing the newly discovered component of the transcription apparatus (R357b) in its genomic context. Three informative tracks are displayed: the protein coding genes, the late gene expression signals, and the gene expression data from the SOLiD™ RNA-seq experiment. Transcriptome data is shown at each genomic position (for each of the 9 samples) going from white (not expressed) to red (highly expressed) in the forward strand, and white to blue (highly expressed) in the reverse strand. B) Protein sequence alignment of the Mimivirus R357b gene and the most similar homologous sequences from the giant virus CroV and the two archea Methanocella paludicola and Ferroplasma acidarmanus.

Discovery of a component of the Mimivirus transcription apparatus. A) Mimivirus genome browser (URL: http://www.igs.cnrs-mrs.fr/mimivirus/) screenshot showing the newly discovered component of the transcription apparatus (R357b) in its genomic context. Three informative tracks are displayed: the protein coding genes, the late gene expression signals, and the gene expression data from the SOLiD™ RNA-seq experiment. Transcriptome data is shown at each genomic position (for each of the 9 samples) going from white (not expressed) to red (highly expressed) in the forward strand, and white to blue (highly expressed) in the reverse strand. B) Protein sequence alignment of the Mimivirus R357b gene and the most similar homologous sequences from the giant virus CroV and the two archea Methanocella paludicola and Ferroplasma acidarmanus.

Appendix

List of newly identified genes: R2b, L10c, R13b, R14, R14b, L34b, L37b, L38b, R61b, R61c, L61d, L66b, L78b, L83b, L88b, L98b, L173b, L174b, R191c, R213b, L309c, R328b, R357b, R365b, R437b, R437c, R449b, L482b, R485b, L487b, R538c, R559b, L565b, L577b, R607b, R661b, R676b, L681b, L684b, L692b, L696b, L769b, L794b, R878b, R884b, R908b, R910b, L911b, L911c. List of genes generated from the fusion of previously identified ORFs: L91/L90, L93/L92, R391/R392, R527/R528, R568/R569, R744/R745, R844/R845. List of deleted or renamed genes: L14, L61b, R70, R847, R886.

Competing interests

Life Technologies financed the chemical and sequencing for the project. AR is an employee of Life Technologies. There are no other financial and non-financial competing interests.

Authors' contributions

ML designed the study, conducted the data analysis and wrote the manuscript; SS participated in the data analysis and draft the manuscript; AR performed libraries construction and sequencing; CA participated in data analysis, produced the initial material and draft the manuscript; JMC designed the study and wrote the manuscript. All authors read and approved the final manuscript.
  16 in total

1.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Authors:  J Besemer; A Lomsadze; M Borodovsky
Journal:  Nucleic Acids Res       Date:  2001-06-15       Impact factor: 16.971

2.  The 1.2-megabase genome sequence of Mimivirus.

Authors:  Didier Raoult; Stéphane Audic; Catherine Robert; Chantal Abergel; Patricia Renesto; Hiroyuki Ogata; Bernard La Scola; Marie Suzan; Jean-Michel Claverie
Journal:  Science       Date:  2004-10-14       Impact factor: 47.728

3.  Mimivirus gene promoters exhibit an unprecedented conservation among all eukaryotes.

Authors:  Karsten Suhre; Stéphane Audic; Jean-Michel Claverie
Journal:  Proc Natl Acad Sci U S A       Date:  2005-10-03       Impact factor: 11.205

4.  Comparisons of dN/dS are time dependent for closely related bacterial genomes.

Authors:  Eduardo P C Rocha; John Maynard Smith; Laurence D Hurst; Matthew T G Holden; Jessica E Cooper; Noel H Smith; Edward J Feil
Journal:  J Theor Biol       Date:  2005-10-18       Impact factor: 2.691

Review 5.  Redefining viruses: lessons from Mimivirus.

Authors:  Didier Raoult; Patrick Forterre
Journal:  Nat Rev Microbiol       Date:  2008-03-03       Impact factor: 60.633

Review 6.  Mimivirus.

Authors:  J M Claverie; C Abergel; H Ogata
Journal:  Curr Top Microbiol Immunol       Date:  2009       Impact factor: 4.291

7.  The polyadenylation site of Mimivirus transcripts obeys a stringent 'hairpin rule'.

Authors:  Deborah Byrne; Renata Grzela; Audrey Lartigue; Stéphane Audic; Sabine Chenivesse; Stéphanie Encinas; Jean-Michel Claverie; Chantal Abergel
Journal:  Genome Res       Date:  2009-04-29       Impact factor: 9.043

8.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

Authors:  Daniel C Koboldt; Ken Chen; Todd Wylie; David E Larson; Michael D McLellan; Elaine R Mardis; George M Weinstock; Richard K Wilson; Li Ding
Journal:  Bioinformatics       Date:  2009-06-19       Impact factor: 6.937

9.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

10.  Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA.

Authors:  Nils Homer; Stanley F Nelson
Journal:  Genome Biol       Date:  2010-10-08       Impact factor: 13.583

View more
  44 in total

1.  Mimivirus collagen is modified by bifunctional lysyl hydroxylase and glycosyltransferase enzyme.

Authors:  Kelvin B Luther; Andreas J Hülsmeier; Belinda Schegg; Stefan A Deuber; Didier Raoult; Thierry Hennet
Journal:  J Biol Chem       Date:  2011-11-01       Impact factor: 5.157

2.  Distant Mimivirus relative with a larger genome highlights the fundamental features of Megaviridae.

Authors:  Defne Arslan; Matthieu Legendre; Virginie Seltzer; Chantal Abergel; Jean-Michel Claverie
Journal:  Proc Natl Acad Sci U S A       Date:  2011-10-10       Impact factor: 11.205

3.  Mimivirus Fibrils Are Important for Viral Attachment to the Microbial World by a Diverse Glycoside Interaction Repertoire.

Authors:  Rodrigo Araújo Lima Rodrigues; Ludmila Karen dos Santos Silva; Fábio Pio Dornas; Danilo Bretas de Oliveira; Thais Furtado Ferreira Magalhães; Daniel Assis Santos; Adriana Oliveira Costa; Luiz de Macêdo Farias; Paula Prazeres Magalhães; Cláudio Antônio Bonjardim; Erna Geessien Kroon; Bernard La Scola; Juliana Reis Cortines; Jônatas Santos Abrahão
Journal:  J Virol       Date:  2015-09-16       Impact factor: 5.103

4.  Exposure to mimivirus collagen promotes arthritis.

Authors:  Nikunj Shah; Andreas J Hülsmeier; Nina Hochhold; Michel Neidhart; Steffen Gay; Thierry Hennet
Journal:  J Virol       Date:  2013-10-30       Impact factor: 5.103

5.  Preliminary crystallographic analysis of a possible transcription factor encoded by the mimivirus L544 gene.

Authors:  Alexandre Ciaccafava; Audrey Lartigue; Pascal Mansuelle; Sandra Jeudy; Chantal Abergel
Journal:  Acta Crystallogr Sect F Struct Biol Cryst Commun       Date:  2011-07-20

6.  Genome watch: Honey, I shrunk the mimiviral genome.

Authors:  Isheng J Tsai
Journal:  Nat Rev Microbiol       Date:  2011-07-11       Impact factor: 60.633

7.  On the occurrence of cytochrome P450 in viruses.

Authors:  David C Lamb; Alec H Follmer; Jared V Goldstone; David R Nelson; Andrew G Warrilow; Claire L Price; Marie Y True; Steven L Kelly; Thomas L Poulos; John J Stegeman
Journal:  Proc Natl Acad Sci U S A       Date:  2019-06-05       Impact factor: 11.205

Review 8.  Fell Muir Lecture: Collagen fibril formation in vitro and in vivo.

Authors:  Karl E Kadler
Journal:  Int J Exp Pathol       Date:  2017-05-16       Impact factor: 1.925

9.  Genomic Signatures Among Acanthamoeba polyphaga Entoorganisms Unveil Evidence of Coevolution.

Authors:  Víctor Serrano-Solís; Paulo Eduardo Toscano Soares; Sávio T de Farías
Journal:  J Mol Evol       Date:  2018-11-20       Impact factor: 2.395

10.  The rare sugar N-acetylated viosamine is a major component of Mimivirus fibers.

Authors:  Francesco Piacente; Cristina De Castro; Sandra Jeudy; Matteo Gaglianone; Maria Elena Laugieri; Anna Notaro; Annalisa Salis; Gianluca Damonte; Chantal Abergel; Michela G Tonetti
Journal:  J Biol Chem       Date:  2017-03-17       Impact factor: 5.157

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.