Literature DB >> 25840045

Three minimal sequences found in Ebola virus genomes and absent from human DNA.

Raquel M Silva¹, Diogo Pratas², Luísa Castro¹, Armando J Pinho², Paulo J S G Ferreira².

Abstract

MOTIVATION: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed.
RESULTS: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome. We identify the shortest such sequences with lengths between 12 and 14. Only three absent sequences of length 12 exist and they consistently appear at the same location on two of the Ebola virus proteins, in all Ebola virus genomes, but nowhere in the human genome. The alignment-free method used is able to identify pathogen-specific signatures for quick and precise action against infectious agents, of which the current Ebola virus outbreak provides a compelling example.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：
DNA, Viral
Viral Proteins

Year: 2015 PMID： 25840045 PMCID： PMC4514932 DOI： 10.1093/bioinformatics/btv189

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Ebola virus (EBOV) is a negative strand-RNA virus from the Filoviridae family that causes high mortality hemorrhagic fevers, for which no vaccine or treatment currently exist (Sarwar ). There are five Ebolavirus species, namely, Zaire ebolavirus, Sudan ebolavirus, Bundibugyo ebolavirus, Tai Forest ebolavirus and Reston ebolavirus, with the first (1976) and major (2014) outbreaks caused by the type species Zaire ebolavirus (Baize ). The numbers of the largest ever EBOV outbreak are worrying and continue escalating, with over 25 000 cases and 10 000 deaths from the virus mainly in Guinea, Liberia and Sierra Leone, according to the World Health Organization. The current outbreak is also the first where transmission has occurred outside Africa, with reported cases in Europe (Spain) and America (USA; Butler and Morello, 2014). Promising vaccine candidate tests are being rushed to face the epidemics and could be available within a few months (Gulland, 2014). These yet experimental therapies include, for example, recombinant viral vectors (Jones ) or antibodies that target the viral glycoprotein (GP; Friedrich ; Sarwar ), but innovative approaches are still needed for the development of diagnosis tools and identification of druggable targets. Minimal absent words are the shortest sequence fragments that are not present in the genomic data of a given organism. They have been studied before to describe properties of prokaryotic and eukaryotic genomes and to develop methods for phylogeny construction or PCR primer design (Chairungsee and Crochemore, 2012; Falda ; Garcia ; Herold ; Pinho ; Wu ). Here, we introduce minimal relative absent words (RAW), a concept which has not been used so far in the context of personalized medicine, but which is deemed useful for differential identification of sequences that are derived from a pathogen genome but absent from its host. We use the current EBOV outbreak sequences, which were recently published (Gire ), to discover and characterize the minimal RAWs that are present in EBOV genomes but absent from the human genome. Moreover, we show that these words are also absent from the other Ebolavirus species and even from the genomes obtained from previous outbreaks. Thus, the sequences that we identify are species-specific and important for future development of diagnosis or therapeutic strategies for EBOV. The method that we introduce can be applied to other emerging pathogens or to show evidence of evolutionary patterns and signatures across species.

2 Methods

2.1 Relative absent words

Consider a target sequence (e.g. a virus sequence), x, and a reference sequence (e.g. the human genome), y, both drawn from the finite alphabet = {A, C, G, T}. We say that α is a factor of x if x can be expressed as x = uαv, with uv denoting the concatenation between sequences u and v. We denote by the set of all k-size words (or factors) of x. Also, we represent the set of all k-size words not in x as . For each word size k, we define the set of all words that exist in x but do not exist in y by and the subset of words that are minimal, in the sense presented in Pinho , as i.e. a minimal absent word of size k cannot contain any minimal absent word of size less than k. In particular, lαr is a minimal absent word of sequence x, where l and r are single letters from , if lαr is not a word of x but both lα and αr are (Pinho ). In this work, we were particularly interested in the non-empty set corresponding to the smallest k. These are referred as RAWs.

2.2 Protein structural models

Protein 3D structural models were built by homology modeling as previously described (Duarte-Pereira ). Appropriate templates were selected from PDB (www.rcsb.org; Berman ), where several nucleoprotein (NP) structures from viruses within Mononegavirales (negative-sense genome single-stranded RNA viruses) are available, whereas for the region of interest in l-protein only structures from more distant viruses exist. Structures from the Nipah virus NP (PDB ID:4CO6; Yabukarski ) and the BVDV (bovine viral diarrhea virus) RNA polymerase (PDB ID:1S48; Choi ) were used as templates in MODELLER (Eswar ; Sali and Blundell, 1993), to predict the structure of the N-terminal regions of Ebola virus NP (residues 1–380) and RNA-polymerase (residues 177–805), respectively (Supplementary Figs. S3 and S4). Accuracy of the predicted models (Supplementary Fig. S5) was estimated using ProSA-web (https://prosa.services.came.sbg.ac.at/prosa.php; Sippl, 1993; Wiederstein and Sippl, 2007) and structures were visualized with PyMOL (Schrodinger, 2010).

3 Results

To identify RAWs, we have developed the EAGLE tool that implements the method described above (Supplementary Data). We have used the full GRC-38 human reference genome (Church ) downloaded from the NCBI, including the mitochondrial, unplaced and unlocalized sequences. The sequences of 99 EBOV genomes from the current outbreak in Sierra Leone (Gire ) and additional 66 Ebolavirus genomes have been also downloaded from NCBI (Supplementary Table S1). The code used in this analysis is available (Pratas, 2015). Figure 1 shows the computation for word sizes 12, 13 and 14 (for computer characteristics see Supplementary Section Software and Hardware). As expected, the number of absent words decreases as the k-mer size decreases. Specifically, for k = 11 (not represented), there are no EBOV RAW. On the other hand, for k = 12, three groups of points emerge (RAW1, RAW2 and RAW3) representing the position of a RAW in each of the 99 unaligned viral genomes (Fig. 1a).

Fig. 1.

Ebola virus minimal absent words relatively to the human complete genome. (a) RAWs were identified in 99 unaligned genomes from the current outbreak in Sierra Leone (2014) and are highlighted in red (k = 12, arrows), blue (k = 13) and grey (k = 14). (b) Whole genome alignments from 124 published Ebolavirus genomes were obtained from Gire and visualized in Geneious (created by Biomatters, available from http://www.geneious.com). Sequence logos and identity define conserved regions. (c) Regions corresponding to the identified RAWs are shown in genome location and both as nucleic acid and protein alignments. The Ebolavirus reference genomes are displayed, as well as selected representative sequences where nucleotide differences are observed Alignments of 124 Ebolavirus sequences (Gire ), including additional EBOV genomes from the current outbreak in Guinea (Baize ) and from previous outbreaks, show that the identified minimal RAWs fall into conserved protein regions (Fig. 1b). However, several mutations can be found in the genomes that discriminate between the different species of Ebolavirus and even between EBOV sequences from the current and previous outbreaks (Fig. 1c). The identification of these viral genome signatures is important for quick diagnosis in outbreak scenarios. Additional analysis with all 165 Ebolavirus genomes confirmed these results (Supplementary Fig. S1). In particular, RAW1 is conserved within EBOV and can distinguish EBOV from other Ebolavirus species. RAW2 is conserved in all sequences from the West African 2014 outbreak in Guinea, Sierra Leone and Liberia, and only one nucleotide difference exists between these sequences and unrelated outbreak genomes. RAW3 is also conserved at the species level, excluding the four EBOV 1976/77 genomes, and can distinguish between all Ebolavirus species (Supplementary Fig. S2). From the three EBOV sequence motifs absent in the human genome, the first (RAW1) is included in the virus NP, while the other two (RAW2 and RAW3) fall within the sequence of the viral RNA-polymerase (l-protein; Fig. 1c). Previous studies show that the N-terminal region of EBOV NP participates in both the formation of nucleocapsid-like structures through NP–NP interactions and in the replication of the viral genome (Watanabe ), and RAW1 sequence (TTTCGCCCGACT) is part of this N-terminal region. The l-protein (LP) produces the viral transcripts to be translated by host ribosomes and is involved in the replication of the viral genome as well. The LP contains the two additional minimal RAWs, RAW2 (TACGCCCTATCG) and RAW3 (CCTACGCGCAAA). Both NP and LP are critical for the virus life cycle and constitute good targets for therapeutic intervention. Screening for new antiviral compounds could benefit from knowledge of their protein structures. For EBOV, most protein structures are unknown except for the C-terminal domain of NP, GP, VP24 and VP35 (Shurtleff ), thus, we have predicted the structure of the N-terminal regions of the EBOV NP and LP by homology modeling (Supplementary Figs. S3–S5). These structural models show that the amino acids corresponding to the RAW1 motif are enclosed within the structure, while RAW2 and RAW3 are exposed at the protein surface, which can justify its higher degree of conservation.

4 Discussion

The personalized medicine field is now closer to clinical practice with the advances of next-generation sequencing technologies. Personalized therapeutics are a possibility and their development is essential with the emergence of resistance to current available drugs. Additionally, quick diagnosis is required for emerging pathogens and in epidemics such as the current Ebola outbreak. Here, we have detected minimal RAWs in the human genome that are present in EBOV genomes, and identified nucleotide differences in some of these sequences that can distinguish between Ebolavirus species and outbreaks. Also, we show that the corresponding amino acid sequences are conserved within EBOV. These results can now be further explored for diagnosis and therapeutics, sometimes mentioned as theranostics (Picard and Bergeron, 2002). Namely, RAW nucleotide sequences can be used in diagnosis to design primers that identify Ebolavirus infections or distinguish between Ebolavirus species. For PCR-based methods, longer sequences and multiplex reactions can be developed to avoid primer binding bias. Additional nucleotide or protein-based strategies for therapeutics can be envisaged, as discussed below. One problem in developing efficient EBOV treatments is the virus ability to evade the immune system. The viral GP is a major target because it mediates attachment and entry into the host cells. However, in addition to the surface envelope protein, the GP gene also produces fragment, soluble GPs that are secreted and direct the immune system to produce antibodies for variable and non-essential regions of the virus (Cook and Lee, 2013; Mohan ). As current efforts based on the viral GP might prove ineffective, additional targets should be sought. Our results show that the viral NP and polymerase (LP) can be attractive targets. As the amino acid sequences of all three 12-mer RAWs are conserved within EBOV, these regions can be used to screen for small molecule inhibitors. In particular, RAW1 is conserved in all Ebolavirus NP proteins, which can indicate a functional or structural role. And, considering that the protein model predicts that RAW2 and RAW3 are relatively close in the 3D structure and in exposed domains, these regions can be used to develop novel antibodies. Also, a recently described mechanism shows that the polymerase (LP) from Ebola and Marburg viruses is capable of editing transcripts, resulting in increased variability in the produced proteins, and that the most edited mRNAs are the Ebola GP and Marburg NP and LP itself (Shabman ). Thus, the use of combined therapies towards multiple proteins can be more effective, as suggested by studies to develop vaccines for Lassa virus that target both NP and GP (Fisher-Hoch ; Lukashevich, 2012). RNA-based strategies such as RNA interference (RNAi) or antissense therapies are also promising approaches to silence target-specific gene expression. The RAW sequences that we have identified can be used to develop RNAi or antisense probes that bind viral transcripts and prevent their translation, thus, inhibiting viral replication without blocking the host mRNAs. Translation of these technologies into clinical applications have been slowed by challenges in the delivery of small RNAs into cells, but recent developments in delivery systems are bridging the bench to bedside gap (Hayden, 2014; Yin ). Among these, gold or lipid nanoparticles (Conde ; Draz ) were shown to be effective against cancer and viral infections, including EBOV (Geisbert ). Gold-nanobeacons can be applied as a combined diagnosis and therapy tool for effective testing, including in low-cost settings (Costa ) and, with this purpose, advances in peptide nucleic acid probes for viral detection are also taking place (Joshi ; Zhang ). Whichever the technology, the identification of genome signatures for rapid evolving species such as Ebola viruses will be useful for the development of both diagnosis and therapeutics.

36 in total

1. Rapid label-free visual assay for the detection and quantification of viral RNA using peptide nucleic acid (PNA) and gold nanoparticles (AuNPs).

Authors: Vinay G Joshi; Kantaraja Chindera; Arvind Kumar Singh; Aditya P Sahoo; Vikas D Dighe; Dimpal Thakuria; Ashok K Tiwari; Satish Kumar
Journal: Anal Chim Acta Date: 2013-06-28 Impact factor: 6.558

Review 2. Therapeutics for filovirus infection: traditional approaches and progress towards in silico drug design.

Authors: Amy C Shurtleff; Tam L Nguyen; David A Kingery; Sina Bavari
Journal: Expert Opin Drug Discov Date: 2012-08-08 Impact factor: 6.098

Review 3. Advantages of peptide nucleic acids as diagnostic platforms for detection of nucleic acids in resource-limited settings.

Authors: Ning Zhang; Daniel H Appella
Journal: J Infect Dis Date: 2010-04-15 Impact factor: 5.226

4. Antigenic subversion: a novel mechanism of host immune evasion by Ebola virus.

Authors: Gopi S Mohan; Wenfang Li; Ling Ye; Richard W Compans; Chinglai Yang
Journal: PLoS Pathog Date: 2012-12-13 Impact factor: 6.823

5. Minimal absent words in prokaryotic and eukaryotic genomes.

Authors: Sara P Garcia; Armando J Pinho; João M O S Rodrigues; Carlos A C Bastos; Paulo J S G Ferreira
Journal: PLoS One Date: 2011-01-31 Impact factor: 3.240

6. Modernizing reference genome assemblies.

Authors: Deanna M Church; Valerie A Schneider; Tina Graves; Katherine Auger; Fiona Cunningham; Nathan Bouk; Hsiu-Chuan Chen; Richa Agarwala; William M McLaren; Graham R S Ritchie; Derek Albracht; Milinn Kremitzki; Susan Rock; Holland Kotkiewicz; Colin Kremitzki; Aye Wollam; Lee Trani; Lucinda Fulton; Robert Fulton; Lucy Matthews; Siobhan Whitehead; Will Chow; James Torrance; Matthew Dunn; Glenn Harden; Glen Threadgold; Jonathan Wood; Joanna Collins; Paul Heath; Guy Griffiths; Sarah Pelan; Darren Grafham; Evan E Eichler; George Weinstock; Elaine R Mardis; Richard K Wilson; Kerstin Howe; Paul Flicek; Tim Hubbard
Journal: PLoS Biol Date: 2011-07-05 Impact factor: 8.029

7. Postexposure protection of non-human primates against a lethal Ebola virus challenge with RNA interference: a proof-of-concept study.

Authors: Thomas W Geisbert; Amy C H Lee; Marjorie Robbins; Joan B Geisbert; Anna N Honko; Vandana Sood; Joshua C Johnson; Susan de Jong; Iran Tavakoli; Adam Judge; Lisa E Hensley; Ian Maclachlan
Journal: Lancet Date: 2010-05-29 Impact factor: 79.321