Literature DB >> 22276113

RNA-seq based transcriptional map of bovine respiratory disease pathogen "Histophilus somni 2336".

Ranjit Kumar1, Mark L Lawrence, James Watt, Amanda M Cooksey, Shane C Burgess, Bindu Nanduri.   

Abstract

Genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding RNAs, proteins and regulatory elements), is a prerequisite for systems level analysis. Current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding RNAs (sRNAs). Whole genome transcriptome analysis is a complementary method to identify "novel" genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. In particular, the identification of non-coding RNAs has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Histophilus somni, one of the causative agents of Bovine Respiratory Disease (BRD) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. In this study, we report a single nucleotide resolution transcriptome map of H. somni strain 2336 using RNA-Seq method.The RNA-Seq based transcriptome map identified 94 sRNAs in the H. somni genome of which 82 sRNAs were never predicted or reported in earlier studies. We also identified 38 novel potential protein coding open reading frames that were absent in the current genome annotation. The transcriptome map allowed the identification of 278 operon (total 730 genes) structures in the genome. When compared with the genome sequence of a non-virulent strain 129Pt, a disproportionate number of sRNAs (∼30%) were located in genomic region unique to strain 2336 (∼18% of the total genome). This observation suggests that a number of the newly identified sRNAs in strain 2336 may be involved in strain-specific adaptations.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22276113      PMCID: PMC3262788          DOI: 10.1371/journal.pone.0029435

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Systems biology approaches are designed to facilitate the study of complex interactions among genes, proteins, and other genomic elements [1], [2], [3]. In the context of infectious disease, systems biology has the potential to complement reductionist approaches to resolve the complex interactions between host and pathogen that determine disease outcome. However, a prerequisite for systems biology is the description of the system's components. Therefore, genome structural annotation or the identification and demarcation of boundaries of functional elements in a genome (e.g., genes, non-coding RNAs, proteins, and regulatory elements) are critical elements in infectious disease systems biology. Bovine Respiratory Disease (BRD) costs the cattle industry in the United States as much as $3 billion annually [4], [5]. BRD is the outcome of complex interactions among host, environment, bacterial, and viral pathogens [6]. Histophilus somni, a gram-negative, pleomorphic species, is one of the important causative agents of BRD [6]. H. somni causes bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis [7]. H. somni strain 2336, the serotype used in this study and isolated from pneumonic calf lung, has a 2.2 Mbp genome and 2044 predicted open reading frames (ORFs), of which 1569 (76%) have an assigned biological function. Genome structural annotation is a multi-level process that includes prediction of coding genes, pseudogenes, promoter regions, repeat elements, regulatory elements in intergenic regions such as small non-coding RNAs (sRNA), and other genomic features of biological significance. Computational gene prediction methods such as Glimmer [8] or GenMark [9] use Hidden Markov models which are based on a training set of well annotated genes. Although these methods are quite efficient, they often miss genes with anomalous nucleotide composition and have several well-described shortcomings: because bacterial genomes do not have introns, detecting gene boundaries is comparatively difficult; due to the usage of more than one start codon, computational genome annotation methods may predict overlapping ORFs [10]; prediction programs use arbitrary minimum cutoff lengths to filter short ORFs, which may lead to under-representation of small genes. In case of sRNA (small non-coding RNA) prediction, the lack of DNA sequence conservation, lack of a protein coding frame, and the limited accuracy of transcriptional signal prediction programs (promoter/Rho terminator prediction) confound computational prediction [11], [12]. Computational prediction methods are a “first pass” genome structural annotation. Whole genome transcriptome studies (such as whole genome tiling arrays [13], [14], [15] and high throughput sequencing [16], [17]) are complementary experimental approaches for bacterial genome annotation and can identify “novel” genes, gene boundaries, regulatory regions, intergenic regions, and operon structures. For example, a transcriptomic analysis of Mycoplasma pneumoniae identified 117 previously unknown transcripts, many of which were non-coding RNAs, and two novel genes [18]. Transcriptome analyses identified novel, non-coding regions in other species, including 27 sRNAs in Caulobacter crescentus [15], 64 sRNAs in Salmonella Typhimurium [17], and a large number of putative sRNAs in Vibrio cholerae [16]. sRNAs found in pathogen genomes are known to be involved in various housekeeping activities and virulence [19]. In this study we used RNA-Seq for the experimental annotation of the H. somni strain 2336 genome and to construct a single nucleotide resolution transcriptome map. Novel expressed elements were identified, and where appropriate, computational predictions of previously described gene boundaries were corrected.

Results

Mapping of reads onto the H. somni genome

In 2008 the complete genome sequence of the H. somni strain 2336 became available (GenBank CP000947). The 2,263,857 bp circular genome has a GC content of 37.4%, and 87% of the sequence is annotated to coding regions. The genome has 2065 computationally predicted genes, of which 1980 are protein coding. We sequenced the transcriptome of H. somni using Illumina RNA-Seq methodology, and obtained 9,015,318 reads, with an average read length of approximately 76 bp. We mapped approximately 9.4% reads onto the reference DNA sequence of H. somni strain 2336 using the alignment program Bowtie [20]. To determine expressed regions in the genome, we estimated the average coverage depth of reads mapped per nucleotide/base. We used pileup format, which represents the signal map file for the whole genome in which alignment results (coverage depth) are represented in per-base format. Regions where coverage depth was greater than the lower tenth percentile of expressed genes were considered significantly expressed [21]; in the current study, this corresponded to a coverage depth of 7 reads/bp in pileup format. As another measure for estimating background expression level, we analyzed the coverage in the intergenic regions of the genome. We assumed that at least half of the intergenic region is not expressed (considering the presence of known expressed regions, such as 3′ and 5′ UTR of genes, intergenic region of the operons, and sRNAs) and calculated the coverage, which corresponded to ≤6 reads per base, lower than our first cutoff estimate. We retained the most conservative cutoff for expression, i.e., 7 reads per base for describing the expression map of H. somni. Nucleotides in the genome sequence with coverage depth above our threshold value were considered to be expressed. This resulted in the generation of a whole genome transcriptome profile of H. somni 2336 at a single nucleotide resolution. Figure 1 show the steps involved in the analysis of expressed intergenic regions.
Figure 1

RNA-Seq data analysis workflow for intergenic expression analysis.

Analysis workflow includes identification of novel protein coding genes and sRNAs in the intergenic region of H. somni 2336 genome.

RNA-Seq data analysis workflow for intergenic expression analysis.

Analysis workflow includes identification of novel protein coding genes and sRNAs in the intergenic region of H. somni 2336 genome.

Expression in the intergenic region of the genome

We compared the RNA-Seq based transcriptome map with the available genome annotation to identify expressed, novel, and intergenic regions in the genome. Promoters and terminators were predicted across the genome to add confidence to the identified novel elements. For the first time, we report the identification of 94 sRNAs (Table 1) in the H. somni genome. The start and end for sRNA in Table 1 refer to the boundaries of transcriptionally active regions (TAR, putative sRNAs). Of these, twelve were similar to well-characterized sRNA families that are described in many bacterial species, such as tmRNA, 6S, and FMN (Figure 2). The total of 82 novel sRNAs reported in this study has not been reported earlier. The majority of the identified sRNAs (>75%) were shorter than 200 nucleotides (length range 70–695 nucleotides). The average GC content of sRNA at 39.3% was slightly higher compared with the 37.4% GC content of the genome. Promoters within 50 nt upstream/downstream of the TAR boundaries were predicted for 68 sRNA. Similarly, Rho-independent transcription terminators were predicted within 50 bp upstream/downstream of 40 sRNA. Figure 3 shows the depth of coverage for one of the identified novel sRNA “HS46” viewed in the Artemis genome browser [22].
Table 1

H. Somni 2336 sRNAs, their genome location, additional features and comparative genomics.

IDStart# End# Length (nt)PromoterRho independent terminatorFlanking gene (left)Flanking gene(right)Rfam annotationConservation across other genome*
HS181098210101-YHSM0009 (+)HSM0010 (+)-C
HS2161191619071YYHSM0018 (+)HSM0019 (−)-B
HS32769327843150Y-HSM0034 (+)HSM0035 (+)-B
HS42821128327116Y-HSM0035 (+)HSM0036 (+)-B
HS52944929733284Y-HSM0037 (+)HSM0038 (+)-C
HS63064430884240YYHSM0039 (+)HSM0040 (−)-B
HS79191392041128-YHSM0081 (−)HSM0082 (+)-C
HS8113922114088166YYHSM0102 (+)HSM0103 (−)-C
HS9161819161939120Y-HSM0149 (+)HSM0150 (+)-B
HS1018309718318083Y-HSM0171 (−)HSM0172 (+)-B
HS11197465197654189--HSM0185 (−)HSM0186 (−)-B
HS12229676229780104Y-HSM0213 (+)HSM0214 (−)-C
HS1324359224368391-YHSM0224 (+)HSM0225 (−)-A
HS14258140258443303YYHSM0242 (−)HSM0243 (−)-A
HS15258962259092130YYHSM0244 (−)HSM0245 (+)-A
HS16260385260577192YYHSM0245 (+)HSM0246 (−)-A
HS17261314261511197-YHSM0246 (−)HSM0247 (−)-A
HS18261829262158329YYHSM0246 (−)HSM0247 (−)-A
HS1926426326436299Y-HSM0250 (+)HSM0251 (+)-A
HS20279541279733192--HSM0266 (−)HSM0267 (+)-B
HS21306503306957454Y-HSM0284 (+)HSM0285 (−)tmrnaD
HS22318188318443255--HSM0292 (+)HSM0293 (−)-D
HS23319335319635300--HSM0295 (+)HSM0296 (+)-B
HS24341911342015104--HSM0316 (−)HSM0317 (−)-B
HS25343637343745108--HSM0318 (−)HSM0319 (+)-B
HS2637795337803784Y-HSM0345 (+)HSM0346 (−)-B
HS27383294383421127Y-HSM0346 (−)HSM0347 (+)-A
HS28385627385733106YYHSM0348 (−)HSM0349 (+)-B
HS29391757392003246--HSM0353 (−)HSM0354 (−)-C
HS3041242541252297--HSM0368 (−)HSM0369 (−)-B
HS3147225647233175YYHSM0407 (−)HSM0408 (−)-B
HS32524433524805372Y-HSM0448 (+)HSM0450 (−)-A
HS33599224599416192YYHSM0521 (−)HSM0522 (+)-B
HS34614906615013107--HSM0538 (+)HSM0539 (−)intron_gpIIA
HS35616791617486695Y-HSM0539 (−)HSM0540 (−)-A
HS36617726618078352Y-HSM0539 (−)HSM0540 (−)-A
HS37618122618228106YYHSM0539 (−)HSM0540 (−)-A
HS38637931638076145YYHSM0552 (+)HSM0553 (+)-B
HS39638222638366144--HSM0552 (+)HSM0553 (+)-B
HS40653813653962149Y-HSM0561 (+)HSM0562 (−)-A
HS41694580694680100Y-HSM0594 (−)HSM0595 (−)-A
HS4270333370342390YYHSM0607 (+)HSM0608 (+)-B
HS4371036371045087Y-HSM0611 (+)HSM0612 (+)-B
HS44747233747386153Y-HSM0644 (−)HSM0645 (−)-A
HS45800181800295114YYHSM0704 (+)HSM0705 (+)-B
HS46851529851662133YYHSM0740 (−)HSM0741 (+)-B
HS47853988854118130Y-HSM0742 (−)HSM0743 (+)-B
HS4887635587643378--HSM0758 (+)HSM0759 (+)glycineC
HS49979462979681219Y-HSM0844 (+)HSM0845 (+)-B
HS50981925982225300--HSM0847 (+)HSM0848 (+)-C
HS5199423499431480Y-HSM0853 (+)HSM0854 (+)-B
HS5210077991008007208YYHSM0868 (−)HSM0869 (−)-C
HS5310080861008580494-YHSM0868 (−)HSM0869 (−)-A
HS5410126171012823206-YHSM0874 (−)HSM0875 (−)-B
HS5510144251014768343Y-HSM0875 (−)HSM0876 (−)-A
HS5610151891015390201Y-HSM0877 (+)HSM0878 (+)-A
HS5710219191022474555Y-HSM0888 (−)HSM0889 (−)-A
HS5810319801032132152YYHSM0900 (+)HSM0901(+)-C
HS5910322061032458252YYHSM0900 (+)HSM0901 (+)-D
HS6010525871052754167Y-HSM0920 (+)HSM0921 (+)-A
HS611147201114729089Y-HSM1005 (+)HSM1006 (+)-B
HS6212606211260860239--HSM1095 (+)HSM1096 (+)6 sD
HS6312924131292563150Y-HSM1125 (−)HSM1126 (+)-B
HS6413077571307987230Y-HSM1136 (+)HSM1137 (−)-A
HS6513126931312855162YYHSM1143 (+)HSM1144 (+)-A
HS6613202281320349121YYHSM1155 (−)HSM1156 (−)-A
HS6713374121337590178YYHSM1172 (+)HSM1173 (+)-B
HS681343583134365976YYHSM1182 (+)HSM1183 (+)-D
HS6913773091377411102Y-HSM1218 (+)HSM1219 (+)-C
HS7014137411413887146--HSM1254 (−)HSM1255 (+)lysineB
HS7114555291455708179Y-HSM1275 (−)HSM1276 (−)MOCORNAB
HS721513886151395569Y-HSM1330 (+)HSM1331 (+)-B
HS731537168153726799--HSM1355 (−)HSM1356 (−)LR-PK1B
HS741591107159118780YYHSM1392 (−)HSM1393 (−)-B
HS7515939531594392439Y-HSM1395 (−)HSM1396 (+)RNaseP_bact_aD
HS7615960111596138127YYHSM1397 (+)HSM1398 (+)-B
HS7717485631748820257YYHSM1521 (−)HSM1522 (+)-D
HS7817526531752795142Y-HSM1525 (+)HSM1526 (+)-A
HS791839524183961692YYHSM1590 (−)HSM1591 (+)-B
HS8018591681859317149YYHSM1612 (−)HSM1613 (−)-B
HS8118743981874609211YYHSM1626 (+)HSM1627 (+)isrKB
HS8219258141925932118--HSM1675 (−)HSM1676 (−)-B
HS8319277971928029232Y-HSM1676 (−)HSM1677 (+)-D
HS8419281571928331174YYHSM1676 (−)HSM1677 (+)-A
HS8519424451942617172YYHSM1692 (+)HSM1693 (+)-A
HS8619624871962618131YYHSM1719 (−)HSM1720 (−)-A
HS8720205452020668123--HSM1776 (−)HSM1777 (+)gcvBD
HS882124794212488490YYHSM1868 (−)HSM1869 (−)-A
HS892136245213632479Y-HSM1881 (−)HSM1882 (−)-A
HS9021395632139823260YYHSM1887 (+)HSM1888 (+)-A
HS9121462862146459173--HSMR0065 (+)HSM1893 (+)-B
HS9222101482210318170--HSM1950 (+)HSM1951 (+)-B
HS9322238022223946144--HSM1974 (+)HSM1975 (+)alpha_RBSD
HS9422292692229450181--HSM1982 (−)HSM1983 (+)FMND

*sRNA sequences conserved in; A - unique to H. somni 2336. B - H. somni strain 129PT only. C – phylogenetically closer bacterial genomes specially members of Pasteurellaceae family (M. haemolytica, P. multocida. H. influenza etc). D - across distant bacterial species.

The start and end represents the boundaries of identified TAR (transcriptionally active region) which is a potential sRNA region.

Any cell with no predicted result is marked with ‘−’.

Figure 2

Identification of sRNA annotated to Rfam.

The figure shows identification of well conserved sRNA “tmRNA” using RNA-Seq based method. “tmRNA” was computationally predicted as a sRNA by Rfam using sequence similarity across other bacterial families.

Figure 3

Identification of a novel sRNA.

A highly expressed sRNA “HS46” found in the intergenic region of H. somni 2336 genome.

Identification of sRNA annotated to Rfam.

The figure shows identification of well conserved sRNA “tmRNA” using RNA-Seq based method. “tmRNA” was computationally predicted as a sRNA by Rfam using sequence similarity across other bacterial families.

Identification of a novel sRNA.

A highly expressed sRNA “HS46” found in the intergenic region of H. somni 2336 genome. *sRNA sequences conserved in; A - unique to H. somni 2336. B - H. somni strain 129PT only. C – phylogenetically closer bacterial genomes specially members of Pasteurellaceae family (M. haemolytica, P. multocida. H. influenza etc). D - across distant bacterial species. The start and end represents the boundaries of identified TAR (transcriptionally active region) which is a potential sRNA region. Any cell with no predicted result is marked with ‘−’. BLAST analysis of the sRNA sequences against the non-redundant, nucleotide database at NCBI revealed that 31 of the sRNA sequences were unique to the H. somni 2336 genome. Another 41 were highly conserved (>95% identity with >95% coverage) only in H. somni strain 129PT, which is a commensal, preputial isolate. A set of 11 sRNAs were conserved in the related Pasteurellaceae family, which includes genomes such as P. multocida, H. influenzae, H. parainfluenzae, and H. ovis. Only 11 sRNAs were conserved in distant bacterial genomes from genera Streptococcus, Clostrodium, Actinobacillus, Vibrio, and others. This lack of sRNA sequence conservation beyond the species could indicate that sRNA sequences are under strong selection pressure, and that they could be responsible for the adaptation of many species to different environmental niches. We searched all H. somni sRNA sequences against the Rfam database [23] to determine their putative functions. We found that 12 sRNAs were homologs to well characterized sRNAs in other genomes. The identified functional categories included FMN riboswitches, gcvB, glycine, intron_gpII, lysine, alpha_RBS, LR-PK1, isrK, MOCORNA, RNaseP_bact_a, tmRNA, and 6S. sRNAs for which no Rfam function could be predicted represent a completely novel set of non-coding sRNAs. Functions of these novel sRNA need to be determined by further experiments.

Identification and characterization of novel genes

We evaluated the coding potential of all expressed intergenic regions, by conducting BLASTX based sequence searches against the non-redundant protein database at NCBI followed by manual analysis and interpretation. We identified 38 novel protein coding regions (Table 2). The average length of the identified novel proteins was around 60 amino acids (ranged from 19 to 135 amino acids). The majority of the novel proteins (30) were conserved hypothetical proteins present in related species such as H. somni 129PT, M. haemolytica, and H. influenzae. Some of the novel proteins had predicted functions, such as DnaK suppressor protein, toxic membrane protein TnaC, and predicted toxic peptide ibsB3 (Table 2). Figure 4 shows an example of a novel protein “HSP7” that is similar (74% similarity and 100% coverage) to a putative, phage-related DNA-binding protein of Neisseria polysaccharea.
Table 2

Novel proteins identified in the H. somni 2336 genome along with closest matching homolog and its annotation.

IDStartEndStrandLength (nt)Top BLASTX HitAnnotation
HSP1140988141083+96ZP_04978675.1hypothetical protein MHA_2182 [Mannheimia haemolytica PHL213]
HSP226001926007860YP_002791255.1toxic membrane protein [Escherichia coli str. K-12 substr. MG1655]
HSP3260229260408180YP_001784202.1hypothetical protein HSM_0870 [Haemophilus somnus 2336]
HSP4260707260850144YP_001784202.1hypothetical protein HSM_0870 [Haemophilus somnus 2336]
HSP526095126100757CBY77851.1predicted toxic peptide IbsB3 [Escherichia coli BL21(DE3)]
HSP6692328692543216ZP_04977604.1hypothetical protein MHA_1062 [Mannheimia haemolytica PHL213]
HSP7748664748849+186ZP_06863963.1putative phage-related DNA-binding protein [Neisseria polysaccharea ATCC 43768]
HSP8752366752593+228ZP_01791588.1hypothetical protein CGSHiAA_00240 [Haemophilus influenzae PittAA]
HSP9753226753384+159ZP_05848096.1conserved hypothetical protein [Haemophilus influenzae RdAW]
HSP10754234754398+165NP_873053.1hypothetical protein HD0492 [Haemophilus ducreyi 35000HP]
HSP11758474758698+225ZP_05848108.1conserved hypothetical protein [Haemophilus influenzae RdAW]
HSP12764501764686+186ABX51978.1hypothetical protein [Haemophilus phage SuMu]
HSP13771653771787+135ZP_04464387.1hypothetical protein CGSHi6P18H1_07995 [Haemophilus influenzae 6P18H1]
HSP14782712782840+129YP_001344686.1hypothetical protein Asuc_1392 [Actinobacillus succinogenes 130Z]
HSP15858416858721+306ZP_04976950.1hypothetical protein MHA_0367 [Mannheimia haemolytica PHL213]
HSP16982362982619+258NP_660225.1repressor-like protein [Haemophilus influenzae biotype aegyptius]
HSP1710083331008518+186YP_001784474.1hypothetical protein HSM_1144 [Haemophilus somnus 2336]
HSP1810140641014276+213ZP_02478185.1hypothetical protein HPS_04457 [Haemophilus parasuis 29755]
HSP1910238541024255+402YP_719605.1hypothetical protein HS_1393 [Haemophilus somnus 129PT]
HSP2010261991026414+216YP_002475212.1putative lytic protein Rz1, bacteriophage protein [Haemophilus parasuis SH0165]
HSP2110312091031379+171YP_002475190.1hypothetical protein HAPS_0589 [Haemophilus parasuis SH0165]
HSP2210317091031906198ZP_05731317.1hypothetical protein Pat9bDRAFT_4634 [Pantoea sp. At-9b]
HSP2310439421044073+132ZP_02479029.1hypothetical protein HPS_00455 [Haemophilus parasuis 29755]
HSP2413063831306649267ZP_04976986.1hypothetical protein MHA_0405 [Mannheimia haemolytica PHL213]
HSP2513096671309807141ZP_01787689.1hypothetical protein CGSHi22421_00792 [Haemophilus influenzae R3021]
HSP2613245411324765225YP_002475146.1DnaK suppressor protein/C4-type zinc finger protein, DksA/TraR family [Haemophilus parasuis SH0165]
HSP2713458681346077+210YP_001088372.1putative conjugative transposon egulatory protein [Clostridium difficile 630]
HSP281448209144829284AAB96578.1TnaC [Haemophilus influenzae]
HSP2917472211747361+141YP_002476351.1hypothetical protein HAPS_1915 [Haemophilus parasuis SH0165]
HSP3017500201750235+216ZP_04752631.1hypothetical protein AM305_05314 [Actinobacillus minor NM305]
HSP3118529681853201+234YP_718779.1hypothetical protein HS_0567a [Haemophilus somnus 129PT]
HSP3219598181959922105ZP_04977712.1hypothetical protein MHA_1177 [Mannheimia haemolytica PHL213]
HSP3319628851963013+129ZP_05993368.1hypothetical protein COI_2717 [Mannheimia haemolytica serotype A2 str. OVINE]
HSP3419629851963176192ZP_05993369.1hypothetical protein COI_2718 [Mannheimia haemolytica serotype A2 str. OVINE]
HSP3519660851966366282ZP_04977704.1hypothetical protein MHA_1169 [Mannheimia haemolytica PHL213]
HSP3619771311977247+117ZP_07538596.1hypothetical protein appser10_8220 [Actinobacillus pleuropneumoniae serovar 10 str. D13039]
HSP3720715452071745201YP_719865.1hypothetical protein HS_1660 [Haemophilus somnus 129PT]
HSP3821657332165906174YP_718223.1hypothetical protein HS_0017a [Haemophilus somnus 129PT]
Figure 4

Identification of a novel protein coding gene.

Novel protein coding gene “HSP7” identified using transcriptome analysis shows homology (similarity 74%, sequence coverage 100%) to a phage related DNA binding protein from Neisseria polysaccharea.

Identification of a novel protein coding gene.

Novel protein coding gene “HSP7” identified using transcriptome analysis shows homology (similarity 74%, sequence coverage 100%) to a phage related DNA binding protein from Neisseria polysaccharea.

Corrections made to the existing genome annotation

The single nucleotide resolution map described in this study enabled us to correct the start site for five genes based on the current genome annotation (Table 3). These genes were annotated as phospholipid synthesis protein, ribosomal protein S2, aconitate hydratase 2, peptide chain release factor 2, and DUF411, a protein of unknown function. Based on evidence from RNA-Seq data, we performed a BLAST comparison with other phylogenetically similar proteins to confirm the new gene boundaries (Table 3).
Table 3

Genes with revised coordinate information based on transcriptome map.

Gene idPrevious annotation (Start-End)New corrected annotation (Start-End)
HSM_003124651–2492924597–24929
HSM_0525602547–603416602547–603602
HSM_0789909036–911534909036–911642
HSM_10191164444–11651631164444–1165244
HSM_17291972283–19726001972283–1972765

Non-functional start codons and frameshifts

The comparison of the transcriptome map of the H. somni genome with predicted proteins revealed the presence of frameshift mutations. Four genes have non-functional start codons, resulting in a predicted protein, truncated at the amino terminus (based on BLAST comparison with homologous proteins in other species), although full length mRNA was present. An example is presented for the gene “HSM_0748”, annotated as “Alpha-L-fucosidase” (Figure S1). The other three genes, HSM_0603, HSM_1666 and HSM_1668, encode a hypothetical protein, type III restriction protein res subunit, and CTP synthase, respectively. Two genes with frameshifts causing protein truncations (based on BLAST comparison with homologous proteins) are HSM_1385 (beta-hydroxyacyl dehydratase, FabA) and HSM_1744 (alcohol dehydrogenase zinc-binding domain protein). The transcriptome map revealed a full length mRNA for these two genes that code for truncated proteins.

Gene expression and operon structures

Our transcriptome map of H. somni identified expression from 1636 (approximately 80%) of the predicted genes. The expressed genes were distributed evenly across all TIGRFAM functional categories (Table S1). The transcriptome map allowed identification of operon structures at a genome scale, critical for identifying co-expressed genes and for understanding coordinated regulation of the bacterial transcriptome. We identified co-expression for 452 pairs (total 730 genes) of H. somni genes (Table S2) that were transcribed together and constituted a minimal operon. By joining consecutive overlapping pairs of co-expressed genes, we identified 278 distinct transcription units (Table S3). We compared our experimentally identified co-expressed genes with computationally predicted operons. The overlap between computational prediction of co-expressed genes using DOOR [24] and this study was 86% (394 gene pairs) (Table S4). Thus, our dataset validates expression of 394 computational gene-pair predictions. We identified 59 new gene pairs that are co-expressed and were not predicted by DOOR, which could be part of unidentified, new operon structures. For example, further in-depth analysis indicated a new operon consisting of three genes: HSM1354, HSM1355 and HSM1356, annotated as ribosomal protein L20, ribosomal protein L35, and translation initiation factor IF-3 respectively, which were not predicted computationally (Figure 5). The orthologs of these genes are well known to form a functional operon of ribosomal proteins (IF3-L35-L20) in Escherichia coli [25].
Figure 5

Identification of a novel operon structure comprised of three genes: HSM_1354, HSM_1355, and HSM_1356.

The RNA-Seq coverage shows three genes annotated as ribosomal proteins (IF3, L35, and L20) being expressed as a transcription unit.

Identification of a novel operon structure comprised of three genes: HSM_1354, HSM_1355, and HSM_1356.

The RNA-Seq coverage shows three genes annotated as ribosomal proteins (IF3, L35, and L20) being expressed as a transcription unit.

Discussion

In this study using RNA-Seq we describe the whole genome transcriptome profile of H. somni 2336, a bovine respiratory disease pathogen. The single nucleotide resolution map helped uncover the structure and complexity of this pathogen's transcriptome and led to the identification of novel, small RNAs and protein coding genes as well as gene co-expression. Prokaryotic genome annotation is performed often using computational gene prediction programs [8], [9]. However, these prediction algorithms are not able to identify the non-coding sRNAs, antisense transcripts, and other small proteins. To overcome the shortcomings of computational genome structural annotation, various experimental methods are used for identification of novel expressed elements [13], [14], [15], [16], [17], [18], [26], [27], [28]. Deep transcriptome sequencing (RNA-Seq) has emerged recently as a method that enables the study of RNA-based structural and regulatory regions at the genome scale. RNA-Seq technology has many advantages compared with existing array based methods for transcriptome analysis. In particular, RNA-Seq does not require probes, so the process is free from probe design issues or bias from hybridization issues. Also, the transcriptome coverage from RNA-Seq is very high [29], [30]. RNA-Seq was demonstrated to be effective for the discovery of bacterial non- coding RNAs, accurate operon definition, and correction of gene annotation [27], [31], [32]. Therefore, in the current study, we used RNA-Seq for profiling H. somni 2336 transcriptome. Mapping of RNA-Seq reads onto the H. somni genome sequence resulted in more than 94% coverage with at least one read per base. This observation is consistent with the reported 94% genome expression in Bacillus anthracis, 89.5% in Sulfolobus solfataricus, and 95% in Burkholderia cenocepacia, studied under one or more experimental growth conditions using RNA-Seq [32], [33], [34]. These results indicate that most of the bacterial genome sequence is expressed at some basal level. To identify significantly expressed regions above this baseline, we used two alternative methods (discussed in Results section) to estimate the background expression. Both methods yielded similar results (6–7 reads per base). We selected the higher stringency cutoff of 7 reads per base to minimize the number of false positives. We identified a total of 95 sRNAs in the H. somni genome. Twelve of these were predicted by Rfam [23] and are similar to conserved sRNA (e.g., 6S, tmRNA, FMN) in other bacterial species, which helps validate our approach. The 83 novel H. somni sRNAs may have housekeeping function, regulatory activity, or participate in virulence as described in other pathogenic bacteria [19], [35], [36]. The identified sRNAs did not show any location specific bias across the genome. Similarly, genes known to be associated with virulence are known to be scattered across bacterial genomes [37], [38]. However, the tendency to form clusters was observed with sRNAs, which could indicate that functionally related sRNAs tend to be located in close proximity. The RNA-Seq based transcriptome map of H. somni identified 38 novel protein coding genes that were missed by the initial annotation. The average length of the proteins coded by these genes exceeds 60 amino acids, suggesting that length based cutoff was not the main reason that these genes were missed by computational gene prediction programs. The novel protein coding genes identified in the current study could serve as a training set to improve gene prediction algorithms. The transcriptome map helped to identify incorrect annotation of start codons in the genome. Transcriptional mapping does not provide direct evidence of translational start sites. However, location of identified transcriptional start sites suggest that the annotated start codons are incorrect, an observation that is confirmed by BLAST comparisons against homologous genes in other bacterial species. Transcriptional mapping revealed genes where the 5′ untranslated sequence extended well beyond the translational start. BLAST comparisons indicated that these genes have either nonsense or missense base changes relative to homologous genes in other bacterial species, causing apparent “truncated” proteins compared with those in other species. Further work is needed to determine whether these 5′ untranslated regions serve regulatory functions or they are vestigial. RNA-Seq data enabled us to determine operon structures at a genome scale, and it allowed identification of some operons not predicted by the computational operon prediction method. Operon structures that include genes not expressed under the experimental growth condition used in the current study, could not be identified. Our results support the notion that using a combination of experimental operon identification by RNA-Seq and computational prediction can improve operon identification in bacterial genomes [39]. For the first time, we report the RNA-Seq based transcriptome map of H. somni 2336 and describe novel expressed regions in the genome. Whereas the results are interesting, we are aware of the limitations of the study. Because the RNA-Seq protocol was not strand specific, we could not determine the strand specificity of expressed novel transcripts. Therefore, Table 1 lacks information about sRNA orientation in the genome. Because strand specific information was missing, we could not describe antisense expression in the genome. For protein coding genes, we derived strand specificity based on alignment of the BLAST hit. Despite this shortcoming, we identified novel expressed regions and transcriptional patterns across the whole genome at a high coverage, which is not possible by other transcriptome analysis methods. Overall, this study describes RNA-Seq based transcriptome map of H. somni for identification of functional elements in a pathogen of importance to agriculture. Our genome-wide survey predicts numerous, novel, expressed regions that need biological characterization for understanding disease pathogenesis. Description of all functional elements in the H. somni system is a prerequisite for conducting holistic systems approaches to understand the complex pathogenesis of bovine respiratory disease.

Methods

RNA isolation and sequencing

We propagated H. somni 2336 on three TSA-blood plates (with 5% sheep red blood cells) for 16 hr or until a fresh lawn of cells was visible. IBC approval was not required for acquiring the plates as they were purchased through a commercial vendor: Fisher Scientific (Pittsburgh, PA), and manufactured by Becton Dickinson Diagnostic Systems, (Franklin Lakes, NJ). We washed the plates with brain heart infusion (BHI) broth, adjusted the culture to an OD620 nm = 0.8, and supplemented with RNAprotect reagent. The cells were harvested by centrifugation and stored at −80°C. We extracted total RNA using the RNeasy mini kit (Qiagen, Valencia, CA) following the manufacturer's protocol. Total RNA was treated with RNase-free DNAse (Invitrogen, Carlsbad, CA). Using Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA), we determined the RNA integrity number (RIN) of total RNA to be greater than 8. MICROBExpress™ Kit (Ambion, TX, USA), which specifically removes rRNAs, was used for mRNA enrichment. Small RNAs (i.e., tRNA and 5S rRNA) are not removed with this enrichment step (confirmed by Bioanalyzer). We used 100 ng enriched mRNA with Illumina mRNA-Seq sample preparation kit (Illumina, San Diego, CA) for library construction following the manufacturer's protocols. Briefly, mRNA was fragmented chemically by divalent zinc cations and randomly primed for cDNA synthesis. After ligating paired-end sequence adaptors to cDNA, we isolated fragments of approximately 200 bp by gel electrophoresis and amplified. We sequenced one nM of mRNA-Seq library on the Illumina GAII (Illumina, San Diego, CA), according to the manufacturer's protocol. Single read sequencing (36 bp) of the clustered flow cell was performed by Illumina's SBS chemistry (v3) and SCS data analysis pipeline v2.4. We used Illumina Real Time Analysis (RTA v1.4.15.0) software for flow-cell image analysis and cluster intensity. Subsequent base-calling was performed using the Illumina GA Pipeline v1.5.1 software.

Mapping and analysis of Illumina reads

We checked all Illumina reads for quality, and removed sequence reads containing “Ns”. Custom perl script was written to convert Illumina reads into fastq format. The script “fq_all2std.pl” from MAQ [40] converted fastq format to Sanger fastq format. Reads in sanger fastq format, were mapped onto the Histophilus somni 2336 genome sequence (GenBank Accession number. CP000947) using the alignment tool Bowtie [41], allowing for a maximum of two mismatches. The reads that mapped to more than one location were discarded. We used Samtools [42] to convert data into SAM/BAM format, and to generate alignment results in a pileup format. Pileup format provides the signal map file and has per-base format coverage. Custom perl scripts were written to calculate the background expression. Processed data was deposited in GEO with the accession number GSE29578.

Analysis of intergenic regions of H. somni genome

We used in-house perl scripts to extract novel expressed intergenic regions to identify novel small RNAs, riboswitches, and putative novel proteins. sRNA <70 bp in length were discarded to minimize the number of false positives. For each novel expressed region, BLAST sequence searches were performed against the non-redundant protein database at NCBI to identify potential protein coding regions. Intergenic regions within predicted operons [24] represent expressed regions and can be mis-classified as sRNAs. Therefore, these regions were excluded. We analyzed BLAST results manually, to identify novel protein coding regions and start codon corrections. If no protein coding region was found in the intergenic expressed regions, the presence of a promoter or a rho-independent terminator allowed us to classify the regions as sRNA. Bacterial promoter sequences were predicted by Neural Network Promoter Prediction program (http://www.fruitfly.org/seq_tools/promoter.html) [43]. Rho-independent transcription terminators were identified using the program TransTermHP [44]. For functional annotation, all identified identified sRNA sequences were searched against the Rfam database [23]. sRNA sequence conservation among other genomes was determined by blastn searches against non-redundant nucleotide database at NCBI. We mapped sRNAs, along with additional features, onto genome browsers like IGV [45] and Artemis [46] for further visualization, manual analysis, and interpretation.

Analysis of annotated regions of H. somni genome

Gene expression: expressed reads with coverage above background were mapped onto the annotated genes of H. somni 2336. Genes that had a significantly higher proportion of their length (>60%) covered by expressed reads were considered to be expressed. Operons: RNA-Seq can identify and predict operon structures in bacteria. We considered two or more consecutive genes to be part of an operon, if they fulfilled the following criteria: (a) they are expressed; (b) they are transcribed in the same direction; and (c) the intergenic region between the genes is expressed. Overlapping pairs of such genes were joined together to identify large operon structures. We used in-house perl scripts for the analyses. Mutated start codon. The Figure shows that the predicted protein coding frame (MH_748) is shorter at the 5′ end than the corresponding transcript level shown by the RNA-Seq coverage. Although the transcript is longer near 5′ end, no start codon is found in that region which might be a result of the mutation in that region of the start codon. This was further validated using homology searches of the full length transcript which shows high homology (95% Identity and >95% coverage) to a alpha-L-fucosidase protein from M. haemolytica PHL213. (TIF) Click here for additional data file. genes expressed in the present study according to the TIGRFAM categories. (XLS) Click here for additional data file. Pairs of co-expressed genes identified in by RNA-Seq data analysis. (XLS) Click here for additional data file. Transcription units identified by joining co-expressed genes in 2336. (XLS) Click here for additional data file. Comparison of co-expressed gene pairs identified from RNA-Seq data and operon prediction program “DOOR”. (XLS) Click here for additional data file.
  46 in total

1.  Artemis: sequence visualization and annotation.

Authors:  K Rutherford; J Parkhill; J Crook; T Horsnell; P Rice; M A Rajandream; B Barrell
Journal:  Bioinformatics       Date:  2000-10       Impact factor: 6.937

Review 2.  Economic impact associated with respiratory disease in beef cattle.

Authors:  D Griffin
Journal:  Vet Clin North Am Food Anim Pract       Date:  1997-11       Impact factor: 3.357

3.  Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome.

Authors:  M G Reese
Journal:  Comput Chem       Date:  2001-12

4.  Microbial gene identification using interpolated Markov models.

Authors:  S L Salzberg; A L Delcher; S Kasif; O White
Journal:  Nucleic Acids Res       Date:  1998-01-15       Impact factor: 16.971

5.  Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays.

Authors:  Brian Tjaden; Rini Mukherjee Saxena; Sergey Stolyar; David R Haynor; Eugene Kolker; Carsten Rosenow
Journal:  Nucleic Acids Res       Date:  2002-09-01       Impact factor: 16.971

6.  Bacillus anthracis genome organization in light of whole transcriptome sequencing.

Authors:  Jeffrey Martin; Wenhan Zhu; Karla D Passalacqua; Nicholas Bergman; Mark Borodovsky
Journal:  BMC Bioinformatics       Date:  2010-04-29       Impact factor: 3.169

Review 7.  A genomic window into the virulence of Histophilus somni.

Authors:  Indra Sandal; Thomas J Inzana
Journal:  Trends Microbiol       Date:  2009-12-23       Impact factor: 17.079

8.  Rfam: annotating non-coding RNAs in complete genomes.

Authors:  Sam Griffiths-Jones; Simon Moxon; Mhairi Marshall; Ajay Khanna; Sean R Eddy; Alex Bateman
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

Review 9.  The immunology of the bovine respiratory disease complex.

Authors:  J A Ellis
Journal:  Vet Clin North Am Food Anim Pract       Date:  2001-11       Impact factor: 3.357

Review 10.  Infectious bovine rhinotracheitis, parainfluenza-3, and respiratory coronavirus.

Authors:  S Kapil; R J Basaraba
Journal:  Vet Clin North Am Food Anim Pract       Date:  1997-11       Impact factor: 3.357

View more
  14 in total

1.  Single nucleotide polymorphisms in the bovine Histophilus somni genome; a comparison of new and old isolates.

Authors:  Claudia Avis Madampage; Neil Rawlyk; Gordon Crockford; Joyce Van Donkersgoed; Craig Dorin; Andrew Potter
Journal:  Can J Vet Res       Date:  2015-07       Impact factor: 1.310

2.  Two outer membrane lipoproteins from Histophilus somni are immunogenic in rabbits and sheep and induce protection against bacterial challenge in mice.

Authors:  Carolina Guzmán-Brambila; Argelia E Rojas-Mayorquín; Beatriz Flores-Samaniego; Daniel Ortuño-Sahagún
Journal:  Clin Vaccine Immunol       Date:  2012-09-12

3.  Analysis of strand-specific RNA-seq data using machine learning reveals the structures of transcription units in Clostridium thermocellum.

Authors:  Wen-Chi Chou; Qin Ma; Shihui Yang; Sha Cao; Dawn M Klingeman; Steven D Brown; Ying Xu
Journal:  Nucleic Acids Res       Date:  2015-03-12       Impact factor: 16.971

4.  Gene expression profile analysis and target gene discovery of Mycobacterium tuberculosis biofilm.

Authors:  Fangxue Ma; Hong Zhou; Zhiqiang Yang; Chao Wang; Yanan An; Lihui Ni; Mingyuan Liu; Yang Wang; Lu Yu
Journal:  Appl Microbiol Biotechnol       Date:  2021-06-14       Impact factor: 4.813

5.  Transcriptome analysis of Acidovorax avenae subsp. avenae cultivated in vivo and co-culture with Burkholderia seminalis.

Authors:  Bin Li; Muhammad Ibrahim; Mengyu Ge; Zhouqi Cui; Guochang Sun; Fei Xu; Michael Kube
Journal:  Sci Rep       Date:  2014-07-16       Impact factor: 4.379

6.  A Transcriptome Map of Actinobacillus pleuropneumoniae at Single-Nucleotide Resolution Using Deep RNA-Seq.

Authors:  Zhipeng Su; Jiawen Zhu; Zhuofei Xu; Ran Xiao; Rui Zhou; Lu Li; Huanchun Chen
Journal:  PLoS One       Date:  2016-03-28       Impact factor: 3.240

7.  Transcriptome profile of a bovine respiratory disease pathogen: Mannheimia haemolytica PHL213.

Authors:  Joseph S Reddy; Ranjit Kumar; James M Watt; Mark L Lawrence; Shane C Burgess; Bindu Nanduri
Journal:  BMC Bioinformatics       Date:  2012-09-11       Impact factor: 3.169

8.  Computational small RNA prediction in bacteria.

Authors:  Jayavel Sridhar; Paramasamy Gunasekaran
Journal:  Bioinform Biol Insights       Date:  2013-03-07

9.  RNA-seq and microarray complement each other in transcriptome profiling.

Authors:  Sunitha Kogenaru; Yan Qing; Yinping Guo; Nian Wang
Journal:  BMC Genomics       Date:  2012-11-15       Impact factor: 3.969

10.  Transcriptome dynamics-based operon prediction in prokaryotes.

Authors:  Vittorio Fortino; Olli-Pekka Smolander; Petri Auvinen; Roberto Tagliaferri; Dario Greco
Journal:  BMC Bioinformatics       Date:  2014-05-16       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.