| Literature DB >> 25168586 |
Panagiotis Ioannidis, Yong Lu, Nikhil Kumar, Todd Creasy, Sean Daugherty, Marcus C Chibucos, Joshua Orvis, Amol Shetty, Sandra Ott, Melissa Flowers, Naomi Sengamalay, Luke J Tallon, Leslie Pick, Julie C Dunning Hotopp1.
Abstract
BACKGROUND: Halyomorpha halys (Stål) (Insecta:Hemiptera;Pentatomidae), commonly known as the Brown Marmorated Stink Bug (BMSB), is an invasive pest of the mid-Atlantic region of the United States, causing economically important damage to a wide range of crops. Native to Asia, BMSB was first observed in Allentown, PA, USA, in 1996, and this pest is now well-established throughout the US mid-Atlantic region and beyond. In addition to the serious threat BMSB poses to agriculture, BMSB has become a nuisance to homeowners, invading home gardens and congregating in large numbers in human-made structures, including homes, to overwinter. Despite its significance as an agricultural pest with limited control options, only 100 bp of BMSB sequence data was available in public databases when this project began.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25168586 PMCID: PMC4174608 DOI: 10.1186/1471-2164-15-738
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Life stages of BMSB. The life stages of BMSB are shown starting with eggs followed by 1st instar nymphs, 2nd instar nymphs, 3rd instar nymphs, 4th instar nymphs, 5th instar nymphs, and an adult in a counter-clockwise spiral outwards and from largest to smallest. The bar in the low left represents 1 cm.
Figure 2Strand-specific sequencing with DSN normalization. In strand-specific sequencing, the final sequencing reads can be assigned to a particular strand of DNA. This is accomplished by a second-strand synthesis with dUTP, ligation of adaptors to double-stranded DNA, and degradation of the second strand with UNG. The result is a fragment to be sequenced that is differentially labeled on the two ends in a manner that dictates how they are loaded on the sequencer and thus the order in which the ends are sequenced.
Figure 3Features of the trinity assembly. (A) The size distribution of the putative transcripts on a logarithmic scale reveals that the vast majority of assembled transcripts are <1 kbp but transcripts >25 kbp can be assembled. (B) For the vast majority of genes there is only a single transcript assembled. (C) The % GC content of the putative transcripts peaks around 35%.
Figure 4Assessment of the strand-specificity of the RNA-Seq library. The strand specificity of the sequencing library was based on comparing the number of first-in-pair reads mapped to the plus strand, to the corresponding number of second-in-pair reads. In a strand-specific library, the strand-specific log ratio (SSLR) should be away from zero. In the protocol used in this study, where the second-in-pair read comes from the sense strand, a negative SSLR is expected. (A) The distribution of SSLRs for putative transcripts having at least one proper read pair mapping to them demonstrates a preference for those with a negative SSLR. (B) When the plot is limited to those with twenty read pairs mapping to them, the bimodal distribution becomes more distinct, likely from the removal of contigs arising from contaminating genomic DNA.
Figure 5Transcript properties relative to transcript coverage. The mean length (orange), number of putative transcripts <300 bp (gray), and homology to an E. coli sequence (yellow) were plotted on the left axis while the number of putative transcripts (blue) were plotted on the right axis after applying a coverage threshold to the transcript (x-axis). A coverage cutoff of 20 maximizes the mean length and minimizes the number of transcripts with homology to E. coli.
Variables and their weights used to filter putative transcripts
| Attribute | Weight | Variable | Reason |
|---|---|---|---|
|
| |||
| 1. What proportion of the database protein is covered in the first Uniref100 hit? | 10 | Proportion covered | Reward the putative transcript based on the proportion of the database protein that is covered in the best BLASTX hit |
| 2. What proportion of the putative transcript is covered by the first Uniref100 hit? | 8 | Proportion covered | Reward the putative transcript based on the proportion of the query putative transcript that is covered in the best BLASTX hit |
| 3. What is the length covered on the database protein in the first Uniref100 hit? | 7 | Database hit length/longest database hit length | Reward based on the absolute database protein length covered in the best BLASTX hit, compared to the longest hit length in the component |
| 4. What is the length covered on the putative transcript in the first Uniref100 hit? | 5 | Putative transcript hit length/longest putative transcript hit length | Reward based on the absolute query putative transcript length covered in the best BLASTX hit, compared to the longest hit length in the component |
| 5. Is the strand of the Uniref100 match, the expected one (based on SSLR)? | 4 | Match strand* (−SSLR/max |SSLR|) | Reward matches in plus strand if SSLR <0 or matches in minus strand if SSLR >0. In contrast, penalize matches in plus strand if SSLR >0 or matches in minus strand if SSLR <0 |
| 6. What proportion of the database protein is covered in the first NR hit? | 9 | Proportion covered | Same as the corresponding metric for Uniref100 |
| 7. What proportion of the putative transcript is covered by the best NR hit? | 7 | Proportion covered | Same as the corresponding metric for Uniref100 |
| 8. What is the relative length covered on the database protein in the first NR hit? | 6 | Database hit length/longest database hit length | Same as the corresponding metric for Uniref100 |
| 9. What is the relative length covered on the Trinity putative transcript in the first NR hit? | 4 | Putative transcript hit length/longest putative transcript hit length | Same as the corresponding metric for Uniref100 |
| 10. Is the strand of the NR match, the expected one (based on SSLR)? | 3 | Match strand* (−SSLR/max |SSLR|) | Same as the corresponding metric for Uniref100 |
| 11. Is the SSLR negative (i.e. the expected)? | 7 | - SSLR / max |SSLR| | Reward putative transcripts with the normal, negative SSLR |
| 12. How long is the putative transcript compared to the longest in the component? | 7 | Putative transcript length/longest putative transcript length | Reward longer putative transcripts |
|
| |||
| 1. Is the best match for each ORF the same? | 10 | (1 - Number of best matches)/number of best matches | Penalize putative transcripts having ORFs that have different hits. |
| 2. Are there ORFs in both strands with both having an NR hit? | 10 | - Number of ORFs in strand "A"/number of ORFs in strand "B" | Maximum penalty if both ORFs have a NR hit |
| 3. Are there ORFs in both strands with only one having an NR hit? | 8 | - Number of ORFs in strand "A"/number of ORFs in strand "B" | Intermediate penalty if only one of the ORFs has a NR hit |
| 4. Are there ORFs in both strands with none of the two having an NR hit? | 3 | - Number of ORFs in strand "A"/number of ORFs in strand "B" | Small penalty if none of the ORFs have a NR hit |
| 5. How many ORFs are called? | 8 | (1 - number of ORFs)/number of ORFs | Penalize putative transcripts having >1 ORFs |
| 6. Are the ORFs found only in the expected strand (SSLR)? | 8 | ORF strand* (−SSLR/max |SSLR|) | Reward putative transcripts having ORFs called in only the expected strand |
|
| |||
| 1. How many sequencing coverage dips? | 10 | - Number of dips/max number of dips in the component | Penalize putative transcripts with sequencing coverage dips |
Summary of assembly and annotation
| Characteristic | Before filtering | After filtering |
|---|---|---|
| Number of reads (both pools) | 366,689,206 | N/A |
| Number of putative transcripts (both pools) | 194,729 | 13,211 |
| Average transcript length (bp) | 1,005 | 2,026 |
| Standard deviation (bp) | 1,474 | 1,592 |
| Median transcript length (bp) | 439 | 1,649 |
| Maximum transcript length (bp) | 27,655 | 24,046 |
| Transcripts >1000 bp | 50,599 | 9,657 |
| Transcripts with a Uniref100 hit (e-value < 1e-10) | 80,536 | 11,513 |
| Transcripts matching unique Uniref100 proteins | 37,160 | 9,993 |
| Transcripts with a NR hit (e-value < 1e-10) | 80,262 | 11,497 |
| Transcripts matching unique NR proteins | 37,346 | 10,007 |
| Number of Trinity components | 123,175 | 13,211 |
| Number of ORFs | 89,684 | 13,210 |
| Number of ORFs >450 bp | 61,569 | 11,141 |
| Number of ORFs with a function assigned | 57,197 | 9,811 |
Figure 6Transcript functional categories. The NCBI Cluster of Orthologous Groups (COG) database was used to classify the predicted proteins in the 13,211 representative transcripts. Assignment of COG categories showed that a large number of ORFs belonged to categories of proteins whose functions are poorly characterized, namely those that have general function prediction only and those with unknown function.
Differentially expressed detoxification genes
| Name | Functional annotation | Stage up-regulated |
|---|---|---|
| Comp25785_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp26191_c0_seq1 | Glutathione S-transferase | Adults |
| Comp15122_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp16607_c1_seq2 | Cytochrome P450 gene | Adults |
| Comp15049_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp15912_c1_seq2 | Cytochrome P450 gene | Adults |
| Comp20672_c0_seq4 | Cytochrome P450 gene | Adults |
| Comp13685_c1_seq2 | Cytochrome P450 gene | Adults |
| Comp40610_c0_seq2 | Cytochrome P450 gene | Adults |
| Comp8954_c0_seq1 | Glutathione S-transferase | Adults |
| Comp20241_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp18070_c0_seq2 | Cytochrome P450 (2 ORFs) | Adults |
| Comp11443_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp18881_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp7095_c0_seq2 | Cytochrome P450 gene | Adults |
| Comp14891_c0_seq2 | Cytochrome P450 gene | Adults |
| Comp8170_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp4236_c0_seq1 | Probable cytochrome P450 gene | Adults |
| Comp2339_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp6322_c0_seq2 | Cytochrome P450 gene | Adults |
| Comp17381_c0_seq6 | Cytochrome P450 gene | Adults |
| Comp23582_c1_seq1 | Cytochrome P450 gene | Adults |
| Comp4238_c0_seq1 | Cytochrome P450 gene | Adults |
| Comp3991_c0_seq3 | Glutathione peroxidase (3 ORFs) | Adults |
| Comp8540_c0_seq1 | Probable cytochrome P450 gene | Adults |
| Comp6146_c0_seq1 | Glutathione S-transferase | Pre-adults |
| Comp11026_c2_seq1 | Cytochrome P450 gene | Pre-adults |
| Comp21713_c0_seq1 | Cytochrome P450 gene | Pre-adults |
| Comp3892_c0_seq1 | Cytochrome P450 gene | Pre-adults |
| Comp18921_c0_seq3 | Catalase | Pre-adults |
| Comp25932_c0_seq1 | Cytochrome P450 gene | Pre-adults |
| Comp10873_c0_seq1 | Probable cytochrome P450 gene | Pre-adults |
| Comp12303_c0_seq1 | Cytochrome P450 gene | Pre-adults |
| Comp8344_c0_seq1 | Probable cytochrome P450 gene | Pre-adults |
Figure 7Mannanase C-terminus phylogeny. Proteins that were homologous to the C-terminus of the putative mannanase proteins were identified using a BLASTX [35] search of NR, aligned with CLUSTALW [36], and maximum likelihood phylogenies generated using RAxML [37]. This reveals the presence of the BMSB mannanase in a well-supported clade of bacterial homologues from bacteria in the genera Dickeya Pectobacterium, Enterobacter, and Pantoea.
Figure 8Lysozyme phylogeny. Homologues to the two putative lysozymes from BMSB were identified using a BLASTX [35] search of NR and the NCBI whole transcriptome database, aligned with MAFFT [40], and maximum likelihood phylogenies generated with RAxML [37]. The two lysozymes cluster with lysozymes from other Hemiptera as well as α-Proteobacteria. This cluster includes the lysozymes that have previously been attributed to lateral gene transfer from bacteria in the order Rickettsiales, which includes Wolbachia endosymbionts of arthropods. Presence in multiple hemipteran lineages may suggest that this transfer predates the divergence of all Hemiptera. Outside the eukaryotic and Wolbachia clades, only the genus is provided, in order to aid visualization. Sequences taken from whole transcriptome assemblies are denoted with asterisks since it is important to carefully consider that the sequences may have arisen from contamination from the microbiome. The inset in the lower left corner illustrates the phylogeny of the suborders of Hemiptera [41].
Figure 9Heat map of qRT-PCR data. All ten individual RNA samples were subjected to qRT-PCR for five insect housekeeping genes and nine genes with best hits to bacterial genes using BLAST. Within each sample the ∆Ct is illustrated in a heat map and was calculated as the difference between the average Ct across the four constitutively expressed genes (ThRS, L30, Tubulin, and L27) and the Ct value for the gene being interrogated. In this way the color coding represents the difference in expression between the gene queried and the average constitutively expressed housekeeping gene, which all had similar abundance levels. In most samples, comp18511_c0_seq1, comp549_c15_seq3, comp2753_c5_seq1, comp7015_c0_seq1, and comp11444_c0_seq1 were poorly expressed relative to the constitutively expressed genes. However, in fifth instar nymphs (dark green), comp18511_c0_seq1 was transcribed at levels similar to the constitutively expressed genes while in active adult females, comp549_c15_seq3 and comp2753_c5_seq1 were transcribed at levels 64-fold higher than the average constitutively expressed gene. Both bLysB, the putative amylase, and the putative BMSB mannanase transcripts were transcribed constitutively at levels similar to the other constitutively expressed genes. For RNA that did not amplify a Ct value was assigned of 43, which was the lowest Ct value measured across the dataset. This reflects the low abundance of a transcript that did not amplify, but still allowed for clustering. The value for comp11444_c0_seq1 in the 4th instar larvae replicate 3 is missing due to an aberrant amplification curve.