| Literature DB >> 28646867 |
Haibo Liu1, Timothy P L Smith2, Dan J Nonneman2, Jack C M Dekkers3, Christopher K Tuggle4.
Abstract
BACKGROUND: High throughput gene expression profiling assays of peripheral blood are widely used in biomedicine, as well as in animal genetics and physiology research. Accurate, comprehensive, and precise interpretation of such high throughput assays relies on well-characterized reference genomes and/or transcriptomes. However, neither the reference genome nor the peripheral blood transcriptome of the pig have been sufficiently assembled and annotated to support such profiling assays in this emerging biomedical model organism. We aimed to assemble published and novel RNA-seq data to provide a comprehensive, well-annotated blood transcriptome for pigs by integrating a de novo assembly with a genome-guided assembly.Entities:
Keywords: De novo transcriptome assembly; Genome-guided transcriptome assembly; Peripheral blood; Sus scrofa
Mesh:
Year: 2017 PMID: 28646867 PMCID: PMC5483264 DOI: 10.1186/s12864-017-3863-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
RNA-seq data used for the blood transcriptome assembly
| Study accession | Sample description | Read length | Layout | Total raw read count | Breed | Source | Run accession | Reference |
|---|---|---|---|---|---|---|---|---|
| PRJEB5250 | 16 pigs (7 weeks old) were infected with | 51 | SE | 1,611,707,563 | Yorkshire crossbred | ENA | ERR413315 - ERR413346 | [ |
| PRJNA189967 | Whole blood from 3 individual healthy pigs | 51 | SE | 106,157,368 | unknown | ENA | SRR747924-SRR747926 | [ |
| Swine Genome Sequencing Consortium | Whole blood from one healthy pig | 83 | SE | 48,973,230 | Duroc | ENSEMBL( | Unknown | http://www. |
| PRJEB12300 | Whole blood was sampled from 31 post-weaning (5 ~ 6 weeks old) pigs from lines divergently selected for residual feed intake. Globin was depleted from the total RNA. | 100 | PE | 2 × 632,557,790 | Yorkshire | ENA | ERR1199492-ERR1199522 | [ |
| PRJEB20136 | 28 pigs (~ 63 kg of body weight) from lines divergently selected for residual feed intake were muscularly injected with | 49 | PE | 2 × 1,509,173,972 | Yorkshire | ENA | ERR1898446-ERR1898477 | H. Liu, K. Feye et al., unpublished |
Summary of assessment of the de novo and integrated transcriptome assemblies
| Transcriptome assembly | Type of assessment | Purpose | Reference data | Software | Results |
|---|---|---|---|---|---|
| The de novo transcriptome assembly | RNA-seq read representation of the assembly | To determine representation of RNA-seq reads | Normalized RNA-seq reads | Trintiy [ | 66.2% of normalized RNA-seq reads could be mapped back to the de novo assembly |
| Representation of full-length assembled protein-coding transcripts | To assess the number of full-length PTs | All protein sequences in the Swiss-Prot database | BLASTX [ | 22,831 (nearly) full-length PTs covered more than 80% of the full length of 10,097 protein sequences in the Swiss-Prot database | |
| Representation of full-length assembled transcripts | To assess the number of full-length PTs | NCBI pig RefSeq mRNAs | DC-megaBLAST [ | 16,010 (nearly) full-length PTs covered more than 80% of the full length of 9228 pig RefSeq mRNAs | |
| Origin of assembled transcripts | To assess whether the assembled PTs were of porcine genomic origin | Pig reference genomes: SSC10.2 and USMARCv1.0 | GMAP [ | 94.2% and 99.4% of the PTs could be mapped to SSC10.2 and USMARCv1.0, respectively | |
| Similarity-based assessment | To annotate the assembled PTs with known sequences of significant similarity | Sequences in the NCBI NT and NR databases | DC-megaBLAST and BLASTX [ | 69.42% and 21.9% of the PTs shared significant similarities to sequences in the NCBI NT and NR databases, respectively | |
| The integrated transcriptome assembly | Similarity-based assessment | To annotate the assembled PTs with known sequences of significant similarity | Sequences in the NCBI NT and NR databases | DC-megaBLAST and BLASTX [ | ~90% and 63% of the PTs shared significant similarities to sequences in the NCBI NT and NR databases, respectively |
| Correctness of exon-intron splicing junctions of PTs | To validate the exon-intron splicing junctions of PTs | Porcine IsoSeq full-length cDNA read data from the liver, spleen and thymus, SSC10.2 transcripts and NCBI RefSeq mRNAs | Bedtools [ | 15,303 PTs and 106,483 IsoSeq sequences had the same exon-intron junctions; and 63,845 uniquely mapping, spliced PTs shared at least one intron or exon with 390,943 IsoSeq reads; 4155 and 6641 PTs shared the same exon-intron junctions as 4010 SSC10.2 annotated transcripts and 6418 RefSeq mRNA sequences, respectively; 54,402 and 60,180 PTs shared at least one intron or one exon with 18,437 SSC10.2 transcripts and 33,870 RefSeq mRNA sequences, respectively | |
| Completeness of 5′ termini of PTs | To validate the completeness of 5′ termini of PTs | FANTOM5 CAGE data for humans and mouse, and porcine macrophage CAGE data | CAGEr [ | Completeness of the 5′ termini of 37,569 PTs were verified by 43,845 proximal promoters determined by CAGE data | |
| Length extension of existing transcripts | To determine to what extent the assembled PTs improved over the existing porcine annotation | SSC10.2 transcripts and NCBI pig RefSeq mRNAs | Bedtools [ | 12,262 PTs had both longer 5′ and 3′ termini than the maximally overlapping SSC10.2 transcripts; 9764 PTs had only longer 3′ termini; and14,650 PTs had only longer 5′ termini | |
| Novelty of PTs | To determine novel PTs | SSC10.2 transcripts and NCBI pig RefSeq mRNAs | Bedtools [ | 41,838 and 35,738 spliced PTs that did not overlap any spliced, uniquely mapping SSC10.2 transcripts or with any spliced, uniquely mapping pig RefSeq mRNA sequence were potential novel transcritps relative to the two reference sets, respectively |
Fig. 1Characterization of the integrated transcriptome assembly. a Length distribution of the PTs of the integrated transcriptome; b Exon number distribution for the uniquely mapping spliced PTs of the integrated transcriptome; c Distribution of the number of isoforms per PT for PTs with at least two isoforms in the integrated transcriptome; d, e Boxplots showing the distributions of percentage of identity, percentage of query coverage, bit scores and E-values of the top BLAST hits of the PTs in the NCBI NT (d) and NR (e) databases by using DC-megaBLAST (d) and BLASTX (e), respectively. f Boxplots showing the distributions of percentage of identity, percentage of query coverage, bit scores and E-values of the best reciprocal DC-megaBLAST hits of the PTs in the human transcriptome (GENCODE v25). For clearer visualization, larger outliers of bit scores and E-values are not displayed in D-F
Fig. 2GO terms and EC code annotation of the integrated transcriptome assembly. a Distribution of the top 20 GO terms at level two, where available; b Distribution of the six main EC classes
Fig. 3IsoSeq full-length cDNA reads (a) and CAGE data (b) validate fine structures of PTs in the integrated transcriptome assembly. a An example showing one assembled PT shadowed gray in the “Assembly” panel was validated by one IsoSeq full-length cDNA read shadowed gray in the “IsoSeq” panel in terms of intron arrangement. For references, from top to bottom displayed are genomic coordinates, genome coverage by the normalized RNA-seq reads, aligned RNA-seq reads, pig RefSeq mRNAs, SSC10.2 transcripts and IsoSeq read alignments. In the panel labeled as “RNA-seq Cov”, heights of the gray or colored bars represent CPB by the RNA-seq reads. In the “RNA-seq” panel, purple and blue boxes represent reads mapped to the forward and reverse strands of the chromosome; while the thin segments represent introns spanned by spliced reads. In the panels labeled as “Assembly”, “RefSeq”, “SSC10.2” or “IsoSeq”, red boxes represent exons, and thin segments stand for introns, with arrows indicating the orientation of the sense strand. The “Assembly” panel shows PTs mapped to this genomic window. b An example showing a conserved proximal promoter among pigs, humans and mice, determined by CAGE, overlaps the 5′ termini of several assembled isoforms of a gene, indicating completeness of the 5′ termini of those PTs. Meaning of the symbols is the same as those in (a); in addition, the blue boxes stand for proximal promoters determined by CAGE in porcine macrophage, human and mouse cells in the three panels labeled with “Pig CAGE”, “Human CAGE” or “Mouse CAGE”
Fig. 4The integrated blood transcriptome assembly improves the structural annotation of the porcine genome compared to the SSC10.2 annotation. The example shown is for the Artemis gene locus, annotation of which in SSC10.2 was improved by extending the 3′ UTR and adding novel isoforms. Meaning of the symbols is the same as those in Fig. 3. The two assembled Artemis isoforms are verified by IsoSeq reads, which show many more isoforms for the Artemis gene. For references, also shown are proximal promoters determined by CAGE data and a RefSeq mRNA of the Artemis gene
Fig. 5Length comparison between PTs and transcripts in the reference sets. The lengths of uniquely mapping spliced PTs were compared with those of SSC10.2 transcripts and pig RefSeq mRNAs. The number of PTs with longer 5′ and 3′, only longer 3′, and only longer 5′ termini, or neither terminus than their maximally overlapping reference transcripts in SSC10.2 annotation (red) and RefSeq mRNA collection (blue), respectively is as displayed