| Literature DB >> 34680944 |
Amit Singh1, Géza Schermann1, Sven Reislöhner1, Nikola Kellner1, Ed Hurt1, Michael Brunner1.
Abstract
A correct genome annotation is fundamental for research in the field of molecular and structural biology. The annotation of the reference genome of Chaetomium thermophilum has been reported previously, but it is essentially limited to open reading frames (ORFs) of protein coding genes and contains only a few noncoding transcripts. In this study, we identified and annotated full-length transcripts of C. thermophilum by deep RNA sequencing. We annotated 7044 coding genes and 4567 noncoding genes. Astonishingly, 23% of the coding genes are alternatively spliced. We identified 679 novel coding genes as well as 2878 novel noncoding genes and corrected the structural organization of more than 50% of the previously annotated genes. Furthermore, we substantially extended the Gene Ontology (GO) and Enzyme Commission (EC) lists, which provide comprehensive search tools for potential industrial applications and basic research. The identified novel transcripts and improved annotation will help to understand the gene regulatory landscape in C. thermophilum. The analysis pipeline developed here can be used to build transcriptome assemblies and identify coding and noncoding RNAs of other species.Entities:
Keywords: Chaetomium thermophilum; Enzyme Commission number; Gene Ontology; R package; genome-wide annotation; industrial application; novel genes; transcriptome assembly
Mesh:
Substances:
Year: 2021 PMID: 34680944 PMCID: PMC8535861 DOI: 10.3390/genes12101549
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Transcriptome-based annotation of the genome of C. thermophilum. (A) Schematic overview of the analysis pipeline. (B) Fraction of newly annotated transcript that differs from the previous annotation by the indicated gffcompare feature. Note that the transcripts add up to more than 100%. (C) Stack bar graph comparison of the number of genes of the old and new annotations. Differences are indicated by colored boxes. Coding genes, old annotation: the stacked grey bar represents the number of coding genes (749) that were not expressed (detected) in our conditions. Coding genes, new annotation: Beige: genes without change in the intron structure (n = 2440). Yellow: genes where at least one isoform has a novel splicing variant (n = 493). Green: genes where all transcripts have at least one intron-exon junction different than in the previous annotation (n = 2132). Pink; genes where all junctions are different (n = 1234). Blue; genes that are flipped (opposite strand with same or similar splice junctions) (n = 19). Orange; completely novel genes (n = 679). Aquamarine: uncategorized genes (n = 49). (D) Histogram representing the lengths of coding transcripts from the new annotation (beige) compared with the previous annotation of ORFs (black). Cutoff at 10,000 nt.
(A): RNA-seq data alignment results for reads of different samples. (B): Different classes of assembled transcripts.
|
| ||
|
|
|
|
| G1 | 69350606 | 95.94% |
| G2 | 73230226 | 95.87% |
| G3 | 59976502 | 95.83% |
|
| ||
|
|
|
|
| = | Complete match of intron chain | 3078 |
| c | Contained in reference (and intron isoform compatible) | 323 |
| k | Containment of reference (reverse containment) | 0 |
| j | At least one splice junction match | 4234 |
| e | At single exon, overlapping intron a possibly pre-mRNA fragment (un spliced intron) | 420 |
| o | Other same strand overlap with reference exons | 1896 |
| s | Intron match on the opposite strand (likely a mapping error) | 158 |
| x | Exonic overlap on the opposite strand (like ‘o’ or ‘e’, but on the opposite strand) | 1754 |
| i | Fully contained in a reference intron | 28 |
| y | Contains a reference within is intron(s) | 5 |
| p | Possible polymerase run-on (no actual overlap) | 706 |
| r | Repeat (at least 50% bases soft masked) | 0 |
| u | None of the above (unknown, intergenic) | 2744 |
RNA-seq data alignment results for three different samples.
| Software | Usage | Parameter Settings/Webpage |
|---|---|---|
| Ensemble genome (version 2.8) | ||
| FastQC (Version: 0.11.5) | The data quality assessment of raw sequence data | |
| HISAT2 (Version: 1.3.3b) | The raw reads were mapped to | [hisat2 -p 8 -x -max-intronlen 2000 -dta-U]) |
| Stringtie (v1.3.4 release) | The mapped reads from HISAT2 for each sample were assembled separately | [stringtie -o -m 50 -p 8 -j 3 -c 5 -g 15]) |
| GffCompare (Version: v0.10.1) | The program used to compare, merge, annotate, and estimate accuracy “query” files, when compared with a reference annotation | gffcompare-merge -K -o gffcomp -i |
| CPC2 | The CPC2 calculate the coding or noncoding of the transcript | CPC2.py -i. Input.fasta -o output.txt |
| dc-mega BLAST (Version: 2.7.1+) | Sequence conservation analysis between different species | |
| Blast2GO (Version 5.1.1) | Functional annotation such as GO term and EC number are extracted from the software | |
| blastx. | Finds regions of local similarity between sequences | |
| T-Coffee software | Pairwise similarity score are calculated | |
| phyloT | Visualization of phylogenetic tree different species | |
| pyfaidx python package | Allowing for fast random access to any subsequence in the indexed FASTA file | |
| R (Version 3.3.3) | Creating figures and GO annotation package |
Figure 2Location of noncoding transcripts. (A) Schematic representation of possible locations of noncoding transcripts relative to coding genes. (B) Bar graph showing the number of noncoding transcripts at the indicated genomic locations.
Figure 3Phylogenetic conservation of coding and noncoding C. chaetomium transcripts. (A) Phylogenic tree of six indicated species. (B) Double bar graph representing the sequence similarity of C. thermophilum transcripts with six indicated species. Light bars correspond to the percentage of similar coding transcripts. The dark bars in front correspond to the percentage of noncoding transcripts. (C) Number of transcripts associated with the GO annotations “biological process” (BP), “molecular function” (MF), and “cellular compartment” (CC). The right bar indicates the total number of transcripts associated with at least one GO term. The percentage of transcripts associated with the respective GO terms is indicated. Note that a given transcript can be associated with more than one GO term.