| Literature DB >> 32541658 |
Hui Luo1,2, Haiping Liu3, Jie Zhang1, Bingjie Hu1, Chaowei Zhou1,2, Mengbin Xiang1, Yuejing Yang1,2, Mingrui Zhou1,2, Tingsen Jing1,2, Zhe Li1, Xinghua Zhou1,2, Guangjun Lv1,2, Wenping He1,2, Benhe Zeng3, Shijun Xiao4, Qinglu Li5, Hua Ye6,7.
Abstract
Gymnocypris namensis, the only commercial fish in Namtso Lake of Tibet in China, is rated as nearly threatened species in the Red List of China's Vertebrates. As one of the highest-altitude schizothorax fish in China, G. namensis has strong adaptability to the plateau harsh environment. Although being an indigenous economic fish with high value in research, the biological characterization, genetic diversity, and plateau adaptability of G. namensis are still unclear. Here, we used Pacific Biosciences single molecular real time long read sequencing technology to generate full-length transcripts of G. namensis. Sequences clustering analysis and error correction with Illumina-produced short reads to obtain 319,044 polished isoforms. After removing redundant reads, 125,396 non-redundant isoforms were obtained. Among all transcripts, 103,286 were annotated to public databases. Natural selection has acted on 42 genes for G. namensis, which were enriched on the functions of mismatch repair and Glutathione metabolism. Total 89,736 open reading frames, 95,947 microsatellites, and 21,360 long non-coding RNAs were identified across all transcripts. This is the first study of transcriptome in G. namensis by using PacBio Iso-seq. The acquisition of full-length transcript isoforms might accelerate the transcriptome research of G. namensis and provide basis for further research.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32541658 PMCID: PMC7296019 DOI: 10.1038/s41598-020-66582-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A picture of Gymnocypris namensis in Namsto Lake in Tibet.
PacBio Iso-seq output statistics.
| Libraries | 09061 | 09062 |
| Polymerase reads | 492,350 | 520,123 |
| Mean length of polymerase reads | 17,327 | 17,542 |
| Polymerase reads N50 | 34,250 | 33,250 |
| Subreads | 5,790,289 | 6,032,472 |
| Mean length of subreads | 1,396 | 1,434 |
| Number of circular consensus sequence reads (CCS) | 295,248 | 320,626 |
| Mean length of CCSs | 1,663 | 1,690 |
| Total bases of CCSs | 491,225,617 | 542,141,219 |
| Number of reads with 5′ adapter sequence | 282,632 | 308,526 |
| Number of reads with 3′ adapter sequence | 284,198 | 309,794 |
| Number of poly-A reads | 281,763 | 307,616 |
| Number of full-length reads | 270,520 | 296,536 |
| Number of full-length non-chimeric reads | 267,490 | 292,946 |
| Mean full-length non-chimeric read length | 1,542 | 1,568 |
| Full-length percentage (FL%) | 54.94 | 57.01 |
Summary statistics of the isoforms.
| Isoform types | Polished high-quality isoforms | Short reads corrected isoforms | Non-redundant isoforms |
|---|---|---|---|
| Total bases | 488,919,715 | 488,137,232 | 228,095,655 |
| Total number | 319,044 | 319,044 | 125,396 |
| Average length | 1,532 | 1,530 | 1,819 |
| Maximum length | 11,351 | 11,289 | 11,289 |
| Minimum length | 132 | 132 | 132 |
| Median length | 1,315 | 1,312 | 1,577 |
| N50 | 1,553 | 1,552 | 2,044 |
Figure 2Length distribution of transcript isoforms. The x-axis represents the transcript isoforms length, the y-axis represents the number of the transcript isoforms.
Figure 3The species identified by homology search against the NCBI NR databases. Note that only the best hits for transcripts are covered in the analysis.
Figure 4KOG function classification of transcripts of G. namensis. Letters on the x-axis represents different KOG categories, as shown detail on right legend, the y-axis represents the number of the transcripts.
Figure 5GO annotation of G. namensis transcriptome. The x-axis represents the number of genes, the y-axis represents different GO categories.
Figure 6Identified KEGG pathways of transcript isoforms. The x-axis represents the number of genes, the y-axis represents different KEGG pathways.
Repeat numbers and unit length distribution of putative pure SSR markers in the transcriptome.
| Repeat number | Motif length | Total | Percent (%) | |||||
|---|---|---|---|---|---|---|---|---|
| mono | Di | Tri | Tetra | Penta | Hexa | |||
| 5 | 0 | 0 | 3,455 | 227 | 37 | 18 | 3,737 | 6.78 |
| 6 | 0 | 5,419 | 1,326 | 104 | 8 | 2 | 6,859 | 12.44 |
| 7 | 0 | 2,815 | 616 | 23 | 6 | 1 | 3,461 | 6.28 |
| 8 | 0 | 1,878 | 298 | 16 | 1 | 0 | 2,193 | 3.98 |
| 9 | 0 | 1,301 | 135 | 9 | 1 | 0 | 1,446 | 2.62 |
| 10 | 9,782 | 1,030 | 71 | 14 | 0 | 0 | 10,897 | 19.76 |
| 11 | 5,170 | 700 | 37 | 6 | 2 | 0 | 5,915 | 10.73 |
| 12 | 3,252 | 507 | 17 | 10 | 0 | 0 | 3,786 | 6.87 |
| 13 | 2,253 | 457 | 17 | 3 | 5 | 0 | 2,735 | 4.96 |
| 14 | 1,534 | 316 | 3 | 20 | 1 | 0 | 1,874 | 3.40 |
| 15 | 1,035 | 210 | 12 | 5 | 1 | 1 | 1,264 | 2.29 |
| 16 | 719 | 195 | 6 | 2 | 1 | 0 | 923 | 1.67 |
| 17 | 538 | 193 | 3 | 4 | 0 | 0 | 738 | 1.34 |
| 18 | 435 | 135 | 11 | 6 | 0 | 0 | 587 | 1.06 |
| 19 | 371 | 87 | 0 | 2 | 2 | 0 | 462 | 0.84 |
| ≥20 | 7,185 | 1,018 | 7 | 47 | 3 | 0 | 8,260 | 14.98 |
| Total | 32,274 | 16,261 | 6,014 | 498 | 68 | 22 | 55,137 | 100.00 |
| Percent (%) | 58.53 | 29.49 | 10.91 | 0.90 | 0.12 | 0.04 | 100.00 | |
Figure 7Length distribution of the coding sequence of complete ORFs. The x-axis represents the coding sequence length, the y-axis represents the number of predicted ORFs.