| Literature DB >> 30398471 |
Ji Yeon Kim1,2, Hye Young Lim1,2, Sang Eon Shin2, Hyo Kyeong Cha1,2, Jeong-Han Seo3, Suel-Kee Kim1,2, Seong Hwan Park2, Gi Hoon Son1,2.
Abstract
Sarcophaga peregrina (flesh fly) is a frequently found fly species in Palaearctic, Oriental, and Australasian regions that can be used to estimate minimal postmortem intervals important for forensic investigations. Despite its forensic importance, the genome information of S. peregrina has not been fully described. Therefore, we generated a comprehensive gene expression dataset using RNA sequencing and carried out de novo assembly to characterize the S. peregrina transcriptome. We obtained precise sequence information for RNA transcripts using two different methods. Based on primary sequence information, we identified sets of assembled unigenes and predicted coding sequences. Functional annotation of the aligned unigenes was performed using the UniProt, Gene Ontology, and Kyoto Encyclopedia of Genes and Genomes databases. As a result, 26,580,352 and 83,221 raw reads were obtained using the Illumina MiSeq and Pacbio RS II Iso-Seq sequencing applications, respectively. From these reads, 55,730 contigs were successfully annotated. The present study provides the resulting genome information of S. peregrina, which is valuable for forensic applications.Entities:
Mesh:
Year: 2018 PMID: 30398471 PMCID: PMC6219405 DOI: 10.1038/sdata.2018.220
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Schematic overview of the study.
Samples were prepared by pooling equal amounts of RNA from each developmental stage of S. peregrina, including early-, middle-, and late-instar larvae; middle-stage pupae; and adults. cDNA was synthesized and sequenced using the Illumina MiSeq platform with paired-end reads. For more accurate gene prediction of S. peregrina, we also employed the Pacbio RS II Iso-Seq system for full-length transcript sequencing. Analysis started with the assembly of full-length transcripts and corresponding isoforms using the de novo assembly program CLC Assembly Cell without a reference genome. The sequenced transcripts derived from Iso-Seq and MiSeq were combined, and the CD-HIT-EST program was used to construct the final standard transcript to eliminate redundancy. To examine the completeness of S. peregrina unigenes, we used the TransDecoder program BUSCO analysis and continued with functional analysis using the BLAST program. Quality control assessments were performed at each step.
Raw data deposit.
| Sample no. | SRA Runs | BioSample | Title |
|---|---|---|---|
| The dataset consists of two samples. Sample 1 is from the paired-end sequencing dataset obtained using the Illumina MiSeq platform. Sample 2 is from the full-length sequencing dataset obtained using the Pacbio RS II system. Sequence data were deposited in the Sequence Read Archive (SRA, accession numbers SRR6265701 and SRR6265702) (Data Citation 1). | |||
| 1 | SRR6265701 | SAMN07981104 | Peregrina-Pooled-RNA_1.fastq and Peregrina-Pooled-RNA_2.fastq |
| 2 | SRR6265702 | SAMN07981103 | NIR_all_quivered_hq.100_30_0.99.fastq |
Unigene deposit.
| File name | File type | Data |
|---|---|---|
| The dataset contains unigenes from the longest contigs per transcript generated using the CLC Assembly Cell, CD-HIT-EST program. The SPER_Unigenes file contains total unigenes from | ||
| SPER_Unigenes | fasta | unigenes |
Annotation deposit.
| File name | File type | Data description |
|---|---|---|
| The dataset contains functional annotations and gene coding sequence annotations for | ||
| SPER_blast2go_GO | Xls | GO database annotation |
| SPER_blast2go_kegg | Xls | KEGG database annotation |
| SPER_blast2go_uniprot | Xls | UniProt database annotation |
| SPER_denovo_Transcriptome_CDS | fasta | Predicted coding sequence |
| SPER_Transcriptome_protein | fasta | Predicted protein sequence |
| DMEL_SPER_ortholog_genes | Xls | Orthlog gene annotation |
Quality control and data statistics of the raw reads.
| Type | MiSeq | Iso-Seq |
|---|---|---|
| Read number | 26,580,352 | 83,221 |
| Read length (Mb) | 7,872.2 | 191.3 |
| Q20 (%) | 90.42 | NA |
| GC (%) | 38.14 | 36.10 |
Assembly statistics.
| Type | |
|---|---|
| Total numbers of unigenes | 55,730 |
| Total numbers of transcripts | 77,089 |
| Total length (bp) | 34,742,946 |
| N50 (bp) | 1,245 |
| Average length (bp) | 623.43 |
| Max length (bp) | 13,704 |
| Min length (bp) | 237 |
| GC (%) | 39.87 |
BUSCO analysis of assembly completeness.
| BUSCO results | Arthropoda | Diptera | Insecta | |||
|---|---|---|---|---|---|---|
| Complete BUSCOs | 970 | 90.99% | 2,112 | 75.46% | 1466 | 88.42% |
| Complete single-copy BUSCOs | 718 | 67.35% | 1,446 | 51.66% | 1062 | 64.05% |
| Complete Duplicated BUSCOs | 252 | 23.64% | 666 | 23.79% | 404 | 24.37% |
| Fragmented BUSCOs | 58 | 5.44% | 424 | 15.15% | 112 | 6.76% |
| Missing BUSCOs | 38 | 3.56% | 263 | 9.40% | 80 | 4.83% |
| Total BUSCO groups searched | 1,066 | 100% | 2,799 | 100% | 1,658 | 100% |
Annotation statistics.
| Type | |
|---|---|
| Unigene number | 55,730 |
| UniProt | 33,991 |
| GO | 23,269 |
| KEGG | 6,335 |
Figure 2Characteristics of homology search of contigs against the UniProt protein database.
(a) E-value distribution of the top BLAST hits for each contig (E-value<1.0 e-3). (b) Hit species distribution.
Figure 3GO classification.
Results are summarized in three main categories: biological process, cellular component, and molecular function. The y-axis indicates the number of contigs.
Ortholog groups of S. peregrina and D. melanogaster identified by OrthoMCL.
| Organism | Total proteins | Orthologs | Ortholog groups | Specific genes (no blast+specific paralog) | Extra (with blast, no grouping) | Orthologs/ortholog groups |
|---|---|---|---|---|---|---|
| 55,730 (100%) | 14,584 (26.17%) | 8,378 | 7,463 groups (1,685 + 12,435) | 27,026 (48.49%) | 1.74 | |
| 30,362 (100%) | 22,670 (74.67%) | 8,378 | 2,250 groups (132 + 5,350) | 2,210 (7.28%) | 2.70 |