Literature DB >> 34917712

De novo transcriptome assembly and annotation of parthenogenetic lizard Darevskia unisexualis and its parental ancestors Darevskia valentini and Darevskia raddei nairensis.

Sergei S Ryakhovsky1,2, Victoria A Dikaya2, Vitaly I Korchagin1, Andrey A Vergun1,3, Lavrentii G Danilov4, Sofia D Ochkalova1,2, Anastasiya E Girnyk1, Daria V Zhernakova1,5, Marine S Arakelyan6, Vladimir B Brukhin7, Aleksey S Komissarov1,2, Alexey P Ryskov1.   

Abstract

Darevskia rock lizards include 29 sexual and seven parthenogenetic species of hybrid origin distributed in the Caucasus. All seven parthenogenetic species of the genus Darevskia were formed as a result of interspecific hybridization of only four sexual species. It remains unknown what are the main advantages of interspecific hybridization along with switching on parthenogenetic reproduction in evolution of reptiles. Data on whole transcriptome sequencing of parthenogens and their parental ancestors can provide value impact in solving this problem. Here we have sequenced ovary tissue transcriptomes from unisexual parthenogenetic lizard D. unisexualis and its parental bisexual ancestors to facilitate the subsequent annotation and to obtain the collinear characteristics for comparison with other lizard species. Here we report generated RNAseq data from total mRNA of ovary tissues of D. unisexualis, D. valentini and D. raddei with 58932755, 51634041 and 62788216 reads. Obtained RNA reads were assembled by Trinity assembler and 95141, 62123, 61836 contigs were identified with N50 values of 2409, 2801 and 2827 respectively. For further analysis top Gene Ontology terms were annotated for all species and transcript number was calculated. The raw data were deposited in the NCBI SRA database (BioProject PRJNA773939). The assemblies are available in Mendeley Data and can be accessed via doi:10.17632/rtd8cx7zc3.1.
© 2021 The Author(s). Published by Elsevier Inc.

Entities:  

Keywords:  AMP; Darevskia lizards; Ovaries; Parthenogenesis; Transcriptome analysis

Year:  2021        PMID: 34917712      PMCID: PMC8666336          DOI: 10.1016/j.dib.2021.107685

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the Data

Data provides information about the first assembled ovary transcriptomes of three genetically related Darevskia lizards species and information about their genes and proteins. This data may benefit evolutionary biologists because it shows genetic differences between unisexual (parthenogenetic) and bisexual parental lizards. The data may provide insight into the genetic underpinning of parthenogenetic reproduction and can be used in further study of these genes.

Data Description

Ovary RNA from three individuals of each species was pooled together and used to prepare the three cDNA libraries: D. unisexualis, D. raddei nairensis, D. valentini. Table 1 shows the total number of bases, reads, GC (%), Q20 (%), and Q30 (%) that were calculated for the three samples. The characteristics of assembled transcriptome sequences are presented in Table 2. Structural characteristics of three transcriptomes are shown in Fig. 1. Obtained Trinity assemblies contain 60132 transcripts for D. unisexualis, 41680 for D. valentini, and 413664 for D. r. nairensis. TransDecoder peptide output was used for BLASTP, Pfam, and EggNOG search (Fig. 1A, Supplementary 1). BLASTP v. 2.9.0+ revealed 14049, 12331, and 11865 proteins for D. unisexualis, D. valentini, and D. nairensis respectively (Fig. 1A). Parthenogenetic species D. unisexualis showed greater TRINITY contigs (> 81.4% and > 87.2%) and transcripts (> 44.3% and > 45.4%) numbers than D. valentini and D. r. raddei respectively (Fig. 1B). The D. unisexualis showed more hits for each of the searching instruments. Top 10 GO terms taken from all GO terms datasets and distribution graphs are presented in Fig. 2 (Supplementary 2). The biggest number of annotated genes and the most annotated category was a cellular component, biological processes were less annotated. In the molecular functions category, most genes were related to binding. The most highly enriched genes in biological processes were related to the regulation of transcription of RNA polymerase II. It was found that in cellular components over-represented molecules were the nucleus and cytoplasm origin. In total, 38844, 38756, 63219 transcripts with GO terms were annotated in Table 3 for D. valentini, D.raddei, and D. unisexualis respectively. The summary of Trinotate shows a more prevalent number of annotated transcripts with GO in D. unisexualis than in D. valentini and D. raddei nairensis (> 62.8% and > 63.1%). The total number of GO in the parthenogenetic sample exceeds D. valentini and D. raddei nairensis on 58.9% and 60% respectively (Table 3). The final TransDecoder stats are presented in Table 4. The overall number of ORFs in D. unisexualis was 45.7% more than in bisexual parental samples, according to the TransDecoder results.. The analysis of common and unique genes on Venn diagrams (Fig. 3A, B) displays that D. unisexualis has more unique genes in BLASTP (> 221.6% and > 250%) and EggNOG (> 228.7% and > 281.6%) than D. valentini and D. nairensis respectively (Supplementary 4). The antimicrobial peptides have been searched in D. valentini, D. nairensis, and D. unisexualis with 59, 81, and 70 possible matches respectively. These sequences were found in 29, 34, and 36 transcripts. The antimicrobial peptides detected have antibacterial activity against Gram+, Gram- as well as against fungi (Supplementary 5). The raw RNA sequence reads for each lizard are available in the NCBI SRA database (PRJNA773939). The assembled transcripts are available in the Mendeley data (doi:https://doi.org/10.17632/rtd8cx7zc3.1).
Table 1

Statistics of the RNA-seq generated from three lizards.

SpeciesTotal readsTotal basesQ20 basesaQ30 basesbGC content
D. unisexualis58.932755 M17,313222 G97.98%94.34%48.05%
D. valentini51.634041 M15,593480 G97.65%93.35%46.59%
D. r. nairensis62.788216 M18,962041 G97.13%92.33%46.77%

Q20 - ratio of bases with probability of containing no more than one error in 100 bases.

Q30 - ratio of bases with probability of containing no more than one error in 1,000 bases.

Table 2

Summary characteristics of transcriptome sequence assembly of all three samples data.

D. raddei nairensisD. valentiniD. unisexualis
# contigs (>= 0 bp)122746126141228862
# contigs (>= 1000 bp)393983914557984
# contigs (>= 5000 bp)351735332447
# contigs (>= 10000 bp)1341188
# contigs (>= 25000 bp)000
# contigs (>= 50000 bp)000
Total length (>= 0 bp)139903614140424203204321108
Total length (>= 1000 bp)105256962104453618138481341
Total length (>= 5000 bp)229372152285198914738047
Total length (>= 10000 bp)1524044131507390998
Total length (>= 25000 bp)000
Total length (>= 50000 bp)000
# contigs618366212395141
Largest contig161871417012181
Total length121008445120525010164367910
GC (%)46,2346,1946,06
N50282728012409
N75156715581378
L50136771371622778
L75278872795145070
# N's per 100 kbp0.00.00.0
Fig. 1

Structural characteristics of three transcriptomes. A number of annotated proteins (A), and TRINITY contigs, and transcripts (B).

Fig. 2

Distribution of the top 10 GO terms for each of the three species.

Table 3

Summary of Trinotate/Gene Ontologies.

SpeciesTotal transcripts with GOTotal transcripts with only one GOTotal transcripts with multiple GOTotal GO in the fileTotal unique GO in the file
D. valentini3884412413760355382716934
D. nairensis3875611953756155018916885
D. unisexualis6321917966142388020017771
Table 4

Open Reading Frames (ORFs) prediction numbers using TransDecoder.

SpeciesTotalComplete5-prime partial3-prime partialInternal
D. valentini558163433312663313322136
D. nairensis55808344991289330265327
D. unisexualis813444340822136539010473
Fig. 3

Venn diagrams showing overlapping hits for three species (A) and overlapping of EggNOG genes (B).

Statistics of the RNA-seq generated from three lizards. Q20 - ratio of bases with probability of containing no more than one error in 100 bases. Q30 - ratio of bases with probability of containing no more than one error in 1,000 bases. Summary characteristics of transcriptome sequence assembly of all three samples data. Structural characteristics of three transcriptomes. A number of annotated proteins (A), and TRINITY contigs, and transcripts (B). Distribution of the top 10 GO terms for each of the three species. Summary of Trinotate/Gene Ontologies. Open Reading Frames (ORFs) prediction numbers using TransDecoder. Venn diagrams showing overlapping hits for three species (A) and overlapping of EggNOG genes (B).

Experimental Design, Materials and Methods

Species sampling and tissues collection

Samples of D. valentini, D. r. nairensis, and D. unisexualis for transcriptome analysis were collected in Armenia in 2019, outside of the protected areas. Several adult lizards of female D. unisexualis from the Hrazdan population (40.503493 N 44.748097 E), females D. r. nairensis from the Vahramaberd population (40.844394 N, 43.755720 E), and females D. valentini from the Sepasar population (41.027492 N, 43.816634 E) were used to surgically extract ovary. Before dissecting out the organs, the animals were subjected to chloroform euthanasia. All tissue samples were stored in RNAlater® reagent at −20°C according to the manufacturer's recommended protocol (Qiagen Inc.) until they were shipped to Macrogen Inc. (Korea) for RNA extraction and further transcriptome preparation.

RNA sequencing and raw data quality control

Total RNA was isolated from an organ/tissue using standard Trizol Tissue RNA Extraction protocol (Standard protocol for QIAzol Lysis Reagent, Qiagen). RNA RIN scores ranged from 6.4 to 6.7. Ovary RNA from three individuals of each species was pooled together and used to prepare the three cDNA libraries: D. unisexualis, D. r. nairensis, D. valentini. Inside the procedure was a cleanup on a carrier with polyT and random primers from TruSeq Stranded mRNA kit were used for preparation cDNA. The paired-end sequencing libraries were prepared by random fragmentation of the cDNA samples into 350-500 bp fragments, followed by 5′ and 3′ adapter ligation using TruSeq RNA Sample Prep Kit v2 (Illumina Inc.) according to TruSeq RNA Sample Preparation Guide (Version 2, Part #15026495 Rev.F). Sequencing of transcriptome libraries was performed on Illumina HiSeq2500 with a mean read length of 101 bp. The Illumina Hiseq generated raw sequencing data utilizing HiSeq Control Software v2.2 for system control and base calling through an integrated primary analysis software. The BCL (base calls) binaries were converted into FASTQ by the Illumina package bcl2fastq (v1.8.4) [1] (RRID:SCR_015058). Raw transcriptome data were trimmed by Trimmomatic v0.39 [2] to remove adapters. Optical duplicates from reads were removed by the rmdup tool [3]. Raw transcriptomes contained 58932755, 51634041, and 62788216 reads for D. unisexualis, D. valentini and D. r. nairensis with GC content of 48.05%, 46.59%, and 46.77% respectively. Filtered reads quality was estimated by FastQC v0.11.9 [4] and became prepared for assembling.

Transcriptome annotation and assembly

Reads obtained after trimming and quality estimating by FastQC and Seq2fun pipeline [5] were assembled using Trinity v2.1.1 [6]. Transcriptome assembly with Trinity can be divided into several parts: searching and calculating k-mers, assembling contigs from k-mers, clustering contigs into components. For Trinity assembler, the default parameters were taken, where the minimum contig length value was 200, k-mer size was 25. TransDecoder v5.5.0 [7] program was used to predict translated proteins and ORFs (open reading frames) from assembled transcripts with at least 100 amino acids length. NCBI-blast-2.9.0+ [8] was used for homology search and protein domain identification on TransDecoder predicted proteins with such parameters as e-value < 1e-5 and percentage of similarity > 95%. The total number of bases, reads, GC (%), Q20 (%), and Q30 (%) were calculated for the three samples (Table 1). Obtained protein sequences from TransDecoder were cross-referenced with the Gene Ontology (GO) [9,10] database using the EggNog v2.0.1 [11] tool. This tool provides functional information in the context of structure, molecular functions, the biological process of query sequences, search matches, and performs them as GO terms. Top GO terms were determined and visualized using the Trinotate package in the TransPi [12] pipeline. PFAM [13] and BLASTP searches were also performed by the TransPi pipeline with OnlyAnn (only annotation) option. This mode used such databases as Swissprot, Uniprot custom database (available under request), and Pfam.

AMPs identification

To identify antimicrobial peptides (AMP) in the transcriptome, we blasted the assembled transcripts against the known AMPs from the DRAMP 3.0 database (Data Repository of Antimicrobial Peptides) [14] using BLAST-2.2.26+ [15] with the similarity cutoff of 70%.

Ethics Statement

All individuals were hand-caught; alive-animal handling procedures were approved by Yerevan State University according to the ethical guidelines, capture permit Code 5/22.1/51043 was issued by the Ministry of Nature Protection of the Republic of Armenia for scientific studies. The study was approved by the Ethics Committee of the Moscow State University (Permit Number: 24–01) and conducted strictly according to ethical principles and scientific standards.

CRediT authorship contribution statement

Sergei S. Ryakhovsky: Formal analysis, Writing – original draft, Writing – review & editing. Victoria A. Dikaya: Formal analysis. Vitaly I. Korchagin: Data curation. Andrey A. Vergun: Writing – original draft, Writing – review & editing, Data curation, Methodology. Lavrentii G. Danilov: Formal analysis, Writing – original draft, Writing – review & editing. Sofia D. Ochkalova: . Anastasiya E. Girnyk: Data curation. Daria V. Zhernakova: Conceptualization, Visualization. Marine S. Arakelyan: Methodology, Supervision. Vladimir B. Brukhin: Writing – original draft, Writing – review & editing. Aleksey S. Komissarov: Project administration, Conceptualization, Visualization, Data curation, Writing – original draft, Writing – review & editing, Supervision. Alexey P. Ryskov: Project administration, Conceptualization, Writing – original draft, Writing – review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

All authors have read and approved the final manuscript. Consent for publication: Not applicable. The authors declare that they have no competing interests.
SubjectBiology
Specific subject areaTranscriptomics
Type of dataTranscriptome assemblies, raw sequences
How data were acquiredOvary RNA from three lizard species were isolated and used for sequencing by the Macrogen Inc. (Korea)
Data formatAnalyzed, Raw
Parameters for data collectionData collection contains raw transcriptome data for ovary tissues of three lizard species: unisexual (parthenogenetic) D. unisexualis and parental bisexual D. valentini and D. raddei nairensis
Description of data collectionData collection includes total Illumina HiSeq2500 generated transcriptome reads, transcripts, TRINITY contigs, predicted proteins, and ORFs.
Data source locationAll lizards were collected from Armenia populations. D. unisexualis from the Hrazdan population (40.503493 N 44.748097 E)D. r. nairensis from Vahramaberd population (40.844394 N, 43.755720 E)D. valentini from Sepasar population (41.027492 N, 43.816634 E)
Data accessibilityRaw data - BioProject PRJNA773939 in NCBI SRA database.Trinity assemblies - doi:https://doi.org/10.17632/rtd8cx7zc3.1 in Mendeley Data
  10 in total

1.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors:  M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal:  Nat Genet       Date:  2000-05       Impact factor: 38.330

2.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

Review 3.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

4.  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.

Authors:  Jaime Huerta-Cepas; Damian Szklarczyk; Davide Heller; Ana Hernández-Plaza; Sofia K Forslund; Helen Cook; Daniel R Mende; Ivica Letunic; Thomas Rattei; Lars J Jensen; Christian von Mering; Peer Bork
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

5.  The Gene Ontology resource: enriching a GOld mine.

Authors: 
Journal:  Nucleic Acids Res       Date:  2021-01-08       Impact factor: 16.971

6.  Ultrafast functional profiling of RNA-seq data for nonmodel organisms.

Authors:  Peng Liu; Jessica Ewald; Jose Hector Galvez; Jessica Head; Doug Crump; Guillaume Bourque; Niladri Basu; Jianguo Xia
Journal:  Genome Res       Date:  2021-03-17       Impact factor: 9.043

7.  DRAMP 3.0: an enhanced comprehensive data repository of antimicrobial peptides.

Authors:  Guobang Shi; Xinyue Kang; Fanyi Dong; Yanchao Liu; Ning Zhu; Yuxuan Hu; Hanmei Xu; Xingzhen Lao; Heng Zheng
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

8.  Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors:  Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal:  Nat Biotechnol       Date:  2011-05-15       Impact factor: 54.908

9.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

10.  Pfam: The protein families database in 2021.

Authors:  Jaina Mistry; Sara Chuguransky; Lowri Williams; Matloob Qureshi; Gustavo A Salazar; Erik L L Sonnhammer; Silvio C E Tosatto; Lisanna Paladin; Shriya Raj; Lorna J Richardson; Robert D Finn; Alex Bateman
Journal:  Nucleic Acids Res       Date:  2021-01-08       Impact factor: 16.971

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.