Literature DB >> 30380119

SAGD: a comprehensive sex-associated gene database from transcriptomes.

Meng-Wei Shi1, Na-An Zhang1, Chuan-Ping Shi1, Chun-Jie Liu2, Zhi-Hui Luo1, Dan-Yang Wang1, An-Yuan Guo2, Zhen-Xia Chen1.   

Abstract

Many animal species present sex differences. Sex-associated genes (SAGs), which have female-biased or male-biased expression, have major influences on the remarkable sex differences in important traits such as growth, reproduction, disease resistance and behaviors. However, the SAGs resulting in the vast majority of phenotypic sex differences are still unknown. To provide a useful resource for the functional study of SAGs, we manually curated public RNA-seq datasets with paired female and male biological replicates from the same condition and systematically re-analyzed the datasets using standardized methods. We identified 27,793 female-biased SAGs and 64,043 male-biased SAGs from 2,828 samples of 21 species, including human, chimpanzee, macaque, mouse, rat, cow, horse, chicken, zebrafish, seven fly species and five worm species. All these data were cataloged into SAGD, a user-friendly database of SAGs (http://bioinfo.life.hust.edu.cn/SAGD) where users can browse SAGs by gene, species, drug and dataset. In SAGD, the expression, annotation, targeting drugs, homologs, ontology and related RNA-seq datasets of SAGs are provided to help researchers to explore their functions and potential applications in agriculture and human health.

Entities:  

Year:  2019        PMID: 30380119      PMCID: PMC6323940          DOI: 10.1093/nar/gky1040

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Sexually reproducing animals usually demonstrate remarkable differences between females and males in morphological, physiological and behavioral phenotypes (1,2). Such differences are caused by the large number of sex-associated genes (SAGs), whose expressions vary between females and males (3,4). The study of SAGs is important not only for understanding gene regulation and evolution, but also for their application to animal reproduction and pest control (3–7). Moreover, the increasing evidence indicates that SAGs are a key factor affecting the risk of developing all kinds of diseases including neurodegenerative diseases, cardiovascular diseases and cancers, and they have been linked to precision medicine (8–10). With the advent of RNA-seq technologies, it becomes possible to accurately quantify expression differences between males and females on a genome-wide scale. Numerous studies have been performed by RNA-seq to identify SAGs and the results revealed that a large fraction of genes are SAGs (11–14). Based on the samples, experimental and statistical methods used, up to 95% of genes may be identified as SAGs (12,15–17). However, there is a lack of a comprehensive database characterizing all the SAGs derived from RNA-seq data of the sequenced animal genomes through the same pipeline. To date, there has been only one database about SAGs called Sebida (18), which collected SAGs from microarray data of three insect species (Drosophila melanogaster, Drosophila simulans and Anopheles gambiae). It was established in 2006 and has not been updated in recent years. Some central repositories of gene expression (e.g. Expression Atlas (19), GEO Profiles (20)) cover comprehensive expression profiles including those from RNA-seq datasets with sex variables. However, these repositories are not designed exclusively for comparing the expressions between paired biological replicates under the same condition, thus they are not suitable for SAG study. To make gene expression comparisons between sexes across species possible, we presented SAGD (sex-associated gene database) integrating data from 2,828 RNA-seq samples to compare male versus female gene expression in 21 sequenced genomes. Users can compare the expression changes of SAGs in different species, tissues, and developmental stages, and can screen out candidate genes. This database will be a valuable resource for researchers and clinicians to conduct studies and practice on the function of SAGs.

MATERIALS AND METHODS

Extraction of metadata information from public resources of RNA-seq samples

We extracted metadata information of RNA-seq samples by integrating databases Expression Atlas (https://www.ebi.ac.uk/gxa/) (19), NCBI Short Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra/), and NCBI Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) (21). We first extracted metadata information from Expression Atlas (19) with well-curated sample information. We downloaded the gzipped tar archive of all Expression Atlas analysis results (4 April 2018) and extracted condensed-sdrf.tsv files for all the assays. Sample feature information including organism, organism part, sex and age was organized into a matrix with species, tissue, sex, and stage as columns. We then extracted metadata information from SRA containing descriptive information of sample attributes in forms of free text which is difficult to parse. We queried SRA projects with sex and gonad, and we curated species, tissue, sex, and stage features out of various attributes of SRA samples. The curated information was organized into one matrix. As a complement, we queried ‘sex’ for series in GEO (20) on 14 July 2018, and refined the 2,545 search results with study type ‘Expression profiling by high throughput sequencing’. We examined the series, and manually curated feature values of series samples into another matrix.

Subsequent selection of RNA-seq samples

We incorporated the three matrices mentioned above, and uniformed nomenclature of the feature values. Then, we grouped the samples by the combination information of project, species, tissue, and stage. Subsequently, we extracted sample library information from SRA by the Bioconductor SRAdb (1.42.2) (21) along with its sqlite database (modified on 8 June 2018). To select all potential RNA-Seq runs consisting of raw reads of sequenced RNAs, we maintained SRA runs with ‘TRANSCRIPTOMIC’ library source, ‘ILLUMINA’ platform, and ‘RNA-Seq’ library strategy. Then, we selected RNA-Seq runs from species with sequenced genomes in Ensembl or Ensembl Metazoa (22) for the purpose of gene expression analysis based on annotation. Next, we retained groups with both female and male biological replicates for differential expression analysis setting sex as a major difference variable. We selected only the group with most biological replicates for further analysis when there was more than one group in a species/tissue/stage combination. Up to 20 biological replicates for each sex from each group were randomly picked. In total, 15,718 SRA runs of potential RNA-Seq data for sequenced genomes were maintained for further selection.

Analysis of RNA-seq data

All selected groups of raw RNA-seq datasets were processed through the same pipeline (Figure 1). The raw RNA-seq data were downloaded from SRA, and mapped to corresponding reference genomes by HISAT2 (version 2.0.5) (23) under the guidance of gene annotation from Ensembl (release 92) or Ensembl Metazoa (Release 40) (22). HTSeq-count (version 0.9.1) (24) was employed to quantify the reads uniquely aligned to each gene so that one read would not be assigned to several paralogs. Read counts were merged by sample so that technical replicates would be integrated. We then normalized read counts and identified differentially expressed genes between female and male samples in each group by DESeq2 (version 1.20.0) (25). We also normalized read counts into the FPKM values (Fragments Per Kilobase of transcript per Million mapped reads). We defined genes with padj <0.05 and |log2 (M/F ratio)| ≥2 in each group as SAGs.
Figure 1.

Overall design of SAGD. SAGD curated metadata information from Expression Atlas, SRA and GEO, and selected groups of RNA-seq datasets with female and male biological replicates with the same project/species/tissue/stage combination. All RNA-seq raw data was processed using a standard pipeline. SAGD includes ‘Browse’, ‘Search’, ‘Download’ and ‘Submission’.

Overall design of SAGD. SAGD curated metadata information from Expression Atlas, SRA and GEO, and selected groups of RNA-seq datasets with female and male biological replicates with the same project/species/tissue/stage combination. All RNA-seq raw data was processed using a standard pipeline. SAGD includes ‘Browse’, ‘Search’, ‘Download’ and ‘Submission’. For quality control, we measured the replicability among biological replicates. We defined a standard sample for each sex/project/species/tissue/stage combination as the median value of normalized read counts of samples in the combination (15), and calculated the Spearman correlation coefficients of normalized read counts between each sample and its corresponding standard sample. Samples with a correlation over 0.8 were defined as qualified (15).

Database implementation

The SAGD database was built with the Flask open source framework (http://flask.pocoo.org/). All data were integrated into MongoDB (version 3.2.11). The web interface was designed and implemented using AngularJS (version 1.6.9) and was improved with some AngularJS libraries and several JavaScript libraries for a more useful interface. Our website was tested with several popular web browsers and Google Chrome was recommended.

RESULTS

Data summary

In total, we identified 27,793 female-biased SAGs and 64,043 male-biased SAGs in 21 species by curating high-throughput datasets of 2,828 samples from 38 projects (Table 1). There were more male-biased genes than female-biased genes in 117 of all the 150 groups (Table 1, Supplementary Table S1). In XX/XY sex chromosome systems, 121 of the 142 groups showed a higher percent of female-biased genes on the X chromosome than male-biased genes (Supplementary Table S1). Among SAGs, there were 4,871 female-biased human SAGs (8.3% human genes) and 17,223 male-biased human SAGs (29.5% human genes) derived from 1,800 samples covering 4 developmental stages and 60 tissues in 16 projects (Table 1). Combining the drug information, we found that 1,126 SAGs were drug targets and thus they might be associated with sex difference of drug response.
Table 1.

Statistics of RNA-seq datasets and SAGs in each species

SAG_F4SAG_M5
ProjectSampleTissueStageGroup#2%3#%
Bos taurus 139414450.2430.2
Caenorhabditis brenneri 161113,58310.85,46516.4
Caenorhabditis elegans 2282231,2492.73,4087.3
Caenorhabditis japonica 161111,8555.73,38410.4
Caenorhabditis remanei 161112,7998.54,48513.6
Danio rerio 1411130.0140.0
Drosophila ananassae 141112951.92,40615.2
Drosophila melanogaster 31609192,90616.47,42541.9
Drosophila mojavensis 141111,0016.82,43316.6
Drosophila pseudoobscura 2123132,99117.65,86634.6
Drosophila simulans 2112129306.02,80018.2
Drosophila virilis 141114172.82,37015.7
Drosophila yakuba 141115953.72,52615.5
Equus caballus 124212100.0200.1
Gallus gallus 1879192,3499.44691.9
Homo sapiens 161,800604664,8718.317,22329.5
Macaca mulatta 112111260.11310.4
Mus musculus 7276203206061.11,0922.0
Pan troglodytes 112111680.21250.4
Pristionchus pacificus 161119733.31,8986.4
Rattus norvegicus 3323122212210.74601.4
Total1 382,82893615027,7934.664,04310.6

Notes: 1Duplicates were removed before summing up.

2Number of the SAGs.

3Percent of the genes in the genome.

4Female-biased genes.

5Male-biased genes.

Statistics of RNA-seq datasets and SAGs in each species Notes: 1Duplicates were removed before summing up. 2Number of the SAGs. 3Percent of the genes in the genome. 4Female-biased genes. 5Male-biased genes. To explore the conservation of sex bias within and among species, we compared sex bias of SAGs in adult somatic tissues among different human groups, as well as the groups between human and other species. The comparison revealed that 2.4–38.9% SAGs shared the same sex bias among different human groups (Supplementary Table S2), whereas only 0.2–9.2% SAGs shared the same sex bias between human species and other species (Supplementary Table S3). Multiple SAGs were found to be human-specific. For example, the gene LTF was female-biased in the adult liver of human, while unbiased in other species. It was reported to affect endometriosis (26), and is the drug target of NIMESULIDE for the treatment of excessive uterine bleeding during menstruation. The low conservation of sex bias across species could result from the varied sample size and experimental methods among groups. Alternatively, it might suggest that sex bias depends on species, and thus researchers need to be cautious when using animal models to study sex differences in drug response.

Browse and search of the database

We designed a user-friendly webpage for the database. A quick search box was provided on the top navigation bar to search by keywords (i.e. gene symbol, ensemble ID, tissue and stage). Users can also browse SAGs of multiple species by gene, species, dataset and drug (Figure 2A).
Figure 2.

An overview of SAGD. (A) The homepage of SAGD. (B) Browse by gene. (C) Browse by species. The species tree was plotted by TimeTree (www.timetree.org) (29) with modifications. (D) Browse by drug. (E) Browse by dataset. (F) Information of each SAG.

An overview of SAGD. (A) The homepage of SAGD. (B) Browse by gene. (C) Browse by species. The species tree was plotted by TimeTree (www.timetree.org) (29) with modifications. (D) Browse by drug. (E) Browse by dataset. (F) Information of each SAG. On the webpage of gene, users can browse and search SAGs by species, tissue and developmental stage, and can refine the results with the range of sex bias (log2 (M/F ratio)) and difference significance (padj) (Figure 2B). For example, if users want to browse SAGs in human liver, they only need to select ‘Human’ and ‘liver’ in the drop-down menus of ‘Species’ and ‘Tissue’ on the top left, and click the ‘search’ button on the top right. The searching results will be exhibited in a table that contains padj, FPKM of each sex, and log2(M/F ratio) for each gene (Figure 2B). Users can start a new search after clicking the ‘clear’ button (Figure 2B). On the webpage of species, users can view the phylogeny of 21 species covered by SAGD, and can browse SAGs in each dataset of the selected species (Figure 2C). The phylogeny is presented as a species tree with time scale. Users can select a species of interest and browse all its groups (Figure 2C). Group information contains project, tissue, stage, SAG number and top 3 significant genes. We colored the groups based on the log2 (M/F ratio) of the most significant gene for visualization. Users can browse all the genes in their interested group via the links of SAGD ID and find corresponding group datasets (Figure 2C). On the webpage of drug, users can browse and search the SAG-targeting drugs by keywords Gene ID, DrugBank ID, Drug Name and Drug Type (Figure 2D). On the webpage of dataset, users can browse all the RNA-seq datasets, and search by keywords SRA accession, species, tissue, stage and sex to find their interested datasets (Figure 2E). All the four browse methods guide users to gene information pages, on which we integrated basic gene information from Ensembl BioMarts (27), expression comparison information from our RNA-seq data analysis, and drug target information from DrugBank (28). Sex-biased gene expressions across groups were shown in bubble plots (Figure 2F).

Downloads

All the search results can be downloaded as CSV files for customized analysis by clicking the Download button on the top right of almost all pages. Alternatively, SAGD offers users the RNA-seq data analysis results of each group in CSV files on the Download page.

Data submission

Users can submit relevant data by sending us a data information table via email. Currently, SAGD only accepts open access RNA-seq data from SRA for species with reference genomes and annotations from Ensembl or Ensembl Metazoa. The submitted data would be added to SAGD after curation and analysis as described in the section of Materials and Methods.

DISCUSSION

SAGD aims to provide users a comprehensive resource for SAGs by curating available high-quality raw RNA-seq datasets through the same pipeline. Multiple efforts were made to ensure the validity of this database. For example, (i) We curated metadata information including project, species, sex, tissue and developmental stage for the datasets from multiple sources. Manual inspection was conducted to ensure correctness and comprehensiveness. (ii) We only used datasets from the same project, species, tissue and developmental stage for SAG identification so as to ensure sex to be the major difference variable. (iii) We selected the groups with the most and at least two biological replicates for each condition, and performed quality control to ensure good replicability among biological replicates. (iv) We provided customized, instead of fixed, filters including sex bias and statistical significance level so that users could define their own SAGs. However, the determination of SAGs is a complicated issue encompassing a vast number of assumptions and hypotheses. Users should be cautious that if a gene is apparently sex-biased or not sex-biased in this database, the level of evidence should be examined carefully.

SUMMARY AND FUTURE DIRECTIONS

With the rapid accumulation of RNA-seq data, it is worthwhile to explore the function of SAGs by curating RNA-seq data from multiple species. SAGD facilitates users to explore their interested SAGs across projects, species, tissues and stages through customized browsing options. The comparative analysis of SAGs within and across species requires comparable group pairs under the same condition. For such analysis, we will update SAGD regularly by adding more SAGs when additional RNA-seq datasets and reference genomes become available. SAGD will also provide more experimentally supported data as a solid resource for the studies of sex differences and comparative genomics. Click here for additional data file.
  29 in total

Review 1.  The evolution of sex-biased genes and sex-biased gene expression.

Authors:  Hans Ellegren; John Parsch
Journal:  Nat Rev Genet       Date:  2007-08-07       Impact factor: 53.242

Review 2.  The evolutionary causes and consequences of sex-biased gene expression.

Authors:  John Parsch; Hans Ellegren
Journal:  Nat Rev Genet       Date:  2013-02       Impact factor: 53.242

3.  Decreased lactoferrin levels in peritoneal fluid of women with minimal endometriosis.

Authors:  Grzegorz Polak; Iwona Wertel; Rafał Tarkowski; Dorota Morawska; Jan Kotarski
Journal:  Eur J Obstet Gynecol Reprod Biol       Date:  2006-04-27       Impact factor: 2.435

4.  Sebida: a database for the functional and evolutionary analysis of genes with sex-biased expression.

Authors:  Florian Gnad; John Parsch
Journal:  Bioinformatics       Date:  2006-07-31       Impact factor: 6.937

5.  Sex-specific and lineage-specific alternative splicing in primates.

Authors:  Ran Blekhman; John C Marioni; Paul Zumbo; Matthew Stephens; Yoav Gilad
Journal:  Genome Res       Date:  2009-12-15       Impact factor: 9.043

6.  NCBI GEO: archive for functional genomics data sets--update.

Authors:  Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

7.  Ensembl BioMarts: a hub for data retrieval across taxonomic space.

Authors:  Rhoda J Kinsella; Andreas Kähäri; Syed Haider; Jorge Zamora; Glenn Proctor; Giulietta Spudich; Jeff Almeida-King; Daniel Staines; Paul Derwent; Arnaud Kerhornou; Paul Kersey; Paul Flicek
Journal:  Database (Oxford)       Date:  2011-07-23       Impact factor: 3.451

8.  Disentangling the relationship between sex-biased gene expression and X-linkage.

Authors:  Richard P Meisel; John H Malone; Andrew G Clark
Journal:  Genome Res       Date:  2012-04-12       Impact factor: 9.043

9.  Sex-biased transcriptome evolution in Drosophila.

Authors:  Raquel Assis; Qi Zhou; Doris Bachtrog
Journal:  Genome Biol Evol       Date:  2012       Impact factor: 3.416

10.  SRAdb: query and use public next-generation sequencing data from within R.

Authors:  Yuelin Zhu; Robert M Stephens; Paul S Meltzer; Sean R Davis
Journal:  BMC Bioinformatics       Date:  2013-01-17       Impact factor: 3.169

View more
  8 in total

1.  Sexual dimorphism of the immune system predicts clinical outcomes in glioblastoma immunotherapy: A systematic review and meta-analysis.

Authors:  Jack M Shireman; Simon Ammanuel; Jens C Eickhoff; Mahua Dey
Journal:  Neurooncol Adv       Date:  2022-05-27

2.  Pan-cancer analyses of synonymous mutations based on tissue-specific codon optimality.

Authors:  Xia Ran; Jinyuan Xiao; Fang Cheng; Tao Wang; Huajing Teng; Zhongsheng Sun
Journal:  Comput Struct Biotechnol J       Date:  2022-07-06       Impact factor: 6.155

3.  Cancer type classification using plasma cell-free RNAs derived from human and microbes.

Authors:  Shanwen Chen; Yunfan Jin; Siqi Wang; Shaozhen Xing; Yingchao Wu; Yuhuan Tao; Yongchen Ma; Shuai Zuo; Xiaofan Liu; Yichen Hu; Hongyan Chen; Yuandeng Luo; Feng Xia; Chuanming Xie; Jianhua Yin; Xin Wang; Zhihua Liu; Ning Zhang; Zhenjiang Zech Xu; Zhi John Lu; Pengyuan Wang
Journal:  Elife       Date:  2022-07-11       Impact factor: 8.713

4.  Sex-specific responses to cold in a very cold-tolerant, northern Drosophila species.

Authors:  Darren J Parker; Tapio Envall; Michael G Ritchie; Maaria Kankare
Journal:  Heredity (Edinb)       Date:  2021-01-28       Impact factor: 3.821

5.  InSexBase: an annotated genomic resource of sex chromosomes and sex-biased genes in insects.

Authors:  X I Chen; Yang Mei; Mengyao Chen; Dong Jing; Yumin He; Feiling Liu; Kang He; Fei Li
Journal:  Database (Oxford)       Date:  2021-01-28       Impact factor: 3.451

6.  Selection of suitable internal controls for gene expression normalization in rats with spinal cord injury.

Authors:  Wei Liu; Jie Yu; Yi-Fan Wang; Qian-Qian Shan; Ya-Xian Wang
Journal:  Neural Regen Res       Date:  2022-06       Impact factor: 5.135

7.  The effect of sex on the mouse lens transcriptome.

Authors:  Adam P Faranda; Mahbubul H Shihan; Yan Wang; Melinda K Duncan
Journal:  Exp Eye Res       Date:  2021-06-17       Impact factor: 3.770

8.  Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations.

Authors:  Gregory P Way; Michael Zietz; Vincent Rubinetti; Daniel S Himmelstein; Casey S Greene
Journal:  Genome Biol       Date:  2020-05-11       Impact factor: 13.583

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.