Literature DB >> 26000771

Genome-wide analysis of positively selected genes in seasonal and non-seasonal breeding species.

Yuhuan Meng¹, Wenlu Zhang¹, Jinghui Zhou¹, Mingyu Liu¹, Junhui Chen¹, Shuai Tian¹, Min Zhuo¹, Yu Zhang², Yang Zhong³, Hongli Du¹, Xiaoning Wang⁴.

Abstract

Some mammals breed throughout the year, while others breed only at certain times of year. These differences in reproductive behavior can be explained by evolution. We identified positively-selected genes in two sets of species with different degrees of relatedness including seasonal and non-seasonal breeding species, using branch-site models. After stringent filtering by sum of pairs scoring, we revealed that more genes underwent positive selection in seasonal compared with non-seasonal breeding species. Positively-selected genes were verified by cDNA mapping of the positive sites with the corresponding cDNA sequences. The design of the evolutionary analysis can effectively lower the false-positive rate and thus identify valid positive genes. Validated, positively-selected genes, including CGA, DNAH1, INVS, and CD151, were related to reproductive behaviors such as spermatogenesis and cell proliferation in non-seasonal breeding species. Genes in seasonal breeding species, including THRAP3, TH1L, and CMTM6, may be related to the evolution of sperm and the circadian rhythm system. Identification of these positively-selected genes might help to identify the molecular mechanisms underlying seasonal and non-seasonal reproductive behaviors.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26000771 PMCID： PMC4441472 DOI： 10.1371/journal.pone.0126736

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The environment can influence gene evolution and thus animal behaviors, including reproduction-related behaviors. Some mammals can breed throughout the year, while others only breed successfully at certain times of year. Such animals are defined as non-seasonal and seasonal breeding species, respectively. Day length, temperature, and food supply can all influence the reproductive behavior of seasonal breeding species and subsequent survival of offspring [1]; if they breed too early, the growing offspring may be exposed to low temperatures and scarce resources, whereas late breeding limits the time available for reproductive behaviors and preparation for the following winter. Accurate timing is therefore an essential component of life-history strategies for organisms living in seasonal environments [2]. The different reproductive behaviors of seasonal and non-seasonal breeding species may result from natural selection pressures [3]. Both strategies benefit the respective species to survive by adaption of their breeding behaviors to the environment through their long evolutionary histories. Whole genome-wide analysis of genes that are positively selected in mammal lineages using the respective breeding strategies may help us to understand the mechanisms responsible for the divergent reproductive behaviors as a result of adaptive evolution. Positive Darwinian selection of protein-coding genes is a major driving force for detecting adaptive evolution and species diversification. The modified version of the branch-site test (Model A) [4, 5] was designed to detect localized episodic bouts of positive selection that affect only a few amino acid residues in particular lineages. This test has been shown to be a reasonably powerful tool, and has been widely used to investigate the adaptive evolution of genes in many species [6-8]. However, alignment errors may influence the results of branch-site gene analysis in mammalian and vertebrate species. It is therefore necessary to use reliable alignment methods to reduce the incidence of false-positive results [9]. Although the aligner software PRANK [10, 11] cannot eliminate false-positive results, it is nonetheless more powerful than other aligners [9, 12] such as MUSCLE [13] and ClustalW [14]. In addition to misalignments in multiple sequences, other factors such as sequence errors, misassembly, and annotation mistakes also increase the incidence of falsely-identified positive selection [15, 16]. More stringent filters are needed to ensure that branch-site analysis has a low and acceptable false positive rate. In this genome-wide study, we investigated the evolution of seasonal breeding strategies by identifying positively-selected genes in non-seasonal and seasonal breeding species using modified branch-site models. We established Distant-Species and Close-Species sets, each of which included seasonal and non-seasonal groups. We then identified positively-selected genes in these groups. PRANK (codon) software was used to align all the gene orthologs in the two gene sets. However, because PRANK generates a relatively high false-positive rate with the branch-site model, stringent filtering using sum of pairs (SP) [17, 18] scoring was used to remove potentially unreliable alignments generated by multiple sequence alignments. Sequence errors, misassembly, and annotation mistakes were also detected by cDNA mapping. Functional analysis of genes identified as positively-selected after this stringent filtering process might help us to understand the molecular mechanisms that determine non-seasonal and seasonal breeding.

Materials and Methods

Materials preparation

Five non-seasonal breeding species and five seasonal breeding species were chosen as the Distant-Species set. The five non-seasonal species included: human (Homo sapiens, GRCh37), chimpanzee (Pan troglodytes, CHIMP2.1) [19], cynomolgus monkey (or crab-eating macaque, Macaca fascicularis) [20], mouse (Mus musculus, NCBIM37) [21] and rat (Rattus norvegicus, RGSC3.4) [22]. The five seasonal breeding species were Indian rhesus monkey (Macaca mulatta, MMUL_1) [23], Chinese rhesus monkey (M. mulatta lasiota, CR) [24], dog (Canis familiaris, BROADD2)[25], horse (Equus caballus, EquCab2) [26] and rabbit (Oryctolagus cuniculus, oryCun2) [27]. The long lineages between species in the Distant-Species set means that behaviors may have changed back and forth between seasonal and non-seasonal breeding strategies several times, while the divergent sequences might influence the branch-site model analysis and generate false positives [28]. To address this problem, we also established a Close-Species set that only included closely-related, non-seasonal (human, gorilla (Gorilla gorilla, gorGor3.1) [29], chimpanzee, and cynomolgus monkey), and seasonal-breeding species (orangutan (Pongo abelii, PPYG2) [30], Indian rhesus monkey, Chinese rhesus monkey, and marmoset (Callithrix jacchus, C_jacchus3.2.1) [31]). The protein-coding sequences for human, gorilla, chimpanzee, orangutan, Indian rhesus monkey, marmoset, mouse, rat, dog, horse, and rabbit were downloaded from the Ensembl database (version 64, Sep. 2011; http://www.ensembl.org/info/data/ftp/index.html) [32]. The sequences for cynomolgus monkey (http://climb.genomics.cn/10.5524/100003) and Chinese rhesus macaque (http://climb.genomics.cn/10.5524/100002) were provided by BGI [33]. The corresponding cDNA sequences used in the accuracy assessment were downloaded from NCBI. Detailed information on the cDNA sequences used in this study are listed in S1 Table.

Calculating positively-selected sites

To identify 1:1 gene orthologs, human protein sequences were used to conduct BLAST [34] searches against other species sequences (blastp-F T-e 1e-5-m 8). It is difficult to select a set of transcripts to minimize alignment gaps and potential errors and thus false-positive branch-site test results [35]. In simple analyses in previous studies [6–8, 33, 36–41], the longest transcript for a given gene was chosen. Reciprocal searches were then performed for each species protein sequences relative to human protein sequences. In each search, pairwise sequences with identities <60% were excluded, and the highest hit for each query was retained to determine the pairwise orthologs between humans and other species. Modified branch-site models [5] for adaptive evolution analysis used each species in one breeding series as the foreground species, and all the other species in that breeding series as background species. For example, to test for positive selection in humans in the Distant-Species set, the human branch was designated as the foreground branch, and the other five species in the seasonal breeding group were designated as background branches. Positive selection signals for all species were tested similarly. Protein-coding sequences associated with the corresponding 1:1 gene orthologs were aligned using PRANK (codon). The corresponding gene-based phylogenetic trees were constructed using the maximum likelihood method in the PHYLIP 3.69 [42] software package, according to the tested aligned protein-coding sequences. The aligned protein-coding sequences and the corresponding phylogenetic trees were then used to analyze the adaptive evolution using the branch-site model in PAML’s codeML program [4]. Branch-site modified model A (model = 2, NSsits = 2) and the corresponding null model (model = 2, NSsits = 2, fix_omega = 1 and omega = 1) [5] were used to identify sequences under positive selection in both test sets of animals. Significance was calculated using the χ2 statistic, with one degree of freedom. Genes with p ≤0.01 were considered to be positively selected [5]. The p values were adjusted according to the FDR method (multiple testing correction with the method of Benjamini and Hochberg) [43] to allow for multiple testing, with a strict criterion of FDR <0.05. Positively-selected sites were obtained based on the Bayes Empirical Bayes (BEB) analysis [5], with a posterior probability >95%.

Screening for valid positive sites by SP penalty scoring

To ensure the accuracy of the positive sites, extended sequences were extracted including 15 amino acids (45 base pairs) upstream and downstream from the positive sites. SP [17, 18] measurements were then performed for penalty scoring of the sequences in both streams. (1) Some of the positive sites were at the edge of the beginning or end of the gene and were not reached by the upstream or downstream sequences, and the penalty base score was set separately for both streams (regarded as S, S = 15/n, where n is equal to the number of amino acids in the upstream or downstream sequence). (2) Penalty scores added 0 point for each position in perfect alignment, while mismatched sites or gaps in the alignment were awarded penalty scores of minus S or 2S, respectively. (3) Penalty scores for the upstream and downstream sequences were calculated separately, and the total penalty scores were the sum of the upstream and downstream scores. (4) Average penalty scores were calculated as the final scores (average penalty score = total penalty score/N, where N is the number of sequences used in each alignment). General and individual penalty scores were used. General penalty scores were equal to the sum of the penalty scores from each of the two compared species. For individual penalty scores, sequences with positive sites were compared with each of the other sequences used in the alignment in turn, and the total penalty scores were regarded as the individual penalty score. Threshold values were set for general and individual penalty scores to filter sequences with valid positive sites. In this study, the threshold values for the general and individual penalty scores were −50 and −15, respectively. If both the general and individual penalty scores were greater than the threshold value, the sequences were filtered and the sites regarded as positive.

Accuracy of positive sites according to cDNA sequences

Mistakes can occur during genome sequencing, sequence assembly, or gene annotation, and cDNA sequences can be used as references to assess the accuracy of the positive sites. Corresponding cDNA sequences were first matched to the gene sequences using the function BLAST [34] (blastn-e 1e-10-a 4-m 8). cDNA sequences that included the positions corresponding to the positive sites were then filtered. Further analysis was conducted using MEGA5 [44]. The gene sequences and their corresponding cDNA sequences were then subjected to alignment analysis using the MUSCLE [13] function. If the nucleotide sequences of the positive sites were identical to those of the corresponding positions in the cDNA sequences, the positive sites were regarded as valid.

Results

Preliminary filtering of positively-selected genes using PRANK and branch-site model

Totals of 11,031 and 13,171 1:1 gene orthologs with >60% identities were filtered from the Distant- and Close-Species sets, respectively, by BLAST [34]. The corresponding protein sequences were used for subsequent alignments. The numbers of pairwise gene orthologs between humans and other species are listed in S2 Table. After alignment using PRANK (codon), 10,918 gene orthologs in the Distant-Species set and 12,485 in the Close-Species set were tested for positive selection signals using the codeML program in the PAML package [4], with the modified branch-site model [5]. Positively-selected genes in each species with a p value <0.01(comparing LRT, the likelihood ratio test, with the χ2 distribution) and with a false-discovery rate (FDR) <5% are shown in Table 1.

Table 1

Numbers of positively-selected genes under different filtering conditions.

Class	Distant-Species	χ2 test p<0.01	Correction FDR<0.05	SP score fitered Genes	Close-Species	χ2 test p<0.01	Correction FDR<0.05	SP score fitered Genes
Non-seasonal	Human	88	16	4	Human	116	20	4
	Chimpanzee	207	68	27	Gorilla	274	163	34
	Cynomolgus	113	62	27	Chimpanzee	289	117	48
	Mouse	228	15	4	Cynomolgus	266	159	69
	Rat	274	43	18
	Mean	182	40.8	16	Mean	236.25	114.75	38.75
Seasonal	Indian rhesus	453	361	131	Orangutan	446	303	147
	Chinese rhesus	203	110	51	Indian rhesus	603	464	157
	Dog	499	158	54	Chinese rhesus	229	130	57
	Horse	463	157	55	Marmoset	688	314	107
	Rabbit	444	129	58
	Mean	412.4	183	69.8	Mean	491.5	302.75	117

In the Distant-Species set, the mean number of positively-selected genes in the seasonal species was four fold greater than in the non-seasonal species (fdr <0.05) (Fig 1A, Table 1). The equivalent increase in the Close-Species set was about 2.63-fold (Fig 1B, Table 1). These results demonstrate that there were more positively-selected genes in seasonal compared with non-seasonal breeders in both species sets.

Fig 1

Numbers of positively-selectived genes (fdr <0.05) and sites (after SP-score filtering).

Numbers of positively-selectived genes (fdr <0.05) and sites (after SP-score filtering).

(A). Positively-selected genes corrected by FDR. Sites (BEB >0.95) were filtered by SP scores in the Distant-Species set. (B). Positively-selected genes (FDR >0.05) and positive sites (BEB >0.95) filtered by general SP score >-50 and individual SP score >-15 in the Close-Species set. However, there were more positively-selected genes in the Close-Species than in the Distant-Species set (mean numbers with FDR <0.05 208.75 and 111.9, respectively). In addition to the different numbers of orthologs (12,485 vs. 10,918), it is also possible that more gaps were generated by alignment in the Distant-Species gene ortholog set compared with in the Close-Species set (mean gap length 244 in the Close-Species set and 322 in the Distant-Species set) (S3 Table), because the sequence divergence was smaller in the Close-Species set. The number of gaps may influence the results of branch-site analysis, because the branch-site would remove columns with gaps in the alignment sequences and would thus exclude more potential positive sites in the Distant-Species set compared with the Close-Species set.

Identification of false-positive sites through sequence misalignment

Putative positively-selected sites in the genome (FDR<0.05) were obtained by Bayes Empirical Bayes (BEB) analysis (posterior probability >95%) [5]. The numbers of putative positively-selected sites in each species are listed in Table 2. The details of all the positive sites with BEB >0.95 are listed in S4 Table.

Table 2

Positive sites after BEB and SP-score filtering.

Class	Distant-Species	Sites (BEB>0.95)	SP scores filtered sites	FPR	Close-Species	Sites (BEB>0.95)	SP scores filtered sites	FPR
Non-seasonal	Human	26	16	38.46%	Human	9	6	33.33%
	Chimpanzee	103	65	36.89%	Gorilla	158	84	46.84%
	Cynomolgus	92	54	41.30%	Chimpanzee	132	90	31.82%
	Mouse	10	5	50.00%	Cynomolgus	237	131	44.73%
	Rat	66	42	36.36%
Seasonal	Indian rhesus	532	206	61.28%	Orangutan	444	246	44.59%
	Chinese rhesus	153	77	49.67%	Indian rhesus	531	299	43.69%
	Dog	261	106	59.39%	Chinese rhesus	189	89	52.91%
	Horse	262	134	48.85%	Marmoset	364	232	36.26%
	Rabbit	241	127	47.30%

Alignment problems may influence the performance of the branch-site test, with poor alignment increasing the incidence of false-positive sites. We therefore filtered out sites with obvious signs of unreliable alignment. We also calculated the SP [17, 18] score for each of the positive sites’ extended sequences (± 15 amino acids/45 base pairs). Most unreliable alignments are represented by numerous gaps and sequence divergences (S1 Fig and S5 Table). After filtering, a total of 2009/3810 (52.73%) positive sites remained. Sites with extended alignments with low divergence are listed in S6 Table. The results after filtering revealed more sites with positive selection in the seasonal compared with the non-seasonal breeding species (Table 2). The false-positive rate due to misalignment was 33.33%–61.28% (Table 2), which was similar to that of 50%–55% in a previous report [12]. After alignment filtering, differences in gene numbers between species in the Distant- and Close-Species sets were consistent with those after FDR-adjusted filtering. However, the false positive rate(FPR) statistics only considered misalignment and did not take account of other factors such as sequence errors, misassembly, or annotation problems. According to extended-sequence alignments of the positive sites, SP scores <-50 were generally caused by excessive gaps or deficient matches, of which gaps contributed more to the low SP penalty scores (S1 Fig and S6 Table). Gaps and deficient matches may arise as a result of diversity between species or different transcript lengths, because we used the longest human transcripts to BLAST other species’ protein-coding sequences [35]. Columns with gaps in the alignments would be deleted in branch-site models, even though positive sites may be located within deficient sequence alignments surrounded by gaps or mismatched sequences. A threshold SP score of −50 can filter out most false-positive sites caused by divergent sequence alignments. SP scoring thus improves the reliability of the results by reducing the false-positive rate caused by unreliable alignments. Details of the positive genes filtered by SP scoring are shown in S7 Table.

cDNA mapping as a novel method of filtering positive sites

The quality of the genome may limit the accuracy of evolutionary analysis. It can result in false-positive results associated with sequencing errors, alternative splicing, amino acid repeats, and frameshift mutations, causing mistakes in gene annotation [8, 15]. However, cDNA sequences are much shorter than genome sequences and are thus more reliable. The reliability of positive sites will therefore be increased if sequences with positive sites are mapped to the corresponding cDNA sequences and aligned with most of the bases. We therefore used cDNA mapping as a novel means of testing sequence errors. cDNA sequences corresponding to the positive sites were analyzed. In this study, we aligned a total of 193 positive sites in perfect alignment with at least one cDNA sequence of the corresponding species using the MUSCLE function [13] in MEGA5 [44]. The coverage between positive sites and corresponding cDNA sequences was low (<10%, 193/2009), and the false positive rate was 61.66% (120/193). Most inconsistent sites were in cynomolgus monkey, horse, and orangutan, which had genome sequences of low quality or with annotation mistakes. In contrast, the human, mouse and rat genome sequences showed high accuracy. The details of the positive sites mapped with the corresponding cDNA sequences are shown in S1 Table. A total of 74 corresponding cDNA sites were finally identified that were consistent with the positive sites (S1 Table). No corresponding cDNA sequences mapped to the positive sites in gorillas, Chinese rhesus monkeys, and marmosets. After verification by cDNA filtering, 39 genes remained, including 15 genes that were positively-selected in non-seasonal species (Table 3), and 24 in seasonal species (Table 4). Although the limited availability of cDNA sequences meant that only a few positive sites remained after mapping, these sites were likely to be more accurate.

Table 3

Positively-selected genes in non-seasonal species filtered by SP scoring and corrected by cDNA mapping.

Species	Gene Symbol	Species ID	Set	P-χ2 test	FDR correction
Human	CGA	ENST00000369582	Distant-Sspecies	0.000000	0.000779
Human	TOMM6	ENST00000398884	Distant-Sspecies	0.000046	0.035968
Human	CD151	ENST00000397420	Close-Sspecies	0.000045	0.029687
Human	RRP8	ENST00000254605	Distant-Sspecies	0.000040	0.033188
Human	ACCN4	ENST00000358078	Distant-Sspecies	0.000000	0.000808
Human	ACCN4	ENST00000358078	Close-Sspecies	0.000000	0.000609
Human	CHRNA1	ENST00000261007	Close-Sspecies	0.000000	0.000115
CE	SNX5	CE_ENSP00000366998	Distant-Sspecies	0.000000	0.000004
CE	SNX5	CE_ENSP00000366998	Close-Sspecies	0.000000	0.000002
CE	NCAPG	CE_ENSP00000251496	Close-Sspecies	0.000013	0.002031
CE	VPS33A	CE_ENSP00000267199	Distant-Sspecies	0.000046	0.009533
Mouse	SWI5	ENSMUST00000113400	Distant-Sspecies	0.000032	0.032039
Mouse	NID2	ENSMUST00000022340	Distant-Sspecies	0.000005	0.005636
Mouse	DHDH	ENSMUST00000011526	Distant-Sspecies	0.000066	0.047987
Mouse	DNAH1	ENSMUST00000048603	Distant-Sspecies	0.000004	0.005318
Rat	INVS	ENSRNOT00000011622	Distant-Sspecies	0.000001	0.001202
Rat	GALK2	ENSRNOT00000012447	Distant-Sspecies	0.000146	0.037931

Table 4

Positively-selected genes in seasonal species filtered by SP scoring and corrected by cDNA mapping.

Species	Gene Symbol	Species ID	Set	P-χ2 test	FDR correction
Orangutan	TADA1	ENSPPYT00000000676	Close-Sspecies	0.000001	0.000150
Orangutan	LGALS3BP	ENSPPYT00000010154	Close-Sspecies	0.000000	0.000000
Orangutan	ZFR	ENSPPYT00000017875	Close-Sspecies	0.000000	0.000001
Orangutan	THRAP3	ENSPPYT00000001838	Close-Sspecies	0.000000	0.000001
Orangutan	MTMR12	ENSPPYT00000017872	Close-Sspecies	0.000000	0.000000
Orangutan	TMCC2	ENSPPYT00000000349	Close-Sspecies	0.000000	0.000018
Orangutan	SLC44A2	ENSPPYT00000011142	Close-Sspecies	0.000016	0.001520
Orangutan	MIPEP	ENSPPYT00000006166	Close-Sspecies	0.000010	0.001011
Orangutan	XRN2	ENSPPYT00000012494	Close-Sspecies	0.000000	0.000000
Orangutan	RBM47	ENSPPYT00000017075	Close-Sspecies	0.000000	0.000000
Orangutan	MBTPS1	ENSPPYT00000008921	Close-Sspecies	0.000138	0.008620
Orangutan	FAM69A	ENSPPYT00000001379	Close-Sspecies	0.000199	0.011474
Orangutan	SLC43A2	ENSPPYT00000009117	Close-Sspecies	0.000992	0.042126
Orangutan	RAB1B	ENSPPYT00000003634	Close-Sspecies	0.000026	0.002287
Orangutan	CMTM6	ENSPPYT00000016330	Close-Sspecies	0.000000	0.000003
Orangutan	DARS2	ENSPPYT00000000592	Close-Sspecies	0.000004	0.000458
Orangutan	AARS	ENSPPYT00000008865	Close-Sspecies	0.000002	0.000217
Orangutan	TH1L	ENSPPYT00000012980	Close-Sspecies	0.000000	0.000000
Rabbit	PLEK	ENSOCUT00000023428	Distant-Sspecies	0.000113	0.016744
Rabbit	SNX25	ENSOCUT00000024747	Distant-Sspecies	0.000005	0.002487
Dog	ALB	ENSCAFT00000037121	Distant-Sspecies	0.000019	0.004967
Horse	SMC4	ENSECAT00000024113	Distant-Sspecies	0.000108	0.015518
Horse	ANO6	ENSECAT00000013517	Distant-Sspecies	0.000007	0.003137
Horse	GLIPR1	ENSECAT00000016491	Distant-Sspecies	0.000035	0.007439

Discussion

Influence of alignment and annotation

The results of evolutionary analysis are influenced the quality of the genome sequence; false-positive sites may be detected and important information may be missed as a result of low-quality sequences [15, 16]. Unfortunately, recent genome-sequencing techniques are still unable to provide sequences reliable enough for evolutionary analysis. Stringent filtering functions and parameters are therefore needed to obtain reliable positive sites, and careful analytical design can achieve reliable results, even from low-quality genome sequences. Evolutionary analysis usually starts with sequence alignment using software such as ClustalW, MUSCLE or PRANK. In this study, we used PRANK (codon), because this software takes evolutionary information into consideration before placing the gaps [11], resulting in fewer mismatches but larger gaps compared with the other programs (S3 Table). Valid positive sites are likely to be located in alignments with low divergence and few gaps or mismatches, and sequence misalignments can thus generate false-positive sites in branch-site models. The branch-site model usually deletes columns with gaps in the alignments when calculating positive sites, so some sites located in deficient alignments may be regarded as positive, whereas some true-positive sites may be missed. SP-score filtering, which focuses on filtering out such false-positive sites, can be used to reduce the false-positive rate and ensure the quality of the filtered positive sites. On the other hand, cDNA mapping can exclude false-positive sites that originate from mistakes in genome sequence assembly and gene annotation. The combination of these processes can thus filter out many false-positive sites and identify low-quality genome sequences, such as those for cynomolgus monkey, horse, and orangutan in this study. cDNA sequences in previous genome-wide studies have generally been used as references for gene annotation [45-47]. In contrast, we used cDNA mapping as a novel method to identify positive sites with high quality. Because cDNA sequences are usually relatively short, current sequencing techniques can provide reliable sequences. Moreover, some sites can be mapped to more than one corresponding cDNA sequence. cDNA mapping can thus ensure the quality of the remaining positive sites. However, there are some limitations. More than 90% of sites cannot be matched with corresponding cDNA sequences, and the validity of these sites therefore cannot be checked using this method. Because cDNA sequences are usually sequenced for a specific purpose, corresponding cDNA sequences may not be available for some putative positive sites, and genes with important evolutionary implications may be missed.

Positively-selected genes in seasonal and non-seasonal breeders

Evolutionary analysis of genome sequences can be used to identify specific, positively-selected genes in various species. The genetic mechanisms and potential environmental adaptations associated with seasonal and non-seasonal breeding can then be inferred by functional analysis of positively-selected genes in the respective species. The functions of positively-selected genes in non-seasonal breeding species reflect reproductive tendencies such as sperm generation and cell proliferation. Two key genes perform these functions in humans: CGA (glycoprotein hormones, alpha polypeptide) is a gonadotropin subunit [48, 49], while CD151 functions in promoting metastasis, and increases the expression of phospho-extracellular signal-regulated kinase (ERK) [50, 51]. Given that ERK is a component of the mitogen-activated protein kinase pathway, positive selection pressure on this gene may influence cell proliferation and differentiation [52, 53]. Mutation of Dnah1 in mice has been reported to cause male infertility [54, 55], suggesting that it may play an important role in influencing mating behavior. Another crucial gene in rats, Invs, is involved in controlling cytoskeletal organization and cell division, which are essential for reproduction [56, 57]. Moreover, this gene can interact with NPHP1 and NPHP3 that influence the Wnt signaling pathway, which may in turn influence kidney function and renal cell formation linked to spermatocyte and spermatid generation in the testis [58-60]. These positively-selected genes may reflect modulation of the reproductive system under environmental pressure in non-seasonal breeding species, enabling them to breed throughout the year. The identification of positive sites focused on sperm generation and cell proliferation suggests that mutations in these genes may influence sperm quantity or reproductive capacity. Genes that were positively selected in seasonal breeding species differed from those in non-seasonal species in having less focused functions. However, the orangutan provided the most valid positive genes among these species, and their functional analysis may help to explain some predominant characteristics of seasonal breeding species. The key gene, THRAP3 (thyroid hormone receptor associated protein 3, also known as Thrap150), is a selective coactivator for CLOCK-BMAL1 and promotes CLOCK-BMAL1 binding to target genes [61]. Moreover, THRAP3 can also interact with HELZ2, which regulates adipocyte differentiation [62]. Clock and Bmal1 have previously been reported to be closely related to seasonal breeding behaviors [63], the THRAP3 mutation may thus influence the circadian rhythm of the reproductive system. This is supported by a previous study showing that thyroid hormone catabolism within the mediobasal hypothalamus regulated seasonal gonadotropin-releasing secretion [64]. However, because orangutans live in Indonesia, which has high temperature throughout the year [30, 65], they may not need to adjust their physical condition, such as lipid storage, to cope with cold weather. THRAP3 may thus influence adipocyte differentiation, while other functionally-related genes such as MTMR12 [66] and ZFR [67] would be positively selected because of such environmental conditions. In addition to THRAP3, the positively-selected genes TH1L and CMTM6 may also help to explain the seasonal breeding behavior. As TH1L may have a similar function to TH1, which attenuates androgen signaling [68], while CMTM6 functions in spermatogenesis [69-71]. Evidence from previous studies suggests that orangutans produce 14 times less sperm than chimpanzees, which is a closely-related, but non-seasonal breeder [72]. Seasonal breeding in orangutans may thus be a consequence of circadian rhythm and limited sperm production, which restrict their breeding to the period from December to May, the most productive months in terms of food (fruit) supply, to ensure adequate food and energy for effective reproduction [73]. Diversity in breeding behaviors can generally be attributed to mutations affecting endocrine mechanisms. Such mutations may be related to specific environmental conditions, such as temperature and food supply. In this study, positively-selected genes related to sperm generation were identified in both types of breeding species. Indeed, previous reports have indicated rapid evolution of sperm proteins in mammals [74, 75]. Evolutionary mutations in these genes may not lead to the unique consequences associated with different breeding strategies. However, previous studies have indicated that the reproduction behavior in seasonal breeding species is largely under the regulation of the circadian rhythm system [64]. This is consistent with our results, which showed that THRAP3, which is functionally-related to the CLOCK-BMAL1 system, was under positive selection pressure. The mechanisms determining breeding behaviors can be complicated, but evolution leads to adaptation to the environment, enabling well-adapted lineages to persist for many generations.

Conclusions

In this study, we conducted a precise, genome-wide scan to detect genes that were positively selected between seasonal and non-seasonal breeding species. The evolutionary analysis was designed to reduce the incidence of false-positive sites by SP filtering and cDNA mapping. Although the lack of cDNA sequences means that some positive genes may have been missed, the identification of valid, positively-selected genes with functions relating to spermatogenesis, cell proliferation, and circadian rhythm might indicate possible molecular mechanisms underlying the seasonal and non-seasonal reproductive behaviors. Further developments in genome-sequencing technologies will allow the sequencing and assembly of higher-quality genomes, and more accurate gene annotation, while the availability of more cDNA sequences will increase the value of cDNA mapping for improving the accuracy of evolutionary analysis.

Sites with extended sequences alignments.

(A). Perfect alignment. (B). Acceptable alignment. (C). Unacceptable alignment because of large number of gaps. (D). Unacceptable alignment because of putative positive sites located in poorly-aligned sequences. (E). False negative. SP scoring filtered out mistaken acceptable alignments. (TIF) Click here for additional data file.

Positive sites mapped with the corresponding cDNA sequences.