Literature DB >> 12498619

Cynomolgus monkey testicular cDNAs for discovery of novel human genes in the human genome sequence.

Naoki Osada1, Munetomo Hida, Jun Kusuda, Reiko Tanuma, Makoto Hirata, Yumiko Suto, Momoki Hirai, Keiji Terao, Sumio Sugano, Katsuyuki Hashimoto.   

Abstract

BACKGROUND: In order to contribute to the establishment of a complete map of transcribed regions of the human genome, we constructed a testicular cDNA library for the cynomolgus monkey, and attempted to find novel transcripts for identification of their human homologues. RESULT: The full-insert sequences of 512 cDNA clones were determined. Ultimately we found 302 non-redundant cDNAs carrying open reading frames of 300 bp-length or longer. Among them, 89 cDNAs were found not to be annotated previously in the Ensembl human database. After searching against the Ensembl mouse database, we also found 69 putative coding sequences have no homologous cDNAs in the annotated human and mouse genome sequences in Ensembl. We subsequently designed a DNA microarray including 396 non-redundant cDNAs (with and without open reading frames) to examine the expression of the full-sequenced genes. With the testicular probe and a mixture of probes of 10 other tissues, 316 of 332 effective spots showed intense hybridized signals and 75 cDNAs were shown to be expressed very highly in the cynomolgus monkey testis, but not ubiquitously.
CONCLUSIONS: In this report, we determined 302 full-insert sequences of cynomolgus monkey cDNAs with enough length of open reading frames to discover novel transcripts as human homologues. Among 302 cDNA sequences, human homologues of 89 cDNAs have not been predicted in the annotated human genome sequence in the Ensembl. Additionally, we identified 75 dominantly expressed genes in testis among the full-sequenced clones by using a DNA microarray. Our cDNA clones and analytical results will be valuable resources for future functional genomic studies.

Entities:  

Year:  2002        PMID: 12498619      PMCID: PMC140308          DOI: 10.1186/1471-2164-3-36

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Progress in genome biology has revealed the complete genome sequences of many non-mammalian species, such as yeast, nematodes, and the fruit fly. In addition, the much larger and more complicated genome sequences of the mouse and the human will soon be made completely available. However, decoding the genome sequences, especially the human sequence will be a long process. In order to achieve a comprehensive understanding of how an organism is established by its genome sequence, we must identify the structure, function, and interaction of as many genes as possible. First, we should accumulate and compile many types of evidence from computational and empirical data. The immediate challenge is establishing a complete map of transcribed regions in the human genome. Current comprehensive studies predicting protein-coding genes from the human genome [1,2] mainly employ three sources of information: empirical evidence provided by expressed sequence tags (ESTs) and cDNAs, nucleotide and protein sequence similarity to those of known genes, and statistical probability calculated by computer algorithms (ab initio prediction). All of these sources more or less lead to false-positive or false-negative types of errors. EST and cDNA sequences usually contain sequences that are not actually transcribed in vivo, i.e. artifacts arising from splicing intermediates, genomic DNA contamination, and transcription from nongenic regions [3,4]. Moreover, rarely expressed genes that may represent only a small portion of all transcripts cannot be easily represented in cDNA libraries. Predictions based on nucleotide and protein sequence similarities to those of other gene families and organisms might misassign pseudo genes, and cannot identify evolutionarily diverged genes that have no sequence similarity to known genes. Ab initio prediction works well for some organisms, such as yeast, nematodes, and the fruit fly. However, the human genome makes ab initio prediction of protein-coding genes difficult because it generally consists of small exons separated by long introns. Ultimately, in order to make a complete catalog of human genes, it will be necessary to gather undiscovered evidence from experiments and discard spurious evidence. Our strategy for finding novel genes is to perform cDNA analysis using an organism closely related to humans, the cynomolgus monkey (Macaca fascicularis). In previous studies, we accumulated a number of 5'-end sequences of many clones derived from the oligo-capped cDNA libraries of the brain with high mRNA complexity, and determined approximately 1,500 full insert sequences of the clones whose 5'-end sequences showed no significant similarity to sequences in the public databases [5,6]. This method allowed us to identify many novel transcripts in the human genome sequence. Using fresh cynomolgus monkey tissues makes it possible to isolate rarely expressed genes, because mRNAs are so fragile that considerable portions of them degenerate during the usual construction of a cDNA library for humans. As an advantage of using cynomolgus monkey, evolutionary inspection can also provide information on gene function. If there are genes that evolved rapidly after the divergence of humans and cynomolgus monkeys, the function of the proteins and parts of the proteins might be important for human evolution. Moreover, biomedical interest in nonhuman primate genomes has been increased rapidly [7], especially in macaques, which also have been a material as transgenic primates [8], and thus genomic analysis of macaques will be important after the completion of human genome sequencing. In this study, we analyzed the cDNA library of the cynomolgus monkey testis. Analysis of testicular cDNA libraries has high potential for finding novel genes [9,10], because the testis is an organ in which transcripts have high complexity and where important biological processes, such as cell differentiation and meiosis, occur. The genes expressed in the testis are also important for medical, evolutionary, and developmental research. It is ironic that one of the most attractive tissues for biology expresses a number of undiscovered genes. We anticipated that analysis of the testicular cDNA library would lead to the discovery of novel genes that would facilitate post-genomic studies to attempt to unravel the complex genomes of higher organisms. Further, we conducted an expression analysis of our full-sequenced cDNAs with cDNA microarray. DNA microarrays are a versatile tool for evaluating gene expression and sequence variation [11]. We used a cDNA microarray, to determine whether our putative genes were actually transcribed in cynomolgus monkey tissues and whether they were expressed dominantly in the cynomolgus monkey testis.

Results

We constructed the cDNA library derived from the cynomolgus monkey testis (library name: QtsA) by the oligo-capping method. The 5'-ends of 10,426 clones isolated from the library were sequenced and yielded 5,381 clusters of sequences (the redundancy rate was 1.94). To classify these cynomolgus monkey cDNAs and find their human homologues, we performed the BLAST search [12] to human RefSeq databases [13]. The 5'-end sequences of 6151 clones were found to have high similarity to 2321 human RefSeq genes with a cut-off value of e-60. The results showed that most of frequently occurring genes in cDNA library, QtsA were those specifically expressed in testis and sperm. Breakdown of the numerically represented genes is shown in Table 1. The clones whose 5'-end sequences had no homologies to the sequences in the nr and EST databases in the Genbank and had the possibility of being a certain length of ORF were subjected to full-insert sequencing. The entire sequences of 512 clones were determined as a result, but the total number of non-redundant transcripts was smaller because we could not completely exclude the 5'-truncated clones of the same transcripts at the stage of 5'-end sequences. Further, we might obtain some alternatively spliced transcripts from the same gene. In such cases, we used the longest transcripts in this study. Ultimately, we obtained a total of 394 non-redundant full insert cDNA sequences (Figure 1). After masking the common repetitive elements in the Repbase Update database [14], we assigned all cDNA sequences to the human genome draft sequence by using the BLAST program. With 85% or greater sequence identity and 50% or greater overlap of cDNA sequence length as criteria, 12 clones were deduced to be chimeric because they could be divided into two regions, of which DNA sequences showed homology to sequences of different human chromosomes. Sequences of 317 cDNAs had only one homologous region in the human genome sequences, while 18 cDNA sequences had homology to more than two human chromosomal regions. The remaining 47 had no homologous sequences in the human genome based on the above criteria. The average nucleotide length of all full-sequenced clones was 2016 bp. Of the 382 non-chimeric sequences, 302 carried a putative CDS (coding sequence) longer than or equal to 300 bp. In order to determine how many human homologues of our full-sequenced cDNAs have been annotated from the human genome sequences, a search was made for 302 putative CDSs to the Ensembl human database (Release 5.28.1) [15], which comprised 29,076 putative transcribed sequences classified as 'known' and 'novel' genes (BLAST cut-off value was e-60, and the coverage was ≥ 50% of each putative CDS length). Genes classified as 'known' in Ensembl are more reliable and have valid cDNA and/or evolutionary evidence, whereas 'novel' genes lack credible sources of expression and are sometimes supported by only ab initio methods and ESTs. As a result, 124 and 89 putative CDSs had human homologous sequences in the known category and novel category, respectively. The other 89 putative CDS had no homologous sequences in Emsembl human database based on these criteria. We also searched 302 cDNA sequences against the Ensembl mouse database (Release 7.3b.2), in which 28,097 putative transcribed sequences were annotated, (cut-off: e-30, coverage: ≥ 50% of ORF length), resulting that 74 and 67 cDNAs had homologous mouse sequences predicted as Ensembl 'known' and 'novel' genes, respectively. Finally, 69 putative CDSs have no homologous sequences in the annotated mammalian genome sequences in Ensembl. The putative functions of 302 hypothetical proteins were predicted by searching against the InterPro database [16]. A list of their name and other information on the 302 cDNAs is provided in the supplementary table (additional file 1). We then constructed the putative human transcribed sequences corresponding to the 302 cynomolgus monkey cDNA sequences carrying enough length of ORFs by using the human genome draft sequences (see Materials and methods). The result showed that 117 putative human transcribed sequences, including 55 'known' and 48 'novel' genes in Ensembl had almost identical genomic structure to those of cynomolgus monkeys. We tested how many exons of 48 'novel' and 12 'unidentified' putative transcribed sequences can be predicted by the ab initio program, GENSCAN [17]. In total, 240 (53%) and 79 (17%) of 455 exons were correctly and partially predicted by GENSCAN, respectively, however, 136 exons (30%) were unpredictable. The list of putative human transcribed sequences is presented in Table 2 and their sequences are provided in the supplementary data (additional file 2), but the sequences have not been registered in the public database because they were not actually sequenced. We also compared the nucleotide and protein sequence similarity of 117 putative transcribed sequences between humans and cynomolgus monkeys. Amino acid sequence identity, nucleotide sequence identity for CDS, synonymous substitution per synonymous site, and non-synonymous substitution per non-synonymous site are presented in Table 2.
Table 1

The list of 10 most frequently represented genes in the library QtsA.

AccessionNo. of clonesGene name (Gene symbol)
NM_002762318Protamine 2 (PRM2)
NM_004645121Coilin (COIL)
NM_004362108Calmegin (CLGN)
NM_005425105Transition protein 2 (TNP2)
NM_00439695DEAD/H box polypeptide 5 (DDX5)
NM_01776983Hypothetical protein FLJ20333 (FLJ20333)
NM_03094180Exonuclease NEF-sp (LOC81691)
NM_00140277Eukaryotic translation elongation factor 1 alpha 1 (EEF1A1)
NM_02100976Ubiquitin C (UBC)
NM_00472463ZW10 (Drosohplila) homolog (ZW10)
Figure 1

Flow of full-sequencing analysis of unidentified clones. 1) The 512 full-sequenced clones were reduced to 394 because slightly different 5'-end sequences could be derived from the same transcripts. 2) 394 non-redundant clones were assigned to the human genome sequence. 3) 302 of 383 non-chimeric clones carried ORFs longer than 300 bp. 4) 302 putative genes were classified as 'known' and 'novel' categories of Ensembl human cDNA sequences. *CHR: Chromosome

Table 2

The list of 117 putative human transcribed sequences.

Referenced macaca cDNAaEnsembl statusblengthCDS length: start..end (bp)aa identity (%)nt identity of CDS (%)KacKsd
QtsA-10152novel1789413AA: 42..128396.197.60.0190.039
QtsA-10154known2010502AA: 377..188598.298.30.0090.032
QtsA-10162novel2444718AA: 72..222896.597.60.0170.040
QtsA-10245known2598752AA: 298..255694.496.10.0270.078
QtsA-10439novel2566538AA: 271..188788.693.70.0590.076
QtsA-10472known2159418AA: 76..133295.596.70.0250.062
QtsA-10491known2439346AA: 1322..236298.098.60.0090.027
QtsA-10636novel2627440AA: 57..1379100.0100.00.0000.000
QtsA-10679known2415523AA: 726..229796.096.80.0210.061
QtsA-10739novel1880231AA: 141..83693.497.00.0340.022
QtsA-10833known2234673AA: 89..211092.295.20.0370.077
QtsA-10891novel2049343AA: 2..103393.295.20.0380.084
QtsA-10947known2132540AA: 112..173493.294.40.0330.126
QtsA-10963known1924462AA: 433..182198.998.60.0050.039
QtsA-11068unidentified2299594AA: 405..218991.894.90.0420.080
QtsA-11127known2084550AA: 54..170689.895.50.0530.034
QtsA-11181known3414566AA: 84..178499.898.90.0010.036
QtsA-11319unidentified1559104AA: 168..482100.0100.00.0000.000
QtsA-11379known2805690AA: 106..217898.898.20.0050.050
QtsA-11535known2116474AA: 304..172898.398.00.0080.057
QtsA-11567unidentified1376376AA: 200..133096.097.70.0200.034
QtsA-11570unidentified2437117AA: 1902..225590.695.50.0460.042
QtsA-11661novel2228588AA: 227..199397.698.00.0130.038
QtsA-11670unidentified1785325AA: 413..139099.799.20.0020.028
QtsA-11842known2173225AA: 142..819100.0100.00.0000.000
QtsA-12007novel2316724AA: 28..220296.697.10.0170.053
QtsA-12095novel710231AA: 16..71194.496.80.0300.039
QtsA-12142known1731404AA: 395..160994.196.30.0270.060
QtsA-12155known1305329AA: 252..124195.597.70.0240.034
QtsA-12190unidentified1962600AA: 21..182394.096.60.0300.044
QtsA-12219known2480793AA: 18..2399100.0100.00.0000.000
QtsA-12282novel2270700AA: 93..219597.197.90.0130.036
QtsA-12354known2082674AA: 5..202984.490.30.0950.119
QtsA-12362novel2177695AA: 91..217893.395.90.0340.068
QtsA-12457novel2405689AA: 105..217494.996.50.0260.059
QtsA-12579novel1499491AA: 8..148393.194.90.0340.103
QtsA-12649novel2114634AA: 33..193797.698.20.0120.038
QtsA-12757known1280278AA: 201..103798.697.70.0080.055
QtsA-12767novel2100622AA: 51..191998.698.00.0070.060
QtsA-12769known1530395AA: 75..126292.295.80.0380.052
QtsA-12850known2825854AA: 262..282697.597.70.0110.053
QtsA-13222known28062806 873AA: 68..268992.795.00.0380.085
QtsA-13252novel2229669AA: 65..207495.796.90.0210.059
QtsA-13272novel1833207AA: 184..80798.198.60.0100.033
QtsA-13343novel1960131AA: 171..56691.696.20.0410.026
QtsA-13392unidentified1761438AA: 313..162998.998.90.0050.022
QtsA-13406known1855266AA: 930..173092.096.20.0390.040
QtsA-13432known1718428AA: 360..164697.798.10.0110.038
QtsA-13460known1492427AA: 26..130995.097.20.0230.035
QtsA-13672novel1824363AA: 734..182595.396.90.0230.046
QtsA-13918known1730537AA: 120..173397.298.10.0120.047
QtsA-13925novel1678515AA: 114..166192.095.40.0410.061
QtsA-14022novel1653517AA: 102..165596.396.80.0200.064
QtsA-14166novel1121293AA: 177..105889.594.70.0520.060
QtsA-14245known1784531AA: 5..160096.697.20.0170.056
QtsA-14351novel2938839AA: 225..274491.896.00.0410.046
QtsA-14618known1273363AA: 57..114894.597.00.0270.037
QtsA-14653known996150AA: 405..857100.098.70.0000.049
QtsA-14746known2049528AA: 17..160396.897.20.0160.060
QtsA-14752known904235AA: 184..89197.597.00.0120.078
QtsA-14816unidentified2515134AA: 242..64698.499.20.0080.013
QtsA-14824known2965891AA: 151..282695.797.50.0200.036
QtsA-14970known1282168AA: 639..1145100.099.20.0000.017
QtsA-15013novel2349303AA: 364..127590.195.30.0450.047
QtsA-15139novel2487740AA: 75..229796.496.90.0170.061
QtsA-15186known2181588AA: 97..186393.196.40.0340.044
QtsA-15224novel2089290AA: 336..120896.196.80.0210.074
QtsA-15268novel1808396AA: 203..139396.296.40.0180.092
QtsA-15315known1284344AA: 217..125188.894.00.0540.073
QtsA-15384known2169565AA: 213..191095.996.90.0190.062
QtsA-15676novel2153581AA: 184..192992.195.80.0390.068
QtsA-15696novel1856563AA: 144..183593.196.50.0340.046
QtsA-15812novel2293576AA: 491..222196.497.20.0190.052
QtsA-15844known2569653AA: 174..213595.696.10.0230.086
QtsA-15875known2327654AA: 357..232196.697.70.0170.038
QtsA-16005known1987518AA: 433..1989100.099.40.0000.021
QtsA-16015known2389671AA: 345..236098.197.70.0090.054
QtsA-16028known1624447AA: 23..136699.897.70.0010.082
QtsA-16077known2307571AA: 576..229199.798.70.0020.047
QtsA-16107known2039432AA: 301..1599100.098.80.0000.034
QtsA-16118known1396415AA: 57..130497.696.70.0120.096
QtsA-16284novel1199342AA: 31..105996.596.80.0170.071
QtsA-16373known2085433AA: 619..192099.598.50.0020.048
QtsA-16429known1783413AA: 42..128396.197.60.0190.039
QtsA-16453novel1757185AA: 793..135087.393.50.0660.065
QtsA-16496novel1599448AA: 72..141895.596.10.0230.081
QtsA-16602unidentified2482263AA: 315..110696.697.10.0150.079
QtsA-16622novel1364313AA: 291..123295.196.90.0240.045
QtsA-16678known2325688AA: 91..215798.596.50.0080.096
QtsA-16765novel2501415AA: 862..210998.197.40.0100.068
QtsA-16837known3268586AA: 873..263399.398.50.0040.041
QtsA-17449known1858506AA: 262..178290.995.40.0440.053
QtsA-17495novel1026261AA: 62..84796.297.30.0180.068
QtsA-17616known2471617AA: 435..228898.498.20.0080.044
QtsA-18070novel1997585AA: 134..189197.897.70.0090.054
QtsA-18134known1832309AA: 592..152199.798.20.0020.053
QtsA-18363novel1807315AA: 638..158595.997.50.0200.040
QtsA-18372novel972128AA: 337..72397.797.40.0110.069
QtsA-18427known2198565AA: 416..211399.398.90.0030.033
QtsA-18831unidentified2133555AA: 47..171492.195.10.0410.082
QtsA-18885known3250642AA: 314..224296.695.80.0170.102
QtsA-19023novel2072500AA: 84..158691.095.60.0430.047
QtsA-19036novel955214AA: 313..957100.099.50.0000.014
QtsA-19380unidentified2158412AA: 625..186398.197.30.0090.071
QtsA-19788novel1080295AA: 116..100398.698.30.0060.040
QtsA-19856known2055352AA: 497..155598.998.40.0050.039
QtsA-19961known1025283AA: 62..913100.098.00.0000.069
QtsA-20273novel1783420AA: 79..134192.796.10.0390.029
QtsA-20302known2889882AA: 87..273594.697.20.0260.042
QtsA-20424unidentified2056505AA: 147..166499.298.40.0050.041
QtsA-20433novel1981559AA: 73..175294.896.50.0270.057
QtsA-20664known2396616AA: 231..208197.196.20.0150.095
QtsA-20987known3090561AA: 636..232197.997.70.0110.053
QtsA-21536novel1409350AA: 134..118692.395.70.0420.052
QtsA-21565novel1810367AA: 276..137994.295.60.0280.093
QtsA-21583novel2640761AA: 260..254590.495.00.0460.060
QtsA-21585known2252202AA: 38..64691.894.50.0450.085

a) Cynomolgus monkey cDNA sequence that was used to deduce putative human transcribed sequence. b) Classification of human transcribed sequence in the Ensembl human database. c) Synonymous substitution rate per synonymous site between human and cynomolgus monkey genes. d) Non-synonymous substitution rate per non-synonymous site between human and cynomolgus monkey genes.

Flow of full-sequencing analysis of unidentified clones. 1) The 512 full-sequenced clones were reduced to 394 because slightly different 5'-end sequences could be derived from the same transcripts. 2) 394 non-redundant clones were assigned to the human genome sequence. 3) 302 of 383 non-chimeric clones carried ORFs longer than 300 bp. 4) 302 putative genes were classified as 'known' and 'novel' categories of Ensembl human cDNA sequences. *CHR: Chromosome The list of 10 most frequently represented genes in the library QtsA. The list of 117 putative human transcribed sequences. a) Cynomolgus monkey cDNA sequence that was used to deduce putative human transcribed sequence. b) Classification of human transcribed sequence in the Ensembl human database. c) Synonymous substitution rate per synonymous site between human and cynomolgus monkey genes. d) Non-synonymous substitution rate per non-synonymous site between human and cynomolgus monkey genes. In order to investigate the expression pattern of the testicular full-sequenced cDNAs, we designed a DNA microarray containing approximately 400 spots of cDNA, full-sequenced samples and controls. Fifty clones carrying common repetitive elements and 12 clones deduced to be chimeric were excluded from further analysis, although they were spotted on the slides. Ultimately, 332 spots were used for quantification of gene expression. First, we investigated whether the putative genes were transcribed in a ubiquitous manner or had a tissue-specific pattern of expression especially in the testis. RNA pools from the testis of the cynomolgus monkey and the mixture of equal amounts of RNA from 10 other cynomolgus monkey tissues (brain, heart, skin, liver, spleen, renal, pancreas, stomach, small intestine, and large intestine) were independently labeled and co-hybridized to the DNA microarray. When the signal intensity of the testicular probe is greater than that of the mixed probe, the gene was concluded to be over-expressed in the testis, or to be transcribed in the testis and a few other tissues, but not ubiquitously. When the intensity of both signals was equal, the gene was concluded to be expressed in a ubiquitous manner. When the signal intensity of the testicular probe was lower than that of the mixed probe, the gene was concluded to be mainly transcribed in non-testicular tissues. We calculated the ratio of the testicular probe intensity to the mixed probe intensity and the ratio was normalized by using the beta-actin cDNA spot. A total of 316 (95%) of the 332 effective spots showed an intense and reproducible signal with either the testicular RNA probe or the mixed RNA probes or both. The signals of 75 spots were four fold or more intense with the testicular probe, and human homologues of the 15 genes among 75 cDNAs had been registered in the RefSeq database (Table 3). Eight of the 15 RefSeq genes were reported to be expressed exclusively or dominantly in the human testis in the literature and the databases: TSGA10, expressed during spermatogenesis [18]; ACTL7B, an intronless gene strongly expressed in the testis and weakly expressed in the prostate [19]; SOX30, Sry-related transcriptional factor specifically expressed in the testis [20]; and five NYD-SP genes, functionally anonymous but highly expressed in the testis in other DNA microarray experiments [21]. The other seven genes had ORFs of hypothetical proteins and were deduced from only the cDNA sequence evidence. Four of the cDNA clones were derived from human testis, and the other three cDNAs were from brain, placenta, or teratocarcinoma (Table 3). The results indicated that the remaining 60 clones that have no human RefSeq homologues are expressed exclusively or dominantly in the cynomolgus monkey testis.
Table 3

The list of genes that were highly expressed in a testis and had human RefSeq homologues

Macaca cloneHuman RefSeqDescriptionRatioaExpression (Reference)
QtsA-10833NM_032559kinesin protein (LOC84643)8.7derived from testis
QtsA-13647NM_025244testis specific, 10 (TSGA10)8.6testis specific [18]
QtsA-16118NM_006686actin-like 7B (ACTL7B)8.5testis and prostate [19]
QtsA-14409NM_018418hypothetical protein (HSD-3.1)7.8derived from testis
QtsA-13567NM_033122testis development protein NYD-SP26 (NYD-SP26)7.5testis
QtsA-14035NM_033123testis-development related NYD-SP27 (NYD-SP27)7.2testis
QtsA-11842NM_032130hypothetical protein DKFZp434J0113 (DKFZP434J0113)6.9derived from testis
QtsA-15256NM_032126hypothetical protein DKFZp564J047 (DKFZP564J047)6.6derived from brain
QtsA-14560NM_032599testes development-related NYD-SP18 (NYD-SP18)6.6testis
QtsA-12850NM_019038hypothetical protein (FLJ11045)6.4derived from placenta
QtsA-15384NM_030672hypothetical protein FLJ10312 (FLJ10312)5.1derived from teratocarcinoma
QtsA-10245NM_007017SRY (sex determining regionY)-box 30 (SOX30)5.0testis specific [20]
QtsA-18012NM_032596testes development-related NYD-SP22 (NYD-SP22)5.0testis
QtsA-14618NM_032598testes development-related NYD-SP20 (NYD-SP20)4.7testis
QtsA-19865NM_033364AAT1-alpha (AAT1) kinesin-like 6 (mitotic centromere-associated kinesin)4.5derived from testis
QtsA-16015NM_006845(KNSL6)4.1thymus and testis [21]

a) The ratio of signal intensity of testicular probe to mixed probe.

The list of genes that were highly expressed in a testis and had human RefSeq homologues a) The ratio of signal intensity of testicular probe to mixed probe.

Discussion

In this study we analyzed a cDNA library derived from a cynomolgus monkey testis. Although most of the human genome sequence has been determined, many unidentified genes remain, and a complete catalog of protein-coding genes is desired. Sequence similarity search of our full-sequenced cDNAs to the human draft genome sequence resulted in the assignment of 347 cDNA sequences to at least one human chromosome, indicating that most genes in the cynomolgus monkey have homologous regions in the human genome. The primary objective of this analysis was to find genes that have not been experimentally identified in the human genome. Among the 302 cDNAs carrying enough length of ORFs ( = 300 bp), we succeeded in identifying 89 putative genes that have no counterparts in the Ensembl 29,076-gene set. Another 89 genes that had highly similar sequences to Ensembl 'novel' genes were discovered in our full-sequenced cDNAs. The latter 89 genes strongly support the existence of predicted 'novel' cDNA sequences, which are relatively less accurate. Many genes expressed in the testis cause male infertility in humans [22]. Since it is estimated that up to 11% of all genes in the fruit fly might lead to male sterility [23], in view of the complexity of the human genome, at least 4000 genes might be responsible for male infertility in humans and there must be many as yet unidentified genes that are related to male fertility. Functional analysis of 75 genes found to be highly expressed in the cynomolgus monkey testis may contribute such a medical interest about male infertility. A DNA microarray analysis is an appropriate method not only of annotating the pattern of expression of our full-length cDNAs, but of demonstrating that our strategy for finding novel gene works well. In the first set of the DNA microarray experiment, among the 199 genes that displayed two fold or more higher expression with the testicular probes than with the mixed probes, 67 (34%) were classified as the Ensembl 'known' genes, whereas among the 45 genes that showed ubiquitous pattern of expression (signal intensities within 1.5 fold of each other with both probes), 23 (51%) were classified as Ensembl 'known' genes. This finding indicated that the probability that transcripts overexpressed in testis are derived from unidentified novel genes is significantly higher than that of ubiquitous transcripts (p = 0.028: Fisher's exact test). Evolutionary inspection is also important, especially for gene analysis of the testis, because genetic diversity in the male reproductive system is quite large, even among closely related species. Many reproductive proteins have evolved rapidly at the molecular level [24,25]. We compared 117 sequences of cynomolgus monkey cDNA and the corresponding human genome sequences described above, and use of the cDNA microarray revealed that 79 of the 117 cDNAs were overexpressed in testis ( = 2.0 fold in testis) and 15 were ubiquitously expressed (within 1.5 fold of each other). We estimated the sequence divergence of putative coding sequences between humans and cynomolgus monkeys and found that the average non-synonymous nucleotide divergence of testis-dominantly expressed genes (0.024) was significantly greater than that of ubiquitously expressed genes (0.012; p value < 0.01), whereas divergence in synonymous sites were not different significantly (testis-dominant genes: 0.54, ubiquitous genes: 0.51). This finding is also highly consistent with a report that the proteins of genes expressed in a tissue-specific manner evolve an average of twice as fast as those that are ubiquitously expressed [26]. Although a number of full and partial sequences of human genes have been deposited in the public databases, many of the genes in the human genome have not yet to be discovered experimentally. Most of the undiscovered genes may be expressed very seldom or their expression may be restricted to certain tissues and developmental stages. The complete human genome will be available in 2003, and a search of the entire genome for novel genes by oligonucleotide-based microarray analysis is designed; i.e. an attempt to predict all candidate human genes from the human genome and experimentally confirm the transcript status of the predicted regions as well as the entire region by using a oligo-nucleotide-based microarray [27,28]. However, it is difficult to overcome the problem of rarely or temporarily expressed genes for practical reasons. The transcriptional and genomic approaches will compensate for each other's blind spots, and many tissues, developmental stages, and other organisms should become experimental subjects for finding undiscovered genes to complete the human gene catalog.

Materials and Methods

cDNA library from cynomolgus monkey testis

A 15-year-old male cynomolgus monkey was used as the source of the testis, and a 1-year-old and 21-year-old female cynomolgus monkeys were used for other RNA samples. The monkeys were cared for and handled according to guidelines established by the Institutional Animal Care and Use Committee of the National Institute of Infectious Diseases (NIID) of Japan and the standard operating procedures for monkeys at the Tsukuba Primate Center, NIID, Tsukuba, Ibaraki, Japan. Tissues were excised in accordance with all guidelines in the Laboratory Biosafety Manual, World Health Organization, and were carried out at the P3 facility for monkeys of the Tsukuba Primate Center, NIID. Immediately after collection, the tissues were frozen with liquid nitrogen and used for RNA extraction. Oligo-capped cDNA libraries were constructed according to the method described previously [29,30].

DNA sequencing

The 5'-end sequences of the clones were sequenced using ABI 3700 sequencer (Applied Biosystems), and categorized using DYNACLUST (DYNACOM), based on a BLAST search against the GenBank database. The entire sequences of clones were determined by the primer walking method. Cycle sequencing was performed with an ABI PRISM BigDye Terminator Sequencing kit (Applied Biosystems) according to the manufacturer's instructions.

Computational analyses

The Sim4 program was used to align each cynomolgus monkey cDNA sequence with the human genome sequence [31]. Whenever Sim4 failed to align cynomolgus monkey cDNA sequence with human genome DNA sequence, comparison by BLAST program was executed, and the alignment was corrected manually. In the intron sequences, GT at the 5' splice site and AG at the 3' site (GT-AG pattern), and the GC-AG pattern were regarded as conserved splice sites, and corresponding human genome regions were concatenated to construct a hypothetical human transcribed sequence. 117 Cynomolgus monkey cDNA sequences and the putative human transcribed sequences were aligned by using the ClustalW program [32]. Synonymous substitution per synonymous site and non-synonymous substitution per non-synonymous site were estimated by the method of Li [33].

cDNA microarray

An aliquot of the same DNA preparation used in the 5'-end-sequencing reactions provided material for the PCRs. Inserts were amplified by PCR using 5'-CTTCTGCTCTAAAAGCTGCG-3' as a forward primer and 5'-CGACCTGCAGCTCGAGCACA-3' as a reverse primer, in a volume of 100 μl. Successful amplification was confirmed by agarose gel electrophoresis. When the first PCR failed to amplify enough products, the first PCR products were amplified again. Four hundred cDNA clones were amplified and samples of approximately 300 μg /ml DNA in 2 × Solution-T reagent (Takara Bio) were printed on duplicate glass-slides with a GMS 417 arrayer (Genetic MicroSystems). The testicular RNA was obtained from only one 15-year-old male cynomolgus monkey, and the other RNA was a mixture of RNA obtained from 10 tissues (brain, heart, skin, liver, spleen, renal, pancreas, stomach, small intestine, and large intestine) of two cynomolgus monkeys, a 1-year-old female and a 21-year-old female. RNA was isolated with Trizol (Life Technologies) and purified with Oligo-Tex (Takara Bio). Both 0.7 μg mRNA probes were labeled with Cy3- and Cy5- dioxynucleotide (Pharmacia) and co-hybridized to DNA spots. The amount of RNA from each tissue was 0.07 μg in the mixed RNA probe. After the hybridization and washing procedure, slides were scanned with ScanArray (GSI Inc.). Several experiments were conducted, and the duplicated spots on the slides, where the most intense signals were obtained, were processed to measure the transcriptional status. When the relative intensity of Cy3/Cy5 signals of duplicated spots differed more than 1.5 times compared to that of the corresponding spots in duplicate, the spots were not processed for the subsequent analyses. Finally, the ratio of signal intensities of Cy3 (the testicular probe) and Cy5 (the mixed probe) was obtained from average value of duplicated spots and normalized by dividing by the ratio of the beta-action spots as a control.

List of abbreviations

EST: expressed sequence tag. CDS: coding sequence. ORF: open reading frame.

Authors' contributions

NO was involved in design of the study, construction of cDNA library, in silico analysis, expression analysis with DNA microarray and preparation of the manuscript. M. Hida and SS performed construction of cDNA libraries and analysis of 5'-end sequence analysis. JK participated in the design and implementation of the study, contributed to writing and revising the manuscript. RT and M. Hirata participated in the sequencing of cDNAs and in-silico analyis of cDNA sequences. YS and M. Hirai participated in the design and implementation of the study on microarray, and obtained funding for the study. KT contributed to obtaining the tissues for cDNA libraries and total RNA from cynomolgus monkeys. KH was involved in the design and implementation of the study, writing and editing the manuscript and obtained funding for the study. All authors read and approved the final manuscript. Click here for file Click here for file
  31 in total

1.  Large-scale transcriptional activity in chromosomes 21 and 22.

Authors:  Philipp Kapranov; Simon E Cawley; Jorg Drenkow; Stefan Bekiranov; Robert L Strausberg; Stephen P A Fodor; Thomas R Gingeras
Journal:  Science       Date:  2002-05-03       Impact factor: 47.728

Review 2.  Biomedical applications and studies of molecular evolution: a proposal for a primate genomic library resource.

Authors:  Evan E Eichler; Pieter J DeJong
Journal:  Genome Res       Date:  2002-05       Impact factor: 9.043

3.  Identification of a novel Sry-related gene and its germ cell-specific expression.

Authors:  E Osaki; Y Nishina; J Inazawa; N G Copeland; D J Gilbert; N A Jenkins; M Ohsugi; T Tezuka; M Yoshida; K Semba
Journal:  Nucleic Acids Res       Date:  1999-06-15       Impact factor: 16.971

4.  A computer program for aligning a cDNA sequence with a genomic DNA sequence.

Authors:  L Florea; G Hartzell; Z Zhang; G M Rubin; W Miller
Journal:  Genome Res       Date:  1998-09       Impact factor: 9.043

Review 5.  Male infertility and the genetics of spermatogenesis.

Authors:  M Okabe; M Ikawa; J Ashkenas
Journal:  Am J Hum Genet       Date:  1998-06       Impact factor: 11.025

6.  Generation and analysis of 280,000 human expressed sequence tags.

Authors:  L D Hillier; G Lennon; M Becker; M F Bonaldo; B Chiapelli; S Chissoe; N Dietrich; T DuBuque; A Favello; W Gish; M Hawkins; M Hultman; T Kucaba; M Lacy; M Le; N Le; E Mardis; B Moore; M Morris; J Parsons; C Prange; L Rifkin; T Rohlfing; K Schellenberg; M Bento Soares; F Tan; J Thierry-Meg; E Trevaskis; K Underwood; P Wohldman; R Waterston; R Wilson; M Marra
Journal:  Genome Res       Date:  1996-09       Impact factor: 9.043

Review 7.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

8.  Prediction of complete gene structures in human genomic DNA.

Authors:  C Burge; S Karlin
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

9.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.

Authors:  M D Adams; A R Kerlavage; R D Fleischmann; R A Fuldner; C J Bult; N H Lee; E F Kirkness; K G Weinstock; J D Gocayne; O White
Journal:  Nature       Date:  1995-09-28       Impact factor: 49.962

10.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors:  J D Thompson; D G Higgins; T J Gibson
Journal:  Nucleic Acids Res       Date:  1994-11-11       Impact factor: 16.971

View more
  6 in total

1.  Full-length cDNA sequences from Rhesus monkey placenta tissue: analysis and utility for comparative mapping.

Authors:  Dae-Soo Kim; Jae-Won Huh; Young-Hyun Kim; Sang-Je Park; Sang-Rae Lee; Kyu-Tae Chang
Journal:  BMC Genomics       Date:  2010-07-12       Impact factor: 3.969

2.  Large-scale transcriptome sequencing and gene analyses in the crab-eating macaque (Macaca fascicularis) for biomedical research.

Authors:  Jae-Won Huh; Young-Hyun Kim; Sang-Je Park; Dae-Soo Kim; Sang-Rae Lee; Kyoung-Min Kim; Kang-Jin Jeong; Ji-Su Kim; Bong-Seok Song; Bo-Woong Sim; Sun-Uk Kim; Sang-Hyun Kim; Kyu-Tae Chang
Journal:  BMC Genomics       Date:  2012-05-04       Impact factor: 3.969

3.  Collection of Macaca fascicularis cDNAs derived from bone marrow, kidney, liver, pancreas, spleen, and thymus.

Authors:  Naoki Osada; Makoto Hirata; Reiko Tanuma; Yutaka Suzuki; Sumio Sugano; Keiji Terao; Jun Kusuda; Yosuke Kameoka; Katsuyuki Hashimoto; Ichiro Takahashi
Journal:  BMC Res Notes       Date:  2009-09-29

4.  Selection of new appropriate reference genes for RT-qPCR analysis via transcriptome sequencing of cynomolgus monkeys (Macaca fascicularis).

Authors:  Sang-Je Park; Young-Hyun Kim; Jae-Won Huh; Sang-Rae Lee; Sang-Hyun Kim; Sun-Uk Kim; Ji-Su Kim; Kang-Jin Jeong; Kyoung-Min Kim; Heui-Soo Kim; Kyu-Tae Chang
Journal:  PLoS One       Date:  2013-04-15       Impact factor: 3.240

5.  Large-scale analysis of Macaca fascicularis transcripts and inference of genetic divergence between M. fascicularis and M. mulatta.

Authors:  Naoki Osada; Katsuyuki Hashimoto; Yosuke Kameoka; Makoto Hirata; Reiko Tanuma; Yasuhiro Uno; Itsuro Inoue; Munetomo Hida; Yutaka Suzuki; Sumio Sugano; Keiji Terao; Jun Kusuda; Ichiro Takahashi
Journal:  BMC Genomics       Date:  2008-02-24       Impact factor: 3.969

6.  Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs.

Authors:  Sreedhar Oduru; Janee L Campbell; SriTulasi Karri; William J Hendry; Shafiq A Khan; Simon C Williams
Journal:  BMC Genomics       Date:  2003-06-03       Impact factor: 3.969

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.