Literature DB >> 17207965

hORFeome v3.1: a resource of human open reading frames representing over 10,000 human genes.

Philippe Lamesch¹, Ning Li, Stuart Milstein, Changyu Fan, Tong Hao, Gabor Szabo, Zhenjun Hu, Kavitha Venkatesan, Graeme Bethel, Paul Martin, Jane Rogers, Stephanie Lawlor, Stuart McLaren, Amélie Dricot, Heather Borick, Michael E Cusick, Jean Vandenhaute, Ian Dunham, David E Hill, Marc Vidal.

Abstract

Complete sets of cloned protein-encoding open reading frames (ORFs), or ORFeomes, are essential tools for large-scale proteomics and systems biology studies. Here we describe human ORFeome version 3.1 (hORFeome v3.1), currently the largest publicly available resource of full-length human ORFs (available at ). Generated by Gateway recombinational cloning, this collection contains 12,212 ORFs, representing 10,214 human genes, and corresponds to a 51% expansion of the original hORFeome v1.1. An online human ORFeome database, hORFDB, was built and serves as the central repository for all cloned human ORFs (http://horfdb.dfci.harvard.edu). This expansion of the original ORFeome resource greatly increases the potential experimental search space for large-scale proteomics studies, which will lead to the generation of more comprehensive datasets.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
DNA, Complementary

Year: 2007 PMID： 17207965 PMCID： PMC4647941 DOI： 10.1016/j.ygeno.2006.11.012

Source DB: PubMed Journal: Genomics ISSN： 0888-7543 Impact factor: 5.736

With the availability of complete genome sequences for many organisms [1], [2], [3], [4], [5], [6], [7], it is now possible to begin systematically to identify all functional genomic elements. Of particular interest are the elements of the genes that encode proteins, called open reading frames (ORFs). Full-length cDNA collections, which contain 5′ and/or 3′ UTRs in addition to the ORF, have been generated for several organisms, including Arabidopsis thaliana [8], Drosophila melanogaster [9], and Homo sapiens [10], [11]. While these collections are of immense value, they do not serve directly as ORF resources, but rather as collections of potential ORFs that must first be subcloned without UTRs before subsequent analysis of the encoded proteins can be performed [12]. One such example is the Mammalian Gene Collection (MGC) [10], [13]. This extensive collection of cDNAs was cloned into a vector that is not immediately useful for downstream functional experimentation. Ideally, clones should be archived in a convenient vector that would allow for high-throughput transfer of ORFs into a variety of different expression vectors, such as Gateway [12], [14] or any other recombinational cloning system [15], [16], [17]. In an effort to generate usable ORF collections, large-scale cloning projects, with the goal of cloning all predicted ORFs into flexible, recombinational vectors, have been described for a few model organisms including Brucella melitensis [18], Saccharomyces cerevisiae [19], and Caenorhabditis elegans [20], [21], [22]. These ORFeome resources [12] represent essential tools for large-scale protein characterization and therefore serve as a necessary bridge between genome annotation and systems biology. Previously, we have described human ORFeome v1.1 [23], in which we used cDNAs from the MGC as templates to clone more than 8000 full-length ORFs. The utility of the resource was exemplified by its use in the generation of a large-scale human protein–protein interaction or “interactome” map, in which 6.4 × 107 (8000 × (8000 + 1)) possible pair-wise combinations were tested for yeast two-hybrid (Y2H) interactions, resulting in the identification of 2754 Y2H interactions between the products of 1549 ORFs [24]. Since linear increases in the number of ORFs in the ORFeome collection result in quadratic expansions in the biological search space that can be tested, the expansion of the human ORFeome will play an essential role in enhancing this interactome mapping effort as well as other systematic ORF studies. For example, using a matrix-based Y2H approach (testing all pair-wise combinations), an increase of 4000 ORFs (from 8000 to 12,000) would allow for the testing of 14.4 × 107 combinations, corresponding to an additional 8 × 107 pair-wise combinations and a 125% increase in the search space. Likewise, a more complete ORFeome resource will yield more comprehensive datasets for all systematic studies of ORF function from protein arrays [25] to high-content screening [26]. One of the main strategies of systems biology, the integration of genome-wide data generated by multiple orthogonal proteomic techniques [27], has been hampered by incomplete datasets. As the complete human ORFeome becomes one of the standard sets of clones used in reverse proteomic studies, the number of analyzed proteins in large-scale experiments should gradually improve, facilitating the integration of these data and ultimately leading to a better understanding of the properties of biological systems. Here, we describe human ORFeome version 3.1 (hORFeome v3.1), a resource of 12,212 distinct ORFs, and introduce an improved human ORFeome database.

Results

Defining hORFeome v3.1

We define an ORF as the protein coding sequence of a gene from its start to its stop codon and excluding the 5′ and 3′ UTRs. A major milestone of the human ORFeome project will be the generation of the complete ORFeome, defined as the collection of protein-encoding ORFs representing at least one splicing isoform for every gene predicted in the human genome. Subsequently this resource will include splice variants and polymorphic variants for each gene. In the first human ORFeome project (hORFeome v1.1) we used directed PCR on the available set of cDNAs from MGC successfully to clone 8107 ORFs into the Gateway entry vector. In our second iteration of the human ORFeome effort, referred to here as the human ORFeome 3 project, we have attempted to clone ORFs from an additional 6027 cDNAs that are not part of v1.1. These cDNAs can be divided into two classes: 4806 clones correspond to newly available MGC clones mostly obtained by random cDNA library screening. The second class of cDNAs corresponds to 1221 MGC clones that failed to clone during the first human ORFeome project. All ORFs were passed through a semiautomated pipeline (Fig. 1) that allowed for efficient cloning and data analysis. First, clones shorter than 100 nucleotides (a threshold three times smaller than the convention of 300 nucleotides), and clones for which no complete coding sequence was available from NCBI, were eliminated from further analysis. Isoforms or polymorphic clones of the same gene were processed individually and treated as separate ORFs. ORFs that we failed to clone in the first round were attempted a second time and if successfully cloned were consolidated with the ORFs successfully cloned in hORFeome v1.1. This consolidated ORFeome collection is called human ORFeome version 3.1 (Fig. 1). Even-numbered version names have been reserved for ORFeome collections that contain single isolated wild-type clones for each ORF [21].

Fig. 1

Automated human ORFeome pipeline. (A) A filter computationally removed ORFs, extracted from MGC cDNAs, that were not full-length; short ORFs (< 100 nucleotides); and redundantly cloned ORFs. Isoforms and SNP variants of each gene were retained and treated as individual clones. (B) Clones were PCR amplified, Gateway cloned, and sequenced at the 5′ end using universal primers. (C) The resulting ORF sequence tags (OSTs) were aligned to the ORFeome database containing all attempted ORF sequences. Clone attempts that produced a PCR band but whose 5′ OST did not correspond to the expected cDNA underwent a second round of cloning. Successfully cloned ORFs from hORFeome v1 and v3 were combined to form hORFeome v3.1. (D) To investigate the quality of this resource, we picked isolated colonies for 564 ORFs and sequenced them at their 5′ and 3′ ends. In the upcoming ORFeome version 4 project, clones without mutations in their end sequences will undergo full-length sequencing to generate a resource of wild-type clones for each ORF in the hORFeome v3.1.

ORF sequence tag (OST) analysis

Following BP recombinational cloning and transformation, ORFs were sequenced from the 5′ end to confirm their identity. Sequencing reads were truncated after the first 400 nucleotides (or fewer if the sequence read was short or of low quality before the 400th nucleotide) and used as queries for BLAST alignment [22] against an internal database containing sequences for all of the ORFs we attempted to clone. ORFs whose 5′ OST aligned to the predicted sequence and contained the predicted start codon were scored as successfully cloned. Following two rounds of cloning, we successfully isolated 4111 ORFs. Of these, 659 corresponded to ORFs that we failed to clone in the first version of the human ORFeome project [23], representing a 54% recovery (659/1220). Since the primers used here were identical to those used in our first attempt [23], the initial cloning failures were likely due to technical errors. As previously observed, the success rate correlated with the size of the ORFs, with small ORFs showing a higher success rate than larger ORFs (see Supplementary Fig. 1) [22]. In total, hORFeome v3.1 contains 12,212 ORFs, corresponding to 10,214 genes, representing a 51% expansion of the original human ORFeome resource. The ORFs range in size from 102 to 5499 bp and include 650 polymorphic ORFs and 1160 ORFs that correspond to multiple splice forms.

Quality assessment of hORFeome v3.1

hORFeome v3.1 is a collection of clones that were generated by PCR from unique, individual cDNA templates. Among the PCR products from individual templates are clones with mutations that originate during primer synthesis and clones that acquired mutations during PCR amplification. Following recombination, clones can also contain empty Gateway donor vector in which the toxic ccdB gene, which normally prevents growth of the empty vector, is no longer functional due to mutation [12], [14]. Since our cloning strategy generates minipools rather than individual isolated clones for each ORF, we did extensive sequence analysis on a set of individual isolated clones to assess the overall quality of hORFeome v3.1. A thorough investigation of the quality of hORFeome v3.1 was carried out by isolating single colonies from a large number of minipools and end-sequencing them from the 5′ and 3′ ends. Five hundred sixty-four ORFs (six plates) were chosen at random from hORFeome v3.1 (three plates previously generated during the human ORFeome project 1 and three plates of newly cloned ORFs) and six single isolated colonies were picked from each well. These 3384 clones (6 plates × 94 wells × 6 colonies) were end-sequenced using two different pairs of sequencing primers, corresponding to two forward and two reverse oligonucleotides that anneal to distinct vector sequences. In total 13,536 sequence reads were generated (3384 clones × 2 pairs of primers × 2 reads) and only high-quality sequence reads (at least 100 nucleotides with a PHRED score of >19) were retained for further analysis. We expected to see mutations that arise from two sources: mutations in the primer sequence likely originated during primer synthesis, while those that were found in the ORF were most likely due to PCR-induced errors. If this were the case we should find different rates of mutation depending on the source of mutation. We identified mutations in 9.8% of the primers from ORFeome project 1 and in 2.6% of the primers from ORFeome project 3. This difference in primer quality is most likely due to a less error-prone primer synthesis protocol used for ORFeome project 3. The analysis of 4,068,518 nt of ORF sequence (excluding primer sequence) revealed 316 mutations that were distributed among 275 sequences (Table 1). The resulting misincorporation rate using KOD polymerase (Novagen) [28] amounts to one nucleotide substitution every 12,875 bp. This mutation rate is higher than previously reported in hORFeome v1.1 (one mutation every ∼ 35,000 bp) using the same polymerase, but that analysis was limited to only 70,000 nt [23]. Nevertheless, this rate is substantially lower than the mutation rate observed in the C. elegans ORFeome (1/1500 bp), which was generated using a high-fidelity Taq DNA polymerase [21]. Considering the much larger dataset analyzed here (4 × 106 in v3.1 vs 7 × 104 in v1.1), this study provides the most extensive quality assessment of any large-scale ORFeome cloning project to date.

Table 1

Summary of the analysis of the nucleotide substitution rate in ORF and primer sequences in human ORFeome v3.1

	No. of analyzed nucleotides	No. of mutations	1 mutation every x nucleotides	No. of analyzed sequences	No. of mutated sequences	Percentage of mutated sequences
ORF sequences	4 × 10⁶	316	12,875	9400	275	2.0
Primer sequences	17 × 10⁴	588	293	9118	557	6.1

hORFeome v3.1 properties

Distribution of ORFs on chromosomes

Most MGC clones were generated by screening a diverse set of cDNA libraries for full-length cDNAs [29], [30]. The probability of finding a particular clone is dependent on its representation in the library; therefore, it may be difficult to identify cDNAs that are expressed under restricted conditions or in small subsets of cells. Given this expression bias, are our cloned ORFs distributed equally throughout the genome, or are there regions that are relatively under- or overrepresented with respect to cloned ORFs? For example, in C. elegans, there is a marked underrepresentation of cloned ORFs on chromosome 5, in a region containing a large cluster of G-protein-coupled receptors [21]. We used BLAT to align the cloned ORFs to the human genome using UCSC's human genome build Golden Path hg35 [31], [32]. The number of ORFs associated with each chromosome was then compared to the number of RefSeq models [33], defined as the most comprehensive nonredundant set of full-length cDNAs. On 22 chromosomes ORF cloning was uniformly successful, with a cloning success rate ranging between ∼ 42 and ∼ 53%. In contrast, cloned ORFs on chromosome 21 were slightly underrepresented (Table 2).

Table 2

Summary of successfully cloned ORFs compared to RefSeq annotations on each chromosome

Chromosome	No. of RefSeqs	No. of ORFs	Percentage of success
1	2396	1207	50.3
2	1499	775	51.7
3	1294	676	52.2
4	838	416	49.6
5	1030	514	49.9
6	1227	620	50.5
7	1077	565	52.4
8	780	397	50.8
9	904	439	48.5
10	942	435	46.2
11	1474	675	45.8
12	1219	604	49.5
13	367	189	51.5
14	748	395	52.8
15	695	346	49.8
16	972	511	52.6
17	1342	667	49.7
18	321	156	48.6
19	1539	773	50.2
20	762	321	42.1
21	372	116	31.2
22	62	30	48.3
X	573	303	52.9
Y	963	408	42.4
All	23,396	11,538	49.3

To investigate ORF distribution along each chromosome, we divided each chromosome into 1-Mb bins and counted the number of ORFs in each bin. We calculated the cloning success rate in each bin as the ratio of the number of cloned ORFs to RefSeq sequences (Fig. 2A). To check quantitatively whether there is a bias toward sparse or dense RefSeq regions in the cloning success rate, we plotted the number of cloned ORFs versus the number of RefSeq models for each bin for three chosen chromosomes (Fig. 2B). We find that the ORF density is linearly proportional to the RefSeq density and that the overall cloning success rate is ∼ 49% for every bin of chromosomes, showing that the cloned ORFs are equally distributed within chromosomes and that there are no regions of obvious over- or underrepresentation. We then compared the distribution of the local success rates among chromosomes and noticed a significantly different local success rate distribution on chromosomes 19, 20, 21, X, and Y (Supplementary Fig. 3). On chromosomes 20, 21, and X, this shift could be explained by the lower overall cloning success rate. On chromosomes 19 and Y, for which the cloning success rate was high, this shift might be due to erroneous gene annotation or related to the fact that these two chromosomes are among the shortest of chromosomes.

Fig. 2

Distribution of cloned ORFs within each chromosome. (A) To determine whether chromosomes contain regions that are under- or overrepresented in the ORFeome, we divided each chromosome into 1-Mb bins and counted the number of cloned ORFs and the number of RefSeq sequences in each bin. The x axis represents the length (Mb) of chromosome I and the y axis the number of RefSeq sequences in each bin. The colors of the bars reflect the percentage of RefSeqs in each bin that were cloned in the ORFeome, as indicated by the color key. If the cloning success rate was uniformly independent of the position on the chromosome, every bar should be colored the same. Gray lines correspond to bins without RefSeq models and the wide gray vertical region in the middle of the chromosome corresponds to the centromere (Supplementary Fig. 2 shows graphs of the remaining chromosomes). (B) The number of cloned ORFs in bins 1 Mb in length, NORF, shown as a function of the number of predictions in the same respective bins, NRefSeq. Three chromosomes were taken as examples in this graph (chromosomes 1, 2, and 3). The straight line represents the linear regression to the data points. While only three of the chromosomes have been shown for clarity, the fitting yields NORF = (0.49 ± 0.006)NRefSeq + (0.42 ± 0.32) if all chromosomes are taken into account, predicting an overall cloning success rate of about 49% for every chromosomal bin.

GO Slim terms

We turned to Gene Ontology (GO) annotations [34], [35] to assess whether specific functional categories were over- or underrepresented in human ORFeome version 3.1. Instead of the full GO hierarchy, we used the broader GO Slim terms of each GO branch (cellular component, biological process, and molecular function). We compared the fraction of each GO term found in clones in the ORFeome to the fraction found in the entire proteome (Fig. 3). We find that the ORFeome has a very similar profile of functional categories compared to the complete human proteome, with no obvious over- or underenriched categories.

Fig. 3

Classification of cloned ORFs by GO Slim terms. To identify over- or underrepresented functional categories of proteins in the ORFeome, we classified ORFs by GO Slim terms within their three GO branches, (A) cellular component, (B) molecular function, and (C) biological process, and compared the fraction of each GO Slim term found in the ORFeome to that of the entire proteome. No GO Slim term in any of the three branches is over- or underrepresented in the ORFeome.

Disease genes

Disease-associated genes are obviously of great interest to the research community. The OMIM (Online Mendelian Inheritance in Man) database [35] represents the central repository for information about inherited disease-related genes. OMIM currently contains information for about 2801 genes that are associated with 1585 different diseases. hORFeome v3.1 contains 956 disease genes associated with 828 distinct diseases described in OMIM (Fig. 4). We classified all OMIM diseases into 22 categories (containing between 6 and 239 different diseases) based on the physiological system affected. We then determined how many diseases in each disease category were represented by at least one ORF in hORFeome v3.1. We could identify ORFs associated with 40–60% of the diseases within a given category, except a few slightly over- (cancer, hematological diseases) or underrepresented (ear–nose–throat-related diseases) categories. For example, v3.1 contains ORFs for 86 of 132 diseases that belong to the cancer category. Despite the good representation of OMIM genes in the ORFeome, only 9.7% of all cloned ORFs have been associated with an inherited disease. The generation of large ORF collections, such as hORFeome v3.1, will be crucial for the identification and characterization of additional disease associations.

Fig. 4

Representation of disease genes in hORFeome v3.1. The list of inherited diseases and their associated genes was retrieved from the OMIM database, and the diseases were grouped into 22 disease categories based on the physiological system affected. The length of each bar represents the percentage of diseases in each disease category for which we cloned at least one associated ORF.

hORFDB 3.1 Web site

A new Web site (http://horfdb.dfci.harvard.edu) that improves both the user interface and the back end has been developed. Searches on the hORFDB 3.1 Web site can be performed for single or multiple clones using different queries, including MGC name, GI, GenBank accession number, EntrezGene ID, OST accession number, symbol, or plate position. The database can also be searched by description or keyword for ORFs involved in specific biological functions or diseases. The result page of a successfully cloned ORF provides information about the location of the ORF in the ORFeome resource, primer and GenBank sequences, and alternative IDs and descriptions for the ORF. Any yeast two-hybrid interactions based on the human interaction dataset produced by Rual et al. [24] are also listed. These interactions can be visualized using the network visualization tool VisANT [36]. If the queried protein has been detected as bait or prey in the above-mentioned interaction dataset, hORFDB links directly to a first-level interaction network (proteins that interact directly with the queried protein) and a second-level interaction network (proteins that interact with the interaction partners of the queried protein). The user can expand the visible network by clicking on each node of interest, thereby revealing the next level of interactors. Each protein in the network contains links back to its corresponding hORFeome v3.1 Web page, as well as to its corresponding pages on the NCBI EntrezGene, NCBI Nucleotide, and KEGG Web sites. All ORFs labeled as cloned in hORFDB are part of the physical resource of ORF Entry minipools and are available from Open Biosystems, Inc. (http://www.openbiosystems.com). The complete list of cloned human ORFs is also available as a downloadable Fasta file on our home page.

Discussion

hORFeome v3.1 greatly expands the human ORFeome collection. Unique MGC cDNAs, initially generated largely by random cDNA library screening, were used individually as templates to clone successfully 4111 additional ORFs, generating a consolidated collection of 12,212 ORFs representing 10,214 genes. Although random library screening followed by PCR amplification and Gateway cloning is an excellent method to clone ORFs corresponding to more than half of the well-defined RefSeq predictions, this approach would be less efficient for the identification of “rare” ORFs. Strategies to overcome this hurdle are to generate normalized cDNA libraries or to presubtract cDNAs retrieved in previous screens. An alternative approach is to perform directed PCR from cDNA using primers that have been designed based on ORF predictions, as has been successful for C. elegans [21]. Recently, the MGC, Integrated Molecular Analysis of Genomes and Their Expression Consortium, Wellcome Trust Sanger Institute, DFCI–CCSB (Dana Farber Cancer Institute–Center for Cancer Systems Biology), Harvard Institute of Proteomics, Deutsches Krebsforschungszentrum, Kazusa DNA Research Institute, and RIKEN Yokohama Institute initiated the human “ORFeome Collaboration” with the aim of sharing existing resources and dividing the task of completing the human ORFeome [37]. This effort is using directed PCR to clone missing ORFs whose exon–intron structure is annotated based on literature or full-length cDNAs. About 4700 ORFs that meet these criteria are currently being processed. In addition to library screening and directed PCR, direct ORF synthesis is a third approach to expand the human ORFeome and will be particularly valuable for ORFs that prove difficult to clone. In a small pilot project to demonstrate the feasibility of the synthetic approach, the MGC recently contracted for the successful synthesis and cloning of 72 ORF sequences, ranging in size from several hundred nucleotides to over 11 kb (Gary Temple, personal communication). In addition to gene coverage, future versions of the human ORFeome will increase coverage of alternatively spliced genes. While recent estimates predict that up to 80% of all human genes code for multiple isoforms, only 1160 ORFs correspond to splice variants in hORFeome v3.1. Finally, while the current ORFeome is a collection of minipools, each initially derived from a single, fully sequenced cDNA template, we ultimately want to generate a resource of wild-type clones, which will require the isolation and full-length sequencing of single colonies for each ORF in the minipools.

Materials and methods

Gateway cloning of the human ORFeome v3.1

For PCR amplification, we designed primers using the automatic primer design program OSP [38]. Although this program is no longer publicly available, we suggest using Primer3 [39] as an alternative primer design program. Forward primers start from the A of the ATG, whereas the reverse primers start from the second nucleotide in the stop codon. Consequently, the reverse attb2.1 primers do not contain the last nucleotide of the termination codon, so as to allow subsequent generation of C-terminal fusion proteins. For ORFs that failed in the first ORFeome project and that we reattempted to clone in ORFeome version 3, we did not synthesize new primers but instead used the primers generated for the previous project. To generate hORFeome v3.1 we closely followed the protocol of Reboul et al. [21], except that we applied the improved PCR conditions and used the improved donor vector pDONR223 [23]. All nonredundant MGC clones were consolidated into a unique set (some MGC clones exist in duplicates) and arranged by size of the ORF and by antibiotic resistance marker. Plasmid preps were obtained using a Qiagen Biorobot 8000. PCR was performed in 25-μl reactions containing 1 unit of KOD Hot Start DNA polymerase according to the manufacturer (Novagen). Gateway BP reactions were performed as described [23] using 2 μl of unpurified PCR product in 10 μl final volume. A 2-μl aliquot of the BP reaction was used to transform Escherichia coli DH5α to spectinomycin resistance (50 μg/ml). Plasmid preps were obtained from 1.0-ml overnight cultures and then used for PCR with M13-based Fwd and Rev primers to generate templates for cycle-sequencing reactions [23]. PCR products were sequenced at the 5′ end using the M13-Fwd primer, generating an OST.

Sequence analysis of the initial MGC cDNAs

For this ORFeome project, we attempted to clone ORFs from 9236 MGC cDNA clones that either were not yet available or remained uncloned in hORFeome v1.1. The coding sequences of all these cDNAs were retrieved from the NCBI Web site and compared to one another to eliminate any cDNAs containing redundant open reading frames (this includes duplicate clones as well as those cDNAs with different 5′ and/or 3′ UTRs but otherwise identical ORF sequences). Next, we aligned the set of unique coding sequences to the human genome (Golden Path hg35) and identified ORFs that were splice variants or polymorphic clones of the same gene.

Sequence analysis of OSTs from minipools

First, OSTs were used as queries for BLASTN searches against our internal database containing all coding sequences that we attempted to clone. In a second step, aligned OSTs were truncated after the first 400 nucleotides (or fewer if the sequence read was short or of low quality before the 400th nucleotide) and a BLAST (blast2seq) was performed between each OST sequence and its best hit. Based on these results, OSTs were grouped into the following classes: (1) good, (2) good but potential polymorphism detected, (3) good but not full length, (4) wrong identity, and (5) empty clones. Only OSTs of categories (1) and (2) were retained for further analysis.

Sequence analysis of OSTs from isolated colonies

Five hundred sixty-four ORFs (six 96-well plates) were selected from the ORFeome 3.1 collection to represent a variety of insert sizes, including the smallest and largest ORFs. Minipools were streaked to single colonies on LB agar containing 100 μg/ml spectinomycin and incubated at 37 °C for 16 h. Six colonies were selected for further analysis. Individual colonies were picked into 0.8-ml 96-well plates (ABgene AB-0859) containing 0.5 ml of selective growth medium (Circlegrow supplemented with 100 μg/ml spectinomycin) and grown in a shaking incubator at 37 °C for 16 h. The sequencing template was prepared for successfully cultivated colonies by standard alkaline-lysis plasmid purification. Initial end sequencing was performed with BigDye terminator v3 Cycle Sequencing Kits (Applied Biosystems) using M13 forward (TGTAAAACGACGGCCAGT) and reverse (CAGGAAACAGCTATGACC) primers and primers designed to pDONR223 (CCCAGTCACGACGTTGTAAAACG; GTAACATCAGAGATTTTGAGACAC) on ABI 3730 sequencing machines. Reads were analyzed for the presence of a complete att site, the correct insert sequence, and the presence of the gene-specific oligonucleotide using crossmatch (Green P, http://www.phrap.org/phredphrap/general.html) and Blastn.

Analysis of successful ORF clones on chromosomes

Sequences of the RefSeq set (June 2005), NCBI's consensus set of nonredundant transcripts, were used as queries to perform a BLAT alignment to the human genome build hg35.1. We chose only those RefSeq models that fulfill the following requirements: (1) RefSeqs are of the “NM” category, which corresponds to sequences that have been validated by one or more cDNAs and (2) they are known as “protein-coding” by NCBI. Using their genomic coordinates, cloned ORFs and RefSeqs were grouped into 1-Mb bins on all chromosomes. The distribution of RefSeq models and the ORF cloning success rate on the chromosomes were plotted using Matlab 6. In the scatter graph of Fig. 2B, we find that the ORF density is linearly proportional to the RefSeq density, as described by the function NORF = 0.49 NRefSeq + 0.42 (the standard errors are 0.006 and 0.32 for the slope and the intercept of the regression function, respectively) for the given binning and considering every chromosome.

Analysis of ORF distribution by functional classes

Gene ontology functional classification was obtained from the EntrezGene database at ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go (April 6, 2006). Each gene-to-GO term association was mapped to a GO Slim association as defined in ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/goslim/goaslim.map (February 27, 2006). The frequency distribution of ORFs in each GO Slim class was then calculated for ORFs in hORFeome v3.1 as well as for the entire proteome.

Analysis of ORF distribution by disease category

The list of human diseases and their associated genes was obtained from the OMIM database at ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap. Similar diseases were collapsed into just one disease. We then manually curated these diseases and divided them into 22 classes mostly based on the type of disease (such as cancer) and the physiological system affected.

39 in total

1. GATEWAY recombinational cloning: application to the cloning of large numbers of open reading frames or ORFeomes.

Authors: A J Walhout; G F Temple; M A Brasch; J L Hartley; M A Lorson; S van den Heuvel; M Vidal
Journal: Methods Enzymol Date: 2000 Impact factor: 1.600

2. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. Towards a proteome-scale map of the human protein-protein interaction network.

Authors: Jean-François Rual; Kavitha Venkatesan; Tong Hao; Tomoko Hirozane-Kishikawa; Amélie Dricot; Ning Li; Gabriel F Berriz; Francis D Gibbons; Matija Dreze; Nono Ayivi-Guedehoussou; Niels Klitgord; Christophe Simon; Mike Boxem; Stuart Milstein; Jennifer Rosenberg; Debra S Goldberg; Lan V Zhang; Sharyl L Wong; Giovanni Franklin; Siming Li; Joanna S Albala; Janghoo Lim; Carlene Fraughton; Estelle Llamosas; Sebiha Cevik; Camille Bex; Philippe Lamesch; Robert S Sikorski; Jean Vandenhaute; Huda Y Zoghbi; Alex Smolyar; Stephanie Bosak; Reynaldo Sequerra; Lynn Doucette-Stamm; Michael E Cusick; David E Hill; Frederick P Roth; Marc Vidal
Journal: Nature Date: 2005-09-28 Impact factor: 49.962

4. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences.

Authors: Robert L Strausberg; Elise A Feingold; Lynette H Grouse; Jeffery G Derge; Richard D Klausner; Francis S Collins; Lukas Wagner; Carolyn M Shenmen; Gregory D Schuler; Stephen F Altschul; Barry Zeeberg; Kenneth H Buetow; Carl F Schaefer; Narayan K Bhat; Ralph F Hopkins; Heather Jordan; Troy Moore; Steve I Max; Jun Wang; Florence Hsieh; Luda Diatchenko; Kate Marusina; Andrew A Farmer; Gerald M Rubin; Ling Hong; Mark Stapleton; M Bento Soares; Maria F Bonaldo; Tom L Casavant; Todd E Scheetz; Michael J Brownstein; Ted B Usdin; Shiraki Toshiyuki; Piero Carninci; Christa Prange; Sam S Raha; Naomi A Loquellano; Garrick J Peters; Rick D Abramson; Sara J Mullahy; Stephanie A Bosak; Paul J McEwan; Kevin J McKernan; Joel A Malek; Preethi H Gunaratne; Stephen Richards; Kim C Worley; Sarah Hale; Angela M Garcia; Laura J Gay; Stephen W Hulyk; Debbie K Villalon; Donna M Muzny; Erica J Sodergren; Xiuhua Lu; Richard A Gibbs; Jessica Fahey; Erin Helton; Mark Ketteman; Anuradha Madan; Stephanie Rodrigues; Amy Sanchez; Michelle Whiting; Anup Madan; Alice C Young; Yuriy Shevchenko; Gerard G Bouffard; Robert W Blakesley; Jeffrey W Touchman; Eric D Green; Mark C Dickson; Alex C Rodriguez; Jane Grimwood; Jeremy Schmutz; Richard M Myers; Yaron S N Butterfield; Martin I Krzywinski; Ursula Skalska; Duane E Smailus; Angelique Schnerch; Jacqueline E Schein; Steven J M Jones; Marco A Marra
Journal: Proc Natl Acad Sci U S A Date: 2002-12-11 Impact factor: 11.205

5. C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression.

Authors: Jérôme Reboul; Philippe Vaglio; Jean-François Rual; Philippe Lamesch; Monica Martinez; Christopher M Armstrong; Siming Li; Laurent Jacotot; Nicolas Bertin; Rekin's Janky; Troy Moore; James R Hudson; James L Hartley; Michael A Brasch; Jean Vandenhaute; Simon Boulton; Gregory A Endress; Sarah Jenna; Eric Chevet; Vasilis Papasotiropoulos; Peter P Tolias; Jason Ptacek; Mike Snyder; Raymond Huang; Mark R Chance; Hongmei Lee; Lynn Doucette-Stamm; David E Hill; Marc Vidal
Journal: Nat Genet Date: 2003-05 Impact factor: 38.330

6. Global analysis of protein activities using proteome chips.

Authors: H Zhu; M Bilgin; R Bangham; D Hall; A Casamayor; P Bertone; N Lan; R Jansen; S Bidlingmaier; T Houfek; T Mitchell; P Miller; R A Dean; M Gerstein; M Snyder
Journal: Science Date: 2001-07-26 Impact factor: 47.728

7. Finishing the euchromatic sequence of the human genome.

Authors:
Journal: Nature Date: 2004-10-21 Impact factor: 49.962

Review 8. Life with 6000 genes.

Authors: A Goffeau; B G Barrell; H Bussey; R W Davis; B Dujon; H Feldmann; F Galibert; J D Hoheisel; C Jacq; M Johnston; E J Louis; H W Mewes; Y Murakami; P Philippsen; H Tettelin; S G Oliver
Journal: Science Date: 1996-10-25 Impact factor: 47.728

9. Genome sequence of the Brown Norway rat yields insights into mammalian evolution.

Authors: Richard A Gibbs; George M Weinstock; Michael L Metzker; Donna M Muzny; Erica J Sodergren; Steven Scherer; Graham Scott; David Steffen; Kim C Worley; Paula E Burch; Geoffrey Okwuonu; Sandra Hines; Lora Lewis; Christine DeRamo; Oliver Delgado; Shannon Dugan-Rocha; George Miner; Margaret Morgan; Alicia Hawes; Rachel Gill; Robert A Holt; Mark D Adams; Peter G Amanatides; Holly Baden-Tillson; Mary Barnstead; Soo Chin; Cheryl A Evans; Steve Ferriera; Carl Fosler; Anna Glodek; Zhiping Gu; Don Jennings; Cheryl L Kraft; Trixie Nguyen; Cynthia M Pfannkoch; Cynthia Sitter; Granger G Sutton; J Craig Venter; Trevor Woodage; Douglas Smith; Hong-Mei Lee; Erik Gustafson; Patrick Cahill; Arnold Kana; Lynn Doucette-Stamm; Keith Weinstock; Kim Fechtel; Robert B Weiss; Diane M Dunn; Eric D Green; Robert W Blakesley; Gerard G Bouffard; Pieter J De Jong; Kazutoyo Osoegawa; Baoli Zhu; Marco Marra; Jacqueline Schein; Ian Bosdet; Chris Fjell; Steven Jones; Martin Krzywinski; Carrie Mathewson; Asim Siddiqui; Natasja Wye; John McPherson; Shaying Zhao; Claire M Fraser; Jyoti Shetty; Sofiya Shatsman; Keita Geer; Yixin Chen; Sofyia Abramzon; William C Nierman; Paul H Havlak; Rui Chen; K James Durbin; Amy Egan; Yanru Ren; Xing-Zhi Song; Bingshan Li; Yue Liu; Xiang Qin; Simon Cawley; Kim C Worley; A J Cooney; Lisa M D'Souza; Kirt Martin; Jia Qian Wu; Manuel L Gonzalez-Garay; Andrew R Jackson; Kenneth J Kalafus; Michael P McLeod; Aleksandar Milosavljevic; Davinder Virk; Andrei Volkov; David A Wheeler; Zhengdong Zhang; Jeffrey A Bailey; Evan E Eichler; Eray Tuzun; Ewan Birney; Emmanuel Mongin; Abel Ureta-Vidal; Cara Woodwark; Evgeny Zdobnov; Peer Bork; Mikita Suyama; David Torrents; Marina Alexandersson; Barbara J Trask; Janet M Young; Hui Huang; Huajun Wang; Heming Xing; Sue Daniels; Darryl Gietzen; Jeanette Schmidt; Kristian Stevens; Ursula Vitt; Jim Wingrove; Francisco Camara; M Mar Albà; Josep F Abril; Roderic Guigo; Arian Smit; Inna Dubchak; Edward M Rubin; Olivier Couronne; Alexander Poliakov; Norbert Hübner; Detlev Ganten; Claudia Goesele; Oliver Hummel; Thomas Kreitler; Young-Ae Lee; Jan Monti; Herbert Schulz; Heike Zimdahl; Heinz Himmelbauer; Hans Lehrach; Howard J Jacob; Susan Bromberg; Jo Gullings-Handley; Michael I Jensen-Seaman; Anne E Kwitek; Jozef Lazar; Dean Pasko; Peter J Tonellato; Simon Twigger; Chris P Ponting; Jose M Duarte; Stephen Rice; Leo Goodstadt; Scott A Beatson; Richard D Emes; Eitan E Winter; Caleb Webber; Petra Brandt; Gerald Nyakatura; Margaret Adetobi; Francesca Chiaromonte; Laura Elnitski; Pallavi Eswara; Ross C Hardison; Minmei Hou; Diana Kolbe; Kateryna Makova; Webb Miller; Anton Nekrutenko; Cathy Riemer; Scott Schwartz; James Taylor; Shan Yang; Yi Zhang; Klaus Lindpaintner; T Dan Andrews; Mario Caccamo; Michele Clamp; Laura Clarke; Valerie Curwen; Richard Durbin; Eduardo Eyras; Stephen M Searle; Gregory M Cooper; Serafim Batzoglou; Michael Brudno; Arend Sidow; Eric A Stone; J Craig Venter; Bret A Payseur; Guillaume Bourque; Carlos López-Otín; Xose S Puente; Kushal Chakrabarti; Sourav Chatterji; Colin Dewey; Lior Pachter; Nicolas Bray; Von Bing Yap; Anat Caspi; Glenn Tesler; Pavel A Pevzner; David Haussler; Krishna M Roskin; Robert Baertsch; Hiram Clawson; Terrence S Furey; Angie S Hinrichs; Donna Karolchik; William J Kent; Kate R Rosenbloom; Heather Trumbower; Matt Weirauch; David N Cooper; Peter D Stenson; Bin Ma; Michael Brent; Manimozhiyan Arumugam; David Shteynberg; Richard R Copley; Martin S Taylor; Harold Riethman; Uma Mudunuri; Jane Peterson; Mark Guyer; Adam Felsenfeld; Susan Old; Stephen Mockrin; Francis Collins
Journal: Nature Date: 2004-04-01 Impact factor: 49.962

Review 10. Genome sequence of the nematode C. elegans: a platform for investigating biology.

Authors:
Journal: Science Date: 1998-12-11 Impact factor: 47.728

148 in total

1. The Tyrosine Kinase Adaptor Protein FRS2 Is Oncogenic and Amplified in High-Grade Serous Ovarian Cancer.

Authors: Leo Y Luo; Eejung Kim; Hiu Wing Cheung; Barbara A Weir; Gavin P Dunn; Rhine R Shen; William C Hahn
Journal: Mol Cancer Res Date: 2014-11-03 Impact factor: 5.852

2. The interface between biomarker discovery and clinical validation: The tar pit of the protein biomarker pipeline.

Authors: Amanda G Paulovich; Jeffrey R Whiteaker; Andrew N Hoofnagle; Pei Wang
Journal: Proteomics Clin Appl Date: 2008-10-01 Impact factor: 3.494

Review 3. The New State of the Art: Cas9 for Gene Activation and Repression.

Authors: Marie F La Russa; Lei S Qi
Journal: Mol Cell Biol Date: 2015-09-14 Impact factor: 4.272

4. Gateway compatible vectors for analysis of gene function in the zebrafish.

Authors: Jacques A Villefranc; Julio Amigo; Nathan D Lawson
Journal: Dev Dyn Date: 2007-11 Impact factor: 3.780

5. Dissecting disease inheritance modes in a three-dimensional protein network challenges the "guilt-by-association" principle.

Authors: Yu Guo; Xiaomu Wei; Jishnu Das; Andrew Grimson; Steven M Lipkin; Andrew G Clark; Haiyuan Yu
Journal: Am J Hum Genet Date: 2013-06-20 Impact factor: 11.025

6. Engineering and Functional Characterization of Fusion Genes Identifies Novel Oncogenic Drivers of Cancer.

Authors: Hengyu Lu; Nicole Villafane; Turgut Dogruluk; Caitlin L Grzeskowiak; Kathleen Kong; Yiu Huen Tsang; Oksana Zagorodna; Angeliki Pantazi; Lixing Yang; Nicholas J Neill; Young Won Kim; Chad J Creighton; Roel G Verhaak; Gordon B Mills; Peter J Park; Raju Kucherlapati; Kenneth L Scott
Journal: Cancer Res Date: 2017-05-16 Impact factor: 12.701

7. GLTSCR2/PICT1 links mitochondrial stress and Myc signaling.

Authors: John C Yoon; Alvin J Y Ling; Meltem Isik; Dong-Young Donna Lee; Michael J Steinbaugh; Laura M Sack; Abigail N Boduch; T Keith Blackwell; David A Sinclair; Stephen J Elledge
Journal: Proc Natl Acad Sci U S A Date: 2014-02-20 Impact factor: 11.205