| Literature DB >> 17207965 |
Philippe Lamesch1, Ning Li, Stuart Milstein, Changyu Fan, Tong Hao, Gabor Szabo, Zhenjun Hu, Kavitha Venkatesan, Graeme Bethel, Paul Martin, Jane Rogers, Stephanie Lawlor, Stuart McLaren, Amélie Dricot, Heather Borick, Michael E Cusick, Jean Vandenhaute, Ian Dunham, David E Hill, Marc Vidal.
Abstract
Complete sets of cloned protein-encoding open reading frames (ORFs), or ORFeomes, are essential tools for large-scale proteomics and systems biology studies. Here we describe human ORFeome version 3.1 (hORFeome v3.1), currently the largest publicly available resource of full-length human ORFs (available at ). Generated by Gateway recombinational cloning, this collection contains 12,212 ORFs, representing 10,214 human genes, and corresponds to a 51% expansion of the original hORFeome v1.1. An online human ORFeome database, hORFDB, was built and serves as the central repository for all cloned human ORFs (http://horfdb.dfci.harvard.edu). This expansion of the original ORFeome resource greatly increases the potential experimental search space for large-scale proteomics studies, which will lead to the generation of more comprehensive datasets.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17207965 PMCID: PMC4647941 DOI: 10.1016/j.ygeno.2006.11.012
Source DB: PubMed Journal: Genomics ISSN: 0888-7543 Impact factor: 5.736
Fig. 1Automated human ORFeome pipeline. (A) A filter computationally removed ORFs, extracted from MGC cDNAs, that were not full-length; short ORFs (< 100 nucleotides); and redundantly cloned ORFs. Isoforms and SNP variants of each gene were retained and treated as individual clones. (B) Clones were PCR amplified, Gateway cloned, and sequenced at the 5′ end using universal primers. (C) The resulting ORF sequence tags (OSTs) were aligned to the ORFeome database containing all attempted ORF sequences. Clone attempts that produced a PCR band but whose 5′ OST did not correspond to the expected cDNA underwent a second round of cloning. Successfully cloned ORFs from hORFeome v1 and v3 were combined to form hORFeome v3.1. (D) To investigate the quality of this resource, we picked isolated colonies for 564 ORFs and sequenced them at their 5′ and 3′ ends. In the upcoming ORFeome version 4 project, clones without mutations in their end sequences will undergo full-length sequencing to generate a resource of wild-type clones for each ORF in the hORFeome v3.1.
Summary of the analysis of the nucleotide substitution rate in ORF and primer sequences in human ORFeome v3.1
| No. of analyzed nucleotides | No. of mutations | 1 mutation every | No. of analyzed sequences | No. of mutated sequences | Percentage of mutated sequences | |
|---|---|---|---|---|---|---|
| ORF sequences | 4 × 106 | 316 | 12,875 | 9400 | 275 | 2.0 |
| Primer sequences | 17 × 104 | 588 | 293 | 9118 | 557 | 6.1 |
Summary of successfully cloned ORFs compared to RefSeq annotations on each chromosome
| Chromosome | No. of RefSeqs | No. of ORFs | Percentage of success |
|---|---|---|---|
| 1 | 2396 | 1207 | 50.3 |
| 2 | 1499 | 775 | 51.7 |
| 3 | 1294 | 676 | 52.2 |
| 4 | 838 | 416 | 49.6 |
| 5 | 1030 | 514 | 49.9 |
| 6 | 1227 | 620 | 50.5 |
| 7 | 1077 | 565 | 52.4 |
| 8 | 780 | 397 | 50.8 |
| 9 | 904 | 439 | 48.5 |
| 10 | 942 | 435 | 46.2 |
| 11 | 1474 | 675 | 45.8 |
| 12 | 1219 | 604 | 49.5 |
| 13 | 367 | 189 | 51.5 |
| 14 | 748 | 395 | 52.8 |
| 15 | 695 | 346 | 49.8 |
| 16 | 972 | 511 | 52.6 |
| 17 | 1342 | 667 | 49.7 |
| 18 | 321 | 156 | 48.6 |
| 19 | 1539 | 773 | 50.2 |
| 20 | 762 | 321 | 42.1 |
| 21 | 372 | 116 | 31.2 |
| 22 | 62 | 30 | 48.3 |
| X | 573 | 303 | 52.9 |
| Y | 963 | 408 | 42.4 |
| All | 23,396 | 11,538 | 49.3 |
Fig. 2Distribution of cloned ORFs within each chromosome. (A) To determine whether chromosomes contain regions that are under- or overrepresented in the ORFeome, we divided each chromosome into 1-Mb bins and counted the number of cloned ORFs and the number of RefSeq sequences in each bin. The x axis represents the length (Mb) of chromosome I and the y axis the number of RefSeq sequences in each bin. The colors of the bars reflect the percentage of RefSeqs in each bin that were cloned in the ORFeome, as indicated by the color key. If the cloning success rate was uniformly independent of the position on the chromosome, every bar should be colored the same. Gray lines correspond to bins without RefSeq models and the wide gray vertical region in the middle of the chromosome corresponds to the centromere (Supplementary Fig. 2 shows graphs of the remaining chromosomes). (B) The number of cloned ORFs in bins 1 Mb in length, NORF, shown as a function of the number of predictions in the same respective bins, NRefSeq. Three chromosomes were taken as examples in this graph (chromosomes 1, 2, and 3). The straight line represents the linear regression to the data points. While only three of the chromosomes have been shown for clarity, the fitting yields NORF = (0.49 ± 0.006)NRefSeq + (0.42 ± 0.32) if all chromosomes are taken into account, predicting an overall cloning success rate of about 49% for every chromosomal bin.
Fig. 3Classification of cloned ORFs by GO Slim terms. To identify over- or underrepresented functional categories of proteins in the ORFeome, we classified ORFs by GO Slim terms within their three GO branches, (A) cellular component, (B) molecular function, and (C) biological process, and compared the fraction of each GO Slim term found in the ORFeome to that of the entire proteome. No GO Slim term in any of the three branches is over- or underrepresented in the ORFeome.
Fig. 4Representation of disease genes in hORFeome v3.1. The list of inherited diseases and their associated genes was retrieved from the OMIM database, and the diseases were grouped into 22 disease categories based on the physiological system affected. The length of each bar represents the percentage of diseases in each disease category for which we cloned at least one associated ORF.