| Literature DB >> 26025923 |
Michael J Gilchrist1, Daniel Sobral2, Pierre Khoueiry2, Fabrice Daian2, Batiste Laporte2, Ilya Patrushev3, Jun Matsumoto4, Ken Dewar4, Kenneth E M Hastings4, Yutaka Satou5, Patrick Lemaire6, Ute Rothbächer7.
Abstract
Genome-wide resources, such as collections of cDNA clones encoding for complete proteins (full-ORF clones), are crucial tools for studying the evolution of gene function and genetic interactions. Non-model organisms, in particular marine organisms, provide a rich source of functional diversity. Marine organism genomes are, however, frequently highly polymorphic and encode proteins that diverge significantly from those of well-annotated model genomes. The construction of full-ORF clone collections from non-model organisms is hindered by the difficulty of predicting accurately the N-terminal ends of proteins, and distinguishing recent paralogs from highly polymorphic alleles. We report a computational strategy that overcomes these difficulties, and allows for accurate gene level clustering of transcript data followed by the automated identification of full-ORFs with correct 5'- and 3'-ends. It is robust to polymorphism, includes paralog calling and does not require evolutionary proximity to well annotated model organisms. We developed this pipeline for the ascidian Ciona intestinalis, a highly polymorphic member of the divergent sister group of the vertebrates, emerging as a powerful model organism to study chordate gene function, Gene Regulatory Networks and molecular mechanisms underlying human pathologies. Using this pipeline we have generated the first full-ORF collection for a highly polymorphic marine invertebrate. It contains 19,163 full-ORF cDNA clones covering 60% of Ciona coding genes, and full-ORF orthologs for approximately half of curated human disease-associated genes.Entities:
Keywords: Ascidians; Full-ORF; Functional genomics; Human disease; Prediction pipeline; Transcriptomics
Mesh:
Year: 2015 PMID: 26025923 PMCID: PMC4528069 DOI: 10.1016/j.ydbio.2015.05.014
Source DB: PubMed Journal: Dev Biol ISSN: 0012-1606 Impact factor: 3.582
Fig. 1Coding genome of Ciona intestinalis. (A). Phylogenetic position of Ciona intestinalis relative to major model organisms, with branch length indicating degree of amino acid divergence (adapted from Putnam et al. (2007)). (B) Length distribution of 5′ UTRs in Ciona intestinalis determined from assembled EST sequence where open reading frame is probably complete. Red line indicates the proportion at any given length expected to include at least one in-frame stop codon. (C) Lack of conservation of N-terminus of Ciona intestinalis proteins relative to well annotated model systems, and compared to Xenopus tropicalis. Comparison of BLASTp alignment data using sets of mutual orthologs between Ciona intestinalis, Xenopus tropicalis, and either human or mouse. Schematic of BLAST alignments indicates how N-terminus divergence is measured.
Fig. 3‘Cliff’ algorithm for confirming full-ORF status. A concentration of the positions of 5′ ends of clones in assembled clusters identifies the likely start of transcription, which is, by definition, upstream of the start of translation. (A) Cluster with 176 ESTs showing truncated open reading frame and no start of transcription. (B) Cluster with 163 ESTs showing ‘cliff’ of 5′ end positions likely containing the start of transcription. (C) Cliff finding: histogram of numbers of 5′ ends in sliding windows of 100 bp determined every 50 bp along 1000 bp of cluster, and used to find the ‘peak’ region of 5′ end density (N100/N1000). (D) Cliff steepness and transcription start site (TSS) prediction: analysis of cumulative 5′ end count across ‘peak’ 100 bp window, used to find the steepest part of the cliff for a determined fraction of reads in the window. (E) Cliff threshold: plots to test the cluster size dependent term for the limiting value N100/N1000, used to determine the presence of a ‘cliff’ and hence the likely start of translation (see text). The heavy dashed line follows the form where Z is the clusters size (number of ESTs). Individual EST clusters (spots) are plotted according to their ‘peak’ of 5′ ends (N100/N1000) on the y-axis, and cluster size (Z) on the x-axis; those falling right and above of the limiting curve are assumed likely to contain sufficient cliff and the start of transcription. (Upper panel) Verification of cliff algorithm: (blue dots) clusters with upstream stop codon confirming open reading frame, showing score is a good predictor of full-ORF status. (Lower panel) Clone selection with cliff algorithm: clusters without upstream stop codon, showing clear bimodal distribution with cluster consensus sequences assumed full-ORF (green) and those assumed truncated (orange). Spots corresponding to the example genes in panels A and B are marked. The light dashed line shows the curve used to determine the proportion of 5′ ends in the peak window used to look for the steepest section of the cliff (see D).
Numbers of clones and clusters affected by novel solutions to the pipeline. These numbers relate to the total of 19,107 clones selected from 26,186 gene clusters and 9380 singletons covering 9083 KH2012 protein coding genes.
| Opposite strand splitting | Gene clustering | Cluster | +476 | Each split cluster may provide full-ORF clones |
| Cliff score | 5′ end detection | Cluster | +3687 | May not be the only evidence used to assess that cluster is full-ORF |
| SL trans-splicing | 5′ end detection | Clone | +5049 | All SL read containing cluster are considered to have the 5′ end fo the ORF, irrespective of cliff score |
| Alternative transcripts | Non-redundant clone selection | Clone | +449 | Additional clones selected in case of alternative transcripts |
| Exon mapping analysis | Final clone list | Clone | −5000 | Excessive number of clones mapping to same locus with same exon structure |
| Manual addition of clones | Final clone list | Clone | +37 | Clones for low abundanec developmental genes with known ORF |
Fig. 2Workflow of full-ORF pipeline showing novelties. Boxes show schematic workflow of the geneDistiller pipeline for the analysis and definition of full-ORF clones from a large collection. Colour blocks show major sections of process. Ovals indicate important additions or updates added in this work, the two most important conceptual novelties (vi, xi) are described in the text. The other improvements are detailed in Section 2.
Fig. 4Full-ORF clone coverage of KH gene loci. (A) Proportion of KH loci covered by one or more full-ORF clones. (B) Size distribution of KH2010 loci covered or not by full-ORF clones. (C) Coverage relative to transcript abundance (EST count from all C. intestinalis cDNA libraries). (D) Full-ORF coverage of regulatory developmental genes.
Over- and under-represented GO Slim (v1.2, 2008) terms in the KH2010 gene loci associated with one or more full-ORF clones, with corrected p-Values<0.01 (see Section 2). Comparison uses only gene loci with associated GO terms: n=number of genes in the whole comparison set with this GO term, and x=the number of covered genes with the same GO term.
| 5622 | 5.35E−17 | 1183 | 1323 | Intracellular |
| 5737 | 2.55E−16 | 536 | 574 | Cytoplasm |
| 8152 | 1.72E−09 | 2471 | 2911 | Metabolic process |
| 43,226 | 1.72E−09 | 722 | 810 | Organelle |
| 166 | 5.45E−08 | 1044 | 1199 | Nucleotide binding |
| 9058 | 1.79E−07 | 611 | 688 | Biosynthetic process |
| 6139 | 1.66E−06 | 514 | 578 | Nucleobase, nucleoside, nucleotide and nucleic acid metabolic process |
| 44,238 | 6.06E−06 | 1786 | 2108 | Primary metabolic process |
| 5634 | 1.62E−05 | 315 | 349 | Nucleus |
| 6412 | 1.81E−05 | 205 | 222 | Translation |
| 15031 | 1.16E−04 | 177 | 192 | Protein transport |
| 3824 | 2.34E−04 | 2680 | 3216 | Catalytic activity |
| 5783 | 3.57E−04 | 48 | 48 | Endoplasmic reticulum |
| 6350 | 5.90E−04 | 57 | 58 | Transcription |
| 3723 | 1.68E−03 | 141 | 154 | RNA binding |
| 5654 | 2.92E−03 | 37 | 37 | Nucleoplasm |
| 5840 | 3.14E−03 | 135 | 148 | Ribosome |
| 8135 | 6.01E−03 | 33 | 33 | Translation factor activity, nucleic acid binding |
| 16,043 | 6.82E−03 | 152 | 169 | Cellular component organisation |
| 4872 | 4.85E−26 | 159 | 296 | Receptor activity |
| 4871 | 1.83E−18 | 209 | 346 | Signal transducer activity |
| 30,246 | 2.30E−11 | 105 | 180 | Carbohydrate binding |
| 5216 | 2.30E−11 | 55 | 108 | Ion channel activity |
| 5509 | 2.79E−07 | 270 | 388 | Calcium ion binding |
| 5576 | 4.96E−04 | 115 | 169 | Extracellular region |
| 3700 | 5.73E−04 | 167 | 237 | Transcription factor activity |
| 5215 | 6.52E−04 | 378 | 509 | Transporter activity |
| 30,528 | 8.03E−04 | 171 | 241 | Transcription regulator activity |
| 6811 | 2.27E−03 | 176 | 245 | Ion transport |
| 5578 | 3.35E−03 | 11 | 23 | Proteinaceous extracellular matrix |
| 3774 | 5.55E−03 | 42 | 66 | Motor activity |
Fig. 5Ciona disease orthologs. (A) Human disease associated genes represented by Ciona intestinalis orthologs and full-ORF clones. Numbers of Ciona orthologs and full-ORF clones are depicted for five human diseases affecting neural or muscular tissue. Disease associated genes and disease complexes are from an integrated interactome (Lage et al., 2008) and contain potentially conserved functional modules to be analysed in simpler Ciona embryos. (B) Conservation of functionally relevant domains in Ciona despite little overall sequence conservation. Dotpath (EMBOSS) of human BACE-1 (GI:6912266) to orthologous protein sequences of zebrafish (GI:45387815), Ciona (KH.L156.2.v4.A.SL1-2) and nematode (GI:17549909).
| The following terms used in this study are explained explicitly to avoid possible confusion with similar terms used elsewhere. |