Gill Bejerano1, David Haussler, Mathieu Blanchette. 1. Center for Biomolecular Science and Engineering, Baskin School of Engineering University of California in Santa Cruz, Santa Cruz, CA 95064, USA. jill@soe.ucsc.edu
Abstract
MOTIVATION: It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited. RESULTS: We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis. AVAILABILITY: Supplementary material to this work is available at http://www.soe.ucsc.edu/~jill/dark.html
MOTIVATION: It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited. RESULTS: We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis. AVAILABILITY: Supplementary material to this work is available at http://www.soe.ucsc.edu/~jill/dark.html
Authors: James Taylor; Svitlana Tyekucheva; David C King; Ross C Hardison; Webb Miller; Francesca Chiaromonte Journal: Genome Res Date: 2006-10-19 Impact factor: 9.043
Authors: David M McGaughey; Ryan M Vinton; Jimmy Huynh; Amr Al-Saif; Michael A Beer; Andrew S McCallion Journal: Genome Res Date: 2007-12-10 Impact factor: 9.043
Authors: William H Thiel; Thomas Bair; Kristina Wyatt Thiel; Justin P Dassie; William M Rockey; Craig A Howell; Xiuying Y Liu; Adam J Dupuy; Lingyan Huang; Richard Owczarzy; Mark A Behlke; James O McNamara; Paloma H Giangrande Journal: Nucleic Acid Ther Date: 2011-06-28 Impact factor: 5.486