Literature DB >> 30357393

GENCODE reference annotation for the human and mouse genomes.

Adam Frankish¹, Mark Diekhans², Anne-Maud Ferreira³, Rory Johnson^4,5, Irwin Jungreis^6,7, Jane Loveland¹, Jonathan M Mudge¹, Cristina Sisu^8,9, James Wright¹⁰, Joel Armstrong², If Barnes¹, Andrew Berry¹, Alexandra Bignell¹, Silvia Carbonell Sala¹¹, Jacqueline Chrast³, Fiona Cunningham¹, Tomás Di Domenico¹², Sarah Donaldson¹, Ian T Fiddes², Carlos García Girón¹, Jose Manuel Gonzalez¹, Tiago Grego¹, Matthew Hardy¹, Thibaut Hourlier¹, Toby Hunt¹, Osagie G Izuogu¹, Julien Lagarde¹¹, Fergal J Martin¹, Laura Martínez¹², Shamika Mohanan¹, Paul Muir^13,14, Fabio C P Navarro⁸, Anne Parker¹, Baikang Pei⁸, Fernando Pozo¹², Magali Ruffier¹, Bianca M Schmitt¹, Eloise Stapleton¹, Marie-Marthe Suner¹, Irina Sycheva¹, Barbara Uszczynska-Ratajczak¹⁵, Jinuri Xu⁸, Andrew Yates¹, Daniel Zerbino¹, Yan Zhang^8,16, Bronwen Aken¹, Jyoti S Choudhary¹⁰, Mark Gerstein^8,17,18, Roderic Guigó^11,19, Tim J P Hubbard²⁰, Manolis Kellis^6,7, Benedict Paten², Alexandre Reymond³, Michael L Tress¹², Paul Flicek¹.

Abstract

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

Entities: Disease Gene Species

Mesh：

Year: 2019 PMID： 30357393 PMCID： PMC6323946 DOI： 10.1093/nar/gky955

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The GENCODE consortium produces foundational reference genome annotation for the human and mouse genomes as well as tools and data to maintain and improve these annotations. Our overall goal is to identify and classify, with high accuracy, all gene features in the human and mouse genomes based on defined biological evidence and to make these annotations freely available for the benefit of biomedical research and genome interpretation. The GENCODE project was founded in 2003 as part of the pilot phase of the ENCODE project to provide reference quality manual gene annotation for the 30Mb (∼1%) of the reference human genome targeted by the ENCODE pilot (1–3). In 2007, we expanded our scope to the whole human genome as the ENCODE project did the same (4,5). In 2012, we began annotating the mouse reference genome to the same standards as human, while continuing to improve the existing gene annotation in both species via targeted reinvestigation of loci flagged by external users and internal QC pipelines. Today, the GENCODE consortium is a long-running partnership of manual annotation, computational biology and experimental groups including four of the founding groups (HAVANA, CRG, Yale and UCSC) and three groups that joined in 2007 (Ensembl, MIT and CNIO). Our gene annotations are regularly released as the Ensembl/GENCODE gene sets. The gene sets are comprehensive and include protein-coding and non-coding loci including alternatively spliced isoforms and pseudogenes. To produce the annotations, we leverage computational and experimental methods to identify new genes and new transcript isoforms, directing manual annotation to regions requiring expert investigation. The Ensembl/GENCODE annotations are the default human and mouse annotation for the Ensembl project (6), while the UCSC Genome Browser (7) uses the human annotation as default and the mouse annotation as a secondary resource until the mouse clone-by-clone annotation is complete (see below). For each versioned release, the underlying genome annotation is exactly the same whether it is accessed at Ensembl, UCSC or https://genecodegenes.org, although there are minor differences in presentation associated with genome assembly patches and representation of the pseudoautosomal regions on the X and Y chromosomes. We also provide subsets of the annotation as described below. For simplicity, we will here refer to the annotation holistically as GENCODE. GENCODE is the reference annotation of choice adopted by many large international consortia including ENCODE, GTEx (8), the International Cancer Genome Consortium (ICGC) (9), component projects of the International Human Epigenome Consortium (10), the 1000 Genomes Project, (11) the Exome Aggregation Consortium (EXAC) and Genome Aggregation Database (gnomAD) (12) and the Human Cell Atlas (HCA) (13).

GENCODE ANNOTATION METHODS AND RESULTS

The GENCODE consortium annotates protein-coding genes, pseudogenes, long non-coding RNAs (lncRNAs) and small non-coding RNAs (sncRNAs). We define protein-coding genes as loci where the weight of available evidence supports the presence of a coding sequence (CDS). Evidence for a CDS may come from high-throughput experimental assays, the demonstration of physiological function in the research literature, the observation of homology to a known protein-coding gene, or the interpretation of evolutionary conservation data. Pseudogenes are sequences derived from protein-coding genes, containing disabling mutations such as in-frame stop codons, frameshifting indels, truncations or insertions, or for which there is no evidence of transcription. lncRNA genes are identified by a combination of transcriptional evidence and a lack of potential to be assigned as protein-coding. We do not absolutely require lncRNA genes to be longer than 200 bp, but very few annotated lncRNAs fall below this threshold, as we also require annotated lncRNAs to be free of secondary structures found in known functional sncRNAs. Currently, sncRNAs are almost entirely annotated by computational pipelines that use homology to known sncRNA sequences and predicted secondary structure to identify functional copies. Our annotation processes use primary transcript and proteomics data, evolutionary conservation, computational methods and curated public databases such as UniProt (14). These data are integrated using a combination of expert manual annotators and computational methods to identify regions of the genome with genic potential, annotate the exon-intron structures of transcripts identified at the locus under investigation and assign a functional classification to both the individual transcript and the locus. Broad functional classes (referred to as ‘biotypes’) of protein-coding, pseudogene, lncRNA and sncRNA are assigned as described above. More detailed functional categories are also added. For example, at the locus level we describe the provenance of pseudogenes as processed (derived via retrotransposition), unprocessed (defined by a genome duplication event) or unitary (arising from the lineage specific disruption of an ancestral protein-coding gene). At the transcript level we define transcripts belonging to protein-coding loci as protein-coding, nonsense mediated decay (NMD) (containing a premature stop codon believed likely to lead to the transcript being targeted by the nonsense-mediated decay pathway) or retained intron (containing sequence that is intronic in other transcripts from the locus). Following the structural and functional classification of transcripts, a subset of GENCODE annotation is subject to targeted experimental validation as described below to ensure consistent high quality of the gene annotation. To cater for a variety of use cases, we create a number of annotation sets. Examples of these are our ‘GENCODE comprehensive’ and ‘GENCODE basic’ gene sets. GENCODE comprehensive includes the complete set of annotations including partial transcripts (i.e. transcripts that are not full length, but represent a unique splice form based on available evidence) and biotypes such as NMD. GENCODE basic is a subset of GENCODE comprehensive that contains only transcripts with full-length CDS. For non-coding loci, GENCODE basic includes the smallest number of transcripts that cover 80% of the exonic features, while ensuring all loci are represented by at least 1 transcript. Computational methods add additional information. For example, APPRIS, described in more detail below, identifies the most likely functional translations at protein-coding loci and TSL (transcript support level) calculates the amount and quality of supporting evidence for each transcript.

Manual annotation

The GENCODE gene set is created by merging the results of manual and computational gene annotation methods. Manual gene annotation has two major modes of operation: clone-by-clone and targeted annotation. ‘Clone-by-clone’ annotation involves ‘walking’ across a genomic region, investigating the sequence, aligned expression data and computational predictions for each BAC clone. In doing so, an expert annotator investigates all possible genic features and considers all possible annotations and biotypes simultaneously. We believe this approach carries substantial advantages. For example, the decision to annotate a locus as protein-coding or pseudogenic benefits from being able to weigh both possibilities in light of all available evidence. This process helps prevent false positive and false negative misclassifications. Targeted annotation is designed to answer specific questions such as ‘is there an unannotated protein-coding gene in this position?’ Ranked target lists are generated by computational analysis based, for example, on transcriptomic data, shotgun proteomic data or conservation measures. Over the last two years mouse annotation has been dominated by the clone-by-clone approach while the human genome has been refined entirely via targeted reannotation except for the annotation of human assembly patches and haplotypes released by the Genome Reference Consortium (15), which take a clone-by-clone approach. Over the last two years, we have focused on two broad areas: completing the first pass manual annotation across the entire mouse reference genome and a dedicated effort to improve the annotation of protein-coding genes in human and mouse. We have completed the annotation of novel protein-coding genes, lncRNAs and pseudogenes, plus QC and updating previous annotation where necessary for mouse chromosomes 9, 10, 11, 12, 13, 14, 15, 16 and 17. These updates bring the fraction of the mouse genome with completed first pass manual annotation to approximately 97%. In addition, we have continued to work with the NCBI and Mouse Genome Informatics project at the Jackson Laboratory to resolve annotation differences for protein-coding, pseudogene and lncRNA loci. For protein-coding genes this is under the umbrella of the Consensus Coding Sequence (CCDS) project (16). We have also manually investigated unannotated regions of high protein-coding potential identified by whole genome analysis using PhyloCSF (17) (a tool described in more detail below). In human, this led to the addition of 144 novel protein-coding genes and 271 pseudogenes (of which 42 were unitary pseudogenes). In mouse, we annotated orthologous loci for all but 11 of the 144 human protein-coding genes. We have also revisited the annotation of all olfactory receptor loci in both human and mouse, using RNAseq data to define 5′ and 3′ UTR sequences for ∼1400 loci. In human we have also targeted a ‘deep dive’ manual reannotation of genes on clinical panels for paediatric neurological disorders to identify missing functional alternative splicing. Incorporating second and third generation transcriptomic data, we reannotated ∼190 genes and added more than 3600 alternatively spliced transcripts, including ∼1400 entirely novel exons and an additional ∼30kb of CDS. We have also completed an effort to capture all recently described unannotated microexons (18) into GENCODE, and further added an additional 146 novel microexons mined from public SLRseq data (19). As part of the CCDS collaboration with RefSeq, we have checked a large subset of human loci where there was disagreement over gene biotype. Similarly, we have checked all UniProt manually annotated and reviewed (i.e. Swiss-Prot) accessions that lack an equivalent in GENCODE. As a result, we added 32 novel protein-coding loci to GENCODE and rejected more than 200 putative coding loci. Finally, we are manually reviewing genes previously annotated as protein-coding, but with weak or no support based on a method incorporating UniProt, APPRIS, PhyloCSF, Ensembl comparative genomics, RNA-seq, mass spectrometry and variation data (20,21). Of the 821 loci investigated to date, 54 have had their coding status removed while a further 110 potentially dubious cases remain under review. The approach taken reflects in the kinds of updates captured in the annotation. For example, the targeted reannotation in human leads to the annotation of few novel protein-coding loci but many novel transcripts at updated protein-coding and lncRNA loci. Conversely, in mouse the emphasis on clone-by-clone annotation identifies many more novel loci and transcripts across a broader range of biotypes (Figure 1).

Figure 1.

New and updated manually annotated genes and transcripts from July 2016 to June 2018. For both human (left) and mouse (right) the numbers of completely new genes and transcripts, updated genes and transcripts and the total number of manually added or edited genes and transcripts for each of four broad categories of annotation. A new gene annotation can represent a completely de novo locus with no overlap with pre-existing annotation or the reclassification of an existing complex locus into multiple loci to better represent the biology of the locus inferred from transcriptomic and/or proteomic data. A new transcript represents the annotation of a unique exon-intron structure, including novel alternative splicing at an annotated locus. Updated genes and transcripts represent pre-existing loci or transcript models that have been edited to improve the representation of biotype (e.g. changed from lncRNA to protein-coding) or structure (e.g. by extension, addition of novel exons).

Computational annotation of small RNAs

We annotate small non-coding RNAs (sncRNAs) using a variety of mechanisms. Specifically, miRNA annotations are imported directly from miRBase (22), while tRNAs are identified ab initio using tRNAScan-SE (23) although they are not included directly in the gene set. For other classes of sncRNA, including small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs) and small Cajal body-specific RNAs (scaRNAs), we use a homology-based, computational pipeline (24), which first compares sequences of known RNA families in Rfam (25) to the genome using BLAST (26). This initial step reduces the genomic search space and excludes sequences with sub-optimal alignments to the genome. We define putative sncRNA models after clustering top BLAST hits and evaluating these predictions by performing sequence and structure searches against covariance models in the Infernal suite of tools (27).

Pseudogenes

Pseudogene annotations across 18 mouse strains were generated using a combination of manual annotation liftover and computational methods. Additionally, we were able to annotate 88 new human and 131 new mouse unitary pseudogenes relative to each other. Amongst the strains we find roughly 20 unitary pseudogenes per strain. We identified nearly 3000 ancestral pseudogenes conserved across all strains. Meanwhile, ∼20% of the pseudogenes in each strain are strain specific. In line with previous results in human, 15% of pseudogenes exhibit transcriptional activity (bioRxiv: https://doi.org/10.1101/386656).

EXPERIMENTAL ANNOTATION APPROACHES

lncRNA annotation using capture long Seq

Determining the precise boundaries and the exonic structure of low abundant transcripts, such as lncRNAs is challenging. We previously showed that 3′ and 5′ boundaries of lncRNAs annotated in GENCODE V7 (April 2011) were less supported by CAGE and PET tags than those of protein-coding genes, even when accounting for differences in expression (28). Methods to assemble transcript sequences from short sequence reads have also been shown to produce poor results when used to resolve the exonic structure of lncRNAs (29,30). To improve lncRNA annotation, we developed the RNA Capture Long Seq (CLS) method (31). Here, probes are first designed against targeted lncRNAs (or suspected, unannotated lncRNA loci). Full-length cDNAs generated from diverse cell types were captured, resulting in cDNA libraries that are highly enriched for the targeted lncRNAs. Libraries were then sequenced using long-read sequencing technologies (31,32). Our initial efforts created a comprehensive capture library targeting the set of intergenic GENCODE lncRNAs in human and mouse, and used it in a set of matched human and mouse tissues (31). This resulted in novel lncRNA transcripts at 3574 loci in human, and 561 in mouse. The long length of the transcript sequences obtained, often correspond to complete 5′-to-3′ RNA molecules, substantially informed manual annotation. Indeed, CLS produces near manual-quality full-length transcript models at high-throughput scales (32). Our current efforts are to include samples across a more diverse panel of tissues such as fetal timepoints.

Proteomics

Proteomic mass spectrometry datasets are a powerful resource contributing to the validation and annotation of protein-coding genes and transcripts. In GENCODE, we use proteomics data as an additional layer of evidence when defining the structure and protein-coding potential of a genomic locus. We apply strict criteria to the peptide evidences we consider from mass spectrometry datasets (33–35) to minimize the incorporation of false positive and ambiguous or variant peptide species. In highly curated genomes such as human, the contribution from mass spectrometry experiments requires considerable scale of data and effort, with correspondingly small returns. Our experimental efforts in GENCODE incorporate targeted proteomics experiments, specific experimental designs and synthetically generated peptides to find these elusive protein-coding genes.

Annotation validation and RACEseq

We used RT-PCR amplification followed by highly multiplexed sequencing readout (36) to assess the quality of the annotations. This method evaluates low confidence transcribed loci (novel or putative). Splice site loci were systematically experimentally tested in eight tissues (brain, heart, kidney, liver, lung, spleen, skeletal muscle, and testis) by RT-PCR-seq (36). From human GENCODE versions 3 to 19, a total of 18 132 splice junctions were analyzed and experimentally tested. Seventy eight percent of all assessed junctions were confirmed through experimental validation. Similar to the human annotation, we assessed the quality of the mouse annotation. A total of 3956 splice junctions from GENCODE versions M2 and M4 were tested with a validation rate of 53%. Finally, to assess the completeness of the annotations we amplified and sequenced the transcripts of 527 deeply annotated human protein-coding genes, which are routinely used for diagnostic tests by the UK Genetic Testing Network (UKGTN). We performed 5′- and 3′- nested- RACEs in seven different tissues (brain, testis, heart, kidney, liver, lung, and spleen) followed by long-read sequencing, which revealed 10 380 novel splice junction candidates.

GENCODE ANNOTATION TOOLS

Comparative annotation toolkit

We developed the Comparative Annotation Toolkit (CAT) (37) to leverage the GENCODE annotations of mouse and human to annotate laboratory mouse strains (38) and great apes (39,40). CAT uses whole genome alignments from Cactus (41) to project GENCODE annotations from mouse or human to related species, and then performs a variety of filtering and clean-up steps to generate a high quality annotation set for these other genomes. The GENCODE M11 mouse annotation was used with CAT to annotate 16 laboratory mouse strains, and these annotations are available in Ensembl. Over 20 000 protein-coding and 12 000 non-coding genes were comparatively annotated in each lab strain. Novel gene predictions using Comparative Augustus (42) also found an average of 22 new loci in classical strains, including the discovery of the gene Efcab3-like in the reference mouse, which was included in subsequent GENCODE releases. Additionally, the GENCODE 27 (August 2017) human annotation set was used to annotate chimpanzee, gorilla and orangutan, and these annotations were incorporated into Genbank, with over 19 000 protein-coding and 36,000 non-coding genes comparatively annotated in all of the great apes.

APPRIS

The APPRIS Database (http://appris-tools.org) (43) was developed to provide annotations for alternative splice variants. APPRIS also determines principal splice isoforms based on cross-species conservation and the conservation of protein structure and function. Most coding genes have a single dominant protein isoform and this main isoform is almost always the APPRIS principal isoform (44). APPRIS maintains up-to-date annotations for the GENCODE and RefSeq reference sets and has been extended to the UniProtKB proteome and to six model species as well as human and mouse (45). Technical improvements include incremental improvements to the core modules that make up the APPRIS pipeline, the implementation of a UCSC Track Hub to make annotation access easier, and Docker images to allow the execution of the annotation pipeline (45). APPRIS is an integral part of the pipeline for the prediction of potential non-coding genes (20). For the GENCODE 27 (August 2017) human annotation the completed pipeline flagged 2432 genes.

PhyloCSF

Comparative genomics is one of the most powerful tools available for distinguishing protein-coding genomic regions. Previously, we developed PhyloCSF to support annotation of coding sequences based on the alignment of multiple genome sequences (17). As described above, we combine whole-genome PhyloCSF data with experimental evidence and expert manual annotation to detect novel coding sequences. The workflow begins with PhyloCSF scores computed on every codon in the human genome in each of the six reading frames; applies a Hidden Markov Model to these scores to find candidate coding intervals; excludes intervals previously annotated as coding or pseudogene, or antisense to such intervals, as well as very short intervals; and uses a Support Vector Machine to prioritize the resulting ‘Novel PhyloCSF Regions’. We have created publicly available PhyloCSF track hubs for viewing the whole-genome PhyloCSF data and novel PhyloCSF Regions from human and mouse in the UCSC and Ensembl genome browsers.

Pseudopipe

Pseudopipe identifies and annotates pseudogenes across the genome (46). It takes as input an organism's protein-coding gene set and searches for homology across the genome using BLAST. Hits overlapping functional genes are removed and the remaining hits are then assembled into pseudogene annotations. Each annotation is also assigned a parent gene, the functional paralog that gave rise to the pseudogene, as well as a biotype (processed, duplicated, or ambiguous). Unitary pseudogenes are also identified via Pseudopipe by using a different organism's protein-coding gene set as the input. We inform our annotation with results from Retrofinder (47) and RCPedia (48). In addition to our core annotation files, further information is available at http://www.pseudogene.org. These computational annotations are then combined with manual annotations in order to produce the full pseudogene complement. Pseudogene annotations are given a confidence level based on the intersection with manual annotations. Annotations detected by both the computational pipelines and manual annotators are assigned level 1, those only detected by manual annotators are given level 2, and the consensus annotations detected by PseudoPipe and RetroFinder are given level 3 and made available in a separate annotation file at https://www.gencodegenes.org.

DATA ACCESS

Versioned GENCODE gene sets are currently released approximately four times a year for mouse and twice a year for human. This asymmetric update pattern reflects the fact that the first pass of the human annotation was completed in GENCODE 15 (January 2013), while the mouse first pass is approaching completion (expected for GENCODE M20) and therefore has been the subject of more intensive annotation. The most recent release of the human geneset is GENCODE 29 (October 2018), while the most recent mouse update is GENCODE M19 (October 2018). Each release incorporates the continuous updates arising from expert manual annotation. Figure 2 shows the increase in the numbers of genes and transcripts in human and mouse GENCODE releases over the past two years. The human genesets look relatively static, although headline figures do not capture updates made to existing annotation and the balancing effect of both adding and removing loci during a release cycle. In mouse however, there is clear growth in the numbers of both genes and transcripts driven predominantly by the addition of lncRNAs and pseudogenes.

Figure 2.

Annotation statistics for human and mouse GENCODE releases from July 2016 to June 2018, encompassing human releases GENCODE 25–28 and mouse releases M10 to M18. The panels on the left show the total number of genes by broad biotype (protein-coding, lncRNA, pseudogene and sncRNA) for each release for human and mouse respectively and panels on the right show the total numbers of genes and transcripts of all biotypes. Extensive data resources for current and archival GENCODE releases are available at https://www.gencodegenes.org. As described above, the GENCODE gene sets are available as default in the Ensembl genome browser and also accessible via the UCSC genome browser. Other interfaces include the Ensembl FTP site (ftp://ftp.ensembl.org/pub/), which includes gene sets in GFF3, Genbank and GTF formats and full download of the complete Ensembl databases. More complex and customizable gene set queries can be created via the Ensembl Biomart (https://www.ensembl.org/biomart/). Programmatic access to the GENCODE gene sets is possible via the extensive Ensembl Perl API and the language-agnostic Ensembl REST API. Programmatic access facilitates advanced genome-wide analysis such as retrieval of supporting features and associated gene trees. Examples of REST endpoint usage and starter scripts in different languages are at https://rest.ensembl.org. GENCODE has been created exclusively on the GRCh38 human assembly since GENCODE 20 (August 2014). However, versions of selected releases since then that have been projection mapped from GRCh38 to GRCh37 are available at UCSC and from https://www.gencodegenes.org. Referred to as the ‘lift37’ annotation set, these data help identify genes where the annotations may have changed between GRCh37 and GRCh38. Due to the difficulty to generate accurate projections, the ‘lift37’ annotation set is not considered official reference annotation and only minimal support is available. We welcome questions and feedback from the community directly via the helpdesks at https://www.gencodegenes.org, Ensembl and UCSC. In addition, the Ensembl and UCSC outreach activities annually reach thousands of researchers via workshops at institutions and meetings, web-based training forums and ‘how-to’ guides focused on using the genome browsers and making best use of their features and data.

CONCLUSION

The GENCODE consortium continues to improve the quality of the reference gene annotation in human and mouse. We have integrated cutting-edge developments in the technology and scientific understanding of genome biology into our annotation workflows to improve the representation of existing loci and extend annotation coverage via the addition of entirely novel loci and alternatively spliced transcripts. While the high quality of our existing transcript annotation is extensively supported by both public data and data generated within the consortium, the abundance of evidence from new transcriptomic and proteomic datasets makes it clear that they are not yet complete.

48 in total

Review 1. Computational genomics of noncoding RNA genes.

Authors: Sean R Eddy
Journal: Cell Date: 2002-04-19 Impact factor: 41.582

2. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

3. The ENCODE (ENCyclopedia Of DNA Elements) Project.

Authors:
Journal: Science Date: 2004-10-22 Impact factor: 47.728

4. Cactus: Algorithms for genome multiple sequence alignment.

Authors: Benedict Paten; Dent Earl; Ngan Nguyen; Mark Diekhans; Daniel Zerbino; David Haussler
Journal: Genome Res Date: 2011-06-10 Impact factor: 9.043

5. PseudoPipe: an automated pseudogene identification pipeline.

Authors: Zhaolei Zhang; Nicholas Carriero; Deyou Zheng; John Karro; Paul M Harrison; Mark Gerstein
Journal: Bioinformatics Date: 2006-03-30 Impact factor: 6.937

6. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

7. International network of cancer genome projects.

Authors: Thomas J Hudson; Warwick Anderson; Axel Artez; Anna D Barker; Cindy Bell; Rosa R Bernabé; M K Bhan; Fabien Calvo; Iiro Eerola; Daniela S Gerhard; Alan Guttmacher; Mark Guyer; Fiona M Hemsley; Jennifer L Jennings; David Kerr; Peter Klatt; Patrik Kolar; Jun Kusada; David P Lane; Frank Laplace; Lu Youyong; Gerd Nettekoven; Brad Ozenberger; Jane Peterson; T S Rao; Jacques Remacle; Alan J Schafer; Tatsuhiro Shibata; Michael R Stratton; Joseph G Vockley; Koichi Watanabe; Huanming Yang; Matthew M F Yuen; Bartha M Knoppers; Martin Bobrow; Anne Cambon-Thomsen; Lynn G Dressler; Stephanie O M Dyke; Yann Joly; Kazuto Kato; Karen L Kennedy; Pilar Nicolás; Michael J Parker; Emmanuelle Rial-Sebbag; Carlos M Romeo-Casabona; Kenna M Shaw; Susan Wallace; Georgia L Wiesner; Nikolajs Zeps; Peter Lichter; Andrew V Biankin; Christian Chabannon; Lynda Chin; Bruno Clément; Enrique de Alava; Françoise Degos; Martin L Ferguson; Peter Geary; D Neil Hayes; Thomas J Hudson; Amber L Johns; Arek Kasprzyk; Hidewaki Nakagawa; Robert Penny; Miguel A Piris; Rajiv Sarin; Aldo Scarpa; Tatsuhiro Shibata; Marc van de Vijver; P Andrew Futreal; Hiroyuki Aburatani; Mónica Bayés; David D L Botwell; Peter J Campbell; Xavier Estivill; Daniela S Gerhard; Sean M Grimmond; Ivo Gut; Martin Hirst; Carlos López-Otín; Partha Majumder; Marco Marra; John D McPherson; Hidewaki Nakagawa; Zemin Ning; Xose S Puente; Yijun Ruan; Tatsuhiro Shibata; Michael R Stratton; Hendrik G Stunnenberg; Harold Swerdlow; Victor E Velculescu; Richard K Wilson; Hong H Xue; Liu Yang; Paul T Spellman; Gary D Bader; Paul C Boutros; Peter J Campbell; Paul Flicek; Gad Getz; Roderic Guigó; Guangwu Guo; David Haussler; Simon Heath; Tim J Hubbard; Tao Jiang; Steven M Jones; Qibin Li; Nuria López-Bigas; Ruibang Luo; Lakshmi Muthuswamy; B F Francis Ouellette; John V Pearson; Xose S Puente; Victor Quesada; Benjamin J Raphael; Chris Sander; Tatsuhiro Shibata; Terence P Speed; Lincoln D Stein; Joshua M Stuart; Jon W Teague; Yasushi Totoki; Tatsuhiko Tsunoda; Alfonso Valencia; David A Wheeler; Honglong Wu; Shancen Zhao; Guangyu Zhou; Lincoln D Stein; Roderic Guigó; Tim J Hubbard; Yann Joly; Steven M Jones; Arek Kasprzyk; Mark Lathrop; Nuria López-Bigas; B F Francis Ouellette; Paul T Spellman; Jon W Teague; Gilles Thomas; Alfonso Valencia; Teruhiko Yoshida; Karen L Kennedy; Myles Axton; Stephanie O M Dyke; P Andrew Futreal; Daniela S Gerhard; Chris Gunter; Mark Guyer; Thomas J Hudson; John D McPherson; Linda J Miller; Brad Ozenberger; Kenna M Shaw; Arek Kasprzyk; Lincoln D Stein; Junjun Zhang; Syed A Haider; Jianxin Wang; Christina K Yung; Anthony Cros; Anthony Cross; Yong Liang; Saravanamuttu Gnaneshan; Jonathan Guberman; Jack Hsu; Martin Bobrow; Don R C Chalmers; Karl W Hasel; Yann Joly; Terry S H Kaan; Karen L Kennedy; Bartha M Knoppers; William W Lowrance; Tohru Masui; Pilar Nicolás; Emmanuelle Rial-Sebbag; Laura Lyman Rodriguez; Catherine Vergely; Teruhiko Yoshida; Sean M Grimmond; Andrew V Biankin; David D L Bowtell; Nicole Cloonan; Anna deFazio; James R Eshleman; Dariush Etemadmoghadam; Brooke B Gardiner; Brooke A Gardiner; James G Kench; Aldo Scarpa; Robert L Sutherland; Margaret A Tempero; Nicola J Waddell; Peter J Wilson; John D McPherson; Steve Gallinger; Ming-Sound Tsao; Patricia A Shaw; Gloria M Petersen; Debabrata Mukhopadhyay; Lynda Chin; Ronald A DePinho; Sarah Thayer; Lakshmi Muthuswamy; Kamran Shazand; Timothy Beck; Michelle Sam; Lee Timms; Vanessa Ballin; Youyong Lu; Jiafu Ji; Xiuqing Zhang; Feng Chen; Xueda Hu; Guangyu Zhou; Qi Yang; Geng Tian; Lianhai Zhang; Xiaofang Xing; Xianghong Li; Zhenggang Zhu; Yingyan Yu; Jun Yu; Huanming Yang; Mark Lathrop; Jörg Tost; Paul Brennan; Ivana Holcatova; David Zaridze; Alvis Brazma; Lars Egevard; Egor Prokhortchouk; Rosamonde Elizabeth Banks; Mathias Uhlén; Anne Cambon-Thomsen; Juris Viksna; Fredrik Ponten; Konstantin Skryabin; Michael R Stratton; P Andrew Futreal; Ewan Birney; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Sancha Martin; Jorge S Reis-Filho; Andrea L Richardson; Christos Sotiriou; Hendrik G Stunnenberg; Giles Thoms; Marc van de Vijver; Laura van't Veer; Fabien Calvo; Daniel Birnbaum; Hélène Blanche; Pascal Boucher; Sandrine Boyault; Christian Chabannon; Ivo Gut; Jocelyne D Masson-Jacquemier; Mark Lathrop; Iris Pauporté; Xavier Pivot; Anne Vincent-Salomon; Eric Tabone; Charles Theillet; Gilles Thomas; Jörg Tost; Isabelle Treilleux; Fabien Calvo; Paulette Bioulac-Sage; Bruno Clément; Thomas Decaens; Françoise Degos; Dominique Franco; Ivo Gut; Marta Gut; Simon Heath; Mark Lathrop; Didier Samuel; Gilles Thomas; Jessica Zucman-Rossi; Peter Lichter; Roland Eils; Benedikt Brors; Jan O Korbel; Andrey Korshunov; Pablo Landgraf; Hans Lehrach; Stefan Pfister; Bernhard Radlwimmer; Guido Reifenberger; Michael D Taylor; Christof von Kalle; Partha P Majumder; Rajiv Sarin; T S Rao; M K Bhan; Aldo Scarpa; Paolo Pederzoli; Rita A Lawlor; Massimo Delledonne; Alberto Bardelli; Andrew V Biankin; Sean M Grimmond; Thomas Gress; David Klimstra; Giuseppe Zamboni; Tatsuhiro Shibata; Yusuke Nakamura; Hidewaki Nakagawa; Jun Kusada; Tatsuhiko Tsunoda; Satoru Miyano; Hiroyuki Aburatani; Kazuto Kato; Akihiro Fujimoto; Teruhiko Yoshida; Elias Campo; Carlos López-Otín; Xavier Estivill; Roderic Guigó; Silvia de Sanjosé; Miguel A Piris; Emili Montserrat; Marcos González-Díaz; Xose S Puente; Pedro Jares; Alfonso Valencia; Heinz Himmelbauer; Heinz Himmelbaue; Victor Quesada; Silvia Bea; Michael R Stratton; P Andrew Futreal; Peter J Campbell; Anne Vincent-Salomon; Andrea L Richardson; Jorge S Reis-Filho; Marc van de Vijver; Gilles Thomas; Jocelyne D Masson-Jacquemier; Samuel Aparicio; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Hendrik G Stunnenberg; Laura van't Veer; Douglas F Easton; Paul T Spellman; Sancha Martin; Anna D Barker; Lynda Chin; Francis S Collins; Carolyn C Compton; Martin L Ferguson; Daniela S Gerhard; Gad Getz; Chris Gunter; Alan Guttmacher; Mark Guyer; D Neil Hayes; Eric S Lander; Brad Ozenberger; Robert Penny; Jane Peterson; Chris Sander; Kenna M Shaw; Terence P Speed; Paul T Spellman; Joseph G Vockley; David A Wheeler; Richard K Wilson; Thomas J Hudson; Lynda Chin; Bartha M Knoppers; Eric S Lander; Peter Lichter; Lincoln D Stein; Michael R Stratton; Warwick Anderson; Anna D Barker; Cindy Bell; Martin Bobrow; Wylie Burke; Francis S Collins; Carolyn C Compton; Ronald A DePinho; Douglas F Easton; P Andrew Futreal; Daniela S Gerhard; Anthony R Green; Mark Guyer; Stanley R Hamilton; Tim J Hubbard; Olli P Kallioniemi; Karen L Kennedy; Timothy J Ley; Edison T Liu; Youyong Lu; Partha Majumder; Marco Marra; Brad Ozenberger; Jane Peterson; Alan J Schafer; Paul T Spellman; Hendrik G Stunnenberg; Brandon J Wainwright; Richard K Wilson; Huanming Yang
Journal: Nature Date: 2010-04-15 Impact factor: 49.962

8. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Authors: Michael F Lin; Irwin Jungreis; Manolis Kellis
Journal: Bioinformatics Date: 2011-07-01 Impact factor: 6.937

9. GENCODE: producing a reference annotation for ENCODE.

Authors: Jennifer Harrow; France Denoeud; Adam Frankish; Alexandre Reymond; Chao-Kung Chen; Jacqueline Chrast; Julien Lagarde; James G R Gilbert; Roy Storey; David Swarbreck; Colette Rossier; Catherine Ucla; Tim Hubbard; Stylianos E Antonarakis; Roderic Guigo
Journal: Genome Biol Date: 2006-08-07 Impact factor: 13.583

10. Retrocopy contributions to the evolution of the human genome.

Authors: Robert Baertsch; Mark Diekhans; W James Kent; David Haussler; Jürgen Brosius
Journal: BMC Genomics Date: 2008-10-08 Impact factor: 3.969

859 in total

1. Molecular subgrouping of primary pineal parenchymal tumors reveals distinct subtypes correlated with clinical parameters and genetic alterations.

Authors: Elke Pfaff; Christian Aichmüller; Martin Sill; Damian Stichel; Matija Snuderl; Matthias A Karajannis; Martin U Schuhmann; Jens Schittenhelm; Martin Hasselblatt; Christian Thomas; Andrey Korshunov; Marina Rhizova; Andrea Wittmann; Anna Kaufhold; Murat Iskar; Petra Ketteler; Dietmar Lohmann; Brent A Orr; David W Ellison; Katja von Hoff; Martin Mynarek; Stefan Rutkowski; Felix Sahm; Andreas von Deimling; Peter Lichter; Marcel Kool; Marc Zapatka; Stefan M Pfister; David T W Jones
Journal: Acta Neuropathol Date: 2019-11-25 Impact factor: 17.088

2. Fluent genomics with plyranges and tximeta.

Authors: Stuart Lee; Michael Lawrence; Michael I Love
Journal: F1000Res Date: 2020-02-12

3. A Pharmacological Interactome between COVID-19 Patient Samples and Human Sensory Neurons Reveals Potential Drivers of Neurogenic Pulmonary Dysfunction.

Authors: Pradipta Ray; Andi Wangzhou; Nizar Ghneim; Muhammad Yousuf; Candler Paige; Diana Tavares-Ferreira; Juliet Mwirigi; Stephanie Shiers; Ishwarya Sankaranarayanan; Amelia McFarland; Sanjay Neerukonda; Steve Davidson; Gregory Dussor; Michael Burton; Theodore Price
Journal: SSRN Date: 2020-05-04

4. HiC-ACT: improved detection of chromatin interactions from Hi-C data via aggregated Cauchy test.

Authors: Taylor M Lagler; Armen Abnousi; Ming Hu; Yuchen Yang; Yun Li
Journal: Am J Hum Genet Date: 2021-02-04 Impact factor: 11.025

5. CARD9-Associated Dectin-1 and Dectin-2 Are Required for Protective Immunity of a Multivalent Vaccine against Coccidioides posadasii Infection.

Authors: Althea Campuzano; Hao Zhang; Gary R Ostroff; Lucas Dos Santos Dias; Marcel Wüthrich; Bruce S Klein; Jieh-Juen Yu; Humberto H Lara; Jose L Lopez-Ribot; Chiung-Yu Hung
Journal: J Immunol Date: 2020-05-01 Impact factor: 5.422

6. The Caenorhabditis elegans RIG-I Homolog DRH-1 Mediates the Intracellular Pathogen Response upon Viral Infection.

Authors: Jessica N Sowa; Hongbing Jiang; Lakshmi Somasundaram; Eillen Tecle; Guorong Xu; David Wang; Emily R Troemel
Journal: J Virol Date: 2020-01-06 Impact factor: 5.103

7. ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data.

Authors: Stephany Orjuela; Ruizhu Huang; Katharina M Hembach; Mark D Robinson; Charlotte Soneson
Journal: G3 (Bethesda) Date: 2019-07-09 Impact factor: 3.154

Review 8. Long Noncoding RNAs in Host-Pathogen Interactions.

Authors: Federica Agliano; Vijay A Rathinam; Andrei E Medvedev; Sivapriya Kailasan Vanaja; Anthony T Vella
Journal: Trends Immunol Date: 2019-04-30 Impact factor: 16.687

9. Nonparametric expression analysis using inferential replicate counts.

Authors: Anqi Zhu; Avi Srivastava; Joseph G Ibrahim; Rob Patro; Michael I Love
Journal: Nucleic Acids Res Date: 2019-10-10 Impact factor: 16.971

10. Muscle-Specific Insulin Receptor Overexpression Protects Mice From Diet-Induced Glucose Intolerance but Leads to Postreceptor Insulin Resistance.

Authors: Guoxiao Wang; Yingying Yu; Weikang Cai; Thiago M Batista; Sujin Suk; Hye Lim Noh; Michael Hirshman; Pasquale Nigro; Mengyao Ella Li; Samir Softic; Laurie Goodyear; Jason K Kim; C Ronald Kahn
Journal: Diabetes Date: 2020-08-31 Impact factor: 9.461