Literature DB >> 22121212

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.

Kim D Pruitt¹, Tatiana Tatusova, Garth R Brown, Donna R Maglott.

Abstract

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22121212 PMCID： PMC3245008 DOI： 10.1093/nar/gkr1079

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

RefSeq integrates an organisms’ genomic, transcript and protein sequence with descriptive feature annotation and bibliographic information (1,2). National Center for Biotechnology Information (NCBI) builds RefSeq from sequence data available in public archival sequence databases of the International Nucleotide Sequence Database Collaboration (INSDC, including the DNA Data Bank of Japan, the European Nucleotide Archive and GenBank). Unique features of the RefSeq collection includes its broad taxonomic scope, reduced redundancy, informative cross-links between nucleic acid and protein records (both curated and computationally derived) and daily curation and maintenance. Data linkages include names, protein domains, orthologs, Enzyme Commission (E.C.) numbers, phenotypes and disease. Curation and maintenance reflect new information and enable the RefSeq collection to support numerous research directions, including associating sequence with phenotype, providing a stable and consistent coordinate system to report clinical variation, comparative genomics and evolutionary studies. The RefSeq collection is a critical element of additional resources at NCBI, including dbSNP, dbVar, Gene, Genomes, Protein Clusters and Map Viewer, enabling the integration of these resources within and among organisms. The RefSeq database is a product of NCBI, a division of the National Library of Medicine at the US National Institutes of Health. Records are freely available by multiple methods, including Internet query, FTP downloads, BLAST or scripted query using NCBI's E-Utilities. A comprehensive FTP release is available on a bi-monthly schedule with incremental daily updates provided between releases. RefSeq records can be identified by a distinct accession format which includes an underscore (‘_’) at the third position. More information is available online (http://www.ncbi.nlm.nih.gov/books/NBK21091/).

GROWTH OF THE REFSEQ DATA SET

The comprehensive bi-monthly RefSeq release continues to grow as new genome and transcript sequence become publicly available. To support the needs of different research communities, the release is provided both comprehensively in the ‘complete’ directory and based on general taxonomic groups, mitochondrial or plastid genomes or plasmid molecules. Release 49 (September 2011) includes records from 16 248 species representing 13 137 813 protein records. Table 1 indicates an annual growth of 49.7 and 14.5% in the number of organisms and the number of accessions, respectively. Records included in the release incorporate over 200 million feature annotation links (denoted ‘db_xref=’) to 60 different Web-based resources. These links allow navigation to related information from these resources, including those within NCBI, for e.g. Gene (3), the Conserved Domain database [CDD (4)], dbSNP (5) and externally, including nomenclature groups, model organism databases, protein-focused resources and many more. Links are managed by collaboration and propagation from the INSDC records upon which the RefSeq is based.

Table 1.

Annual Growth of the RefSeq release

Release directory	Number of organisms			Number of records
Release directory	Release 43^a	Release 49	Increase (%)	Release 43	Release 49	Increase (%)
Complete	10 854	16 248	49.7	15 934 055	18 236 994	14.5
Fungi	280	301	7.5	1 178 671	1 319 842	12.0
Invertebrate	637	754	18.4	1 993 670	2 232 026	12.0
Microbial	5585	10 346	85.2	9 031 974	10 711 822	18.6
mitochondrion	2266	2654	17.1	34 688	40 664	17.2
Plant	182	229	25.8	817 648	842 720	3.1
Plasmid	952	1061	11.4	160 065	191 018	19.3
Plastid	186	233	25.3	16 908	21 103	24.8
Protozoa	134	146	9.0	932 990	956 479	2.5
Vertebrate_mammalian	327	354	8.3	1 492 157	1 587 895	6.4
Vertebrate_other	1120	1334	19.1	398 084	483 449	21.4
Viral	2250	2745	22.0	87 759	101 664	15.8

aRelease 43 included data available on 7 September 2010; release 49 included data available on 5 September 2011.

Annual Growth of the RefSeq release aRelease 43 included data available on 7 September 2010; release 49 included data available on 5 September 2011. Microbial organisms as a group account both for the greatest number of organisms and accessions in Release 49 (Table 2) and displayed the most significant annual growth in number of organisms (85.2%; Table 1). Note, however, that the number of microbial group accessions increased by only 18.6%. This value is skewed downward relative to the growth in the number of organisms or RNA records. Release 49 actually saw a 156% increase in the number of microbial RNA records (data not shown); this reflects activity of the RefSeq Targeted Locus project (http://www.ncbi.nlm.nih.gov/genomes/static/refseqtarget.html), whose mandate is to provide a single representative 16S ribosomal RNA sequence for bacterial and archaeal genomes and strains. Release 49 included 6949 organisms with a single record, 5680 organisms with more than one but fewer than 100 accessions and 184 organisms with more than 10 000 accessions.

Table 2.

Distribution of RefSeq release 49 by ftp directory

Release directory	Percent of total
Release directory	Organisms	Accessions
Fungi	1.9	7.2
Invertebrate	4.6	12.2
Microbial	63.7	58.7
Mitochondrion	16.3	0.2
Plant	1.4	4.6
Plasmid	6.5	1.0
Plastid	1.4	0.1
Protozoa	0.9	5.2
Vertebrate_mammalian	2.2	8.7
Vertebrate_other	8.2	2.7
Viral	16.9	0.6

Distribution of RefSeq release 49 by ftp directory

STATUS OF CURATING HUMAN REFSEQ RECORDS

NCBI staff actively curate several subsets of the RefSeq collection for Homo sapiens. Curation improves multiple aspects of the human RefSeq collection by (i) providing quality reference sequence records for genomic regions, transcripts and proteins; (ii) maintaining and expanding functionally relevant information integrated into both RefSeq records and NCBI's Gene database; (iii) communicating and coordinating with international curation groups to generate a unified, consistent view of human genes and their primary products (see the Consensus CDS (CCDS) project, http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi); and (iv) supporting the scientific community in response to suggestions, questions or error reports.

Genomic regions

RefSeq provides region-specific genomic region records for non-transcribed pseudogenes and for the RefSeqGene project (http://www.ncbi.nlm.nih.gov/refseq/rsg/). Pseudogene loci are defined through collaboration with the HUGO Gene Nomenclature Committee [HGNC (6)] downloaded from Pseudogene.org [http://pseudogene.org/ (7)], or defined by RefSeq curation staff when reviewing transcripts having more than one high-quality alignment to the human genome. Curation involves defining the length and location of the pseudogene locus, determining whether it is transcribed and providing a link between the pseudogene locus and a functional homolog. As an example, please see NG_002746.2, which represents a eukaryotic translation initiation factor pseudogene and the ‘General gene information’ section of its Gene record (GeneID 1986) where a link to the related functional gene (EIF5A, GeneID 1984) is provided. Additionally, putative exon regions are annotated on the pseudogene RefSeq record based on alignment to a RefSeq transcript of the functional homolog. The number of non-transcribed pseudogene records increased by 7.7% in the past year. RefSeqGene, as part of the international Locus Reference Genomic initiative [LRG (8)], provides stable, gene-specific human genomic sequence records for reporting sequence variation in medical records and locus-specific databases (see http://www.ncbi.nlm.nih.gov/refseq/rsg/). The RefSeqGene and LRG records often represent explicitly only a subset of the known mRNA and coding regions. Identification of the sequences to use as standards depends on evaluation by the user base, but usually corresponds to the RefSeq transcript and protein records that have already been curated and reviewed by RefSeq and CCDS staff. If a question arises, review of evidence from the stakeholder, the literature and sequence evidence may result in an update to a RefSeqGene record, revision of the reference transcripts and proteins annotated on the RefSeqGene record or additional splice variants to represent in the RefSeq collection before assigning the LRG identifier. Transcript variants and protein isoforms not part of the explicit annotation are represented by alignments which can be seen in NCBI's graphical display. The number of RefSeqGene records grew by 25.8% in the past year. To request a RefSeqGene for a gene, contact rsgene@ncbi.nlm.nih.gov.

Transcripts and proteins

Transcripts and proteins are an important focus for curation at NCBI. This data set has two major categories—the ‘model’ subset generated directly by NCBI's genome annotation pipeline and the ‘known’ subset maintained independently of the genome annotation process using a combination of automated analyses and manual review. These subsets can be distinguished by the accession number prefix (models begin with ‘X’) as well as by the annotation in the COMMENT block of the record when viewed in flatfile format (see http://www.ncbi.nlm.nih.gov/books/NBK21091/ for additional details). Records in the model subset are created or updated only upon whole-genome re-annotation but may be removed from the collection following manual review between such updates. The human model set was reviewed last year, which resulted in revision of the gene type designation (e.g. protein-coding, non-coding, pseudogene, etc.), replacement of model records with known records and removal of records considered to be insufficiently supported. For example, 2068 model RefSeqs met the evidence criteria to be replaced by a known RefSeq type in the 1-year period between releases 43 and 49. A series of status codes are annotated on the known RefSeq data set to indicate information about the level of curation (these codes are not applicable for the model record subset). Records with a status of either ‘validated’ or ‘reviewed’ are considered to be curated. As of RefSeq release 49, 92.5% of the human protein coding transcripts (and their associated proteins) are tracked with a curated status, and 57.2% of the non-coding transcripts are tracked with a curated status (Table 3). This includes curation to add or update over 7500 human transcript records between releases 43 and 49. RefSeq continues to represent protein-coding regions that are considered to be full length, and transcripts that are considered to be at least near complete. Transcripts that are obviously partial are not represented but are presented in NCBI's genome browser (Map Viewer).

Table 3.

Current status of human transcripts and proteins

Type	Accessions in Release 49
Type	Total	Curated^a	Percent curated
Known protein-coding transcripts	31 933	29 531	92.5
Model protein-coding transcripts	1118	NA
Known non-coding transcripts	5932	3396	57.2
Model non-coding transcripts	3762	NA
Total	42 745	32 927	77.0

aCurated records have a review status of ‘Validated’ or ‘Reviewed’ which is not applied to model RefSeq records.

Current status of human transcripts and proteins aCurated records have a review status of ‘Validated’ or ‘Reviewed’ which is not applied to model RefSeq records. NCBI staff coordinates closely with other major databases and curation groups to maximize consistent data representation at NCBI and other web sites. The Consensus Coding Sequence collaboration [http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi (9)] is a central hub for curation of protein-coding loci and all members must agree to updates affecting the genomic coordinates of a CDS. Ambiguous or complex cases are discussed among CCDS members in light of available supporting evidence and published reports to achieve consensus on the likely annotated protein product. CCDS review often includes communication and coordination with the HGNC, UniProt or the Genome Reference Consortium [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/ (10)] when annotation cannot be well represented due to a concern over the sequence represented in the reference human genome assembly. The human CCDS data set was updated twice in the past year, adding 2126 CCDS IDs for 456 genes. The database currently includes 26 473 distinct human protein identifiers corresponding to 18 471 genes and is available at http://www.ncbi.nlm.nih.gov/CCDS/. In addition to ongoing review of transcripts and proteins, several changes affecting the content of RefSeq transcript and protein records were implemented recently. These include: new policy for management of protein names; new policy for management of readthrough (conjoined) transcripts; expanded representation of non-coding RNAs to include microRNAs; and expanded feature annotation for transcript and protein records.

Protein names

RefSeq is in the process of adopting UniProtKB (11) guidelines for protein naming (http://www.uniprot.org/docs/gennameprot) for both prokaryotic and eukaryotic records. Implementation of this policy is relatively new and remains variable across taxa. For prokaryotic proteins, protein name curation occurs in conjunction with NCBI's Protein Clusters resource (http://www.ncbi.nlm.nih.gov/proteinclusters). For vertebrate proteins associated in NCBI's Gene database with a related Swiss-Prot accession number, the Swiss-Prot preferred name is used verbatim. For some vertebrate RefSeq records, a distinct protein name continues to be provided if a Swiss-Prot name is not available, or one does not adhere to the revised UniProt guidelines.

Readthrough transcripts

Transcripts representing exons from what are typically considered to be neighboring, yet distinct loci pose a particular challenge for curation. This category of data resulted in conflicting annotation, requiring extensive discussion among the CCDS collaboration members, as well as HGNC. NCBI and the CCDS collaboration recently defined an annotation policy that tracks most readthrough transcript as a distinct locus as it is not solely the product of either of the two underlying loci. This approach improves consistency of the gene extent annotated for the two smaller loci while also reflecting the transcriptional complexity of the region. RefSeq opts to annotate the readthrough transcript when it appears to be full length and there is a minimum of two independent lines of support for the readthrough event. The Gene database reports this transcriptional complexity in the ‘General gene information’ section of the record. Examples can be found using available Gene queries [e.g. ‘readthrough parent’ (properties)]. In the last year, RefSeq curators reviewed the data reported in the ConjoinG database (12) to expand representation of this type. RefSeq currently tracks 120 human loci as an instantiated readthrough locus tracked with a distinct GeneID (for example, NME1-NME2, Gene ID 654364), and 358 loci with any type of readthrough association which includes reports of readthrough transcripts that do not meet the requirements to represent in RefSeq (for example, GeneID 6728).

Non-protein-coding transcripts

RefSeq representation of non-coding RNAs grew by 30% between September 2010 and 2011. Non-coding transcripts are managed in part by downloading other publicly available data sets including that available from miRBase http://www.mirbase.org/ (13). MicroRNAs, represented in RefSeq as the stem–loop precursor product with feature annotation of the functional RNA product, currently number 6848 records, 1409 of which are for human. Other types of functional RNAs, for instance small nucleolar RNAs, may be initially defined by HGNC, or by NCBI curation staff. Long non-coding transcripts, including splice variants, have been added to the database as well. Some of these include transcripts considered unlikely to encode a protein for several reasons, including non-sense-mediated decay issues, inhibitory alternative open reading frames or an alternate splice variant for which there are concerns about significant protein truncation or numerous upstream ORFs. There are currently 6057 non-coding transcripts for 4421 human genes, including 1134 non-coding transcripts for 797 human protein-coding loci.

Expanded feature annotation

RefSeq feature annotation has expanded to indicate localization or function, and to highlight details of the sequence considered during manual review. For many years, RefSeq protein records have displayed protein annotation computed by NCBI's CDD group, including protein domains, intra- or inter-molecular binding sites and metal-binding sites. While some signal peptide, mature peptide and other features have been manually annotated by NCBI staff, these features are now also propagated from UniProtKB/Swiss-Prot records and predicted by SignalP 4.0 (14). Criteria for propagation include a high-quality alignment and confirmation that the sequence and feature length are consistently maintained. Feature types that are already provided by the CDD group are not propagated. The source of the annotated feature is indicated with a ‘/inference’ qualifier which cites SignalP4.0 or with a note ‘propagated from UniProtKB/Swiss-Prot’ and an indication of the Swiss-Prot accession number (for example, see NP_001028219.1, NP_001171622.1). Protein-coding RefSeq transcripts now display evidence for the 5′ completeness of the annotated coding region following a computational search for an in-frame stop codon upstream of the annotated start codon. Identified stop codons are annotated with a misc_feat (see Figure 1 and NM_145204.3). Non-protein-coding transcripts, when provided for a protein-coding gene, are also computationally analysed to identify an open reading frame that shares the same start codon as a protein-coding transcript (for that gene) but that renders the transcript a candidate for non-sense-mediated mRNA (NMD) decay. Putative NMD ORFs are annotated with a misc_feat (see NR_040252.1). Misc_feat annotation is also added to a non-coding transcript if it contains an upstream ORF likely to be inhibitory to translation of the predicted ORF (see CCDS documentation at http://www.ncbi.nlm.nih.gov/CCDS/docs/CCDS-AUGguidelines.pdf and NR_003253.1 for an example).

Figure 1.

NM_145204.3 is shown in the Nucleotide Graphical display format. The display was configured to show the six-frame translation track restricted to the sense strand, and to add three markers highlighting the annotated upstream in-frame stop codon, the translation initiation codon and a second in-frame AUG codon located further downstream. The observation of a stop codon upstream of, and in the same reading frame, suggests the annotated CDS is 5′ complete.

GENOME ANNOTATION POLICY

Assembled genome sequence data are selected for inclusion in RefSeq based on several considerations including quality and completeness of a sequencing project, phylogenetic distance, model organism status, impact on disease and health studies, and identified utility to targeted research projects. Over the last several years, NCBI has developed robust whole genome annotation pipelines for both prokaryotes and eukaryotes. The prokaryotic pipeline has matured to the point that it is routinely offered as a service to submitters if genome sequences are submitted to GenBank without annotation. RefSeq genome representation for prokaryotes is currently managed by propagating annotation from the primary genome data in GenBank, calculating annotation for RefSeq when annotation is not available in GenBank within 6 months of submission of the genome, supplemented with curation to represent rRNAs, tRNAs and to provide improved protein names based on curated protein clusters. Eukaryotic genomes are managed based on general taxonomic groups, availability of submitted annotation and existence of an active model organism database. For mammalian genomes included in RefSeq, genome annotation is always provided using the NCBI annotation pipeline. For other organisms, RefSeq genome annotation is propagated from GenBank when available. Otherwise, annotation is provided using NCBI's eukaryotic annotation pipeline if a quality genome assembly is submitted with no intent to annotate, or if annotation is not submitted within a reasonable period of time, or is considered to need updating and the research group is not able to maintain it over time. When possible, the RefSeq group works with research communities and model organism databases to provide a single standard annotation for the reference genome; examples include Drosophila melanogaster, Arabidopsis thaliana, Anopheles gambiae, Saccharomyces cerevisiae and Escherichia coli K-12.

FUTURE DIRECTIONS

One of the short-term goals for the RefSeq group is to be more transparent with regard to curation decisions and support evidence used. The expanded transcript feature annotation mentioned above is a small step in this direction that will be further extended. Curators and programmers supporting the vertebrate RefSeq data set store a wide variety of gene and transcript data attributes that are potentially of use to consumers of the RefSeq data set. Attribute categories, and available stored data, are being reviewed and a subset will be selected for reporting in a structured comment on RefSeq records. Examples of stored attributes include reported RNA editing, potential alternate translation initiation codons, loci reported to be imprinted, use of non-AUG initiation codons and more. In addition, the vertebrate RefSeq group is working on reporting more explicit information about the underlying support for the exon combination that is instantiated in a RefSeq transcript record, to highlight proteins that are highly conserved, and to provide a comparison utility to evaluate putative functional consequence among transcript variants.

FUNDING

Funding for open access charge: Intramural Research Program of the National Institutes of Health, National Library of Medicine. Conflict of interest statement. None declared.

14 in total

1. dbSNP: the NCBI database of genetic variation.

Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Introducing RefSeq and LocusLink: curated human genome resources at the NCBI.

Authors: K D Pruitt; K S Katz; H Sicotte; D R Maglott
Journal: Trends Genet Date: 2000-01 Impact factor: 11.639

3. SignalP 4.0: discriminating signal peptides from transmembrane regions.

Authors: Thomas Nordahl Petersen; Søren Brunak; Gunnar von Heijne; Henrik Nielsen
Journal: Nat Methods Date: 2011-09-29 Impact factor: 28.547

4. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043

5. Locus Reference Genomic sequences: an improved basis for describing human DNA variants.

Authors: Raymond Dalgleish; Paul Flicek; Fiona Cunningham; Alex Astashyn; Raymond E Tully; Glenn Proctor; Yuan Chen; William M McLaren; Pontus Larsson; Brendan W Vaughan; Christophe Béroud; Glen Dobson; Heikki Lehväslaiho; Peter Em Taschner; Johan T den Dunnen; Andrew Devereau; Ewan Birney; Anthony J Brookes; Donna R Maglott
Journal: Genome Med Date: 2010-04-15 Impact factor: 11.117

6. Expression of conjoined genes: another mechanism for gene regulation in eukaryotes.

Authors: Tulika Prakash; Vineet K Sharma; Naoki Adati; Ritsuko Ozawa; Naveen Kumar; Yuichiro Nishida; Takayoshi Fujikake; Tadayuki Takeda; Todd D Taylor
Journal: PLoS One Date: 2010-10-12 Impact factor: 3.240

7. genenames.org: the HGNC resources in 2011.

Authors: Ruth L Seal; Susan M Gordon; Michael J Lush; Mathew W Wright; Elspeth A Bruford
Journal: Nucleic Acids Res Date: 2010-10-06 Impact factor: 16.971

8. Modernizing reference genome assemblies.

Authors: Deanna M Church; Valerie A Schneider; Tina Graves; Katherine Auger; Fiona Cunningham; Nathan Bouk; Hsiu-Chuan Chen; Richa Agarwala; William M McLaren; Graham R S Ritchie; Derek Albracht; Milinn Kremitzki; Susan Rock; Holland Kotkiewicz; Colin Kremitzki; Aye Wollam; Lee Trani; Lucinda Fulton; Robert Fulton; Lucy Matthews; Siobhan Whitehead; Will Chow; James Torrance; Matthew Dunn; Glenn Harden; Glen Threadgold; Jonathan Wood; Joanna Collins; Paul Heath; Guy Griffiths; Sarah Pelan; Darren Grafham; Evan E Eichler; George Weinstock; Elaine R Mardis; Richard K Wilson; Kerstin Howe; Paul Flicek; Tim Hubbard
Journal: PLoS Biol Date: 2011-07-05 Impact factor: 8.029

9. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2010-11-28 Impact factor: 16.971

10. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation.

Authors: John E Karro; Yangpan Yan; Deyou Zheng; Zhaolei Zhang; Nicholas Carriero; Philip Cayting; Paul Harrrison; Mark Gerstein
Journal: Nucleic Acids Res Date: 2006-11-11 Impact factor: 16.971

601 in total

Review 1. Methods for biological data integration: perspectives and challenges.

Authors: Vladimir Gligorijević; Nataša Pržulj
Journal: J R Soc Interface Date: 2015-11-06 Impact factor: 4.118

2. Global diversity and biogeography of deep-sea pelagic prokaryotes.

Authors: Guillem Salazar; Francisco M Cornejo-Castillo; Verónica Benítez-Barrios; Eugenio Fraile-Nuez; X Antón Álvarez-Salgado; Carlos M Duarte; Josep M Gasol; Silvia G Acinas
Journal: ISME J Date: 2015-08-07 Impact factor: 10.302

3. The genome of the alga-associated marine flavobacterium Formosa agariphila KMM 3901T reveals a broad potential for degradation of algal polysaccharides.

Authors: Alexander J Mann; Richard L Hahnke; Sixing Huang; Johannes Werner; Peng Xing; Tristan Barbeyron; Bruno Huettel; Kurt Stüber; Richard Reinhardt; Jens Harder; Frank Oliver Glöckner; Rudolf I Amann; Hanno Teeling
Journal: Appl Environ Microbiol Date: 2013-08-30 Impact factor: 4.792

4. Neuronal Kmt2a/Mll1 histone methyltransferase is essential for prefrontal synaptic plasticity and working memory.

Authors: Mira Jakovcevski; Hongyu Ruan; Erica Y Shen; Aslihan Dincer; Behnam Javidfar; Qi Ma; Cyril J Peter; Iris Cheung; Amanda C Mitchell; Yan Jiang; Cong L Lin; Venu Pothula; A Francis Stewart; Patricia Ernst; Wei-Dong Yao; Schahram Akbarian
Journal: J Neurosci Date: 2015-04-01 Impact factor: 6.167

5. Integrative analysis of many RNA-seq datasets to study alternative splicing.

Authors: Wenyuan Li; Chao Dai; Shuli Kang; Xianghong Jasmine Zhou
Journal: Methods Date: 2014-02-28 Impact factor: 3.608

6. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm.

Authors: Supatcha Lertampaiporn; Chinae Thammarongtham; Chakarida Nukoolkit; Boonserm Kaewkamnerdpong; Marasri Ruengjitchatchawalya
Journal: Nucleic Acids Res Date: 2014-04-25 Impact factor: 16.971

Review 7. CELFish ways to modulate mRNA decay.

Authors: Irina Vlasova-St Louis; Alexa M Dickson; Paul R Bohjanen; Carol J Wilusz
Journal: Biochim Biophys Acta Date: 2013-01-15

8. Impact of human pathogenic micro-insertions and micro-deletions on post-transcriptional regulation.

Authors: Xinjun Zhang; Hai Lin; Huiying Zhao; Yangyang Hao; Matthew Mort; David N Cooper; Yaoqi Zhou; Yunlong Liu
Journal: Hum Mol Genet Date: 2014-01-16 Impact factor: 6.150

9. Phylogenomic reconstruction of archaeal fatty acid metabolism.

Authors: Daria V Dibrova; Michael Y Galperin; Armen Y Mulkidjanian
Journal: Environ Microbiol Date: 2014-04 Impact factor: 5.491

10. Epstein-Barr virus and human herpesvirus 6 detection in a non-Hodgkin's diffuse large B-cell lymphoma cohort by using RNA sequencing.

Authors: Michael J Strong; Tina O'Grady; Zhen Lin; Guorong Xu; Melody Baddoo; Chris Parsons; Kun Zhang; Christopher M Taylor; Erik K Flemington
Journal: J Virol Date: 2013-09-18 Impact factor: 5.103