| Literature DB >> 17947321 |
Elmar Pruesse1, Christian Quast, Katrin Knittel, Bernhard M Fuchs, Wolfgang Ludwig, Jörg Peplies, Frank Oliver Glöckner.
Abstract
Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity. The ARB software suite with its corresponding rRNA datasets has been accepted by researchers worldwide as a standard tool for large scale rRNA analysis. However, the rapid increase of publicly available rRNA sequence data has recently hampered the maintenance of comprehensive and curated rRNA knowledge databases. A new system, SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains. All sequences are checked for anomalies, carry a rich set of sequence associated contextual information, have multiple taxonomic classifications, and the latest validly described nomenclature. Furthermore, two precompiled sequence datasets compatible with ARB are offered for download on the SILVA website: (i) the reference (Ref) datasets, comprising only high quality, nearly full length sequences suitable for in-depth phylogenetic analysis and probe design and (ii) the comprehensive Parc datasets with all publicly available rRNA sequences longer than 300 nucleotides suitable for biodiversity analyses. The latest publicly available database release 91 (August 2007) hosts 547 521 sequences split into 461 823 small subunit and 85 689 large subunit rRNAs.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17947321 PMCID: PMC2175337 DOI: 10.1093/nar/gkm864
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Sequence retrieval and processing for SILVA 91
| SSU Parc | LSU Parc | |
|---|---|---|
| Candidates | 900 573 | 417 217 |
| <300 Bases | 320 327 | 297 218 |
| >2% Ambiguities | 8018 | 2193 |
| >2% Homopolymers | 19 240 | 4772 |
| >5% Vector contamination | 14 973 | 2573 |
| Insufficient relatives | 49 063 | 13 081 |
| <300 Gene bases | 25 961 | 7510 |
| <30 Alignment quality or base pair score | 6583 | 3390 |
| Total sequences in Parcs | 461 823 | 85 689 |
Figure 1.Sequence length distribution of rRNA genes in the SILVA 91 SSU database. The dotted line represents the sequence distribution directly after importing, the solid line after quality checks and alignment. The huge amount of sequences around 100 bases reflect the first impact of tag sequencing approaches.
Figure 2.Sequence length distribution in the SILVA 91 LSU database. The dotted line represents the sequence distribution directly after importing, the solid line after quality checks and alignment. The huge amount of sequences around 100 bases reflect the first impact of tag sequencing approaches.
Description of database fields in ARB files exported from SILVA for ARB specific fields and entries
| ARB field name | Owned by | Description |
|---|---|---|
| aligned | User | User-defined entry, e.g. name and date of the person who aligned the sequence |
| ambig | ARB | Ambiguities calculated in ARB using ‘count ambiguities’ |
| ARB_color | ARB | Stores the information about sequence colors |
| name | ARB | Internal ARB database ID, do not change! |
| nuc | ARB | Number of nucleotides; calculated by ARB using ‘count nucleotides’ |
| nuc_term | ARB | Number of nucleotides coding for the respective rRNA gene; calculated by ‘count nucleotides gene’ |
| remark | User | Field for remarks |
| tmp | ARB | Used by several ARB modules |
Description of database fields in ARB files exported from SILVA for Fields and entries imported from EMBL
| ARB field name | EMBL field | Description |
|---|---|---|
| acc | AC | Accession number |
| ali_xx/data | sequence | Sequence information |
| author | RA | Reference author(s) |
| clone | FT/clone | Clone from which the sequence was obtained |
| collected by | FT/collected_by | Name of the person who collected the specimen |
| collection_date | FT/collection_date | Date that the specimen was collected |
| country | FT/country | Geographical origin of sequenced sample |
| date | DT | Entry creation and update date separated by; |
| description | DE | Description |
| full_name | OS | Organism species |
| gene | FT/gene | Symbol of the gene corresponding to a sequence region |
| insdc | PR | The International Nucleotide Sequence Database Collaboration (INSDC) Project Identifier that has been assigned to the entry |
| isolate | FT/isolate | Individual isolate from which the sequence was obtained |
| isolation_source | FT/isolation_source | Describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived |
| journal | RL | Reference location |
| lat_lon | FT/lat_lon | Geographical coordinates of the location where the specimen was collected |
| nuc_region | FT source | Identifies the biological source of the specified span of the sequence |
| nuc_rp | RP | Reference positions |
| product | FT/product | Name of the product associated with the feature |
| publication_doi | RX | Cross-reference DOI number |
| pubmed_id | RX | Cross-reference Pubmed ID |
| specific_host | FT/specific_host | Natural host from which the sequence was obtained |
| specimen_voucher | FT/specimen_Voucher | An identifier of the individual or collection of the source organism and the place where it is currently stored, usually an institution |
| start | FT rRNA | Start of the ribosomal RNA gene |
| stop | FT rRNA | Stop of the ribosomal RNA gene |
| strain | FT/strain | Strain from which the sequence was obtained |
| submit_author | RL | Submission authors from reference location |
| submit_date | RL | Submission date from reference location |
| tax_embl | OC | Organism classification according to EMBL |
| tax_embl_name | OC | Organism name taken from the classification field |
| tax_xref_embl | FT/db_xref | Database cross-reference: pointer to related information in another database |
| title | RT | Reference title |
| version | ID SV | Subversion from identification line |
Description of database fields in ARB files exported from SILVA for SILVA specific fields and entries
| ARB field name | Description |
|---|---|
| align_bp_score_slv | Calculates the number of bases in helices in the aligned sequence taken into account canonical and non canonical basepairing. The cost matrix is taken from ARB Probe_Match ( |
| align_cutoff_head_slv | Unaligned bases at the beginning of the sequence |
| align_cutoff_tail_slv | Unaligned bases at the end of the sequence |
| align_family_slv | Names and scores of reference sequences in the alignment process |
| align_log_slv | Detailed aligner comments |
| align_quality_slv | Maximal similarity to reference sequence in the seed |
| aligned_slv | Data and time of alignment by Silva |
| ambig_slv | Calculated percent ambiguities in the sequences, a maximum of 2% is allowed |
| homop_slv | Calculated percentages repetitive bases with more than four bases, a maximum of 2% is allowed |
| homop_events_slv | Absolute number of repetitive elements with more than four bases |
| nuc_gene_slv | Aligned bases within gene boundaries |
| pintail_slv | Information about potential sequence anomalies detected by Pintail ( |
| alternative_name_slv | Synonyms or basonyms of the species according to the DSMZ ‘nomenclature up to date’ catalogue |
| seq_quality_slv | Summary sequence quality value calculated based on values from vector, ambiguities and homopolymers, 100 means very good |
| tax_gg | Taxonomy mapped from greengenes |
| tax_gg_name | Organism name in greengenes |
| tax_rdp | Nomenclatural taxonomy mapped from RDP II |
| tax_rdp_name | Organism name in RDP II |
| vector_slv | Percent vector contamination, a maximum of 5% is allowed |