Literature DB >> 22211174

Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies.

Elizabeth S Chen1, Indra Neil Sarkar.   

Abstract

Within large sequence repositories such as GenBank there is a wealth of metadata providing contextual information that may enhance search and retrieval of relevant sequences for a range of subsequent analyses. One challenge is the use of free-text in these metadata fields where approaches are needed to extract, structure, and encode essential information. The goal of the present study was to explore the feasibility of using a combination of existing resources for annotating unstructured GenBank metadata, initially focusing on the "host" and "isolation_source" fields. This paper summarizes early results for 10 host organisms that include a characterization of associated isolation sources with respect to biomedical ontologies and semantic types. The findings from this preliminary study provide insights to the rich amount of information captured within these unstructured metadata, guidance for addressing the challenges and issues encountered, and highlight the potential value for enriching comparative biological studies towards improving human health.

Entities:  

Year:  2011        PMID: 22211174      PMCID: PMC3248757     

Source DB:  PubMed          Journal:  AMIA Jt Summits Transl Sci Proc


INTRODUCTION

The availability of molecular sequence data for a broad range of organisms in centralized resources such as GenBank presents great opportunities for advancing biological discoveries1. Given the exponential growth of such repositories, there is an increasing need to organize information within metadata fields in order to facilitate the identification and retrieval of relevant sequences for biological and biomedical studies. Each entry in GenBank is associated with a detailed set of information about a sequence including a description, scientific name of the source organism, bibliographic references, and a table of features2. This “Feature Table” provides contextual information through a series of biological annotations for each sequence. Collectively, these metadata fields represent both structured and unstructured data. For example, “organism” contains the formal scientific name for the source organism and can be considered a structured field since it is organized according to the NCBI Taxonomy3. There are also numerous unstructured (free-text) fields such as “host” and “isolation_source” in the Feature Table, which are respectively defined as “natural (as opposed to laboratory) host to the organism from which sequenced molecule was obtained” and “describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived”4. There have been some efforts for identifying and standardizing key terms in such free-text fields. Towards the creation of Habitat-Lite for use in relevant specifications for habitat information, the isolation_source field in GenBank was examined5,6. The approaches used revealed a variety of information in this field with a majority of values falling into the broad “organism-associated” category where further work is needed to extract more specific information such as organism and anatomy. Another recent study explored the use of existing biomedical ontologies and annotation services available through the National Center for Biomedical Ontology (NCBO) for identifying anatomical sources in the GenBank isolation_source and note fields for ten domesticated mammalian species towards enabling comparative microbiome hypotheses7. Other studies and resources further highlight the value of capturing these contextual data in a structured format8,9.

METHODS & RESULTS

Building upon the aforementioned previous work, the goal of this feasibility study was to explore and develop approaches for annotating information within the unstructured “host” and “isolation_source” metadata in GenBank. Using a local GenBank database (Release 175), the following approach was followed (Figure 1): (1) identify and map host organisms to the NCBI Taxonomy, (2) annotate and characterize information in the isolation_source field using the NCBO BioPortal and UMLS Metathesaurus, and (3) describe how the structured host, isolation_source, and organism fields might be combined to enable host-oriented or cross-species studies.
Figure 1:

Overview of Methods

Identifying and Merging Organism Names in Host Metadata

All host values were extracted from the local GenBank database (n=1,350,040) and a list of unique values along with frequency counts was generated (n=28,907). In addition to including organism names (scientific, common, and synonyms) as anticipated, a manual review of this list revealed other types of information and varying formats. Given this, a combination of approaches was initially explored for identifying and mapping organism information to the NCBI Taxonomy (downloaded on June 23, 2010) that would facilitate the merging of values through the Taxonomy ID: – find a completely exact match for the host value in the NCBI Taxonomy database. For example, the following map to ID 9796: – use basic rules to find organism names relative to specific delimiters (e.g., ‘;’, ‘,’, and ‘(’). For example, the following map to ID 9606: 10 – this named entity recognition approach identifies taxonomic names in text and maps to universal identifiers if possible, which may link to Taxonomy IDs. Examples of those with no links: – each host value is viewed as a sequence of n words and an attempt is made to find a match for n-grams from size 1 to n. The following examples map to ID 9913: Holstein dairy Australian feedlot These four approaches were applied sequentially where each was meant to build upon the results of the previous one (while recognizing that the subsequent approaches may introduce noise or inaccuracies). For the 28,907 unique host values, organism names were identified for 40.5% of the values with exact matching, 60.5% with basic pattern matching, 87.9% with the TNR tool, and 94.97% with the n-gram approach. Given that a portion of the organism names identified by the TNR tool could not be mapped to the NCBI Taxonomy, a final total of 75% of the values could be mapped to Taxonomy IDs. These values were subsequently merged according to these identifiers in order to identify a more comprehensive set of sequences for a given host organism. For example, the single value Homo sapiens is associated with 504,967 sequences; through the mapping process, there were found to be over 600 different host values that mapped to Homo sapiens resulting in a total of 545,470 sequences when merged. For the purposes of this study, the top 10 host organisms ranked at the species-level were considered for further examination (thus excluding those ranked as genus, family, or subspecies as defined in the NCBI Taxonomy). Table 1A lists each organism along with the total number of host values (roughly equivalent to the number of GenBank entries) and number of unique host values (after manual review and removal of false positives).
Table 1:

Top 10 Host Organisms with Frequencies for Host (A), Isolation Source (B), and Organism (C).

A. HOSTB. ISOLATION SOURCEC. ORGANISM
Taxonomy IDScientific NameGenBank Common NameTotal ValuesUnique ValuesTotal ValuesUnique ValueOntologiesSemantic TypesTotal ValuesUnique Values
9606Homo sapienshuman54547060933743736281238354535719645
10116Rattus norvegicusNorway rat156894197739930712080888184
10118Rattus sp.7600877593342057596313
9805Diceros bicornisblack rhinoceros495003494943122494997
9796Equus caballushorse2758238433859713227575323
9792Equus grevyiGrevy’s zebra232803232704144232764
10090Mus musculushouse mouse21088331471044802621071172
9913Bos tauruscattle195407810454191964719462884
9891Antilocapra americanapronghorn129511129501122129512
9844Lama glamallama115822115793315115827

Analyzing, Characterizing, and Merging Information in Isolation Source Metadata

A preliminary analysis of all isolation_source values in GenBank (n=1,837,706) consisting of 35,980 unique values revealed more complex semantics and syntax than the host field. Given this, a different approach was used that involved focusing on host-specific sets of values. The rationale for this was that these subsets may be used to develop a generalizable approach that could then be applied for all values. For the 10 host organisms identified in the first step of this study, isolation_source values were extracted and each set of unique host-specific values was annotated using the NCBO Annotator Web service11. The default settings for this service were used for most parameters with the exception of “longestValue” (set to true), “mappingTypes” (set to inter-cui), and “format” (set to text). Each annotation includes a score, source ontology ID (e.g., 42789 = SNOMED Clinical Terms), concept ID, preferred name, synonym(s), and semantic type(s). As an initial pass, annotations with a score of less than 10 were removed and the remaining annotations underwent further semantic analysis that involved summarizing the source ontologies (from NCBO BioPortal12 or UMLS Metathesaurus13) and semantic types (from the UMLS Semantic Network). Since a given value may map to multiple concepts and semantic types in one or multiple ontologies, a unique list of ontologies and semantic types was identified for each value and the total counts were calculated by summarizing across all values. For each host organism, Table 1B presents the total number of isolation_source values, number of unique isolation_source values, number of source ontologies, and number of semantic types. As these results demonstrate, there is variation across host organisms, which highlights the potential differences in the content and format of isolation_source values. When combining results across the host organisms, the top 10 ontologies (out of a total of 124) were found to be: NCI Thesaurus, SNOMED CT, LOINC, Galen, BRENDA Tissue/Enzyme Source, MeSH, Uber Anatomy Ontology, Foundational Model of Anatomy, Mouse Adult Gross Anatomy, and RadLex. Other top host-specific ontologies included: HL7, ICNP, and Environment Ontology. Across the 10 host organisms, the top 5 UMLS semantic types (out of a total of 88 and excluding “NCBO BioPortal Concept”) were: Qualitative Concept, Body Substance, Disease or Syndrome, Patient or Disabled Group, and Body Part. The following two examples depict multiple semantic types within a given isolation_source value: of Body Part = “lymph node” Patient or Disabled Group = “patient Qualitative Concept = “with” Disease or Syndrome = “sarcoidosis from suffering Body Substance = “milk” Mammal = “cow Qualitative Concept = “from” Disease or Syndrome = “mastitis Semantic types were used to further categorize the host-specific isolation_source values. For three of the top semantic types (Body Part, Body Substance, and Disease or Syndrome), the preferred names associated with each annotation were extracted (regardless of source ontology) and used to generate a preliminary ranked list of values in each category (recognizing that future efforts should involve use of the concept IDs and linkages between ontologies to generate such lists). With this strategy, the following are example isolation_source values that map to the single preferred name of “plasma” (semantic type = Body Substance) for Homo sapiens: human serum or host Table 2 (shaded rows) highlights the total number of isolation_source values for the three semantic types (along with the proportion of all isolation_source values) and the number of unique preferred names for five of the host organisms.
Table 2:

Top 5 Body Parts, Body Substances, Diseases or Syndromes, and Organisms for Selected Host Organisms.

Homo sapiensRattus norvegicusEquus caballusMus musculusBos taurus

ISOLATION SOURCEBody PartTotal: 34950 (0.104)Total: 132 (0.002)Total: 71 (0.016)Total: 3303 (0.225)Total: 1056 (0.101)
Unique: 94Unique: 15Unique: 8Unique: 13Unique: 20

esophagus0.212lung0.432brain0.775cecum0.540rumen0.729
external auditory canal0.143rat colon0.326vagina0.113ileum0.449teat0.050
hoof0.028spleen0.002omasum0.028
umbilicus0.140ileum0.068gastric mucosa0.028lung0.002brain0.031
manubrium0.128caecum0.030intestinal0.002nasal0.026
glabella0.123kidney0.023uterus0.014

Body SubstanceTotal: 44991 (0.133)Total: 32209 (0.416)Total: 3958 (0.912)Total: 444 (0.030)Total: 6084 (0.582)
Unique: 59Unique: 3Unique: 10Unique: 3Unique: 11

saliva0.317feces>99.999feces0.959feces0.980feces0.947
feces0.259blood<0.001semen0.022blood0.011blood0.021
plasma0.166isolate<0.001blood0.014lysate0.009milk0.014
serum0.142peripheral blood0.003serum0.006
blood0.027serum<0.001exudate0.004

Disease or SyndromeTotal: 3363 (0.010)Total: 0Total: 14 (0.003)Total: 983 (0.067)Total: 445 (0.430)
Unique: 137Unique: 0Unique: 4Unique: 1Unique: 9

subgingival plaque0.161sarcoid0.714Salmonella1.000interdigital necrobacillosis0.892
encephalitis0.143
chronic hepatitis b0.140valvular endocarditis0.071mastitis0.070
dermatitis0.020
pneumococcal infection0.121endometritis0.071septicemia0.004
warts0.004
liver abscess0.050
acute hepatitis b0.049

ORGANISMuncultured bacterium (0.589)uncultured bacterium (0.986)uncultured Neocallimastigales (0.897)uncultured bacterium (0.957)uncultured Neocallimastigales (0.280)
Human immunodeficiency virus 1 (0.112)uncultured Escherichia sp. (0.002)Equine infectious anemia virus (0.022)Lactobacillus Reuteri (0.005)uncultured bacterium (0.277)
Hepatitis C virus (0.027)Seoul virus (0.002)Burkholderia mallei PRL-2 (0.010)uncultured Clostridiales Bacterium (0.005)Rabies virus (0.055)
uncultured organism (0.020)Lactobacillus reuteri (0.001)Burkholderia mallei GB8 horse 4 (0.007)Lymphocytic choriomeningitis virus (0.005)uncultured rumen archaeon (0.036)
Hepatitis B virus (0.018)uncultured Bacillus sp. (0.001)Equine arteritis virus (0.005)Hepatitis C virus (0.004)uncultured rumen bacterium (0.035)

Enabling Comparative Biology Inquiries

The ability to extract, structure, and encode contextual information captured within the host and isolation_source fields in GenBank may be valuable for a range of subsequent uses. As suggested in a previous study7, the organization of data within GenBank could potentially facilitate initiatives like the Human Microbiome Project (study variation in the human microbiome and its impact on disease) or comparative microbiome studies (compare microbiomes in similar environments across species). An essential component of such studies is the identification of relevant sequences for a given host organism and a better understanding of the context or environment in which they were collected. As shown earlier, the identification of organism names within the host field and their subsequent mapping to Taxonomy IDs can enhance the number of relevant sequences for a given host (e.g., there was almost a 10% increase for Homo sapiens). Based on the enhanced sets of host-specific sequences, Table 2 depicts the top 5 body parts, body substances, diseases or syndromes, and organisms associated with five of the hosts based on the isolation_source field (along with the proportion of total values for the host-specific semantic type). With respect to microbiome studies, a potential use of this contextual information is enabling comparisons between organism sequences obtained from different body parts of the same host organism (e.g., “cecum” versus “ileum” for Mus musculus). In addition to the aforementioned host-specific implications, the organization of unstructured GenBank fields may ultimately be used to enrich and facilitate cross-species studies by enabling context-specific questions such as: (1) For organism X, what are possible host organisms to study; (2) For body substance Y, what host organisms have been sources; or, (3) across a specified set of host organisms, how do the isolation sources and organisms compare? For example, as shown in Table 2, “feces” and “blood” are both among the top 5 body substances across the five host organisms.

DISCUSSION

Through this feasibility study, we have gained valuable insights to the richness and variation of information captured within two unstructured metadata fields in GenBank (host and isolation_source). The methods and results presented in this paper represent early attempts to structure this information towards enriching subsequent analyses. Next steps include performing extensive evaluations, addressing the various challenges and issues encountered, refining the techniques accordingly towards a more generalized approach, and demonstrating the potential impact on biological and biomedical studies. The analysis of GenBank host metadata involved using four consecutive approaches for identifying organism names and mapping those names to NCBI Taxonomy IDs. While organism names were identified in 97% of the values (and 75% could be mapped to the NCBI Taxonomy), host organisms could not be identified or mapped for the remaining values for several reasons including: organism is not in the NCBI Taxonomy (e.g., Pachnoda ephippiata and Thamnomys rutilans), common name or synonym for an organism is not in the NCBI Taxonomy (e.g., snail, white-fronted wallaby, and avian), and typographical errors (e.g., Lcopersicon esculentum instead of Lcopersicon esculentum and Biomalaria pfeifferi instead of Biomlaria pfeifferi). Further evaluation of the results from each approach is needed to quantify and further examine both the false negatives and false positives in order to improve the techniques. In addition, techniques will be needed to extract other contextual information that is captured in the host field aside from organism names such as organism attributes (e.g., “adult two-spotted spider mite” and “female Ixodes persulcatus”), diseases (e.g., diabetes-prone (BB-DP) rat), and relationships (e.g., Scolytus ratzeburgi on Betula pendula). For isolation_source metadata in GenBank, a key goal was to gain a better understanding of the types of information found within this field. The NCBO Annotator Web service was used to annotate host-specific values where no restrictions to ontologies or semantic types were applied. The initial semantic analysis provided insights to the coverage of concepts for guiding next steps for both host-specific and host-independent analysis. Future work includes evaluating the annotations produced by NCBO Annotator to determine if and how parameters should be adjusted. For example, limiting to specific ontologies (e.g., guided by NCBO Recommender14) and focusing on particular semantic types.

CONCLUSION

This study involved examining the free-text host and isolation_source metadata fields in GenBank towards organizing key contextual information using a combination of existing biomedical ontology and annotation resources. Preliminary results for ten host organisms demonstrate how the structuring of these fields may contribute to comparative studies.
  13 in total

1.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

Review 2.  Biodiversity informatics: organizing and linking information across the spectrum of life.

Authors:  Indra Neil Sarkar
Journal:  Brief Bioinform       Date:  2007-08-17       Impact factor: 11.622

3.  MetaBar - a tool for consistent contextual data acquisition and standards compliant submission.

Authors:  Wolfgang Hankeln; Pier Luigi Buttigieg; Dennis Fink; Renzo Kottmann; Pelin Yilmaz; Frank Oliver Glöckner
Journal:  BMC Bioinformatics       Date:  2010-06-30       Impact factor: 3.169

4.  Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts.

Authors:  Indra Neil Sarkar
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

5.  The minimum information about a genome sequence (MIGS) specification.

Authors:  Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal:  Nat Biotechnol       Date:  2008-05       Impact factor: 54.908

6.  Building a biomedical ontology recommender web service.

Authors:  Clement Jonquet; Mark A Musen; Nigam H Shah
Journal:  J Biomed Semantics       Date:  2010-06-22

7.  BioPortal: ontologies and integrated data resources at the click of a mouse.

Authors:  Natalya F Noy; Nigam H Shah; Patricia L Whetzel; Benjamin Dai; Michael Dorf; Nicholas Griffith; Clement Jonquet; Daniel L Rubin; Margaret-Anne Storey; Christopher G Chute; Mark A Musen
Journal:  Nucleic Acids Res       Date:  2009-05-29       Impact factor: 16.971

8.  Comparison of concept recognizers for building the Open Biomedical Annotator.

Authors:  Nigam H Shah; Nipun Bhatia; Clement Jonquet; Daniel Rubin; Annie P Chiang; Mark A Musen
Journal:  BMC Bioinformatics       Date:  2009-09-17       Impact factor: 3.169

9.  GeMInA, Genomic Metadata for Infectious Agents, a geospatial surveillance pathogen database.

Authors:  Lynn M Schriml; Cesar Arze; Suvarna Nadendla; Anu Ganapathy; Victor Felix; Anup Mahurkar; Katherine Phillippy; Aaron Gussman; Sam Angiuoli; Elodie Ghedin; Owen White; Neil Hall
Journal:  Nucleic Acids Res       Date:  2009-10-22       Impact factor: 16.971

10.  GenBank.

Authors:  Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal:  Nucleic Acids Res       Date:  2007-12-11       Impact factor: 16.971

View more
  5 in total

1.  Leveraging biodiversity knowledge for potential phyto-therapeutic applications.

Authors:  Vivekanand Sharma; Indra Neil Sarkar
Journal:  J Am Med Inform Assoc       Date:  2013-03-21       Impact factor: 4.497

2.  A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.

Authors:  Tasnia Tahsin; Davy Weissenbacher; Robert Rivera; Rachel Beard; Mari Firago; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez
Journal:  J Am Med Inform Assoc       Date:  2016-01-17       Impact factor: 4.497

3.  GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records.

Authors:  Tasnia Tahsin; Davy Weissenbacher; Karen O'Connor; Arjun Magge; Matthew Scotch; Graciela Gonzalez-Hernandez
Journal:  Bioinformatics       Date:  2018-05-01       Impact factor: 6.937

4.  GenBank as a Source to Monitor and Analyze Host-Microbiome Data.

Authors:  Vivek Ramanan; Shanti Mechery; Indra Neil Sarkar
Journal:  Bioinformatics       Date:  2022-07-08       Impact factor: 6.931

5.  Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research.

Authors:  Tasnia Tahsin; Davy Weissenbacher; Demetrius Jones-Shargani; Daniel Magee; Matteo Vaiente; Graciela Gonzalez; Matthew Scotch
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.