Literature DB >> 22211174

Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies.

Abstract

Within large sequence repositories such as GenBank there is a wealth of metadata providing contextual information that may enhance search and retrieval of relevant sequences for a range of subsequent analyses. One challenge is the use of free-text in these metadata fields where approaches are needed to extract, structure, and encode essential information. The goal of the present study was to explore the feasibility of using a combination of existing resources for annotating unstructured GenBank metadata, initially focusing on the "host" and "isolation_source" fields. This paper summarizes early results for 10 host organisms that include a characterization of associated isolation sources with respect to biomedical ontologies and semantic types. The findings from this preliminary study provide insights to the rich amount of information captured within these unstructured metadata, guidance for addressing the challenges and issues encountered, and highlight the potential value for enriching comparative biological studies towards improving human health.

Entities: CellLine Chemical Disease Gene Species

Year: 2011 PMID： 22211174 PMCID： PMC3248757

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

INTRODUCTION

The availability of molecular sequence data for a broad range of organisms in centralized resources such as GenBank presents great opportunities for advancing biological discoveries1. Given the exponential growth of such repositories, there is an increasing need to organize information within metadata fields in order to facilitate the identification and retrieval of relevant sequences for biological and biomedical studies. Each entry in GenBank is associated with a detailed set of information about a sequence including a description, scientific name of the source organism, bibliographic references, and a table of features2. This “Feature Table” provides contextual information through a series of biological annotations for each sequence. Collectively, these metadata fields represent both structured and unstructured data. For example, “organism” contains the formal scientific name for the source organism and can be considered a structured field since it is organized according to the NCBI Taxonomy3. There are also numerous unstructured (free-text) fields such as “host” and “isolation_source” in the Feature Table, which are respectively defined as “natural (as opposed to laboratory) host to the organism from which sequenced molecule was obtained” and “describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived”4. There have been some efforts for identifying and standardizing key terms in such free-text fields. Towards the creation of Habitat-Lite for use in relevant specifications for habitat information, the isolation_source field in GenBank was examined5,6. The approaches used revealed a variety of information in this field with a majority of values falling into the broad “organism-associated” category where further work is needed to extract more specific information such as organism and anatomy. Another recent study explored the use of existing biomedical ontologies and annotation services available through the National Center for Biomedical Ontology (NCBO) for identifying anatomical sources in the GenBank isolation_source and note fields for ten domesticated mammalian species towards enabling comparative microbiome hypotheses7. Other studies and resources further highlight the value of capturing these contextual data in a structured format8,9.

METHODS & RESULTS

Building upon the aforementioned previous work, the goal of this feasibility study was to explore and develop approaches for annotating information within the unstructured “host” and “isolation_source” metadata in GenBank. Using a local GenBank database (Release 175), the following approach was followed (Figure 1): (1) identify and map host organisms to the NCBI Taxonomy, (2) annotate and characterize information in the isolation_source field using the NCBO BioPortal and UMLS Metathesaurus, and (3) describe how the structured host, isolation_source, and organism fields might be combined to enable host-oriented or cross-species studies.

Figure 1:

Overview of Methods

Identifying and Merging Organism Names in Host Metadata

All host values were extracted from the local GenBank database (n=1,350,040) and a list of unique values along with frequency counts was generated (n=28,907). In addition to including organism names (scientific, common, and synonyms) as anticipated, a manual review of this list revealed other types of information and varying formats. Given this, a combination of approaches was initially explored for identifying and mapping organism information to the NCBI Taxonomy (downloaded on June 23, 2010) that would facilitate the merging of values through the Taxonomy ID: – find a completely exact match for the host value in the NCBI Taxonomy database. For example, the following map to ID 9796: – use basic rules to find organism names relative to specific delimiters (e.g., ‘;’, ‘,’, and ‘(’). For example, the following map to ID 9606: 10 – this named entity recognition approach identifies taxonomic names in text and maps to universal identifiers if possible, which may link to Taxonomy IDs. Examples of those with no links: – each host value is viewed as a sequence of n words and an attempt is made to find a match for n-grams from size 1 to n. The following examples map to ID 9913: Holstein dairy Australian feedlot These four approaches were applied sequentially where each was meant to build upon the results of the previous one (while recognizing that the subsequent approaches may introduce noise or inaccuracies). For the 28,907 unique host values, organism names were identified for 40.5% of the values with exact matching, 60.5% with basic pattern matching, 87.9% with the TNR tool, and 94.97% with the n-gram approach. Given that a portion of the organism names identified by the TNR tool could not be mapped to the NCBI Taxonomy, a final total of 75% of the values could be mapped to Taxonomy IDs. These values were subsequently merged according to these identifiers in order to identify a more comprehensive set of sequences for a given host organism. For example, the single value Homo sapiens is associated with 504,967 sequences; through the mapping process, there were found to be over 600 different host values that mapped to Homo sapiens resulting in a total of 545,470 sequences when merged. For the purposes of this study, the top 10 host organisms ranked at the species-level were considered for further examination (thus excluding those ranked as genus, family, or subspecies as defined in the NCBI Taxonomy). Table 1A lists each organism along with the total number of host values (roughly equivalent to the number of GenBank entries) and number of unique host values (after manual review and removal of false positives).

Table 1:

Top 10 Host Organisms with Frequencies for Host (A), Isolation Source (B), and Organism (C).

			A. HOST		B. ISOLATION SOURCE				C. ORGANISM
Taxonomy ID	Scientific Name	GenBank Common Name	Total Values	Unique Values	Total Values	Unique Value	Ontologies	Semantic Types	Total Values	Unique Values
9606	Homo sapiens	human	545470	609	337437	3628	123	83	545357	19645
10116	Rattus norvegicus	Norway rat	156894	19	77399	30	71	20	80888	184
10118	Rattus sp.		76008	7	75933	4	20	5	75963	13
9805	Diceros bicornis	black rhinoceros	49500	3	49494	3	12	2	49499	7
9796	Equus caballus	horse	27582	38	4338	59	71	32	27575	323
9792	Equus grevyi	Grevy’s zebra	23280	3	23270	4	14	4	23276	4
10090	Mus musculus	house mouse	21088	33	14710	44	80	26	21071	172
9913	Bos taurus	cattle	19540	78	10454	191	96	47	19462	884
9891	Antilocapra americana	pronghorn	12951	1	12950	1	12	2	12951	2
9844	Lama glama	llama	11582	2	11579	3	31	5	11582	7

Analyzing, Characterizing, and Merging Information in Isolation Source Metadata

A preliminary analysis of all isolation_source values in GenBank (n=1,837,706) consisting of 35,980 unique values revealed more complex semantics and syntax than the host field. Given this, a different approach was used that involved focusing on host-specific sets of values. The rationale for this was that these subsets may be used to develop a generalizable approach that could then be applied for all values. For the 10 host organisms identified in the first step of this study, isolation_source values were extracted and each set of unique host-specific values was annotated using the NCBO Annotator Web service11. The default settings for this service were used for most parameters with the exception of “longestValue” (set to true), “mappingTypes” (set to inter-cui), and “format” (set to text). Each annotation includes a score, source ontology ID (e.g., 42789 = SNOMED Clinical Terms), concept ID, preferred name, synonym(s), and semantic type(s). As an initial pass, annotations with a score of less than 10 were removed and the remaining annotations underwent further semantic analysis that involved summarizing the source ontologies (from NCBO BioPortal12 or UMLS Metathesaurus13) and semantic types (from the UMLS Semantic Network). Since a given value may map to multiple concepts and semantic types in one or multiple ontologies, a unique list of ontologies and semantic types was identified for each value and the total counts were calculated by summarizing across all values. For each host organism, Table 1B presents the total number of isolation_source values, number of unique isolation_source values, number of source ontologies, and number of semantic types. As these results demonstrate, there is variation across host organisms, which highlights the potential differences in the content and format of isolation_source values. When combining results across the host organisms, the top 10 ontologies (out of a total of 124) were found to be: NCI Thesaurus, SNOMED CT, LOINC, Galen, BRENDA Tissue/Enzyme Source, MeSH, Uber Anatomy Ontology, Foundational Model of Anatomy, Mouse Adult Gross Anatomy, and RadLex. Other top host-specific ontologies included: HL7, ICNP, and Environment Ontology. Across the 10 host organisms, the top 5 UMLS semantic types (out of a total of 88 and excluding “NCBO BioPortal Concept”) were: Qualitative Concept, Body Substance, Disease or Syndrome, Patient or Disabled Group, and Body Part. The following two examples depict multiple semantic types within a given isolation_source value: of Body Part = “lymph node” Patient or Disabled Group = “patient” Qualitative Concept = “with” Disease or Syndrome = “sarcoidosis” from suffering Body Substance = “milk” Mammal = “cow” Qualitative Concept = “from” Disease or Syndrome = “mastitis” Semantic types were used to further categorize the host-specific isolation_source values. For three of the top semantic types (Body Part, Body Substance, and Disease or Syndrome), the preferred names associated with each annotation were extracted (regardless of source ontology) and used to generate a preliminary ranked list of values in each category (recognizing that future efforts should involve use of the concept IDs and linkages between ontologies to generate such lists). With this strategy, the following are example isolation_source values that map to the single preferred name of “plasma” (semantic type = Body Substance) for Homo sapiens: human serum or host Table 2 (shaded rows) highlights the total number of isolation_source values for the three semantic types (along with the proportion of all isolation_source values) and the number of unique preferred names for five of the host organisms.

Table 2:

Top 5 Body Parts, Body Substances, Diseases or Syndromes, and Organisms for Selected Host Organisms.

		Homo sapiens		Rattus norvegicus		Equus caballus		Mus musculus		Bos taurus

ISOLATION SOURCE	Body Part	Total: 34950 (0.104)		Total: 132 (0.002)		Total: 71 (0.016)		Total: 3303 (0.225)		Total: 1056 (0.101)
		Unique: 94		Unique: 15		Unique: 8		Unique: 13		Unique: 20

		esophagus	0.212	lung	0.432	brain	0.775	cecum	0.540	rumen	0.729
		external auditory canal	0.143	rat colon	0.326	vagina	0.113	ileum	0.449	teat	0.050
						hoof	0.028	spleen	0.002	omasum	0.028
		umbilicus	0.140	ileum	0.068	gastric mucosa	0.028	lung	0.002	brain	0.031
		manubrium	0.128	caecum	0.030			intestinal	0.002	nasal	0.026
		glabella	0.123	kidney	0.023	uterus	0.014

	Body Substance	Total: 44991 (0.133)		Total: 32209 (0.416)		Total: 3958 (0.912)		Total: 444 (0.030)		Total: 6084 (0.582)
		Unique: 59		Unique: 3		Unique: 10		Unique: 3		Unique: 11

		saliva	0.317	feces	>99.999	feces	0.959	feces	0.980	feces	0.947
		feces	0.259	blood	<0.001	semen	0.022	blood	0.011	blood	0.021
		plasma	0.166	isolate	<0.001	blood	0.014	lysate	0.009	milk	0.014
		serum	0.142			peripheral blood	0.003			serum	0.006
		blood	0.027			serum	<0.001			exudate	0.004

	Disease or Syndrome	Total: 3363 (0.010)		Total: 0		Total: 14 (0.003)		Total: 983 (0.067)		Total: 445 (0.430)
		Unique: 137		Unique: 0		Unique: 4		Unique: 1		Unique: 9

		subgingival plaque	0.161			sarcoid	0.714	Salmonella	1.000	interdigital necrobacillosis	0.892
						encephalitis	0.143
		chronic hepatitis b	0.140			valvular endocarditis	0.071			mastitis	0.070
										dermatitis	0.020
		pneumococcal infection	0.121			endometritis	0.071			septicemia	0.004
										warts	0.004
		liver abscess	0.050
		acute hepatitis b	0.049

ORGANISM		uncultured bacterium (0.589)		uncultured bacterium (0.986)		uncultured Neocallimastigales (0.897)		uncultured bacterium (0.957)		uncultured Neocallimastigales (0.280)
		Human immunodeficiency virus 1 (0.112)		uncultured Escherichia sp. (0.002)		Equine infectious anemia virus (0.022)		Lactobacillus Reuteri (0.005)		uncultured bacterium (0.277)
		Hepatitis C virus (0.027)		Seoul virus (0.002)		Burkholderia mallei PRL-2 (0.010)		uncultured Clostridiales Bacterium (0.005)		Rabies virus (0.055)
		uncultured organism (0.020)		Lactobacillus reuteri (0.001)		Burkholderia mallei GB8 horse 4 (0.007)		Lymphocytic choriomeningitis virus (0.005)		uncultured rumen archaeon (0.036)
		Hepatitis B virus (0.018)		uncultured Bacillus sp. (0.001)		Equine arteritis virus (0.005)		Hepatitis C virus (0.004)		uncultured rumen bacterium (0.035)

Enabling Comparative Biology Inquiries

The ability to extract, structure, and encode contextual information captured within the host and isolation_source fields in GenBank may be valuable for a range of subsequent uses. As suggested in a previous study7, the organization of data within GenBank could potentially facilitate initiatives like the Human Microbiome Project (study variation in the human microbiome and its impact on disease) or comparative microbiome studies (compare microbiomes in similar environments across species). An essential component of such studies is the identification of relevant sequences for a given host organism and a better understanding of the context or environment in which they were collected. As shown earlier, the identification of organism names within the host field and their subsequent mapping to Taxonomy IDs can enhance the number of relevant sequences for a given host (e.g., there was almost a 10% increase for Homo sapiens). Based on the enhanced sets of host-specific sequences, Table 2 depicts the top 5 body parts, body substances, diseases or syndromes, and organisms associated with five of the hosts based on the isolation_source field (along with the proportion of total values for the host-specific semantic type). With respect to microbiome studies, a potential use of this contextual information is enabling comparisons between organism sequences obtained from different body parts of the same host organism (e.g., “cecum” versus “ileum” for Mus musculus). In addition to the aforementioned host-specific implications, the organization of unstructured GenBank fields may ultimately be used to enrich and facilitate cross-species studies by enabling context-specific questions such as: (1) For organism X, what are possible host organisms to study; (2) For body substance Y, what host organisms have been sources; or, (3) across a specified set of host organisms, how do the isolation sources and organisms compare? For example, as shown in Table 2, “feces” and “blood” are both among the top 5 body substances across the five host organisms.

DISCUSSION

Through this feasibility study, we have gained valuable insights to the richness and variation of information captured within two unstructured metadata fields in GenBank (host and isolation_source). The methods and results presented in this paper represent early attempts to structure this information towards enriching subsequent analyses. Next steps include performing extensive evaluations, addressing the various challenges and issues encountered, refining the techniques accordingly towards a more generalized approach, and demonstrating the potential impact on biological and biomedical studies. The analysis of GenBank host metadata involved using four consecutive approaches for identifying organism names and mapping those names to NCBI Taxonomy IDs. While organism names were identified in 97% of the values (and 75% could be mapped to the NCBI Taxonomy), host organisms could not be identified or mapped for the remaining values for several reasons including: organism is not in the NCBI Taxonomy (e.g., Pachnoda ephippiata and Thamnomys rutilans), common name or synonym for an organism is not in the NCBI Taxonomy (e.g., snail, white-fronted wallaby, and avian), and typographical errors (e.g., Lcopersicon esculentum instead of Lcopersicon esculentum and Biomalaria pfeifferi instead of Biomlaria pfeifferi). Further evaluation of the results from each approach is needed to quantify and further examine both the false negatives and false positives in order to improve the techniques. In addition, techniques will be needed to extract other contextual information that is captured in the host field aside from organism names such as organism attributes (e.g., “adult two-spotted spider mite” and “female Ixodes persulcatus”), diseases (e.g., diabetes-prone (BB-DP) rat), and relationships (e.g., Scolytus ratzeburgi on Betula pendula). For isolation_source metadata in GenBank, a key goal was to gain a better understanding of the types of information found within this field. The NCBO Annotator Web service was used to annotate host-specific values where no restrictions to ontologies or semantic types were applied. The initial semantic analysis provided insights to the coverage of concepts for guiding next steps for both host-specific and host-independent analysis. Future work includes evaluating the annotations produced by NCBO Annotator to determine if and how parameters should be adjusted. For example, limiting to specific ontologies (e.g., guided by NCBO Recommender14) and focusing on particular semantic types.

CONCLUSION

This study involved examining the free-text host and isolation_source metadata fields in GenBank towards organizing key contextual information using a combination of existing biomedical ontology and annotation resources. Preliminary results for ten host organisms demonstrate how the structuring of these fields may contribute to comparative studies.

13 in total

1. The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors: Olivier Bodenreider
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

Review 2. Biodiversity informatics: organizing and linking information across the spectrum of life.

Authors: Indra Neil Sarkar
Journal: Brief Bioinform Date: 2007-08-17 Impact factor: 11.622

3. MetaBar - a tool for consistent contextual data acquisition and standards compliant submission.

Authors: Wolfgang Hankeln; Pier Luigi Buttigieg; Dennis Fink; Renzo Kottmann; Pelin Yilmaz; Frank Oliver Glöckner
Journal: BMC Bioinformatics Date: 2010-06-30 Impact factor: 3.169

4. Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts.

Authors: Indra Neil Sarkar
Journal: AMIA Annu Symp Proc Date: 2010-11-13

5. The minimum information about a genome sequence (MIGS) specification.

Authors: Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal: Nat Biotechnol Date: 2008-05 Impact factor: 54.908

6. Building a biomedical ontology recommender web service.

Authors: Clement Jonquet; Mark A Musen; Nigam H Shah
Journal: J Biomed Semantics Date: 2010-06-22

7. BioPortal: ontologies and integrated data resources at the click of a mouse.

Authors: Natalya F Noy; Nigam H Shah; Patricia L Whetzel; Benjamin Dai; Michael Dorf; Nicholas Griffith; Clement Jonquet; Daniel L Rubin; Margaret-Anne Storey; Christopher G Chute; Mark A Musen
Journal: Nucleic Acids Res Date: 2009-05-29 Impact factor: 16.971

8. Comparison of concept recognizers for building the Open Biomedical Annotator.

Authors: Nigam H Shah; Nipun Bhatia; Clement Jonquet; Daniel Rubin; Annie P Chiang; Mark A Musen
Journal: BMC Bioinformatics Date: 2009-09-17 Impact factor: 3.169

9. GeMInA, Genomic Metadata for Infectious Agents, a geospatial surveillance pathogen database.

Authors: Lynn M Schriml; Cesar Arze; Suvarna Nadendla; Anu Ganapathy; Victor Felix; Anup Mahurkar; Katherine Phillippy; Aaron Gussman; Sam Angiuoli; Elodie Ghedin; Owen White; Neil Hall
Journal: Nucleic Acids Res Date: 2009-10-22 Impact factor: 16.971