| Literature DB >> 15471541 |
Diane E Oliver1, Gaurav Bhalotia, Ariel S Schwartz, Russ B Altman, Marti A Hearst.
Abstract
BACKGROUND: Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers.Entities:
Mesh:
Year: 2004 PMID: 15471541 PMCID: PMC524480 DOI: 10.1186/1471-2105-5-146
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Representation of information related to authors in the DTD. Selected portions of the DTD are shown. Database schema designers determine how entities and elements are converted to table names or field names in the database schema. See Figure 2 for the author table.
Figure 2Representation of author information in the database schema. The typical table has a PubMed identifier (pmid) associated with other fields.
Figure 3Dependencies in the database schema. Parent tables contain primary keys that child tables reference as foreign keys. The main table medline_citation, is a parent of thirteen other tables. The table medline_mesh_heading is a parent of medline_mesh_heading_qualifier.
Figure 4MEDLINE database development process. In Step 1, the user loads the schema, creating empty tables in the database. In Step 2, the conversion software parses the XML files and either loads the data directly into the database (2a), or writes the data out to intermediate text files (2b). If intermediate text files are generated, data from those files are loaded into the database as a separate step in Step 3.
Metadata for medline_author table.
| pmid | PMID | Types.INTEGER |
| last_name | Author.LastName | Types.VARCHAR |
| fore_name | Author.ForeName | Types.VARCHAR |
| first_name | Author.FirstName | Types.VARCHAR |
| middle_name | Author.MiddleName | Types.VARCHAR |
| initials | Author.Initials | Types.VARCHAR |
| suffix | Author.Suffix | Types.VARCHAR |
| affiliation | Author.Affiliation | Types.VARCHAR |
| collective_name | Author.CollectiveName | Types.VARCHAR |
: column names for the table,
: XML element names that correspond to column names,
: data type of each column
Loading time and disk-space utilization.
| Site | Language | Processor | Database | Input Size | Loading Time | Disk Space |
| Berkeley | Java | Intel | DB2 | 44.4 GB (500 files) | 76 hours | 46.3 GB |
| Stanford | Java | Sun | Oracle | 40.8 GB (396 files) | 196 hours | 37.7 GB |
| Stanford | Perl | Sun | Oracle | 40.8 GB (396 files) | 132 hours | 31.6 GB |
| SELECT pmid |
| FROM medline_citation |
| SELECT | mc.medline_ta, count(mc.pmid) as num_of_publications |
| FROM | medline_citation mc |
| JOIN medline_mesh_heading msh ON | |
| mc.pmid = msh.pmid | |
| WHERE | msh.descriptor_name = 'Leukemia' |
| GROUP BY | mc.medline_ta |
| ORDER BY | count(mc.pmid) desc |
| FETCH | first 10 rows only; |
| Blood | 940 |
| Cancer | 619 |
| Rinsho Ketsueki | 610 |
| Cancer Res | 588 |
| Br J Haematol | 524 |
| Bone Marrow Transplant | 520 |
| Lancet | 515 |
| Leuk Res | 476 |
| Leukemia | 463 |
| The N Engl J Med | 342 |
| Blood | 6721 |
| Cancer Res | 4653 |
| Leukemia | 4640 |
| Br J Haematol | 3918 |
| Leuk Res | 3061 |
| Cancer | 2772 |
| Rinsho Ketsueki | 2628 |
| Cancer Genet Cytogenet | 2192 |
| Bone Marrow Transplant | 2123 |
| Lancet | 1931 |
| SELECT | mc.medline_ta, count(mc.pmid) as num_of_publications |
| FROM | medline_citation mc |
| JOIN medline_mesh_heading msh ON | |
| mc.pmid = msh.pmid | |
| JOIN mesh_descriptor md ON | |
| md.descriptor_name = msh.descriptor_name | |
| JOIN mesh_desc_tree_number mdtn ON | |
| md.descriptor_ui = mdtn.descriptor_ui | |
| WHERE | mdtn.tree_number LIKE 'C04.557.337%' |
| GROUP BY | mc.medline_ta |
| ORDER BY | count(mc.pmid) desc |
| FETCH | first 10 rows only; |
| Blood | 7361 |
| Leukemia | 5168 |
| Cancer Res | 4595 |
| Br J Haematol | 4249 |
| Leuk Res | 3274 |
| Cancer | 2856 |
| Rinsho Ketsueki | 2789 |
| Cancer Genet Cytogenet | 2362 |
| Leuk Lymph | 2226 |
| Bone Marrow Transplant | 2183 |
| SELECT | 'Berkeley' as institution, count(pmid) as num_of_publications |
| FROM | medline_citation |
| WHERE | CONTAINS(article_affiliation,'"Berkeley"') = 1 |
| AND | date_created > current date – 3 years |
| UNION | |
| SELECT | 'Stanford' as institution, count(pmid) as num_of_publications |
| FROM | medline_citation |
| WHERE | CONTAINS(article_affiliation,'"Stanford"') = 1 |
| AND | date_created > current date – 3 years; |
| Berkeley | 2623 |
| Stanford | 4226 |