| Literature DB >> 21658294 |
Xiaoli Zhang1, Jie Zou, Daniel X Le, George R Thoma.
Abstract
BACKGROUND: Automated extraction of bibliographic data, such as article titles, author names, abstracts, and references is essential to the affordable creation of large citation databases. References, typically appearing at the end of journal articles, can also provide valuable information for extracting other bibliographic data. Therefore, parsing individual reference to extract author, title, journal, year, etc. is sometimes a necessary preprocessing step in building citation-indexing systems. The regular structure in references enables us to consider reference parsing a sequence learning problem and to study structural Support Vector Machine (structural SVM), a newly developed structured learning algorithm on parsing references.Entities:
Mesh:
Year: 2011 PMID: 21658294 PMCID: PMC3111593 DOI: 10.1186/1471-2105-12-S3-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
) and Publication Year (
Examples of references following different styles in medical journal articles
| <N>19</N> <A>S. Miyazaki, K. Takahashi, M. Shiraki, T. Saito, Y. Tezuka, K. Kasuya,</A> <T>Properties of a poly(3-hydroxybbutyrate) depolymerase from Penicillium funiculosum,</T> <J>J. Polym. Environ.</J> <V>8</V> <Y>(2002),</Y> <P>pp. 175–182.</P> |
| <N>Sofuoglu and Kosten, 2005</N> <A>M. Sofuoglu and T.R. Kosten,</A> <T>Novel approaches to the treatment of cocaine addiction,</T> <J>CNS Drugs</J> <V>19</V> <Y>(2005),</Y> <P>pp. 13–25.</P> <O>Full Text via CrossRef | Abstract + References in Scopus | Cited By in Scopus</O> |
| <A>Czarnetzki, A. B., and C. C. Tebbe.</A> <Y>2004.</Y> <T>Diversity of bacteria associated with Collembola: a cultivation-independent survey based on PCR-amplified 16S rRNA genes.</T> <J>FEMS Microbiol. Ecol.</J> <V>49:</V> <P>217-227.</P> <O>[CrossRef]</O> |
| <A>Rios R, Carneiro I, Arce VM, and Devesa J.</A> <T>Myostatin is an inhibitor of myogenic differentiation.</T> <J>Am J Physiol Cell Physiol</J> <V>282:</V> <P>C993–C999,</P> <Y>2002.</Y> <O>[Abstract/Free Full Text]</O> |
| <N>12.</N> <A>T.J. McCarthy et al.,</A> <J>Chem. Biol.</J> <V>12,</V> <P>1221</P > <Y>(2005).</Y> <O>[CrossRef] [ISI] [Medline]</O> |
| <N>18</N> <A>J. Cavanagh, W.J. Fairbrother, A.G. Palmer and N.J. Skelton,</A> <J>Protein NMR Spectroscopy,</J> <O>Academic Press, San Diego, CA</O> <Y>(1996)</Y> |
| <A>Anonymous.</A> <Y>2005.</Y> <T>Microbiology of food animal feeding stuffs. Polymerase chain reaction (PCR) for the detection of food-borne pathogens. Requirements for amplification detection for qualitative methods.</T> <O>Draft International Standard ISO/FDIS 20838</O> <Y>2005.</Y> <O>DIN, Berlin, Germany.</O> |
Features extracted from each token in a reference
| 1.Author Name Feature | Is the word in Author Name dictionary? |
| 2. Article Title Feature | Is the word in Article Title dictionary? |
| 3. Journal Title Feature | Is the word in Journal Title dictionary? |
| 4. Pagination Pattern | Is the word in pagination formation, e.g., 200-5, H100-H105? |
| 5. Name Initial Pattern | Is the word in name initial pattern, e.g., J.Z., J.-Z.? |
| 6. Four Digit Year Pattern | Is the word in four digit year pattern, e.g., 2005? It must be not before 1500, and not later than the current year. |
| 7. et, al | Is the word “et” or “al”, or “et.”, or “al.”? |
| 8. pp., p. | Is the word “pp.”, or “p.”, or “pp”, or “p”? |
| 9. Ended With “.” | Does the word end with “.”? |
| 10. Upper Case First Char | Is the first character of the word upper case? |
| 11. Letter Only | Does the word contain letters only? |
| 12. Digit Only | Does the word contain digits only? |
| 13. Digit and Letter | Does the word contain both digits and letters? |
| 14. Digit and Letter Only | Does the word contain digits and letters only? |
| 15. Normalized position | The position of the word normalized by the total number of words in the reference. |
The total number of words and chunks for each of the 8 entities in references for evaluation
| Citation Number | Author | Title | Journal | Volume | Year | Pagination | Other | Overall | |
| Total number of words | 742 | 18273 | 16346 | 4608 | 1739 | 1791 | 2106 | 8017 | 53622 |
| Total number of chunks | 627 | 1800 | 1308 | 1758 | 1735 | 1791 | 1751 | 1708 | 12478 |
Token classification accuracy obtained by SVM and structural SVM
| SVM | Structural SVM | |
| Features from token itself (15 features) | 93.03% | 98.41% |
| Features from the token and its two neighbors (45) | 98.20% | 98.91% |
| Features from the token and its four neighbors (75) | 98.65% | 99.02% |
Chunk-level accuracies of SVM method
| Citation Number | Author | Title | Journal | Volume | Year | Pagination | Other | Overall | |
| Features from token itself | 93.47% | 74.28% | 41.90% | 51.82% | 94.52% | 99.50% | 93.95% | 83.37% | 79.12% |
| Features from the token and its two neighbors | 98.73% | 92.78% | 81.04% | 89.48% | 99.25% | 99.83% | 98.63% | 93.91% | 94.27% |
| Features from the token and its four neighbors | 98.73% | 95.11% | 84.33% | 92.61% | 99.31% | 99.83% | 98.91% | 94.91% | 95.59% |
Chunk-level accuracies of structural SVM method
| Citation Number | Author | Title | Journal | Volume | Year | Pagination | Other | Overall | |
| Features from token itself | 99.04% | 98.94% | 78.59% | 91.24% | 98.90% | 99.50% | 98.63% | 95.90% | 95.35% |
| Features from the token and its two neighbors | 99.04% | 96.39% | 90.60% | 94.31% | 99.14% | 99.94% | 98.74% | 96.08% | 96.81% |
| Features from the token and its four neighbors | 99.20% | 97.17% | 90.29% | 94.94% | 99.14% | 99.83% | 98.63% | 95.84% | 96.95% |
Token classification accuracy obtained by SVM, structural SVM and CRF
| SVM | Structural SVM | CRF | |
| Features from the token and its four neighbors (70 features) | 97.84% | 98.99% | 98.91% |
Chunk-level accuracies of SVM, structural SVM and CRF
| Citation Number | Author | Title | Journal | Volume | Year | Pagination | Other | Overall | |
| SVM | 99.04% | 93.06% | 78.44% | 92.38% | 98.85% | 99.78% | 98.74% | 93.03% | 94.29% |
| Structural SVM | 98.89% | 96.39% | 90.21% | 94.99% | 99.25% | 99.83% | 98.80% | 95.78% | 96.82% |
| CRF | 98.57% | 97.83% | 90.75% | 94.99% | 98.96% | 99.22% | 98.91% | 95.61% | 96.93% |