| Literature DB >> 20682041 |
Miguel García-Remesal1, Alejandro Cuevas, Victoria López-Alonso, Guillermo López-Campos, Guillermo de la Calle, Diana de la Iglesia, David Pérez-Rey, José Crespo, Fernando Martín-Sánchez, Víctor Maojo.
Abstract
BACKGROUND: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20682041 PMCID: PMC2923139 DOI: 10.1186/1471-2105-11-410
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the primer and probe extraction process.
Figure 2ST corresponding to a PDF paper from the Virology Journal identified by PubMed ID 18234069. Documents are organized hierarchically. The root node <, null, null> represents the entire document. Complex sections (e.g. containing multiple subsections) are hierarchically decomposed according to the original paper structure. For instance, the section <, "Abstract", null> can be decomposed into its three child sections: <, "Background", "Dengue is...">, <, "Results", "An optimal..."> and <, "Conclusion", "These findings...">. Nodes of types (e.g. <, "Table 2: Comparison of...", "M-RT-PCR\tVirus isolation\nPositive\t96 (15.48%)\t...">) and (e.g. <, "Figure 1", "1.5% Agarose gel electrophoresis...">) are considered as special sections and thus allocated as children of the root node—the escape sequences "\t" and "\n" denote tab and newline characters. The natural reading order of the PDF paper can be reproduced by iterating the ST in depth-first order.
Equivalences between alphabet symbols and nucleotides
| Alphabet Symbol | Permissible Nucleotides | Complement | Meaning |
|---|---|---|---|
| A | T | [A]denine | |
| C | G | T | V | Any but Adenine | |
| C | G | [C]ytosine | |
| A | G | T | H | Any but Cytosine | |
| G | C | [G]uanine | |
| A | C | T | D | Any but Guanine | |
| G | T | M | [K]eto | |
| A | C | K | A[M]ino | |
| A | C | G | T | N | A[N]y nucleotide | |
| A | G | Y | Pu[R]ine | |
| C | G | S | [S]trong (3 H-bonds) | |
| T | A | [T]hymine | |
| A | C | G | B | Any but Thymine | |
| A | T | W | [W]eak (2 H-bonds) | |
| C | T | R | P[Y]rimidine | |
This table (adapted from http://www.geneinfinity.org/sp_nucsymbols.html) shows the mappings between wildcard symbols used to represent DNA sequences in scientific papers and their permissible nucleotide types.
Sample sequences recognized by each detector
| Detector | PMID | Text String | List of Tokens |
|---|---|---|---|
| 19781080 | ...primers AA247 (5'-TGCCATTGCCAAAGAGAC-3') and pLQ510-rp1... | {"TGCCATTGCCAAAGAGAC"} | |
| 19664269 | ...mecA gene, mecAR (5'-TTACTCATGCCATACATAAATGGATA-\ | {"TTACTCATGCCATACATAAATGGATA", "GACG"} | |
| 19379498 | ...specific primer pair traD-F (5'-caatgcttgatctatttggtag-3') and traD-R... | {"caatgcttgatctatttggtag"} | |
| 19758438 | ...MY 09, 5-CGT CCM\ | {"CGT", "CCM", "ARR", "GGA", "WAC", "TGA", "TC"} | |
| 19799780 | B-globin outside R @ CTC AAG TTC TCA GGA TCC A @ 1st round PCR primer for Human Beta globin DNA | {"CTC", "AAG", "TTC", "TCA", "GGA", "TCC", "A"} | |
| 18847469 | btherm @ GAT GTG CCG GGC TCC TGC ATG @ This study | {"GAT", "GTG", "CCG", "GGC", "TCC", "TGC", "ATG"} | |
| 18154687 | Stx1 @ GTA CGT CTT TAC TGA TGA TTG ATA GTG GCA CAG GG @ 35 @ 73.5 | {"GTA", "CGT", "CTT", "TAC", "TGA", "TGA", "TTG", "ATA", "GTG", "GCA", "CAG", "GG"} | |
| 19558693 | ...are listed below.\n | {"ATG", "GTG", "GGC", "CAG", "CTT", "GTC"} | |
| 19754958 | ...with primer N309 (ACATGCGGATCCCTCGAGCCTTTGAA-\nGATGACTAACTCCCCA) and N297... | {" ACATGCGGATCCCTCGAGCCTTTGAA", "GATGACTAACTCCCCA"} | |
| 19737401 | ...and 3' AAGCT TGGTA CCTCA CTGCA\nGCAGA GCGCT GAGGC CCAGC AGCAC. The resulting PCR... | {"AAGCT", "TGGTA", "CCTCA", "CTGCA", "GCAGA", "GCGCT", "GAGGC", "CCAGC", "AGCAC"} | |
| 19149882 | 1 @ XAC0340 @ 432 @ gATACCCCATATgAATgCgAT | {"gATACCCCATATgAATgCgAT"} | |
| 19775435 | 20 @ F:GAGATGGATTAACCAGATGTCTTAAAAACTATCGTAAC | {":","GAGATGGATTAACCAGATGTCTTAAAAACTATCGTAAC"} | |
This table shows some examples of sequences that can be recognized by each detector. The number under the column "Detector" identifies the detector that recognized the "List of Tokens" from the string "Text String" that can be found in the manuscript whose PubMed Identifier (PMID) is shown under the "PMID" column. The symbols /n and @ denote the newline and the table cell separator characters respectively.
Figure 3State transition diagrams describing the preliminary sequence recognizers. Circles represent regular states, whereas double circles stand for final (accepting) states. Edges denote state transitions triggered by the occurrence of any of the symbols drawn on the edges. These include 's' symbols in blue that represent strings of any length belonging to ∑+, whereas 's1', 's2' and 's3' are strings of symbols from ∑+ of lengths 1, 2 and 3 respectively. Green items represent different literals such as dashes, colons, newline tokens, etc. States labeled with the number 0 that are pointed at by an arrow with no origin represent initial states.
The knowledge base in a nutshell (I)
| Rules | |
|---|---|
| ∃ | |
| ∃ | |
| ∃( | |
| in_dictionary(s1) ∧ length(s1) ≥ 3 → discard(s) ∧ add(s') | |
| in_dictionary( | |
| size(s) ≥ 2 → merge(s) | |
This table shows the complete knowledge base for refining DNA sequences. R1 is designed to discard short sequences. R2, R3, R5 and R6 are designed to refine noisy sequences, whereas R5 deals with incorrectly merged sequences. R4, by contrast, removes concatenations of dictionary words recognized by the detectors as valid sequences. Finally, R8 converts a list of tokens containing two or more elements into a singleton whose only element represents the refined sequence. The symbol s denotes a list of tokens s = {s, s,..., s} of size n. See Table 4 for details on the functions, actions and symbols used by the different rules.
The knowledge base in a nutshell (II)
| Length( |
| discard( |
| add( |
| affix_in_sequence_tail( |
| affix_in_sequence_head( |
| affix_within_sequence( |
| in_dictionary( |
| size( |
| merge( |
Some examples of automated sequence refinement
| List of Tokens | Execution Trace | Refined Singleton(s) |
|---|---|---|
| {"CATATTCACCTTTTCAGGCGTTTTGACCGT", "TAMRA", "T"} | <R2> | {"CATATTCACCTTTTCAGGCGTTTTGACCGT"} |
| {"ATAAC", "TCGAG", "GTGGA", "ATTCA", "TGGCA", "TCTAC", "TTCGT", "ATGAC", "TATTGC", "and", "AAGCT", "TGGTA", "CCTCA", "CTGCA", "GCAGA", "GCGCT", "GAGGC", "CCAGC", "AGCAC"} | <R5, R8, R8> | {"ATAACTCGAGGTGGAATTCATGGCATCTACTTCGTATGACTATTGC"}, |
| {"than", "standard"} | <R4> | - |
| {"DNA"} | <R1> | - |
| {"TTCTTTTGGTGGACGATGTG", "and", "GAGGGACGC", "TTGGTAACG", "TAMRA", "and", "TCGCAAGCC", "AAGCAAATAC", "TAMRA", "T", "and", "GAGATAGGGTGCGATGGTTG", "TCGGCGATGACTACGACA"} | <R5, R3, R5, R5, R3, R8, R8, R8> | {"TTCTTTTGGTGGACGATGTG"}, |
| {"RNA", "strand"} | <R7, R1> | - |
| {":","GCGGCCTGATAAGGGATATTGGAAGC", "R", ":", "GGCGAAATTCATTAAAGAGGATCCTGACAC"} | <R3, R5> | {"GCGGCCTGATAAGGGATATTGGAAGC" }, {"GGCGAAATTCATTAAAGAGGATCCTGACAC" } |
This table shows the results of using the knowledge-based system to refine some sample lists of tokens produced by different recognizers in phase 2. Each row of the table presents the refinement process of a single list of tokens, including: (1) the initial contents of the facts base, (2) the execution trace and (3) the final state of the facts base. All singletons in the facts base at the end of the execution are considered as valid and refined sequences.
Figure 4Plot showing how CSs are assigned to the matched organism names depending on the length of the match. Unlike regular English noun groups, where the meaning of the noun is narrowed by the preceding words, organism names are made more specific by post-positive words. The plot shows the CSs assigned to matches of length l for different values of L. This figure shows that the more specific—i.e. the longer—the matches are, the higher the assigned CS.
Summary of results of the evaluation of activities 2 and 3
| No. of sequences | Recognized | Not Recognized | False Positives |
|---|---|---|---|
| 3999 | 3830 | 169 | 79 |
| 97.98% | |||
| 95.77% | |||
| 0.9686 | |||
Summary of results of the assessment of activity 4
| Number of Sequences | Correct Tagging | Incorrect Tagging | Information Not Found | |
|---|---|---|---|---|
| 3830 | 2936 | 168 | 726 | |
| 2936 | 1356 | 0 | 1580 | |