| Literature DB >> 22726842 |
Tim Massingham1, Nick Goldman.
Abstract
BACKGROUND: The Exact Call Chemistry for the SOLiD Next-Generation Sequencing platform augments the two-base-encoding chemistry with an additional round of ligation, using an alternative set of probes, that allows some mistakes made when reading the first set of probes to be corrected. Additionally, the Exact Call Chemistry allows reads produced by the platform to be decoded directly into nucleotide sequence rather than its two-base 'color' encoding.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22726842 PMCID: PMC3464616 DOI: 10.1186/1471-2105-13-145
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Different representations of GF4
| GF4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ||||
| Nucleotides | A | C | G | T | 1 | 1 | 0 | 1 | 0 | 1 | ||||
| Colors | 0 | 1 | 0 | 1 | ||||||||||
| 1 | 0 | 0 | 1 |
For the purposes of the coding theory presented here, both nucleotides and colors represent elements of the Galois Field over four elements (GF4) and the correspondence between them is shown below. For example, the color ‘ ’, the nucleotide ‘G’ and the element ‘α’ are considered to be equivalent for the purposes of calculation. A field consists of a set of elements and rules on how to add (⊕) and multiply (⊗) them together; the results of combining two elements are expressed by the Cayley tables above; for example, α⊕β=1 and α⊗α=β. The standard rules for associativity and commutativity for multiplication and addition still apply in finite fields, and multiplication is still distributive over addition [8]. One notable difference from ordinary arithmetic is that all elements are self-invertible under addition in GF4, so addition and subtraction are equivalent operations.
Figure 1Architecture of the SOLiD ECC encoder. The architecture of the convolutional code for the SOLiD ECC consists of two streams. The nucleotide sequence , ,… is passed through the encoder, progressing one position at a time, and the color obtained from the additions and multiplications indicated is emitted from each stream. The top ‘color stream’ is that produced by the two-base-encoding chemistry and the bottom ‘SOLiD ECC stream’ is punctured so that only every fifth color member of the sequence is used.
Figure 2Block structure of encoded read. A read partitioned into blocks of five bases, with block i containing bases , showing how the two-base-encoding and ECC color calls are split into five ‘data’ colors , from which the block can be called, and a ‘parity’ color ( ) which straddles the block and its downstream neighbour. The data colors are used to determine the nucleotide sequence of the blocks and the parity color is used to detect whether an error has occurred. Note that the data colors are a mixture of both color streams, with the parity color coming from the color stream of the code.
Syndromes for ECC generator
| | | | | |||||||||
| + | Complement | 0 | 0 | 0 | | | | {
| ||||
| + | Transition | 0 | 1 | 1 | 0 | 0 | | | | {
| ||
| +1 | Transcomplement | 10 | 01 | 01 | 11 | 01 | {
| |||||
All syndromes caused by a single error in a block of five code letters ( ) and two parity letters ( ) of the SOLiD ECC code. Each row in the table corresponds to a specific type of error at the given position of the code word, with the ‘interpretation’ of an error type being the effect it would have when considered as a nucleotide substitution. The table entries are the values of the relevant elements of the syndrome, corresponding to the upstream and downstream parity checks for the block for each error type. The error type +0 is not shown since it represents no error; its syndromes would be 00 at all positions. The equivalence classes are listed separately and do not correspond to specific error types. Note that is also for the preceding block.
Patterns of error for the ECC generator
|
| Single | 00100 |
|
| Single | 00001 |
|
| Triple | 01110 |
|
| Quadruple | 01111 |
All possible patterns of error caused in corrected base-space sequence by a single wrongly corrected error in the observed sequence of type +d (where d=1, αor β). The ‘Positions’ column indicates all possible pairs of positions at which an error can occur and be wrongfully corrected; it is not necessary to identify which member of the pair corresponds to which form of error as the resulting pattern is the same. In all cases the error pattern should be multiplied element-wise by the error type (d) to get the actual pattern (e.g. a wrongly corrected + βerror at results in an error pattern of 00β00).
Syndromes for alternative codes
| | | | | classes | ||||||||
| 10 | + | 1 | 1 | 0 | 0 | 0 | | | | {
| ||
| | + | 0 | 0 | 11 | 0 | | | | {
| |||
| | +1 | 10 | 01 | 01 | 01 | | | | {
| |||
| 1 | + | 1 | 1 | 0 | 11 | 0 | | | | {
| ||
| | + | 1 | 0 | 0 | | | | {
| ||||
| +1 | 10 | 01 | 01 | {
| ||||||||
All syndromes caused by a single error in a block of five code letters ( ) and two parity letters ( ) for codes with the specified generator. Each row corresponds to a specific type of error at the given position of the code word and the table entries are the values of the relevant elements of the syndrome corresponding to the upstream and downstream parity checks for the block for each error type. The error type +0 is not shown since it represents no error. The equivalence classes are listed separately and do not correspond to specific error types. Note that is also for the preceding block.
Patterns of error for alternative codes
| 10 |
| Single | 01000 |
| |
| Single | 00010 |
| |
| Double | 00011 |
| |
| Single | 00001 |
| | | | |
| 1 |
| Single | 00100 |
|
| Single | 00001 |
All possible patterns of error caused in corrected base-space sequence by a single, wrongly corrected error in the observed sequence for code with given generator. An error occurs at one of the positions in the ‘Positions’ column; the other member of ‘Positions’ is where the observed sequence is wrongfully corrected. As in Table 3, all error patterns should be multiplied element-wise by the error type (1, αor β) to get the actual pattern.
Number of errors for simulated data
| | ||||||||
| Color | None | 77.1 | 38.4 | 16.7 | 10.5 | 6.2 | 3.6 | 1.8 |
| 1 | 78.4 | 48.6 | 10.2 | 9.7 | 5.0 | 3.2 | 1.7 | |
| 10 | 78.3 | 48.4 | 10.3 | 9.7 | 5.0 | 3.2 | 1.7 | |
| 1 | 78.3 | 49.8 | 9.1 | 9.7 | 4.9 | 3.2 | 1.7 | |
| Base | None | 47.2 | 38.4 | 2.3 | 1.6 | 1.8 | 1.8 | 1.2 |
| 1 | 64.3 | 49.6 | 4.7 | 2.1 | 3.1 | 2.6 | 2.1 | |
| 10 | 65.5 | 49.4 | 6.2 | 3.0 | 2.3 | 2.4 | 2.2 | |
| 1 | 64.9 | 50.8 | 5.0 | 2.2 | 2.4 | 2.4 | 2.2 |
Percentage of reads mapped (five edits or fewer), and mapped with a given number of errors, for one million simulated reads using the codes with probe generators as specified. For comparison, figures are also given using only the two-base-encoding probe set (probe generator ‘None’). Since color-space reads have their first position trimmed before mapping to produce reads 49 colors long, the percentages of mapped reads and reads with a given number of errors are slightly inflated compared to those given for base-space reads.