| Literature DB >> 23186305 |
Bilal Wajid1, Erchin Serpedin, Mohamed Nounou, Hazem Nounou.
Abstract
: Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that "counting the number of reads of the novel genome present in the reference sequence" is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of "counting the number of reads that align to the reference sequence" and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.Entities:
Year: 2012 PMID: 23186305 PMCID: PMC3608252 DOI: 10.1186/1687-4153-2012-18
Source DB: PubMed Journal: EURASIP J Bioinform Syst Biol ISSN: 1687-4145
Counting number of reads not enough
| 1 | Fibrobacter succinogenes subsp. succinogenes S85 (NC_013410.1) | 3842635 | 157 |
| 2 | Human Chromosome 21 (AC_000044.1) | 32992206 | 158 |
The table shows that choosing the reference sequence which has the highest number of reads present is not a sufficient condition. Just by looking at the “Data given the model” ≡“Number of reads found” one ends up choosing Human Chromosome 21. However, looking at the fact that Chromosome 21 is about 9×larger than S85 one realizes that actually S85 is the model of choice. Furthermore, S85 is a bacterial genome whereas Chromosome 21 comes from a eukaryote genome. PAb1 is also a bacteria, therefore, S85 is most definitely the model of choice.
Figure 1MDL proposed scheme: The output of the system shows that the three components of the encoding scheme are separated from one another by “>”. The scheme follows the format “Model > Model given the Data > Data given the hypothesis”. In the genome assembly framework the scheme mentioned above translates into “Reference Sequence >Reference Sequence according to the set of reads > Set of reads according to the Reference sequence”. “Model given the Data” is identified using {−1, 0, 1}. “1”(s) represent the base locations where the reads are found. “0”(s) represents the locations which are not covered by any read. “−1”(s) represents the locations of the genome that are inverted.
Summary of the experiment using three reads {ATAT, GGGG, CCAA} and three reference sequences {1, 2, 3}
| | | | | | |||
|---|---|---|---|---|---|---|---|
| | | | | | | ||
| 1 | 1111011110-1-1-1-1 | CCAA | 12 | 0 | ATATCGGGGCATAT>1111 0 1111 0 -1-1-1-1>CCAA | 102 | |
| 2 | ATGGGCCCTTATTGC | 000000000000000 | ATAT>GGGG>CCAA | 42 | 30 | ATGGGCCCTTATTGC> 000000000000000 >ATAT>GGGG >CCAA | 138 |
| 3 | 1111-1-1-1-11111 | ATAT>CCAA | 27 | 15 | GGGGCCCCGGGG>1111-1-1-1-11111>ATAT>CCAA | 105 |
Regret is defined as . Here the loss function, , happens to be code-length of the data , given the model class M. Whereas, “Data given the hypothesis”, is the code-length of the “Reads that do not align to the reference sequence”. The code-length in the last column is measured according to Equation (3). The experiment shows that given the MDL proposed scheme Ref. 1 is the optimal choice for a reference sequence.
Figure 2Correcting inversions in the reference sequence.(a) Reads are derived from the novel sequence. (b) The reference sequence, S, contains two inversions, shown as yellow and blue regions. (c) The sequence generated θhas both yellow and blue regions rectified. Notice that using a simple ad-hoc scheme of counting the number of reads in the reference sequence one would have made use of (b) for assembly of novel genome. However, using MDL one can now use (c) for the assembly of the novel genome.
Figure 3Removing insertions in the reference sequence.(a) Reads are derived from the novel sequence. (b) The reference sequence, S, contains two insertions, shown as shaded grey boxes. (c) The proposed MDL process generates θ. The process removes only those insertions which are larger than τ1 but smaller than τ2; where τ1 and τ2 are user-defined. To remove the other insertion the value of τ2 could be increased.
Variable number of SNPs: the experiment shows the effect of increasing the number of SNPs on choice of the reference sequence
| 1 | 183 | 52 / 52 | 62 / 59 | 62 | 1815.14 |
| 2 | 224 | 50 / 51 | 66 / 58 | 63 | 1843.35 |
S has higher number of SNPs as opposed to S. The code-length suggests that S is the model of choice as it has a smaller code-length. The results show that the MDL scheme works successfully on variable number of SNPs by choosing the model with a lower number of SNPs in them.
Variable number of insertions: the experiment shows the effect of increasing the number of insertions on choice of the reference sequence
| 1 | 0 | 0 | 136 / 196 | 0 | 1200.3 |
| 2 | 0 | 0 | 132 / 203 | 0 | 1228.25 |
The location and length of these insertions was chosen randomly. shows that out of 196 insertions in S only 136 were removed. The remaining insertions were not recovered due to the choice of τ1 and τ2. S has higher number of insertions as opposed to S. The code-length suggests that Sis the model of choice as it has a smaller code-length.
Variable number of deletions: the experiment shows the effect of increasing the number of deletions on choice of the reference sequence
| 1 | 0 | 0 | 2 / 0 | 182 | 1997.28 |
| 2 | 0 | 0 | 3 / 0 | 189 | 2015.35 |
The location and length of these deletions was chosen randomly. S has higher number of deletions as opposed to S. The code-length suggests that S is the model of choice as it has a smaller code-length. The experiment show that although no insertions were put in the actual sequence yet still two and three insertions were found for Sand S, respectively. This may be due to a large section of reads that could not align to the reference sequence on the edges of these deletions.
Variable number of inversions: the experiment shows the proposed scheme is robust to the number of inversions in the reference sequence
| 1 | 0 | 0 | 0 | 0 | 586.04 |
| 2 | 0 | 176 / 176 | 0 | 0 | 586.04 |
Both S and S have the same code-length. This is because the MDL scheme not only detected all the inversions for S but also recovered all of them. So effectively S ≡ S after the MDL process as explained in Figure 2.
Simulations with Influenza virus A, B, and C
| 1 | A, H1N1 (NC_002023.1) | 0 / 4 | 1 | 254.109 |
| 2 | A, H5N1 (NC_007357.1) | 0 / 4 | 1 | 254.109 |
| 3 | A, H2N2 (NC_007378.1) | 0 / 4 | 1 | 254.109 |
| 4 | A, H3N2 (NC_007373.1) | 0 / 4 | 1 | 254.109 |
| 5 | A, H9N2 (NC_004910.1) | 0 / 4 | 1 | 254.109 |
| 6 | B (NC_002204.1) | 4 / 4 | 1 | 68.62 |
| 7 | C (NC_006307.1) | 0 / 4 | 1 | 254.027 |
One of the sequences from Influenza virus {A, B, C} was randomly selected and modified to include {SNPs =7, inversions =4, deletions =1, insertions =3}. As Influenza virus A has five different strains while both Influenza viruses B and C each have one the MDL process was used to compare the seven sequences to determine which is the best reference sequence. Ref. Seq. 6, Influenza virus B was found to have the smallest code-length (68.62Kb), and is therefore, the model of choice. The experiment also shows that given the optimal reference sequence, in this case Influenza virus B, the MDL process rectifies all inversions (4/4). However, given non-optimal reference sequences, the proposed MDL process is not able to rectify the inversions (0/4). So the proposed algorithm chooses the optimal reference sequence, and given the optimal reference sequence if not all, at least most of the inversions are also corrected.
The experiment uses the proposed MDL scheme on the same set of reads but different set of reference sequences
| 1 | 1 | 696 | 128.60 | 0.046 | 14 |
| 2 | 2 | 696 | 128.73 | 0.031 | 47 |
| 3 | 5 | 693 | 128.575 | 0.046 | 113 |
| 4 | 10 | 684 | 127.576 | 0.046 | 229 |
| 5 | 25 | 668 | 126.615 | 0.093 | 565 |
| 6 | 50 | 650 | 126.615 | 0.109 | 650 |
| 8 | 150 | 2 | 21.164 | 0.062 | 2341 |
| 9 | 200 | 2 | 27.808 | 0.124 | 2341 |
| 10 | 300 | 2 | 41.525 | 0.140 | 2341 |
The set of reads contained 3817 reads all of which were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’. Out of 3817 reads the method extracted 696 unique reads which were then used in the MDL proposed scheme. All the reference sequences were derived from the same Influenza A (H1N1) virus. Ref. Seq. 1% used in S.No. 1, has a length which is 1% of the actual genome. Similarly Ref. Seq. 25% has a length which is a quarter of the length of the actual genome. All other genomes were derived in a similar way. For, e.g., Ref. Seq. 200% has two H1N1 viruses concatenated together making the length twice that of the original H1N1 sequence. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the best reference sequence, one which has the smallest code-length as determined by Equation (3). The MDL scheme does not choose smaller reference sequences with more unaligned reads rather than choosing larger reference sequence with smaller unaligned reads. The experiment also proves the correctness of the optimal reference sequence as it chooses Ref. Seq. 7, (shown underlined), since it has the smallest code-length, as the optimal reference sequence. It was Ref. Seq. 7 from which all the reads were derived from. Since the MDL scheme chooses Ref. Seq. 7 as the optimal sequence, the experiment also proves the correctness of the reference sequence chosen.
The exeriment tests the proposed MDL scheme on a single set of reads yet on a number of reference sequences
| 1 | 75 | 172 | 25.91 | 1755 |
| 2 | 85 | 148 | 25.10 | 1989 |
| 3 | 95 | 123 | 24.20 | 2223 |
| 5 | 105 | 108 | 24.22 | 2458 |
| 6 | 115 | 107 | 25.50 | 2692 |
| 7 | 125 | 106 | 26.78 | 2926 |
The set of reads, 390 in total, were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’ using the ART read simulator for NGS with read length 30, standard deviation 10, and mean fragment length of 100, [79]. Similarly the reference sequences were also derived from the same H1N1 virus. Ref. Seq. 75% used in S.No. 1, has a length which is 75% of the actual genome. Similarly Ref. Seq. 125% has a quarter of the actual genome concatenated with the complete H1N1 genome making the total length 125% of H1N1. All other genomes were derived in a similar way. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the correct reference sequence, Ref. Seq. 100%, (shown underlined) even when all the contending sequences are closely related to one another in terms of their genome and length.