| Literature DB >> 26678826 |
Vinhthuy Phan, Shanshan Gao, Quang Tran, Nam S Vo.
Abstract
BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship.Entities:
Mesh:
Year: 2015 PMID: 26678826 PMCID: PMC4674900 DOI: 10.1186/1471-2105-16-S17-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Information on the selected 100 genomic sequences [Part 1].
| ID | Genome size | Description | Lineage | Source |
|---|---|---|---|---|
| AE017198 | 1992676 | Lactobacillus johnsonii NCC 533, | Bacteria, Firmicutes | EBI |
| AJ270060 | 14497843 | Arabidopsis thaliana DNA chr. 4, long arm | Eukaryota, Viridiplantae | EBI |
| AM055943 | 2013089 | Toxoplasma gondii RH, genomic DNA chr. Ib | Eukaryota, Alveolata | EBI |
| AM263198 | 2814130 | Listeria welshimeri serovar 6b str. SLCC5334 | Bacteria, Firmicutes | EBI |
| AM269894 | 1347714 | Eimeria tenella chr. 1, ordered contigs | Eukaryota, Alveolata | EBI |
| BA000004 | 4202352 | Bacillus halodurans C-125 DNA, | Bacteria, Firmicutes | EBI |
| BN001302 | 4011161 | TPA: Aspergillus nidulans FGSC A4 chr. II | Eukaryota, Fungi | EBI |
| BX284601 | 15072434 | Caenorhabditis elegans Bristol N2 genomic chr., I | Eukaryota, Metazoa | EBI |
| CAID01000012 | 521582 | Ostreococcus tauri WGS project CAID00000000 data, contig chr. 12 | Eukaryota, Viridiplantae | EBI |
| CM000001 | 122678785 | Canis lupus familiaris chr. 1 | Eukaryota, Metazoa | EBI |
| CM000038 | 23914537 | Canis lupus familiaris chr. 38 | Eukaryota, Metazoa | EBI |
| CM000043 | 1786351 | Cryptococcus neoformans var. neoformans B-3501A chr. 4 | Eukaryota, Fungi | EBI |
| CM000071 | 19787792 | Drosophila pseudoobscura pseudoobscura strain MV2-25 chr. 3 | Eukaryota, Metazoa | EBI |
| CM000091 | 57791882 | Rattus norvegicus strain BN/SsNHsdMCW chr. 20 | Eukaryota, Metazoa | EBI |
| CM000110 | 11219875 | Gallus gallus chr. 18 | Eukaryota, Metazoa | EBI |
| CM000134 | 21712932 | Oryza sativa (indica cultivar-group) chr. 9 | Eukaryota, Viridiplantae | EBI |
| CM000152 | 6357299 | Dictyostelium discoideum AX4 chr. 3 | Eukaryota, Amoebozoa | EBI |
| CM000157 | 22324452 | Drosophila yakuba strain Tai18E2 chr. 2L | Eukaryota, Metazoa | EBI |
| CM000158 | 21139217 | Drosophila yakuba strain Tai18E2 chr. 2R | Eukaryota, Metazoa | EBI |
| CM000169 | 4918979 | Aspergillus fumigatus Af293 chr. 1 | Eukaryota, Fungi | EBI |
| CM000177 | 161428367 | Bos taurus chr. 1 | Eukaryota, Metazoa | EBI |
| CM000201 | 44081797 | Bos taurus chr. 25 | Eukaryota, Metazoa | EBI |
| CM000208 | 4054025 | Trypanosoma brucei brucei strain 927/4 GUTat10.1 chr. 10 | Eukaryota, Euglenozoa | EBI |
| CM000209 | 199526509 | Mus musculus chr. 1 | Eukaryota, Metazoa | EBI |
| CM000302 | 78773432 | Macaca mulatta chr. 16 | Eukaryota, Metazoa | EBI |
| CM000377 | 185838109 | Equus caballus chr. 1 | Eukaryota, Metazoa | EBI |
| CM000452 | 2067354 | Plasmodium vivax chr. 11 | Eukaryota, Alveolata | EBI |
| CM000515 | 118548696 | Taeniopygia guttata chr. 1 | Eukaryota, Metazoa | EBI |
| CM000530 | 16962381 | Taeniopygia guttata chr. 13 | Eukaryota, Metazoa | EBI |
| CM000572 | 46535552 | Pongo abelii chr. 22 | Eukaryota, Metazoa | EBI |
| CM000575 | 8914601 | Fusarium graminearum PH-1 chr. 2 | Eukaryota, Fungi | EBI |
| CM000580 | 4643527 | Gibberella moniliformis 7600 chr. 3 | Eukaryota, Fungi | EBI |
| CM000592 | 5212762 | Fusarium oxysporum f. sp. Lycopersici | Eukaryota, Fungi | EBI |
Information on the selected 100 genomic sequences [Part 2].
| ID | Genome size | Description | Lineage | Source |
|---|---|---|---|---|
| CM000612 | 1002813 | Phaeodactylum tricornutum CCAP | Eukaryota, Stramenopiles | EBI |
| CM000638 | 3042585 | Thalassiosira pseudonana CCMP1335 | Eukaryota, Stramenopiles | EBI |
| CM000692 | 1385275 | Saccharomyces kluyveri NRRL Y-12651 | Eukaryota, Fungi | EBI |
| CM000767 | 55460251 | Sorghum bicolor chr. 8 | Eukaryota, Viridiplantae | EBI |
| CM000769 | 60981646 | Sorghum bicolor chr. 10 | Eukaryota, Viridiplantae | EBI |
| CM000777 | 301354135 | Zea mays chr. 1. | Eukaryota, Viridiplantae | EBI |
| CM000799 | 47997241 | Oryctolagus cuniculus chr. 10 | Eukaryota, Metazoa | EBI |
| CM000829 | 61220071 | Sus scrofa chr. 18. | Eukaryota, Metazoa | EBI |
| CM000831 | 1255352 | Drosophila virilis strain 15010-1051.88 | Eukaryota, Metazoa | EBI |
| CM000850 | 41906774 | Glycine max chr. 17 | Eukaryota, Viridiplantae | EBI |
| CM000875 | 44557958 | Callithrix jacchus chr. 20 | Eukaryota, Metazoa | EBI |
| CM000906 | 55886266 | Ovis aries chr. 22 | Eukaryota, Metazoa | EBI |
| CM000907 | 66770968 | Ovis aries chr. 23 | Eukaryota, Metazoa | EBI |
| CM000917 | 27037145 | Nasonia vitripennis chr. 3 | Eukaryota, Metazoa | EBI |
| CM001221 | 42630297 | Medicago truncatula chr. 5. | Eukaryota, Viridiplantae | EBI |
| CM001222 | 23282162 | Medicago truncatula chr. 6. | Eukaryota, Viridiplantae | EBI |
| CM001276 | 232296185 | Macaca fascicularis chr. 1 | Eukaryota, Metazoa | EBI |
| CM001294 | 65364038 | Macaca fascicularis chr. 19 | Eukaryota, Metazoa | EBI |
| CP000048 | 922307 | Borrelia hermsii DAH, | Bacteria, Spirochaetes | EBI |
| CP000496 | 2740984 | Scheffersomyces stipitis CBS 6054 chr. 2, complete sequence. | Eukaryota, Fungi | EBI |
| CP000828 | 6503724 | Acaryochloris marina MBIC11017, | Bacteria, Cyanobacteria | EBI |
| CP001037 | 8234322 | Nostoc punctiforme PCC 73102, | Bacteria, Cyanobacteria | EBI |
| CP001141 | 945026 | Phaeodactylum tricornutum CCAP | Eukaryota, Stramenopiles | EBI |
| CP001681 | 5167383 | Pedobacter heparinus DSM 2366, | Bacteria, Bacteroidetes | EBI |
| CP001699 | 9127347 | Chitinophaga pinensis DSM 2588, | Bacteria, Bacteroidetes | EBI |
| CP001982 | 5097447 | Bacillus megaterium DSM319, | Bacteria, Firmicutes | EBI |
| CP002287 | 7013095 | Achromobacter xylosoxidans A8, | Bacteria, Proteobacteria | EBI |
| CP002987 | 4044777 | Acetobacterium woodii DSM 1030, | Bacteria, Firmicutes | EBI |
| CP003170 | 9239851 | Actinoplanes sp. SE50/110, | Bacteria, Actinobacteria | EBI |
| CP003348 | 4321753 | Desulfitobacterium dehalogenans ATCC 51507, | Bacteria, Firmicutes | EBI |
| CP003872 | 5196935 | Acidovorax sp. KKS102, | Bacteria, Proteobacteria | EBI |
| CR380954 | 1050361 | Candida glabrata strain CBS138 chr. H complete sequence. | Eukaryota, Fungi | EBI |
| CU234118 | 7456587 | Bradyrhizobium sp. ORS278,complete sequence. | Bacteria, Proteobacteria | EBI |
| CU329672 | 2452883 | Schizosaccharomyces pombe chr. III, complete sequence | Eukaryota, Fungi | EBI |
Information on the selected 100 genomic sequences [Part 3].
| ID | Genome size | Description | Lineage | Source |
|---|---|---|---|---|
| CU928173 | 1114666 | Zygosaccharomyces rouxii strain CBS732 chr. A complete sequence. | Eukaryota, Fungi | EBI |
| DG000010 | 27390870 | Oryzias latipes DNA, chr.10, strain: HdrR. | Eukaryota, Metazoa | EBI |
| FA000001 | 10049037 | Drosophila melanogaster unordered unlocalized genomic scaffolds (chrUn) | Eukaryota, Metazoa | EBI |
| FM178379 | 3325165 | Aliivibrio salmonicida LFI1238 chr. 1 | Bacteria, Proteobacteria | EBI |
| FN543502 | 5346659 | Citrobacter rodentium ICC168, | Bacteria, Proteobacteria | EBI |
| FN554974 | 4531609 | Trypanosoma brucei gambiense DAL972 chr. 11, complete sequence | Eukaryota, Euglenozoa | EBI |
| FO082874 | 3568623 | Babesia microti strain RI chr. III, complete sequence. | Eukaryota, Alveolata | EBI |
| FP929060 | 3108859 | Clostridiales sp. SM4/1 draft genome. | Bacteria, Firmicutes | EBI |
| FR798980 | 512965 | Leishmania braziliensis MHOM/BR/75/M2904, chr. 6 | Eukaryota, Euglenozoa | EBI |
| GCA 000002035.2 | 60348388 | Danio rerio genome assembly, chr1 | Eukaryota, Metazoa | EBI |
| GCA 000151905.1 | 229507203 | Gorilla gorGor3.1 chr. 1 | Eukaryota | Ensembl |
| HE601630 | 9743550 | Schistosoma mansoni strain Puerto Rico chr. 7, | Eukaryota, Metazoa | EBI |
| HE616744 | 1292049 | Torulaspora delbrueckii CBS 1146 chr. 3, | Eukaryota, Fungi | EBI |
| HE616749 | 833973 | Torulaspora delbrueckii CBS 1146 chr. 8, | Eukaryota, Fungi | EBI |
| HE806319 | 1449145 | Tetrapisispora blattae CBS 6284 chr. 4, | Eukaryota, Fungi | EBI |
| HE978314 | 1290777 | Kazachstania naganishii CBS 8797 chr. 1, | Eukaryota, Fungi | EBI |
| NC 003070.9 | 30427671 | Arabidopsis thaliana chr. 1, complete sequence. | Eukaryota, Viridiplantae | NCBI |
| NC 007605 | 171823 | Human herpesvirus 4 complete wild type genome. | Viruses, dsDNA viruses | NCBI |
| NC 008394.4 | 45064769 | Oryza sativa Japonica Group DNA, chr. 1, complete sequence, cultivar: Nipponbare | Eukaryota, Viridiplantae | NCBI |
| NC 008397.2 | 30039014 | Oryza sativa Japonica Group DNA, chr. 4, complete sequence, cultivar: Nipponbare | Eukaryota, Viridiplantae | NCBI |
| NC 008398.2 | 32124789 | Oryza sativa Japonica Group DNA, chr. 5, complete sequence, cultivar: Nipponbare | Eukaryota, Viridiplantae | NCBI |
| NC 008399.2 | 30357780 | Oryza sativa Japonica Group DNA, chr. 6, complete sequence, cultivar: Nippon bare | Eukaryota, Viridiplantae | NCBI |
| NC 008400.2 | 28530027 | Oryza sativa Japonica Group DNA, chr. 7, complete sequence, cultivar: Nipponbare | Eukaryota, Viridiplantae | NCBI |
| NC 008401.2 | 23661561 | Oryza sativa Japonica Group DNA, chr. 8, complete sequence, cultivar: Nipponbare | Eukaryota, Viridiplantae | NCBI |
| NC 008403.2 | 35571569 | Oryza sativa Japonica Group DNA, chr. 10, complete sequence, cultivar: Nipponbare | Eukaryota, Viridiplantae | NCBI |
| NC 008467.1 | 35863200 | Populus trichocarpa linkage group I, whole genome shotgun sequence | Eukaryota, Viridiplantae | NCBI |
| NT 024477.14 | 1034903 | Homo sapiens chr. 12 genomic contig, GRCh37.p13 Primary Assembly | Eukaryota, Metazoa | NCBI |
| NT 024498.12 | 369930 | Homo sapiens chr. 13 genomic contig, | Eukaryota, Metazoa | NCBI |
| GRCh37.p13 Primary Assembly | ||||
| NT 029928.13 | 3915179 | Homo sapiens chr. 3 genomic contig, GRCh37.p13 Primary Assembly | Eukaryota, Metazoa | NCBI |
| NT 077528.2 | 556644 | Homo sapiens chr. 7 genomic contig, GRCh37.p13 Primary Assembly | Eukaryota, Metazoa | NCBI |
| NT 078094.2 | 868660 | Homo sapiens chr. 15 genomic contig, GRCh37.p13 Primary Assembly | Eukaryota, Metazoa | NCBI |
| NT 167185.1 | 3353625 | Homo sapiens chr. 1 genomic contig, GRCh37.p13 Primary Assembly | Eukaryota, Metazoa | NCBI |
| NT 167196.1 | 754004 | Homo sapiens chr. × genomic contig, GRCh37.p13 Primary Assembly | Eukaryota, Metazoa | NCBI |
Figure 1Running time (in seconds) of aligners as function of genome size with 2x coverage, read length equal to 100, sequencing error at 2%, mutation rate at 0.1%.
Precision and recall averaged across 100 genomes at read lengths 50, 75, 100.
| Prec-50 | Rec-50 | Prec-75 | Rec-75 | Prec-100 | Rec-100 | |
|---|---|---|---|---|---|---|
| Bowtie2 | 0.9871 | 0.9062 | 0.9943 | 0.9721 | 0.9965 | 0.9891 |
| BWA-SW | 0.9886 | 0.8983 | 0.9952 | 0.9831 | 0.9972 | 0.9951 |
| CUSHAW2 | 0.9882 | 0.9868 | 0.9956 | 0.9956 | 0.9975 | 0.9975 |
| GASSST | 0.9836 | 1.1109 | 0.9897 | 1.0339 | 0.9914 | 0.9757 |
| Masai | 0.9889 | 0.9861 | 0.9958 | 0.9903 | 0.9976 | 0.9790 |
| mrFAST | 0.9408 | 0.5700 | 0.9862 | 0.9166 | 0.9833 | 0.9268 |
| SeqAlto | 0.9875 | 0.8851 | 0.9956 | 0.9748 | 0.9976 | 0.9925 |
| SHRiMP2 | 0.9892 | 0.9798 | 0.9958 | 0.9905 | 0.9975 | 0.9974 |
| Smalt | 0.9858 | 0.9714 | 0.9954 | 0.9944 | 0.9974 | 0.9974 |
| SOAP2 | 0.9893 | 0.9025 | 0.9959 | 0.7904 | 0.9976 | 0.6526 |
Figure 2Correlation coefficients between different measures of complexity and aligners' performance (precision and recall) at read length 100.
Figure 3Correlation coefficients between different measures of complexity and aligners' performance (precision and recall) at read length 75.
Figure 4Correlation coefficients between different measures of complexity and aligners' performance (precision and recall) at read length 50.
Figure 5Correlation coefficients between D100 and aligners' performance (precision and recall) of aligning reads of length 100 at sequencing error rates of 0.5%, 1%, and 2%.
Figure 6Correlation coefficients between D100 and aligners' performance (precision and recall) of aligning reads of length 100 at mutation rates between 0.1% and 1%.
Figure 7Top Figure: box plots of accuracy (precision and recall) of aligners across 100 genomes; read length equal to 100. Bottom Figure: correlation between performance and D100.
Figure 8Cumulative distributions of .