| Literature DB >> 26870323 |
Yvan Le Bras1, Olivier Collin1, Cyril Monjeaud1, Vincent Lacroix2, Éric Rivals3, Claire Lemaitre4, Vincent Miele2, Gustavo Sacomoto2, Camille Marchet2, Bastien Cazaux3, Amal Zine El Aabidine3, Leena Salmela5, Susete Alves-Carvalho4, Alexan Andrieux4, Raluca Uricaru6, Pierre Peterlongo4.
Abstract
BACKGROUND: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools.Entities:
Keywords: Bloom filter; De Bruijn graph; Metagenomics; NGS; RNA-seq; Variant calling; Whole-genome assembly-less treatment; long read correction
Mesh:
Year: 2016 PMID: 26870323 PMCID: PMC4750246 DOI: 10.1186/s13742-015-0105-2
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1Overview of the six tools from the Colib’read project integrated with Galaxy and presented in this paper
Summary of the Colib’read tools inputs and outputs
| Tool | In | Out |
|---|---|---|
| KISSPLICE | One or more RNA-seq read set(s) | SNPs, small indels, alternative splicing events |
| DISCOSNP | One or more raw genomic read set(s) | SNP sequences with their coverages |
| TAKEABREAK | One or more raw genomic read set(s) | Inversion breakpoints |
| MAPSEMBLER2 | Pieces of known sequences, and associated raw read sets | Validation and visualization of genome structure near a locus of interest |
| COMMET | Several raw metagenomic complex read sets | Global comparison of input sets at the read level |
| LORDEC | Illumina and PacBio read sets | Corrected PacBio read set |
Fig. 2Toy example of a ‘bubble’ in the de Bruijn graph (k=4). The bubble is generated by an SNP present in two polymorphic sequences, …CTGACCT… and …CTGTCCT…
Fig. 3de Bruijn graph with k=3 for the sequences: (awb) and (ab). The pattern in the sequence generates an (s,t)-bubble, from to . In this case, b= and w= GGA have their first letter in common, so the path corresponding to the junction ab has k−1−1=1 vertex
Time and memory consumption examples
| Tool | Sample type | Number of reads | Computation time | Max. RAM use |
|---|---|---|---|---|
| KISSPLICE |
| 71 million | 3 h | 8 GB |
| DISCOSNP |
| 1.4 billion | 34 h | 6 GB |
| MAPSEMBLER2 |
| 430 million | 24 h | 1 GB |
| TAKEABREAK |
| 430 million | 2 h | 4 GB |
| COMMET | Soil and seawater metagenomes | 71 million | 14 h | 7 GB |
| LORDEC |
| 11 million and 0.08 million | 3.3 h | 0.66 GB |
| LORDEC |
| 2.25 million and 0.26 million | 25 h | 0.74 GB |
Fig. 4Running KisSplice on Galaxy. a KisSplice tool form allowing selection of input datasets and tool parameters. b KisSplice outputs
Fig. 5Running MAPSEMBLER2 on Galaxy. a MAPSEMBLER2 tool form allowing selection of input datasets and tool parameters. b MAPSEMBLER2 FASTA output
Fig. 6Running GSV on Galaxy. The JSON graph generated by MAPSEMBLER2 can be navigated, filter parameters used to modify the visualization aspect, and results exported
Isolated SNPs found in S. cerevisiae and validated in [26]
| First population studied | ||||
| (5 found among 6) | ||||
| Chromosome | Position | Ref | Alt | Predicted by DISCOSNP |
| 1 | 39425 | A | G | Yes |
| 3 | 235882 | C | A | Yes |
| 4 | 1014740 | G | C | Yes |
| 6 | 71386 | G | C | Yes |
| 12 | 200286 | C | T | Yes |
| 15 | 438512 | A | C | No |
| Second population studied | ||||
| (9 found among 9) | ||||
| Chromosome | Position | Ref | Alt | Predicted by DISCOSNP |
| 1 | 39261 | G | A | Yes |
| 4 | 1014763 | T | G | Yes |
| 4 | 1014850 | T | A | Yes |
| 6 | 71813 | A | C | Yes |
| 7 | 146779 | T | C | Yes |
| 10 | 179074 | C | A | Yes |
| 12 | 162304 | A | T | Yes |
| 14 | 681026 | T | G | Yes |
| 15 | 412148 | G | T | Yes |
| Third population studied | ||||
| (13 found among 14) | ||||
| Chromosome | Position | Ref | Alt | Predicted by DISCOSNP |
| 1 | 191184 | A | G | No |
| 2 | 521881 | C | T | Yes |
| 4 | 1014981 | A | T | Yes |
| 4 | 1015077 | G | T | Yes |
| 6 | 70913 | C | T | Yes |
| 9 | 401526 | G | A | Yes |
| 10 | 250988 | G | A | Yes |
| 10 | 619870 | G | T | Yes |
| 11 | 64697 | T | C | Yes |
| 11 | 434707 | A | G | Yes |
| 12 | 404866 | G | T | Yes |
| 15 | 174575 | T | G | Yes |
| 15 | 1013813 | C | A | Yes |
| 16 | 79761 | T | G | Yes |
chr16:581589 mutation in experiment E2, originally presented in [26], is not reported in this table, as it could not be validated
Fig. 7MAPSEMBLER2 output graph obtained from the Saccharomyces cerevisiae dataset visualized using GSV. A zoom is proposed for visualizing first nodes. The grey node is the starter. Node size depicts the length of the sequence stored by the node. The node and edge colors depict the read coverage (here for one among all datasets) of the sequence stored by the node. The ‘bubbles’ seen on the right of the starter witness the presence of SNPs and small indels in the datasets. Note that by changing the choice of the read set selected for visualizing the coverage (node and edge colors), one can deduce the heterozygous or homozygous nature of these variants
Fig. 8Dendrograms from MetaSoil study. a Fig. from [29] showing the cluster tree, constructed using Euclidean distances, confronting 13 samples others soil metagenomes (Puerto Rican Forest soil and Italian Forest Soil) and a metagenome from Sargasso Sea (SargassoSea). DNA extraction methods are indicated. Thus, “MP BIO 101” means Fast prep MP Bio101 Biomedical, Eschwege, Germany, “In plugs” means indirect lysis in plug, “DNA Tissue” means Nucleospin Tissue kit, “MoBio” means MoBio Powersoil DNA Isolation Kit (Carlsbad, CA, USA) and finally “Gram positive” for the Gram-positive kit b COMMET analyses, comparing the same 15 samples
Datasets used to evaluate the efficiency and impact of LoRDEC read correction on the assembly
|
| Yeast | ||
|---|---|---|---|
| Reference organism | |||
| Name |
|
| |
| Strain | K-12 substr. MG1655 | W303 | |
| Reference sequence | NC_000913 | S288C | |
| Genome size | 4.6 Mbp | 12 Mbp | |
| PacBio Data | |||
| Accession number | PacBio reads | DevNet PacBio | |
| Number of reads | 75152 | 261964 | |
| Average read length | 2415 | 5891 | |
| Max. read length | 19416 | 30164 | |
| Number of bases | 181 Mbp | 1.5 Gbp | |
| Coverage | 30 × | 129 × | |
| Illumina Data | |||
| Accession number | Illumina reads | SRR567755 | |
| Number of reads (millions) | 11 | 2.25 | |
| Read length | 114 | 100 | |
| Number of bases | 1.276 Gbp | 225 Mbp | |
| Coverage | 277 × | 18 × |
For the short read data of yeast, we used only half of the available reads. The reference yeast genome is available from [40]
Comparison of the assemblies obtained for E. coli and S. cerevisiae from either uncorrected or corrected PacBio reads
|
|
| |||
|---|---|---|---|---|
| Statistical metrics | Corrected | Uncorrected | Corrected | Uncorrected |
| Number of contigs | 2349 | 1721 | 61496 | 39127 |
| Number of contigs ≥ 1 kbp | 321 | 0 | 1657 | 0 |
| Genome coverage (%) | 98 | 0 | 91 | 0 |
| Total length (Mbp) | 4.71 | 0.12 | 15.00 | 2.39 |
| Largest contig (bp) | 93000 | 127 | 52444 | 378 |
| GC (%) | 50.19 | 3.77 | 38.75 | 40.00 |
| N50 | 23473 | 69 | 6943 | 57 |
The genome coverage accounts only for contigs longer than 1kbp. With uncorrected reads, the N50 remains close to the k-mer length (whatever the value of k); this strongly suggests that ABySS fails to assemble uncorrected reads. On the contrary, the metrics with corrected PacBio reads indicate that it yields satisfactory assemblies for both genomes