| Literature DB >> 31029080 |
A Iacoangeli1,2, A Al Khleifat3, W Sproviero3, A Shatunov3, A R Jones3, S L Morgan4, A Pittman4, R J Dobson5,6,7, S J Newhouse5,6,7, A Al-Chalabi3,8.
Abstract
BACKGROUND: Next Generation Sequencing (NGS) is a commonly used technology for studying the genetic basis of biological processes and it underpins the aspirations of precision medicine. However, there are significant challenges when dealing with NGS data. Firstly, a huge number of bioinformatics tools for a wide range of uses exist, therefore it is challenging to design an analysis pipeline. Secondly, NGS analysis is computationally intensive, requiring expensive infrastructure, and many medical and research centres do not have adequate high performance computing facilities and cloud computing is not always an option due to privacy and ownership issues. Finally, the interpretation of the results is not trivial and most available pipelines lack the utilities to favour this crucial step.Entities:
Keywords: Annotation; Bioinformatics; Next generation sequencing; Repeat expansion; Structural variants; Variant calling; Viral detection
Mesh:
Substances:
Year: 2019 PMID: 31029080 PMCID: PMC6487045 DOI: 10.1186/s12859-019-2791-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Pipeline overview. Central panel: DNAscan accepts sequencing data, and optionally variant files. The pipeline firstly performs an alignment step (details in the left panel), followed by a customisable data analysis protocol (details in the right panel). Finally, results are annotated and user-friendly QC and result reports are generated. The annotation step uses Annovar to enrich the results with functional information from external databases. Right panel: detailed description of the post alignment analysis pipeline. Aligned reads are used by the variant calling pipeline (Freebayes + GATK HC); both aligned and unaligned reads are used by Manta and ExpensionHunter (for which repeat description files have to be provided) to look for structural variants. The unaligned reads are mapped to a database of known viral genomes (NCBI database) to screen for their DNA in the input sequencing data. Left panel: Alignment stage description. Raw reads are aligned with HISAT2. Resulting soft-clipped and unaligned reads are realigned with BWA mem and then merged with the others using Samtools
Key tools used by DNAscan in the three modes
| Stage | DNAscan mode | ||
|---|---|---|---|
| Fast | Normal | Intensive | |
| Alignment | HISAT2 | HISAT2 + BWA mem | HISAT2 + BWA mem |
| SNVs calling | Freebayes | Freebayes | Freebayes |
| Small indels calling | Freebayes | Freebayes | GATK HC |
DNAscan mode usage recommendations
| Type of analysis | DNAscan mode | ||
|---|---|---|---|
| Fast | Normal | Intensive | |
| SNVs | Yes | Yes | Yes |
| Small indels (< 50 bps) | No | No | Yes |
| Structural Variants | No | Yes | Yes |
| Repeat expansions | No | Yes | Yes |
| Non-human microbes | Yes | Yes | Yes |
Fig. 2Variant calling assessment. Graph a shows the precision, sensitivity and F-measure of DNAscan in Fast, Normal and Intensive mode, SpeedSeq and GATK best practice workflow in calling SNVs and small indels over the whole exome sequencing of NA12878. Illumina platinum calls were used as true positives. The first three columns show the results for SNVs and the last three columns for indels. Graph b shows the time needed and the memory footprint for the same pipelines
Analysis of two ALS patients
| Case A | Case B | |
|---|---|---|
| Analysis time (minutes) | 30 | 460 |
| Data size (MBs) | 40 | 70,000 |
| N. of ALS-related variants | 13 | 33 |
| N. of FTD-related variants | 4 | 3 |
| N. of non-synonymous variants | 6 | 64 |
| N. of variants with CADD> 13 | 6 | 748 |
| N. long insertions | 0 | 1 |
| N. long deletions | 0 | 3 |
| N. Duplications | 0 | 1 |
| N. Inversions | 0 | 0 |
| – | Yes | |
| rs121909670 | Yes | – |
Case A was sequenced with targeted MiSeq ALS gene panel and carries a pathogenic non-synonymous mutation (rs21909670) in the FUS gene. Case B was whole-genome sequenced and carries a pathogenic C9orf72 expansion
Fig. 3Identification of non-human reads Panel a shows the proportion of human reads (blue), viral reads (red) and unknown reads (yellow). Panel b shows the proportion for viral reads belonging to HIV (blue), PhiX174 (red) and to other viruses (yellow). Human reads are defined as reads which aligned to the human reference genome, viral reads as the reads which did not align to the human reference genome but aligned to at least one of the NCBI viral genomes and unknown reads as the reads which did not align either to the human or to any viral reference genomes. In panel c we plotted the numbers of aligned reads in logarithmic scale, for the 20 non-human microbe genomes to which the highest number of reads was aligned