Literature DB >> 31472161

Sequencing Technologies and Analyses: Where Have We Been and Where Are We Going?

Abstract

A wave of technologies transformed sequencing over a decade ago into the high-throughput era, demanding research in new computational methods to analyze these data. The applications of these sequencing technologies have continuously expanded since then. The RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq) meeting, established in 2011, brings together leading researchers in computational genomics and genomic biology to discuss emerging frontiers in algorithm development for massively parallel sequencing data. The ninth edition of this workshop was held in Washington, DC, in George Washington University on May 3 and 4, 2019. There was an exploration of several traditional topics in sequence analysis, including genome assembly, sequence alignment, and data compression, and development of methods for new sequencing technologies, including linked reads and single-molecule long-read sequencing. Here we revisit these topics and discuss the current status and perspectives of sequencing technologies and analyses.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31472161 PMCID： PMC6733309 DOI： 10.1016/j.isci.2019.06.035

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Main Text

Advances in high-throughput sequencing technologies provide holistic investigatory capabilities to address critically important and complex problems in virtually every area of biology. These technologies have led to an explosive growth of the amount of sequencing data being generated every year. For example, the Human Genome Project cost billions of dollars and took a decade to complete, whereas more than 100,000 human genomes have been sequenced over the past 5 years. In addition to the advancements in throughput and cost, a number of novel sequencing technologies have emerged, including ultra-long read sequencing (e.g., Nanopore), high-resolution restriction maps (e.g., Bionano data), linked-read sequencing technologies, and cross-linking methods (see Table 1 for an overview of various sequencing technologies). All these sequencing technologies have been a source of discussion and intense research. Here, we summarize some of the most recent findings and opportunities.

Table 1

Comparison of the Read Lengths, Error Rates, and Costs of Various DNA Sequencing Technologies

Technology	Method	Read Length	Error Rate (%)	Throughput(GB/run)
Illumina	Synthesis	100–300 bp	0.1	200–600
Pacific BiosciencesSMRT	Synthesis	10–100 kb	5–15	10–20
Oxford NanoporeMinION	Nanopore	Variable(up to 1,000 kb)	5–20	5–10

Comparison of the Read Lengths, Error Rates, and Costs of Various DNA Sequencing Technologies Assembly of genomes using shotgun sequencing of chromosomal DNA still remains a fundamental problem in the bioinformatics community. As stated by Adam Phillippy in his talk “40 Years of Genome Assembly: Are We Done Yet?” the development of strategies to assemble sequence reads was described in the late 1970s when (Staden, 1979) stated that “With modern fast sequencing technologies and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps.” Four decades later, although reference genomes have been assembled for a number of organisms, including humans, significant challenges remain in obtaining complete assemblies routinely. As described by Phillippy, the current human reference genome (GRCh38) contains 102 gaps and lacks sequence for centromeres and other repetitive regions. Phillippy described the first “telomere-to-telomere” (T2T) assembly of the human chromosome X using a combination of long (e.g., PacBio technology) and ultra-long Nanopore sequence reads. This notable success could lead to the generation of a T2T assembly of all human chromosomes in the near future. The challenges that remain in this area are not new, but they have continued to haunt researchers for many years, namely, long repeats, heterozygosity, data accuracy, and measuring assembly quality. Toward overcoming these challenges, there have been a number of proposed computational solutions that improve the quality of both the assembly and the sequence data. Morisse et al. (2019) proposed a method for self-correction of long-read data, which combines algorithmic approaches of current state-of-the-art long-read error correction methods, namely, construction and use of multiple alignment of the reads, and subsequently, a de Bruijn graph. They demonstrate that the method is able to error correct both long and ultra-long sequence reads and is highly scalable, as it is the only method that is able to scale to a human dataset containing ultra-long reads. Marijon et al. (2019) describe a method to analyze assembly graphs produced from long reads to recover contigs that were lost during the assembly process. They demonstrate that their method recovers useful adjacency information between contigs and show that it is able to “provide a more informative representation of fragmented assemblies, examine repeat structures, and propose likely contig orderings.” In a similar spirit Shlemov and Korobeynikov (2019) develop a method for analyzing the assembly graph by aligning a profile hidden Markov models to the graph to discover the set of most probable paths in the graph. It is suggested that this information can be used for putative gene finding in metagenomic samples, repeat resolution, or scaffolding. These works suggest that there is significant opportunity to improve upon error correction and assembly of long-read sequencing data and a surge in interest and potential use of ultra-long sequencing in genome assembly. Last, large sequencing projects, such as the Vertebrate Genome Project, foreshadow the need for hybrid assembly approaches and assembly frameworks wherein algorithmic ideas and approaches can be easily validated. For species with an assembled genome, the sequence reads can be aligned to this reference genome to identify genetic variants and perform a variety of other biological analyses. Therefore, alignment of DNA sequences to a genome is a fundamental computational problem. A number of methods have been developed for the problem of aligning short reads (50–200 bases in length) to a reference genome over the past 10 years (Reinert et al., 2015). Many of these alignment tools have been developed specifically for aligning reads generated using next-generation sequencing (NGS) protocols such as RNA sequencing (RNA-seq) and microRNA (miRNA) sequencing. Reads generated using RNA-seq can span exon-exon junctions, and therefore accurate mapping of RNA-seq reads requires the ability to detect spliced alignments. Zhong and Zhang (2019) described an alignment tool designed to enable the accurate mapping of cross-linked miRNA-mRNA reads. This tool uses a Burrows-Wheeler Transform (BWT)-based index for finding short matches but implements a number of additional optimizations to enable the sensitive mapping of duplex reads formed by miRNA-mRNA interactions. Compared with existing alignment tools such as STAR and BLASTN, this specialized alignment tool, CLAN, maps more reads and has greater accuracy. With the emergence of single-molecule long-read sequencing technologies such as Pacific Biosciences SMRT and Oxford Nanopore MinION (Pollard et al., 2018), there is an increasing need for alignment tools capable of aligning long reads. Existing NGS alignment tools are optimized for low error rates and short read lengths, whereas these technologies generate reads that are tens of kilobases long and have high error rates (5%–20%, see Table 1). Almost all short-read alignment tools use a hash table or a BWT-based index to efficiently find short matches between a query sequence and a genome. Hash-table-based approaches require the storage of a large index for finding the seed matches, which can be space prohibitive for large genomes such as those of humans. Li described the use of “minimizers,” an elegant idea that enables the detection of seed matches while storing only a fraction of the seeds (Roberts et al., 2004), to design a long-read alignment tool, Minimap2 (Li, 2018). This tool combines the use of minimizers with chaining and affine gap alignment to efficiently align both long DNA reads and cDNA/mRNA reads. Detection of genetic variants using sequence reads aligned to a reference genome is perhaps the most common application of NGS technologies. Similar to read alignment, many tools have been developed to detect short sequence variants (single nucleotide variants and short insertions or deletions). Using state-of-the-art tools such as GATK (DePristo et al., 2011) both these types of variants can be reliably detected using whole-genome or whole-exome DNA sequencing. Nevertheless, other types of variants such as structural variants remain challenging to detect using NGS reads. Melissa Gymrek highlighted one such limitation of NGS for short tandem repeats (STRs). STRs (tandem repeats of 1- to 6-base-long motifs) are abundant in the human genome and are prone to mutations that can expand or contract the repeat. Expansions of STRs have been shown to cause a number of rare Mendelian diseases (Ashley, 2016). One example of such a disease is Huntington disease, which is caused by the expansion of a trinucleotide repeat. Genotyping of STRs and detection of repeat expansions requires careful analysis to capture the signal for such events in short sequence reads. Gymrek described a computational tool, GangSTR (Mousavi et al., 2019), that can accurately genotype STRs at more than 500,000 tandem repeat loci and even detect repeat expansions that are longer than the length of Illumina reads. Nevertheless, many challenges remain in this area, including genotyping GC-rich repeats and accounting for nonuniformity in sequence coverage. Similar to read mapping, detection of variants in different applications (e.g., somatic variants in cancer genomes) requires specialized tools. Charlotte Darby presented a clever approach (Darby et al., 2019) to detect mosaic variants using the 10X Genomics linked-read technology (Zheng et al., 2016). Unlike germline variants, mosaic variants are those that are present in only a subset of the cells of an individual and are harder to detect. In contrast with standard Illumina sequencing, linked reads provide long-range haplotype information that can be leveraged for discriminating mosaic mutations (present on a subset of reads from one haplotype) from sequencing errors and other artifacts. The novel method, Samovar, assigns reads to haplotypes using the linked reads and enables accurate detection of mosaic mutations in pediatric cancer genomes without the use of matching normal datasets. Tools such as GangSTR and Samovar are crucial for realizing the full potential of whole-genome sequencing and will further enhance the use of NGS as a diagnostic tool. One limitation of these tools is that they rely on the alignment of reads to a reference genome and are designed to detect specific types of variants. There is a growing interest in alignment-free variant detection and genotyping methods for NGS data. Such methods utilize the information contained within the set of k-mers (and their counts) observed in the sequence reads and can be used to detect almost all types of sequence variants (Nordstrom et al., 2013). Daniel Standage presented a method, Kevlar (Standage et al., 2019), that detects de novo variants in an individual's genome by identifying frequent k-mers that are either completely absent or appear at very frequency in the genomes of the parents. This alignment-free approach can detect SNVs, short indels, and even structural variants. Alignment-free approaches are also valuable for genotyping known variants in a sequenced genome. Luca Denti described MALVA (Bernardini et al., 2019), which can genotype both SNVs and short indels efficiently and improves upon previous methods for this problem. The success of alignment-free methods suggests that approaches that combine k-mer-based analysis with reference-based mapping could maximize accuracy for variant detection and genotyping using NGS reads. The 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) is now largely completed, and now the 100,000 Genomes Project is well underway (Turnbull et al., 2018). With no compression, the raw data for 100,000 human genomes requires roughly 300 terabytes of disk space. Given the size of the data and its continual growth, efficient compression and decompression of the data is vital to any sort of analysis. There are different methods to tackle data compression, which is frequently but not necessarily reliant on the analysis goals. General compression algorithms, such as Lempel-Ziv parsing (Ziv and Lempel, 1977), BWT (Burrows and Wheeler, 1994), and Huffman encoding (Turnbull et al., 2018), aim to transform the input file(s) into a representation that requires fewer bits than the original file(s). Conversely, decompression aims to recover the original files from the compressed format. Sequence data offers unique opportunities for significant compression because it contains high levels of redundancy. A number of methods that utilize novel data structures to exploit this characteristic to achieve efficient compression and decompression have been developed. Sequence Bloom Trees and space-efficient de Bruijn graph representations are two examples of such data structures that have been continuously improved upon in the past few years. Sequence Bloom Trees were first proposed by Solomon and Kingsford (2016) as a means to efficiently index sequence data in a manner that supports queries about the presence of transcripts. SeqOthello (Yu et al., 2018), Split-SBT (Solomon and Kingsford, 2018), and AllSome-SBT (Sun et al., 2017) improve upon this recent original representation. Paul Medvedev described a representation (Harris and Medvedev, 2019) that requires substantially less time and space to construct the index, demonstrating that there remains opportunity for further improvements to existing representations. Comparably, de Bruijn graphs, which were originally proposed for genome assembly, have been used to compactly index all k-length subsequences (k-mers) from a set of sequence reads. Although there have been numerous improvements in the representation of de Bruijn graphs (Muggli et al., 2017, Almodaresi et al., 2019, Karasikov et al., 2019, Almodaresi et al., 2017, Alipanahi et al., 2018, Mustafa et al., 2017, Pandey et al., 2018), we still continue to witness substantial improvements on existing representations. For example, the representation of Marchet et al. (2019) was able to index all k-mers from the human genome in 8-GB space and 30 min and all k-mers from the axolotl genome (10 times the size of the human genome) in 63-GB space and within 10 h. This area of using de Bruijn graphs for compactly representing and indexing k-mers still has unexplored avenues. Last, there is still significant work in developing targeted parallelism to rapidly compress and decompress large gzip files. Kerbiriou and Rayan presented a parallel algorithm for fast decompression of gzip-compressed files that allows random access to compressed DNA sequence data in the FASTQ format. Demonstrations of their method show that it is an order of magnitude faster than gunzip, and five times faster than a highly optimized sequential implementation. Two recurring themes emerged at RECOMB-Seq 2019. First, although computational methods for alignment, assembly, and variant detection have advanced tremendously over the past decade, significant challenges remain, e.g., end-to-end genome assembly using long reads and detection of repeat variants using high-throughput sequencing. Second, there is a need for new algorithms and data structures to process data generated from multiple genomes and using newer sequencing technologies. In particular, long-read sequencing technologies such as Pacific Biosciences SMRT and 10X Genomics linked-read sequencing are becoming increasingly ubiquitous and we expect to see the development of new methods that leverage these technologies in the near future. Last, the meeting would not have been a success without the diligent work of the members of the program and steering committees. We would like to thank everyone who contributed to making RECOMB-Seq 2019 a success.

21 in total

1. Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers.

Authors: Karl J V Nordström; Maria C Albani; Geo Velikkakam James; Caroline Gutjahr; Benjamin Hartwig; Franziska Turck; Uta Paszkowski; George Coupland; Korbinian Schneeberger
Journal: Nat Biotechnol Date: 2013-03-10 Impact factor: 54.908

Review 2. Towards precision medicine.

Authors: Euan A Ashley
Journal: Nat Rev Genet Date: 2016-08-16 Impact factor: 53.242

3. A strategy of DNA sequencing employing computer programs.

Authors: R Staden
Journal: Nucleic Acids Res Date: 1979-06-11 Impact factor: 16.971

4. Profiling the genome-wide landscape of tandem repeat expansions.

Authors: Nima Mousavi; Sharona Shleizer-Burko; Richard Yanicky; Melissa Gymrek
Journal: Nucleic Acids Res Date: 2019-09-05 Impact factor: 16.971

5. The 100 000 Genomes Project: bringing whole genome sequencing to the NHS.

Authors: Clare Turnbull; Richard H Scott; Ellen Thomas; Louise Jones; Nirupa Murugaesu; Freya Boardman Pretty; Dina Halai; Emma Baple; Clare Craig; Angela Hamblin; Shirley Henderson; Christine Patch; Amanda O'Neill; Katherine Smith; Antonio Rueda Martin; Alona Sosinsky; Ellen M McDonagh; Razvan Sultana; Michael Mueller; Damian Smedley; Adam Toms; Lisa Dinh; Tom Fowler; Mark Bale; Tim Hubbard; Augusto Rendon; Sue Hill; Mark J Caulfield
Journal: BMJ Date: 2018-04-24

6. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

7. MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants.

Authors: Luca Denti; Marco Previtali; Giulia Bernardini; Alexander Schönhuth; Paola Bonizzoni
Journal: iScience Date: 2019-07-12

8. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing.

Authors: Grace X Y Zheng; Billy T Lau; Michael Schnall-Levin; Mirna Jarosz; John M Bell; Christopher M Hindson; Sofia Kyriazopoulou-Panagiotopoulou; Donald A Masquelier; Landon Merrill; Jessica M Terry; Patrice A Mudivarti; Paul W Wyatt; Rajiv Bharadwaj; Anthony J Makarewicz; Yuan Li; Phillip Belgrader; Andrew D Price; Adam J Lowe; Patrick Marks; Gerard M Vurens; Paul Hardenbol; Luz Montesclaros; Melissa Luo; Lawrence Greenfield; Alexander Wong; David E Birch; Steven W Short; Keith P Bjornson; Pranav Patel; Erik S Hopmans; Christina Wood; Sukhvinder Kaur; Glenn K Lockwood; David Stafford; Joshua P Delaney; Indira Wu; Heather S Ordonez; Susan M Grimes; Stephanie Greer; Josephine Y Lee; Kamila Belhocine; Kristina M Giorda; William H Heaton; Geoffrey P McDermott; Zachary W Bent; Francesca Meschi; Nikola O Kondov; Ryan Wilson; Jorge A Bernate; Shawn Gauby; Alex Kindwall; Clara Bermejo; Adrian N Fehr; Adrian Chan; Serge Saxonov; Kevin D Ness; Benjamin J Hindson; Hanlee P Ji
Journal: Nat Biotechnol Date: 2016-02-01 Impact factor: 54.908

9. SeqOthello: querying RNA-seq experiments at scale.

Authors: Ye Yu; Jinpeng Liu; Xinan Liu; Yi Zhang; Eamonn Magner; Erik Lehnert; Chen Qian; Jinze Liu
Journal: Genome Biol Date: 2018-10-19 Impact factor: 13.583

10. Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.

Authors: Daniel S Standage; C Titus Brown; Fereydoun Hormozdiari
Journal: iScience Date: 2019-07-23

9 in total

Review 1. Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit.

Authors: Fernando Meyer; Till-Robin Lesker; David Koslicki; Adrian Fritz; Alexey Gurevich; Aaron E Darling; Alexander Sczyrba; Andreas Bremges; Alice C McHardy
Journal: Nat Protoc Date: 2021-03-01 Impact factor: 13.491

2. Evaluation of Genetic Kidney Diseases in Living Donor Kidney Transplantation: Towards Precision Genomic Medicine in Donor Risk Assessment.

Authors: Yasar Caliskan; Brian Lee; Adrian Whelan; Fadee Abualrub; Krista L Lentine; Arksarapuk Jittirat
Journal: Curr Transplant Rep Date: 2022-03-16

3. RResolver: efficient short-read repeat resolution within ABySS.

Authors: Vladimir Nikolić; Amirhossein Afshinfard; Justin Chu; Johnathan Wong; Lauren Coombe; Ka Ming Nip; René L Warren; Inanç Birol
Journal: BMC Bioinformatics Date: 2022-06-21 Impact factor: 3.307

4. Characterization of Atypical Shiga Toxin Gene Sequences and Description of Stx2j, a New Subtype.

Authors: Alexander Gill; Forest Dussault; Tanis McMahon; Nicholas Petronella; Xiong Wang; Elizabeth Cebelinski; Flemming Scheutz; Kelly Weedmark; Burton Blais; Catherine Carrillo
Journal: J Clin Microbiol Date: 2022-03-16 Impact factor: 11.677

Review 5. Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning.

Authors: Jonathan P Allen; Evan Snitkin; Nathan B Pincus; Alan R Hauser
Journal: Trends Microbiol Date: 2021-01-14 Impact factor: 18.230

6. Information Theory in Computational Biology: Where We Stand Today.

Authors: Pritam Chanda; Eduardo Costa; Jie Hu; Shravan Sukumar; John Van Hemert; Rasna Walia
Journal: Entropy (Basel) Date: 2020-06-06 Impact factor: 2.524

7. Bioinformatics services for analyzing massive genomic datasets.

Authors: Gunhwan Ko; Pan-Gyu Kim; Youngbum Cho; Seongmun Jeong; Jae-Yoon Kim; Kyoung Hyoun Kim; Ho-Yeon Lee; Jiyeon Han; Namhee Yu; Seokjin Ham; Insoon Jang; Byunghee Kang; Sunguk Shin; Lian Kim; Seung-Won Lee; Dougu Nam; Jihyun F Kim; Namshin Kim; Seon-Young Kim; Sanghyuk Lee; Tae-Young Roh; Byungwook Lee
Journal: Genomics Inform Date: 2020-03-31

8. Streamlined Subpopulation, Subtype, and Recombination Analysis of HIV-1 Half-Genome Sequences Generated by High-Throughput Sequencing.

Authors: Bhavna Hora; Naila Gulzar; Raja Mazumder; Feng Gao; Yue Chen; Konstantinos Karagiannis; Fangping Cai; Chang Su; Krista Smith; Vahan Simonyan; Sharaf Ali Shah; Manzoor Ahmed; Ana M Sanchez; Mars Stone; Myron S Cohen; Thomas N Denny
Journal: mSphere Date: 2020-10-14 Impact factor: 4.389

Review 9. The Role of Structural Variation in Adaptation and Evolution of Yeast and Other Fungi.

Authors: Anton Gorkovskiy; Kevin J Verstrepen
Journal: Genes (Basel) Date: 2021-05-08 Impact factor: 4.096

9 in total