Literature DB >> 30480253

Human Brain Single Nucleotide Polymorphism: Validation of DNA Sequencing.

Ángel J Picher¹, Félix Hernández^2,3, Bettina Budeus⁴, Eduardo Soriano^3,5,6,7, Jesús Avila^2,3.

Abstract

Genetic factors may be involved in the onset of neurodegenerative diseases like Alzheimer's disease. In the case of the familial type, the disease is due to an inherited mutation at specific sites in three genes. Also, there are some genetic risk factors that facilitate the development of sporadic Alzheimer's disease. All of these genetic analyses were performed using blood samples as a source of DNA. However, the presence of somatic mutations in the brain can be identified only using brain samples. In this review, we comment on a method that correctly identifies single nucleotide variations in the human brain and that can be used to validate high-through sequencing techniques. This method involves selective enrichment of the DNA population bearing the nucleotide variations, thereby facilitating posterior validation of the data by Sanger's sequencing.

Entities: Chemical Disease Gene Species

Keywords: Brain somatic mutations; DNA sequencing; neurodegeneration

Year: 2018 PMID： 30480253 PMCID： PMC6159612 DOI： 10.3233/ADR-170039

Source DB: PubMed Journal: J Alzheimers Dis Rep ISSN： 2542-4823

INTRODUCTION

The primary cause of some cases of Alzheimer’s disease (AD) (familial AD) is an inherited mutation (s) at a specific site (s) in the genes APP, PSEN1, or PSEN2. These mutations can be detected in blood cells since they are already present in the germinal cells [1]. However, in most cases of AD (sporadic AD), the primary cause is not well determined, although several non-modifiable (aging and genetic) and modifiable (non-genetic) risk factors may facilitate the onset of the condition [2]. However, it has recently been proposed that brain somatic mutations [3] are involved in the development of the sporadic AD [4, 5] and that the presence of these mutations causes mosaic genomic heterogeneity [6]. Somatic mutations may be due to a change in a single nucleotide or to the insertion or deletion (indels) of a small number of linked (oligo) nucleotides that can sometimes “jump” from one part of the genome to another one [7]. These oligonucleotides “jump” via a mechanism involving Piwi-RNAs [8]. This mechanism is impaired in aging—a physiological process known to be a main risk for neurodegenerative disorders. When a somatic single mutation is present in a small number of brain cells, methods like Sanger’s sequencing cannot be used for detection purposes. Sanger’s sequencing requires an allelic frequency of 20% (or higher) for the detection of a somatic mutation [9]. However, other techniques, like massive parallel sequencing (e.g., Illumina), allow the detection of mutations present in a very small number of cells [10, 11]. However, these techniques can introduce a low proportion of errors when reading sequence alignments. Here we comment on the validation of brain somatic mutations detected by high-throughput sequencing techniques.

HIGH-THROUGHPUT SEQUENCING TECHNIQUES: POSSIBLE ERRORS

When only a small number of cells bear the somatic mutation compared to the total cell number, detection techniques like Sanger’s method are unsuitable and others are required. These somatic mutations sometimes occur at CpG nucleotides, at cytosines in modified (methylated) or unmodified form [12]. In this regard, as glycine (Gly) or arginine (Arg) amino acids begin with a CG dinucleotide, a larger proportion of somatic mutations have been reported to involve the replacement of Gly or Arg by other residues [13]. Thus, reported mutations at CpG sites can often be real mutations. Nevertheless, errors can also be introduced by new sequencing technologies. Among sequencing technologies, the Illumina platform enjoys widespread use in the field because of its low costs for a high number of fragments and samples [14, 15]. However, despite its popularity and advantages, the platform has some drawbacks [16]. Its performance depends greatly on the starting material and its uniformity. For example, starting material with a high uniformity, such as 16S DNA, leads to a higher number of false read nucleotides [17, 18]. Two kinds of errors can be distinguished: those involving incorrect sorting of the sequence and those that are inside the sequence itself.

Index errors

A second issue is the correct identification of indices [19]. Under normal circumstances, the index of the read is analyzed in a separate step to identify those that belong to a sample in a multiplex setting. Sinha reported that up to 10% of sequencing reads (or signals) are incorrectly assigned in a multiplexed pool of samples. In contrast to substitution errors, which can be tackled by bioinformatics tools and analysis, this kind of error is often undetectable. It is not possible to discern whether a human genome read in one sample truly originates in this sample or whether it is from another that has been included in the sequencing process with the intention to reduce costs. Of course, in the case of distinct genomes, such incorrect sequence alignments would be resolved, but with the same genome this is not possible. Therefore, it is also important to analyze non-matched reads. This can be done with the high-throughput tools Kraken [20] or Centrifuge [21] when an analysis of all reads is needed, or, for example, with TruePure, a tool that focuses on only a small part of the reads and provides a first impression [22].

Substitution errors

The most important error, especially in a clinical analysis, is an undiscovered SNP an artificially introduced SNP that is not present in the original DNA. These SNPs are often mutations with a frequency below 50% and are thus low level mutations (LLMs) [23]. It is widely known that the Illumina platform gives such substitution errors. The occurrence of these types of error contrasts with those given by Roche/454 sequencers, which give more indel errors [24, 25]. The behavior of the Illumina platform is explained not only by the sequencing step itself, but also by amplification steps, which are needed to add adapters or during cluster generation on the flow cell [26]. Substitution errors are not evenly distributed, and some errors, like A to C and C to G, are more common, as are inverted repeats [27]. The technology itself, which uses similar emission spectra of the fluorophores for A and C and G and T, is responsible for the errors [24]. Errors gather at the end of the sequence due to technical accumulation of phasing and pre-phasing [24, 28]. Thus, longer reads, which then cover more of the genome and can be more easily aligned, do not provide a simple solution for this issue. Also, quality scores are not always useful to detect such errors, as they are sometimes associated with high quality [17]. Schirmer et al. analyzed these substitution errors and found the bias to be associated with ddGTPs [15].

Occurrence and distribution

Substitutions errors are not evenly distributed across the genome as was always used in general analysis for sequencing data to show how few errors are across a read [29-32]. The above-mentioned motifs occur in certain regions of the genome and thus such regions are prone to errors and are therefore difficult to scan for LLMs. Another issue is that repeated sequencing of the same region may result in detection of the same error, which leads to the assumption that a certain mutation is present at a high percentage [23]. The best way to tackle this issue is to increase the number of molecules with the mutation of interest without a simple error-prone PCR.

VALIDATION OF HIGH-THROUGHPUT SEQUENCING TECHNIQUES

It is difficult to confirm the somatic SNVs detected by high-throughput sequencing techniques like the Illumina platform. However, only a few of the total mutations that are identified may be caused by an error in the sequencing. Thus, it is advisable to achieve total validation of all the SNVs detected, thus moving from suspected SNVs to true SNVs. Sanger’s sequencing remains one of the most reliable techniques with respect to errors. However, it cannot be used when there is a very low proportion of a specific SNV. In this regard, it is convenient to have a procedure by which to remove the population lacking the specific SNV and thus enrich the population with it. This SNV could be then validated by Sanger’s sequencing. In addition, prior to SNV removal, a first validation of the SNV based on a data filter using software such as Virmid is recommended [33]. After filtering the data, the DNA molecules bearing the specific SNV can be amplified by treating the whole DNA population with specific restriction enzymes that recognize a motif present in those molecules in which the specific SNV is absent. In this regard, those DNA molecules lacking the specific SNV can be removed, thus achieving the enrichment of the population with the SNV of interest.

RESTRICTION NUCLEASES

Restriction (endo) nucleases are enzymes that cut DNA at a specific sequence (motif). These nucleases were discovered in the last century during research into bacterial DNases [34]. Several types of restriction nucleases have been described, including the most recent, Cos-9-gRNA complex (CRISPRs), which use RNA to target specific DNA sequences [35] (Table 1). However, for the purpose of validating Illumina sequencing, we will focus on type II restriction nucleases, which were discovered by H.O. Smith [36]. These nucleases recognize a specific sequence of nucleotides and then cut it at specific DNA site. The motif can be 4 to 8 base pairs long. Thousands of restrictions enzymes have been analyzed and many are commercially available. The structure and function of type II restriction endonucleases have been deeply reviewed, and type II have been classified into eight subtypes: orthodox, IIS, IIE, IIF, IIT, IIG, IIIB, and IIM. Each of them is characterized by a specific example of restriction enzyme, indicating the characteristic features of the subtype [37, 38]. Among those features, the presence of metal ions may play a role in the properties of protein-DNA interaction [39].

Table 1

Restriction nucleases (type). Different types (I-IV) of restriction nucleases, reviewed in reference [34], together with others discovered later [35], are shown

Type	Cut
I	DNA at a random distance from the recognition sequence
II	DNA at the site of the recognition sequence
III	DNA 20–30 nucleotides away from the recognition sequence
IV	Modified (usually methylated) DNA
Others	CRISPRs, zinc finger nucleases, talens nucleases

Restriction nucleases (type). Different types (I-IV) of restriction nucleases, reviewed in reference [34], together with others discovered later [35], are shown On the other hand, the use of restriction endonucleases has recently decreased, and for genome engineering techniques the use of artificial DNases, like zinc finger nucleases, to modify specific target sequences has increased [40]. However, these artificial nucleases have not yet been used for validation of DNA sequencing. As previously indicated, for our sequencing validation analysis, we used type II restriction nucleases, which were selected after identifying the motif in which the somatic mutation is found (see Fig. 1 and reference [41]). The specific restriction nuclease is used to remove the normal sequence, thus allowing the detection of the rare variant.

Fig.1

Removal of fragments with a specific length by means of nuclease digestion. The rational of the process is shown. Thus, when a mixture of DNA fragments of the same length is digested with an enzyme that cuts only those fragments bearing a specific nucleotide, the uncleaved fragments can be isolated by gel electrophoresis, since they maintain their length. These fragments can then be amplified and sequenced in further steps. Thus, the use of type II restriction nucleases to remove DNA fragments of a specific length and lacking the SNV of interest was tested in blood and brain, with the aim to identify specific brain somatic mutations. Curiously, DNA cleavage by these nucleases provided clear data and better resolutions when blood cell DNA was used than when brain DNA was tested [41, 42]. After isolating the uncleaved brain DNA fragments containing the mutation by means of gel electrophoresis, they were amplified and sequenced in further steps.

AMPLIFICATION

Whole genome amplification (WGA) is a non-sequence-specific technique that allows amplification of the entire DNA sample. Given the insufficient amount of DNA present in certain samples (e.g., a single cell), WGA is a useful tool for several research fields (e.g., genetic diseases), as well as new for technologies such as next-generation sequencing (NGS) and comparative genomic hybridization (CGH) array. Unfortunately, DNA amplification is prone to the introduction of bias, error and co-amplification of minute levels of contaminating DNA. In recent years, various techniques have been developed for WGA. These can be broadly categorized into PCR-related protocols and methods based on multiple displacement amplification (MDA). The former, in turn, can be classified into degenerate oligonucleotide-primed polymerase chain reaction (DOP-PCR; iDOP-PCR) [43, 44], linker-adapter PCR (LA-PCR) [45], primer extension pre-amplification PCR (PEP-PCR/ I-PEP-PCR) [46, 47], and variations thereof. MDA methods are based on using the highly processive Phi29 DNA polymerase [48] either in combination with random hexamers [49-52] or with a DNA primase (TthPrimPol) responsible for synthesizing the primers for the polymerase during the reaction [53]. Another variant of the MDA method, called pWGA, is based on the reconstituted T7 replication system [54]. A hybrid PCR / MDA method called multiple annealing and looping-based amplification cycles (MALBAC), which relies on the Bst polymerase for the MDA, has also been reported [55]. Finally, a method called Linear Amplification via Transposon Insertion (LIANTI), which combines Tn5 transposition and T7 in vitro transcription, has recently been described [56]. Each of these methods has its own merits and limitations. The quality of the amplification result is determined by the following key parameters: the absence of contamination and artefacts in the reaction products; coverage breadth and uniformity; nucleotide error rates; and the ability to recover single-nucleotide variants (SNVs), copy number variants (CNVs) and structural variants. In general, PCR-based methods are thought to be appropriate for CNV detection [57], whereas MDA-based methods have the advantage that they give extremely low nucleotide error rates due to the high fidelity of Phi29 DNA polymerase and they produce very long amplification products, thus providing more complete coverage of the genome. Other issues affecting all amplification methods to some extent are chimera formation and preferential amplification of one allele (allelic dropout, ADO). These WGA techniques can be adapted to the needs of studies that require specific amplification of particular DNA molecules, such as those containing somatic mutations or any other structural variants. In our validation method [35], the genomic region containing the SNV of interest in the COL3A1 gene was amplified by PCR and digested with a restriction enzyme (Eco0109I) that cleaves only the molecules lacking the SNV. The low recovery yield of the uncleaved DNA after the digestion and gel-based purification steps were the main bottleneck for posterior Sanger sequencing. Therefore, an amplification step after DNA digestion and before sequencing was introduced. For this purpose, we first heat-denatured and circularized the DNA molecules remaining after the enzymatic digestion and purification, in order to generate a substrate suitable for rolling circle amplification (RCA). To enrich the amplification products with molecules bearing the SNV of interest and reduce the effect of non-digested or contaminating DNA molecules lacking the SNV, we then performed TruePrime™ RCA in the presence of specific forward and reverse oligonucleotides complementary only to the DNA molecules of interest. Two distinct amplification protocols were followed in the presence of increasing concentrations of the SNV-specific oligonucleotides. In the first case, all the components of the amplification mixture were added simultaneously. In the alternative protocol, to prioritize the use of the specific primers as starting sites of RCA and therefore increase the specificity of the procedure, TthPrimPol DNA primase was added after incubating the rest of the amplification mixture for 1 h. The subsequent addition of TthPrimPol allowed an increase in the number of starting points for the amplification and therefore in the efficiency of the process, thereby enhancing the final amplification yield. Sanger sequencing of the amplified DNA samples demonstrated the effectiveness of the method to validate low frequency allele variations present in at least 10% of the original DNA molecules. Thus, this method could be suitable to validate SNPs identified by high-throughput sequencing techniques.

CONCLUSIONS

Somatic mutations in the brain may be involved in the onset of various neurodegenerative disorders. However, if there is a low proportion of brain cells bearing the mutation, the proper characterization of these mutations is not straightforward as brain tissue, in contrast to blood, is not suitable material for genetic studies. Furthermore, the use of high-throughput sequencing techniques can introduce errors. Thus, to identify true brain-specific mutations, a novel procedure has been proposed [41, 42]. This procedure involves the use of suitable software for data processing [33], the removal of DNA fragments lacking the mutation by specific restriction nucleases, amplification of the uncleaved DNA fragments bearing the mutation, and characterization of the mutation by Sanger sequencing (see Fig. 2).

Fig.2

Schematic diagram of the method for validating somatic mutations in the brain characterized by Illumina sequencing.

55 in total

Review 1. Epidemiology of neurodegeneration.

Authors: Richard Mayeux
Journal: Annu Rev Neurosci Date: 2003-01-24 Impact factor: 12.449

Review 2. The origins, determinants, and consequences of human mutations.

Authors: Jay Shendure; Joshua M Akey
Journal: Science Date: 2015-09-24 Impact factor: 47.728

3. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer.

Authors: H Telenius; N P Carter; C E Bebb; M Nordenskjöld; B A Ponder; A Tunnacliffe
Journal: Genomics Date: 1992-07 Impact factor: 5.736

4. Whole-Genome Amplification by Improved Primer Extension Preamplification PCR (I-PEP-PCR).

Authors: Nona Arneson; Simon Hughes; Richard Houlston; Susan Done
Journal: CSH Protoc Date: 2008-01-01

5. Comparison of Sanger sequencing, pyrosequencing, and melting curve analysis for the detection of KRAS mutations: diagnostic and clinical implications.

Authors: Athanasios C Tsiatis; Alexis Norris-Kirby; Roy G Rich; Michael J Hafez; Christopher D Gocke; James R Eshleman; Kathleen M Murphy
Journal: J Mol Diagn Date: 2010-04-29 Impact factor: 5.568

6. Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes.

Authors: Mingkun Li; Anna Schönberg; Michael Schaefer; Roland Schroeder; Ivane Nasidze; Mark Stoneking
Journal: Am J Hum Genet Date: 2010-08-13 Impact factor: 11.025

Review 7. An updated evolutionary classification of CRISPR-Cas systems.

Authors: Kira S Makarova; Yuri I Wolf; Omer S Alkhnbashi; Fabrizio Costa; Shiraz A Shah; Sita J Saunders; Rodolphe Barrangou; Stan J J Brouns; Emmanuelle Charpentier; Daniel H Haft; Philippe Horvath; Sylvain Moineau; Francisco J M Mojica; Rebecca M Terns; Michael P Terns; Malcolm F White; Alexander F Yakunin; Roger A Garrett; John van der Oost; Rolf Backofen; Eugene V Koonin
Journal: Nat Rev Microbiol Date: 2015-09-28 Impact factor: 60.633