Literature DB >> 33937879

Long-read genome sequencing for the molecular diagnosis of neurodevelopmental disorders.

Susan M Hiatt¹, James M J Lawlor¹, Lori H Handley¹, Ryne C Ramaker¹, Brianne B Rogers^1,2, E Christopher Partridge¹, Lori Beth Boston¹, Melissa Williams¹, Christopher B Plott¹, Jerry Jenkins¹, David E Gray¹, James M Holt¹, Kevin M Bowling¹, E Martina Bebin³, Jane Grimwood¹, Jeremy Schmutz¹, Gregory M Cooper¹.

Abstract

Exome and genome sequencing have proven to be effective tools for the diagnosis of neurodevelopmental disorders (NDDs), but large fractions of NDDs cannot be attributed to currently detectable genetic variation. This is likely, at least in part, a result of the fact that many genetic variants are difficult or impossible to detect through typical short-read sequencing approaches. Here, we describe a genomic analysis using Pacific Biosciences circular consensus sequencing (CCS) reads, which are both long (>10 kb) and accurate (>99% bp accuracy). We used CCS on six proband-parent trios with NDDs that were unexplained despite extensive testing, including genome sequencing with short reads. We identified variants and created de novo assemblies in each trio, with global metrics indicating these datasets are more accurate and comprehensive than those provided by short-read data. In one proband, we identified a likely pathogenic (LP), de novo L1-mediated insertion in CDKL5 that results in duplication of exon 3, leading to a frameshift. In a second proband, we identified multiple large de novo structural variants, including insertion-translocations affecting DGKB and MLLT3, which we show disrupt MLLT3 transcript levels. We consider this extensive structural variation likely pathogenic. The breadth and quality of variant detection, coupled to finding variants of clinical and research interest in two of six probands with unexplained NDDs, support the hypothesis that long-read genome sequencing can substantially improve rare disease genetic discovery rates.

Entities: Chemical

Year: 2021 PMID： 33937879 PMCID： PMC8087252 DOI： 10.1016/j.xhgg.2021.100023

Source DB: PubMed Journal: HGG Adv ISSN： 2666-2477

Introduction

Neurodevelopmental disorders (NDDS) are a heterogeneous group of conditions that lead to a range of physical and intellectual disabilities and collectively affect 1%–3% of children.[1] Many NDDs result from large-effect genetic variation, which often occurs de novo,[2] with hundreds of genes known to associate with disease.[3] Owing to this combination of factors, exome and genome sequencing (ES/GS) have proven to be powerful tools for both clinical diagnostics and research on the genetic causes of NDDs. However, while discovery power and diagnostic yield of genomic testing have consistently improved over time,[4] most NDDs cannot be attributed to currently detectable genetic variation.[5] There are a variety of hypotheses that might explain the fact that most NDDs cannot be traced to a causal genetic variant after ES/GS, including potential environmental causes and complex genetic effects driven by small-effect variants.[6] However, one likely possibility is that at least some NDDs result from highly penetrant variants that are missed by typical genomic testing. ES/GS are generally performed by generating millions of “short” sequencing reads, often paired-end 150 bp reads, followed by alignment of those reads to the human reference assembly and detection of variation from the reference. Various limitations of this process, such as confident alignment of variant reads to a unique genomic location, make it difficult to detect many variants, including some known to be highly penetrant contributors to disease. Examples of NDD-associated variation that might be missed include low-complexity repeat variants,[7] small to moderately sized structural variants (SVs),[4,8] and mobile element insertions (MEIs).[9,10] Indeed, despite extensive effort from many groups, detection of such variation remains plagued by high error rates, both false positives (FPs) and false negatives (FNs), and it is likely that many such variants are simply invisible to short-read analysis.[11] One potential approach to overcome variant detection limitations in ES/GS is to use sequencing platforms that provide longer reads. Long reads allow for more comprehensive and accurate read alignment to the reference assembly, including within and near to repetitive regions, and de novo assembly.[12] However, to date, the utility of these long reads has been limited for several reasons, including cost, requirements on size, quantity and quality of input DNA, and high base-pair-level error rates. Recently, Pacific Biosciences released an approach, called circular consensus sequencing (CCS), or “HiFi,” in which fragments of DNA are circularized and then sequenced repeatedly.[13] This leads to sequence reads that are both long (>10 kb) and accurate at the base pair level (>99%). In principle, such an approach holds great potential for more comprehensive and accurate detection of human genetic variation, especially in the context of rare genetic disease. We have used CCS to analyze six proband-parent trios affected with NDDs that we previously sequenced using a typical Illumina genome sequencing (IGS) approach but in whom no causal or even potentially causal genetic variant was found. The CCS data were used to detect variation within each trio and generate de novo genome assemblies, with a variety of metrics indicating that the results are more comprehensive and accurate, especially for complex variation, than those seen in short-read datasets. In one proband, we identified an L1-mediated de novo insertion within CDKL5 that leads to a duplicated coding exon and is predicted to lead to a frameshift and loss of function. Transcript analyses confirm that the duplicated exon is spliced into mRNA in the proband. We have classified this variant as likely pathogenic (LP) using American College of Medical Genetics (ACMG) standards.[14] In a second proband, we found multiple large SVs that together likely disrupt at least seven protein-coding genes. Our observations support the hypothesis that long-read genome analysis can substantially improve success rates for the detection of variation associated with rare genetic conditions.

Material and methods

Illumina sequencing, variant calling, and analysis

Six probands and their unaffected parents were enrolled in a research study aimed at identifying genetic causes of NDDs,[15] which was monitored by the Western Institutional Review Board (IRB) (20130675). All six of these families underwent trio IGS between 4 and 5 years ago, which was performed as described.[15] Briefly, whole-blood genomic DNA was isolated using the QIAsym-phony (QIAGEN), and sequencing libraries were constructed by the HudsonAlpha Genomic Services Lab, using a standard protocol that included PCR amplification. Sequencing was performed on the Illumina HiSeqX using paired-end reads with a read length of 150 bp. Each genome was sequenced at an approximate mean depth of 30×, with at least 80% of base positions reaching 20× coverage. While originally analyzed using hg37, for this study reads were aligned to hg38 using DRAGEN version 07.011.352.3.2.8b. Variants were discovered (in gvcf mode) with DRAGEN, and joint genotyping was performed across six trios using GATK version 3.8-1-0-gf15c1c3ef. SVs were called using a combination of Delly (v0.6.01),[16] CNVnator (v0.3.2),[17] ERDS (v1.1),[18] and Manta (v1.1.1.).[19] Individual SVs were then annotated with gene features and allele frequencies from 1000 Genomes,[20] gnomAD,[21] NDD publications,[22,23] and an internal SV database. We merged SVs from the various callers when they were of the same SV type and exhibited at least 50% reciprocal overlap. SVs that were only called by one caller were discarded unless they were >400 kb. MEIs were called using MELT (v2.02)[24] run in MELT-SINGLE mode. Variant analysis and interpretation were performed using ACMG guidelines,[14] similar to that which we previously performed.[4,15] None of the probands had a pathogenic (P), likely pathogenic, or variant of uncertain significance (VUS) identified by IGS, either at the time of original analysis or after a reanalysis performed at the time of generation of long-read data. In all trios, expected relatedness was confirmed.[25] IGS data for probands 1–5 are available via dbGAP (project accession number dbGAP: phs001089). Complete IGS data for proband 6 is not available due to consent restrictions.

Long-read sequencing, variant calling, analysis, and de novo assemblies

Long-read sequencing was performed using CCS mode on a PacBio Sequel II instrument (Pacific Biosciences of California). Libraries were constructed using a SMRTbell Template Prep Kit 1.0 and tightly sized on a SageELF instrument (Sage Science, Beverly, MA, USA). Sequencing was performed using a 30 h movie time with 2 h pre-extension, and the resulting raw data were processed using either the CCS3.4 or CCS4 algorithm, as the latter was released during the course of the study. Comparison of the number of high-quality insertion or deletion (indel) events in a read versus the number of passes confirmed that these algorithms produced comparable results. Probands were sequenced to an average CCS depth of 32× (range, 25× to 44×), while parents were covered at an average depth of 16× (range, 10× to 22×; Table 1). CCS reads were aligned to the complete GRCh38.p13 human reference. For single-nucleotide variants (SNVs) and indels, CCS reads were aligned using the Sentieon v.201808.07 implementation of the BWA-MEM aligner,[26] and variants were called using DeepVariant v0.10[27] and joint-genotyped using GLNexus v1.2.6.[28] For SVs, reads were aligned using pbmm2 1.0.0, and SVs were called using pbsv v2.2.2. Candidate de novo SVs required a proband genotype of 0/1 and parent genotypes of 0/0, with ≥6 alternate reads in the proband and 0 alternate reads, and ≥5 reference reads in the parents.

Table 1.

Probands selected for PacBio sequencing

				Previous genetic testing
Family ID	Proband gender	Race	Major phenotypic features	Array	Single gene test(s) or panel(s)[a]	ES/GS	Other normal test results	PacBio CCS coverage (P/D/M)	Average insert size (bp) (P/D/M)
1	F	C	seizures, facial dysmorphism, hypotonia	normal	normal×2	no findings (both)	karyotype	25×/10×/11×	12,655/12,238/12,884
2	F	AA	ID, seizures, hypotonia	normal	normal×7	no findings (both)	mito	26×/16×/12×	12,651/12,865/12,600
3	M	C	ID, seizures	VUS dup	normal×3	no findings (GS)	fragile X	35×/19×/22×	14,393/16,604/16,344
4	F	C/AA	ID, facial dysmorphism, hypotonia	normal	normal×1	no findings (GS)	fragile X	44×/14×/20×	11,420/11,555/11,197
5	M	C	ID, seizures, speech delay, brain MRI abnormalities	normal	normal×4	no findings (GS)	mito	30×/16×/20×	21,145/19,264/21,568
6	F	C	ID, seizures, speech delay	normal	NP	no findings (GS)	NP	33×/19×/14×	12,452/12,183/13,641

ES/GS, exome sequencing/genome sequencing; P, proband; D, dad; M, mom; F, female; M, male; C, Caucasian; AA, African American; ID, intellectual disability; NP, not performed.

Some VUS SNVs have been reported in these probands.

For one proband (proband 4), we used several strategies to create de novo assemblies using 44× CCS data. Assemblies were generated using canu (v1.8),[29] Falcon unzip (falcon-kit 1.8.1),[30] HiCanu (hicanu_rc +325 changes [r9818 86bb2e221546c76437887d3a0f-f5ab9546f85317]),[31] and hifiasm (v 0.5-dirty-r247).[32] Hifiasm was used to create two assemblies. First, the default parameters were used, followed by two rounds of Racon (v1.4.10) polishing of contigs. Second, trio-binned assemblies were built using the same input CCS reads, in addition to kmers generated from a 36× paternal Illumina library and a 37× maternal Illumina library (singletons were excluded). The kmers were generated using yak(r55) using the suggested parameters for running a hifiasm trio assembly (kmer size = 31 and Bloom filter size of 2**37). Maternal and paternal contigs went through two rounds of Racon (v1.4.10) polishing. Trio-binned assemblies were built for the remaining probands in the same way. Individual parent assemblies were also built with hifiasm (v0.5-dirty-r247) using default parameters. The resulting contigs went through two rounds of Racon (v1.4.10) polishing. Coordinates of breakpoints were defined by a combination of assembly-assembly alignments using minimap2[33] (followed by use of bedtools bamToBed), visual inspection of CCS read alignments, and BLAT. Rearranged segments in the chromosome 6 region were restricted to those >4 kb. Dot plots illustrating sequence differences were created using Gepard.[34]

QC statistics

SNV and indel concordance and de novo variant counts were calculated using bcftools v1.9 and rtg-tools vcfeval v3.9.1. “High-quality de novo” variants were defined as PASS variants (IGS/GATK only) on autosomes (on primary contigs only) that were biallelic with total alelle depth (DP) ≥ 7 and genotype quality (GQ) ≥ 35. Additional requirements were a proband genotype of 0/1, with ≥2 alternate reads and an allele balance ≥0.3 and ≤0.7. Required parent genotypes were 0/0, with alternate allele depth of 0. Mendelian error rates were also calculated using bcftools. “Rigorous” error rates were restricted to PASS variants (IGS/GATK only) on autosomes with GQ > 20, and DP > 5. Total variant counts per trio were calculated using Variant Effect Predictor (VEP, v98), counting multi-allelic sites as one variant. SV counts were calculated using bcftools and R. Counts were restricted to calls designated as “PASS,” with an alternate allele depth (AD) ≥ 2. Candidate SV de novos required proband genotype of 0/1 and parent genotypes of 0/0, with ≥6 alternate reads in the proband and 0 alternate reads and ≥5 reference reads in the parents. De novo MELT calls in IGS data were defined as isolated proband calls where the parent did not have the same type (ALU, L1, or SVA) of call within 1 kb as calculated by bedtools closest v.2.25.0. These calls were then filtered (using bcftools) for “PASS” calls and varying depths, defined as the number of read pairs supporting both sides of the breakpoint (left read pairs, LP; right read pairs, RP). To create a comparable set of de novo mobile element calls in CCS data, individual calls were extracted from the pbsv joint-called VCF using bcftools and awk and isolated proband calls were defined as they were for the IGS data and filtered (using bcftools) for PASS calls and varying depths, defined as the proband alternate allele depth (AD[1]).

Simple repeat and low-mappability regions

We generated a bed file of disease-related low-complexity repeat regions in 35 genes from previous studies.[7,35] Most regions (25) include triplet nucleotide repeats, while the remainder include repeat units of 4–12 bp. Reads aligning to these regions were extracted from bwa-mem-aligned bams and visualized using the Integrated Genomics Viewer (IGV[36]). Proband depths of MAPQ60 reads spanning each region were calculated using bedtools multicov v2.28.0. For the depth calculations, regions were expanded by 15 bp on either side (using bedtools slop) to count reads anchored into non-repeat sequence. The mean length of these regions was 83 bp, with a max of 133 bp. Low-mappability regions were defined as the regions of the genome that do not lie in Umap k100 mappable regions.[37] Regions ≥100,000 nt long and those on non-primary contigs were removed, leaving a total of 242,222 difficult-to-map regions with average length of 411 bp. Proband depths of MAPQ60 reads spanning each region were calculated using bedtools multicov v2.28.0. High-quality protein-altering variants in probands were defined using VEP annotations and counted using bcftools v1.9. Requirements included a heterozygous or homozygous genotype in the proband, with ≥4 alternate reads, an allele balance ≥0.3 and ≤0.7, GQ > 20, and DP > 5. Reads supporting 57 loss-of-function variants (high quality and low quality) in proband 5 were visualized with IGV and semiquantitatively scored to assess call accuracy. Approximate counts of reads were recorded and grouped by mapping quality (MapQ = 0 and MapQ ≥ 1), along with subjective descriptions of the reads. The total evidence across CCS and IGS reads was used to estimate truth and score each variant call as true positive (TP), FP, true negative (TN), FN, or undetermined (UN).

CDKL5 cDNA amplicon sequencing

Total RNA was extracted from whole blood in PAXgene tubes using a PAXgene Blood RNA Kit version 2 (PreAnalytiX, #762164) according to the manufacturer’s protocol. cDNA was generated with a High-Capacity Reverse Transcription Kit (Applied Biosystems, #4368814) using 500 ng of extracted RNA from each individual as input. Primers were designed to CDKL5 exons 2, 5, and 6 to generate two amplicons spanning the potentially disrupted region of CDKL5 mRNA. Select amplicons were purified and sent to MCLAB (Molecular Cloning Laboratories, South San Francisco, CA, USA) for Sanger sequencing. See Supplemental methods for additional details, including primers.

Genomic DNA PCR to confirm relevant breakpoints in probands 4 and 6 and Alu insertions

We performed PCR to amplify products spanning junctions of various insertions and breakpoints, using the genomic DNA (gDNA) of the probands and parents as template. Select amplicons were purified and sent to MCLAB (Molecular Cloning Laboratories, South San Francisco, CA, USA) for Sanger sequencing. See Supplemental methods for additional details, including primers.

DGKB/MLLT3 qPCR

Total RNA was extracted from whole blood using a PAXgene Blood RNA Kit version 2 (PreAnalytiX, #762164), and cDNA was generated with a High-Capacity Reverse Transcription Kit (Applied Biosystems, #4368814) in an identical fashion as described for CDKL5 cDNA amplicon sequencing. For qPCR, Two TaqMan probes targeting the MLLT3 exon 3–4 and exon 9–10 splice junctions (ThermoFisher, Hs00971092_m1 and Hs00971099_m1) were used with cDNA diluted 1:5 in dH2O to perform qPCR for six replicates per sample on an Applied Biosystems Quant Studio 6 Flex. Differences in CT values from the median CT values for either an unrelated family or the proband’s parents were used to compute relative expression levels. See Supplemental methods for additional details, including primers.

Results

Affected probands and their unaffected parents were enrolled in a research study aimed at identifying genetic causes of NDDs.[15] All trios were originally subject to IGS and analysis using ACMG standards[14] to find pathogenic or likely pathogenic variants, or VUSs. Within the subset of probands for which no variants of interest (pathogenic, likely pathogenic, VUS) were identified either originally or after subsequent reanalyses,[4,15] six trios were selected for sequencing using the PacBio Sequel II CCS approach (Table 1). These trios were selected for those with a strong suspicion of a genetic disorder, in addition to diversifying with respect to gender and ethnicity. Parents were sequenced, at a relatively reduced depth, to facilitate identification of de novo variation.

QC of CCS data

Variant calls from CCS data and IGS data were largely concordant (Table S1A). When comparing each individual’s variant calls in the Genome in a Bottle (GIAB) high-confidence regions[38] between CCS and IGS, concordance was 94.63%, with higher concordance for SNVs (96.88%) than indels (75.96%). Concordance was slightly higher for probands only, likely due to the lower CCS read-depth coverage in parents. While CCS data showed a consistently lower number of SNV calls than IGS (mean = 7.0 M versus 7.45 M, per trio), more de novo SNVs at high QC stringency were produced in CCS data than IGS (mean SNVs = 89 versus 38; Tables S1B and S1C). CCS yielded far fewer de novo indels at these same thresholds (mean indels, 11 versus 148), with the IGS de novo indel count being much higher than biological expectation[39] and likely mostly FP calls (Table S1C). In examining reads supporting variation that was uniquely called in each set, we found that CCS FP de novos were usually FN calls in the parent, due to lower genome-wide coverage in the parent and the effects of random sampling (i.e., sites at which there were 7 or more CCS reads in a parent that randomly happened to all derive from the same allele; Table S1C). Mendelian error rates in autosomes were lower in CCS data relative to IGS (harmonic mean of high-quality calls, 0.18% versus 0.34%; Table S1D), suggesting the CCS SNV calls are of higher accuracy, consistent with previously published data.[13] Each trio had an average of ~56,000 SVs among all three members, including an average of 59 candidate de novo SVs per proband (Table S1E). Trio SVs mainly represent insertions (48%) and deletions (43%), followed by duplications (6%), single breakends (BND) (3%), and inversions (<1%). Trio-binned hifiasm de novo assemblies were built for each proband. The average N50 for proband trio-based assemblies was 35.4 Mb (Table S2A). Several assemblers were used to build de novo assemblies for one proband (proband 4). Canu, Falcon, and HiCanu all produced high-quality assemblies, but hifiasm assemblies were of highest quality (Table S2B). Use of trio-binned hifiasm allowed assembly of high-quality maternal- and paternal-specific contigs with an average N50 of 45.65 Mb, approaching that of hg38.

Variation in simple repeat regions

Accurate genotyping of simple repeat regions like trinucleotide repeat expansions presents a challenge in short-read data where the reads are often not long enough to span variant alleles. We assessed the ability of CCS to detect variation in these genomic regions and compared that to IGS, which in this case was produced from libraries produced with a PCR amplification step. We first examined variation in FMR1 (MIM: 309550). Expansion of a trinucleotide repeat in the 5′ UTR of FMR1 is associated with fragile X syndrome (MIM: 300624), the second-most common genetic cause of intellectual disability.[40] Visualization of this region in all 18 individuals indicated insertions in all but two samples in the CGG repeat region of FMR1 relative to hg38, with a range of insertion sizes from 6–105 bp (Table S3; Figure S1). When manually inspecting these regions, while one or two major alternative alleles are clearly visible, there are often minor discrepancies in insertion lengths, often by multiples of 3. It is unclear if this represents true somatic variation or if this represents inaccuracy of consensus generation in CCS processing. Like that for FMR1, manual curation of 34 other disease-causal repeat regions in each proband indicated that alignment of CCS reads provides a more accurate assessment of variation in these regions compared to IGS. When looking at region-spanning reads with high-quality alignment (mapQ = 60), 97% (34 of 35) of the regions were covered by at least 10 CCS reads in all six probands, as compared to 11% (4 of 35) of regions with high-quality IGS reads (Table S4A). While all query regions measured ≤144 bp (which includes an extension of 15 bp on either end of the repeat region), seven query regions were ≥100 bp. When considering only regions of interest <100 bp, 14% (4 of 28 regions) are covered by at least 10 high-quality IGS reads in each proband. Mean coverage of high-quality, region-spanning reads across probands was higher in CCS data than in IGS (29 versus 11; Table S4A). Of all repeat regions studied, none harbored variation classified as pathogenic/likely pathogenic/VUS. We also compared coverage of high-quality CCS and IGS reads in low-mappability regions of the genome, specifically those that cannot be uniquely mapped by 100 bp kmers.[37] While over half of these regions (62.5%) were fully covered by at least 10 high-quality CCS reads (mapQ = 60) in all six probands, only 19.3% of the regions met the same coverage metrics in the IGS data (Table S4B). The average CCS read depth in these regions was 26 reads, versus 8 reads in IGS. Within these regions, CCS yielded twice as many high-quality, protein-altering variants in each proband when compared to IGS (182 in CCS versus 85 in IGS) (Table S4C). Outside of the low-mappability regions, counts of protein-altering variants were similar (6,627 in CCS versus 6,759 in IGS). To assess the accuracy of the protein-altering variant calls in low-mappability regions, we visualized reads for 57 loss-of-function variants detected by CCS, IGS, or both in proband 5 and used the totality of read evidence to score each variant as TP, FP, TN, FN, or UN. Six of these were “high-quality” calls (see Material and methods), and all of these were correctly called in CCS (TPs, 100%); in IGS, two were correctly called (TPs, 33%) and four were undetected (FNs, 67%) (Table S4D). Among all 57 unfiltered variant calls, most CCS calls were correct (29 TP, 15 TN, total 77%), while most IGS calls were incorrect (16 FP, 22 FN, total 67%) (Table S4E).

MEIs

We searched for MEIs in these six probands within the IGS data using MELT (Tables S5A and S5B)[24] and within CCS data using pbsv (Tables S1E, S5C, and S5D; see also Material and methods). Our results suggest that CCS detection of MEIs is far more accurate. For example, it has been estimated that there exists a de novo Alu insertion in ~1 in every 20 live births (mean of 0.05 per individual).[41,42] However, at stringent QC filters (i.e., ≥5 read-pairs at both breakpoints, PASS, and no parental calls of the same MEI type within 1 kb), a total of 82 candidate de novo Alu insertions (average of 13.7) were called across the six probands using the IGS data (Table S5B), a number far larger than expected. Inspection of these calls indicated that most were bona fide heterozygous Alu insertions in the proband that were inherited but undetected in the parents. Filtering changes to improve sensitivity comes at a cost of elevated FP rates; for example, requiring only 2 supporting read pairs at each breakpoint leads to an average of ~55 candidate de novo Alu insertions per proband (Table S5 B). In contrast, using the CCS data and stringent QC filters (≥5 alternate reads, PASS, and no parental calls within 1 kb), we identified a total of only 6 candidate de novo Alu MEIs among the 6 probands (Table S5D), an observation that is far closer to biological expectation. We retained 4 candidate de novo Alu MEIs after further inspection of genotype and parental reference read depth (Table S1E). One of these 4 appears genuine, while the other three appear to be correctly called in the proband but missed in the parents owing to low read-depth, such that the Alu insertion-bearing allele was not covered by any CCS reads (Figure S2). Three of these four were confirmed by PCR, with PCR at the fourth yielding unclear results, and amplification and results were consistent with observations in IGV (Figure S3; Supplemental methods).

A likely pathogenic de novo SV in CDKL5

Analysis of SV calls and visual inspection of CCS data in proband 6 indicated a de novo SV within the CDKL5 gene (MIM: 300203; Figure 1). Given the de novo status of this event, the association of CDKL5 with early infantile epileptic encephalopathy 2 (EIEE2, MIM: 300672), and the overlap of disease with the proband’s phenotype (see Supplemental note), which includes intellectual disability, developmental delay, and seizures, we prioritized this event as the most compelling candidate variant in this proband.

Figure 1.

Proband 6 has a de novo insertion resulting in duplication of exon 3 of CDKL5

(A) Ideogram showing location of CDKL5 on chromosome X. Ideogram is from the NCBI Genome Decoration Page.

(B) Gene structure of CDKL5, RS1, and PPEF1, indicating the location of the 6,993 bp insertion in CDKL5 (blue/red/gray bars) and location of the origin of the duplicated PPEF1 intronic sequence (red).

(C) Zoomed-in view of the insertion. The gray box indicates the entire 6,993 nt insertion, which consists of a partial L1HS retrotransposon (blue box), duplicated PPEF1 intronic sequence (red box), and target site duplication (TSD, yellow box) with duplicated exon 3 (3*). Green boxes indicate RepeatMasker annotation of the proband’s insertion-bearing, contig sequence.

(D) Alignment of CCS reads near exon 3 of CDKL5 in IGV in proband 6 and her parents. Gray reads represent alignment to reference, and multicolor alignments represent unaligned ends of reads. The TSD is indicated by a yellow box. Reads highlighted by the pink box include examples of reads that align to reference upstream of the insertion, contain the TSD, and then have inserted sequence at their 3′ end. Those highlighted in the turquoise box represent inserted sequence, TSD, and reference sequence downstream of the insertion. Note that some reads have hard-clipped bases, which are designated with a black diamond.

A trio-based de novo assembly in this proband identified a 45.3 Mb paternal contig and a 50.6 Mb maternal contig in the region surrounding CDKL5. While these contigs align linearly across the majority of the p arm of chromosome X (Figure S4), alignment of the paternal contig to GRCh38 revealed a heterozygous 6,993 bp insertion in an intron of CDKL5 (chrX: 18,510,871–18,510,872_ins6993 [GenBank: GRCh38]; Figure 1; Figure S5). Analysis of SNVs in the region surrounding the insertion confirm that it lies on the proband’s paternal allele. However, mosaicism is suspected, as there exist paternal haplotype reads within the proband that do not harbor the insertion (5 of 8 paternal reads without the insertion at the 5′ end of the event, and 7 of 16 paternal reads without insertion at the 3′ end of the event; Figure S6). Annotation of the insertion indicated that it contains three distinct segments: 4,272 bp of a retrotransposed, 5′ truncated L1HS mobile element (including a poly[A] tail), 2,602 bp of sequence identical to an intron of the nearby PPEF1 gene (g.18738310_18740911 [GenBank: NC_000023.11]; [c.235+4502_235+7103 (GenBank: NM_006240.2)]), and a 119 bp target-site duplication (TSD) that includes a duplicated exon 3 of CDKL5 (35 bp) and surrounding intronic sequence (chrX: 18510753–18510871 [GenBank: GRCh38]; [c.65–67 (GenBank: NM_003159.2) to c.99+17 (GenBank: NM_003159.2)]; 119 bp total) (Figure 1; Figure S7). The 2,602 bp copy of PPEF1 intronic sequence includes the 5′ end (1,953 bp) of an L1PA5 element that is ~6.5% divergent from its consensus L1, an AluSx element, and additional repetitive and non-repetitive intronic sequence. The size and identity of this insert in the proband, and absence in both parents, was confirmed by PCR amplification and partially confirmed by Sanger sequencing (see Supplemental methods; Figure S7). Exon 3 of CDKL5, which lies within the target-site duplication of the L1-mediated insertion, is a coding exon that is 35 bp long; inclusion of a second copy of exon 3 into CDKL5 mRNA is predicted to lead to a frameshift (Thr35ProfsTer52; Figure 2). To determine the effect of this insertion on CDKL5 transcripts, we performed RT-PCR from RNA isolated from each member of the trio. Using primers designed to span from exon 2 to exon 5, all three members of the trio had an expected amplicon of 240 bp. However, the proband had an additional amplicon of 275 bp (Figure 2A). Sanger sequencing of this amplicon indicated that a duplicate exon 3 was spliced into this transcript (Figure 2B). The presence of transcripts with a second copy of exon 3 strongly supports the hypothesis that the variant leads to a CDKL5 loss-of-function effect in the proband.

Figure 2.

The duplicated CDKL5 exon 3 is present in a subset of the proband’s CDKL5 transcripts

(A) RT-PCR using primers specific to exons 2–5 of CDKL5 cDNA results in a 240 bp amplicon in proband (P), dad (D), and mom (M). An additional 275 bp amplicon is present only in the proband (asterisk).

(B) Sanger sequencing of both amplicons from the proband confirmed that the 240 bp amplicon includes the normal, expected sequencing and inclusion of a duplicated exon 3 in the upper, 275 bp band. This is predicted to lead to a frameshift (red circle) and downstream stop, p.Thr35ProfsTer52. Yellow outlined box, exon 3 sequence; orange outlined box, duplicated exon 3 sequence.

Multiple large de novo SVs in proband 4

Analysis of SV calls in proband 4 indicated several large, complex, de novo events affecting multiple chromosomes (6, 7, and 9). To assess the structure of the proband’s derived chromosomes, we inspected the trio-binned de novo assembly for this proband. Four paternal contigs were assembled for chromosome 6, which showed many structural changes compared to reference chromosome 6 (Figure 3). The proband harbors a pericentric inversion, with breakpoints at chr6: 16,307,569 (6p22.3) and chr6: 142,572,070 (6q24.2; Figures 3A and 3B; Table S6A). In addition, a 9.3 Mb region near 6q22.31–6q23.3 contained at least eight additional breakpoints, with local rearrangement of eight segments, some of which are inverted (ABCDEFGH in reference versus DCAGHFEB; Figure 3C; Table S6B). The median fragment size is just over 400 kb (range, 99 kb to 5.7 Mb; Table S6B). While the ends of several fragments do overlap annotated repeats, many do not. We were not able to identify microhomology at the junctions of these eight segments, the majority of which (7/8) were PCR confirmed in the proband (Table S6B; Figures S8 and S9; Supplemental methods). Together, the 10 breakpoints identified across chromosome 6 are predicted to disrupt at least six genes, five of which are annotated as protein coding (Table S6A). None of these have been associated with neurodevelopmental disease.

Figure 3.

Proband 4 has several large structural changes on chromosome 6

(A) Ideogram with annotation of chromosome 6 breakpoints identified in proband 4, including pericentric inversion breakpoints (pinv1, pinv2) and multiple breakpoints of a complex genomic rearrangement (red arrows). Ideogram is from the NCBI Genome Decoration Page.

(B) Schematic of proband 4’s maternal (pink box) and paternal (blue box) chromosome 6 structures. The maternal structure matches reference, while the paternally inherited derived chromosome 6 has pericentric inversion breakpoints (pinv1/pinv2) and a complex cluster of rearranged fragments (DCAGHFEB).

(C) Zoomed-in view of (B), showing the schematic of additional fragmentation near 6q22.31–6q23.3 (vertical dashed lines). Asterisks indicate inverted sequence as compared to hg38 reference. See Table S6 for additional breakpoint coordinates and details.

(D) Alignment of four sequential paternal contigs to reference chromosome 6 identified a pericentric inversion spanning 6p22.3 to 6q24.2 and a 9.3 Mb region near 6q22.31–6q23.3 with several additional breaks.

(E) Zoomed-in view of (D), showing additional fragmentation near 6q22.31–6q23.3.

CCS reads and contigs from the de novo paternal assembly of proband 4 also support structural variation involving chromosomes 7 and 9, with five breakpoints (Figure 4). The proband has two insertional translocations in addition to an inversion at the 5′ end of the chromosome 7 sequence within the derived 9p arm. Manual curation of SNVs surrounding all breakpoints confirmed that all variation lies on the paternal allele, and no mosaicism is suspected. Manual curation of the proband’s de novo assembly (specifically tig66) was required to resolve an assembly artifact (Figure S10; Supplemental methods).

Figure 4.

Proband 4 has two insertional translocations between chromosomes 7 and 9 and an inversion

(A) Ideogram with annotation of chromosome 7 and 9 breakpoints identified in proband 4. Ideograms are from the NCBI Genome Decoration Page.

(B) Schematic of the proband’s maternal (pink box) and paternal (blue box) p arms of chromosomes 7 and 9. The proband’s maternal alleles match reference. The paternal sequences represent the outcome of translocations (7A;9A and 7B;9B) and inversion (7A;7C), with fragment sizes shown. The red fragment in paternal der9p is inverted with respect to hg38 reference.

(C) Alignment of three paternal contigs to reference chromosomes 7 and 9 identified two insertional translocations. See Figure S6 and Supplemental methods regarding blue and red boxed areas.

The net effect of the translocations and inversion is likely disruption of two protein-coding genes: DGKB (MIM: 604070) on chromosome 7 and MLLT3 (MIM: 159558) on chromosome 9, neither of which has been associated with disease (Table S6A). To determine if MLLT3 transcripts are disrupted in this proband, we performed qPCR using RNA from each member of the trio, in addition to three unrelated individuals (family 3). Using two validated TaqMan probes near the region of interest (exons 3–4 and exons 9–10), we found that proband 4 showed a ~35%–39% decrease in MLLT3 compared to her parents and a 38%–45% decrease relative to unrelated individuals (Figure S11; Table S7). Expression of DGKB was not examined, as the gene is not expressed at appreciable levels in blood.[43]

Analysis of CCS-detected SVs in IGS reads

None of the disease-associated variation described here and detected by CCS analysis was identified in our IGS analyses. We analyzed raw variant calls and IGS reads at each of the relevant breakpoints to determine why such variants were not detected (Figures S12–S15). In the case of CDKL5, MELT did not call any L1, SVA, or Alu-mediated insertions with 1 Mb of CDKL5. This is likely, at least in part, because the insertion is L1-mediated but has a non-L1 sequence at one breakpoint. However, in retrospectively searching for structural variation near CDKL5 from our standard SV pipeline, we found that Delly and Manta both called a 230 kb duplication event in CDKL5. The call passed our frequency filters and was flagged as de novo. However, upon inspection, read depth and allele ratios clearly did not support a duplication event (Figure S16). Retrospectively, it is clear that this “230 kb duplication” call resulted from the duplication and insertion of a segment of PPEF1 intronic sequence into the CDKL5 intron. However, the Delly and Manta calls are plainly not correct and at the time of initial IGS analysis were disregarded. In the case of the multiple complex breakpoints identified in proband 4, most of the breakpoints were in fact called as BND or inversions by Manta (Table S6). However, Manta is the only tool capable of detecting such variation, and our pipeline requires concordance from at least two callers for small SVs (see Material and methods); thus, these events were disregarded. Furthermore, it is important to note that the proband had 814 potentially de novo BND/inversion calls from Manta, a number that is indicative of an untenably high number of false de novo calls (be they inherited or simply FP variants). In addition, typical strategies to curate and interpret candidate variation, including filtration using population frequencies, are unavailable for these categories of variation. The net result is that these variants were not evaluated in our routine analysis process. Lastly, even to the extent that individual breakpoints were flagged in IGS analysis, the lack of a coherent assembly of how the individual breakpoints and fragments relate to one another would have precluded meaningful evaluation.

Discussion

Here we describe CCS long-read sequencing of six probands with NDDs who had previously undergone extensive genetic testing with no variants found to be relevant to disease. Generally, the CCS genomes appeared to be highly comprehensive and accurate in terms of variant detection, facilitating detection of a diversity of variant types across many loci, including those that prove challenging to analysis with short reads. Detection of simplerepeat expansions and variants within low-mappability regions, for example, was more accurate and comprehensive in CCS data than that seen in IGS, and many complex SVs were plainly visible in CCS data but missed by IGS. Given the importance of de novo variation in rare disease diagnostics, especially for NDDs, it is also important to note the qualities of discrepant de novo calls between the two technologies. We found that most of the erroneously called de novo variants in the CCS data were correctly called as heterozygous in the proband but missed in the parents due to lower coverage and random sampling effects such that the variant allele was simply not covered by any reads in the transmitting parent. Such errors could be mitigated by sequencing parents more deeply. In contrast, de novo variants unique to IGS were enriched for systematic artifacts that cannot be corrected for with higher read-depth. Indels, for example, are a well-known source of error and heavily enriched among IGS de novo variant calls. In one proband we identified a likely pathogenic, de novo L1-mediated insertion in CDKL5. CDKL5 encodes cyclindependent kinase-like 5, a serine-threonine protein kinase that plays a role in neuronal morphology, possibly via regulation of microtubule dynamics.[44] Variation in CDKL5 has been associated with EIEE2 (MIM: 300672), an X-linked dominant syndrome characterized by infantile spasms, early-onset intractable epilepsy, hypotonia, and variable additional Rett-like features.[45,46] CDKL5 is one of the most commonly implicated genes identified by ES/GS in individuals with epilepsy.[47] SNVs, small insertions and deletions, copy-number variants (CNVs), and balanced translocations have all been identified in affected individuals, each supporting a haploinsufficiency model of disease.[48] We also note that de novo SVs, including deletions and at least one translocation, have been reported with a breakpoint in intron 3, near the breakpoint identified here[48-51] (Table S8; Figure S17). The variant observed here appears to be mosaic, and we note that a recent study found that 8.8% of previously reported CDKL5 mutations are also mosaic.[52] While most such mutations have been identified in males rather than females, noting that pathogenic CDKL5 variation is often lethal in males, there is not an obvious relationship between phenotypic severity, gender, variant type, and mosaicism.[53] The variant harbors two classic marks of an L1HS insertion, including the preferred L1 EN consensus cleavage site (5′-TTTT/G-3′), and a 119-bp TSD, which, in this case, includes exon 3 of CDKL5. Although TSDs are often fewer than 50 bp long, TSDs up to 323 bp have been detected.[54] The variant appears to be a chimeric L1 insertion. The 3′ end of the insertion represents retrotransposition of an active L1HS mobile element, with a signature poly(A) tail. However, the 5′ portion of the L1 sequence has greater identity to an L1 sequence within an intron of PPEF1, which lies about 230 kb downstream of CDKL5. Additional non-L1 sequence at the 5′ end of the insertion is identical to an intronic segment of PPEF1. While transduction of sequences at the 3′ end of L1 sequence has been described,[55] the PPEF1 intronic sequence here lies at the 5′ end of the L1. A chimeric insertion similar to that observed here has been described previously and has been proposed to result from a combination of retrotransposition and a synthesis-dependent strand annealing (SDSA)-like mechanism.[54] Using ACMG variant classification guidelines, we classified this variant as likely pathogenic. The variant was experimentally confirmed to result in frameshifted transcripts due to exon duplication and was shown to be de novo, allowing for use of both the PVS1 (loss of function)[56] and PM2 (de novo)[57] evidence codes. Use of likely pathogenic, as opposed to pathogenic, reflects the uncertainty resulting from the intrinsically unusual nature of the variant and its potential somatic mosaicism, in addition to the fact that its absence from population variant databases is not in principle a reliable indicator of true rarity. Identification of additional MEIs and other complex SVs in other individuals will likely aid in disease interpretation by both facilitating more accurate allele frequency estimation and by improving interpretation guidelines. More generally, MEIs have been previously described as a pathogenic mechanism of gene disruption, but their contribution to developmental disorders has been limited to a modest number of individuals in a few studies, each of which report pathogenic/likely pathogenic variation lying within coding exons.[9,10] However, the MEI observed here in CDKL5 would likely be missed by exome sequencing as the breakpoints are intronic, and in fact it was also missed in our previous short-read genome sequencing analysis.[15] Global analyses of MEIs, such as our assessment of de novo Alu insertion rates (Table S5), also support the conclusion that MEI events are far more effectively detected within CCS data compared to that seen in short-read genomes. We find it likely that long-read sequencing will uncover MEIs that disrupt gene function and lead to NDDs in many currently unexplained cases. CCS data also led to the detection of multiple large, complex, de novo SVs in proband 4, affecting at least three chromosomes. Both complex chromosomal rearrangements (CCRs), which involve at least three cytogenetically visible breakpoints on two or more chromosomes, and complex genomic rearrangements (CGRs), which are often on a smaller scale but more complex, have been reported in individuals with NDDs or other congenital anomalies.[58-61] Proband 4 appears to have both a CGR and a CCR, the latter of which includes insertional translocations and an inversion on chromosomes 7 and 9. The CGR consists of local rearrangement of eight segments near 6q22.31–6q23.3 and appears to represent chromothripsis, as the segments are localized, do not have microhomology at their breaks, and show no significant copy gain or loss in the region (Figure S18), all of which are characteristics of chromothripsis.[62] The location of this cluster near one of the breakpoints of the pericentric inversion is consistent with observations that missegregated chromosomes can undergo micronucleus formation and shattering.[63] However, we cannot rule out other related mechanisms under the umbrella term of chromoanagenesis.[64] One of the most compelling disease causal candidate genes affected in proband 4 is MLLT3, which is predicted to be moderately intolerant to loss-of-function variation (pLI = 1, o/e = 0 [0–0.13];[21] RVIS = 21.1%[65]). MLLT3, also known as AF9, undergoes somatic translocation with the MLL gene, also known as KMT2A (MIM: 159555), in individuals with acute leukemia; pathogenicity in these cases results from expression of an in-frame KMT2A-MLLT3 fusion protein and subsequent deregulation of target HOX genes.[66] Balanced translocations between chromosome 4 and chromosome 9, resulting in disruption of MLLT3, have been previously reported in two individuals, each with NDDs including intractable seizures.[67,68] Although proband 4 does not exhibit seizures, she does have features that overlap the described probands, including speech delay, hypotonia, and fifth-finger clinodactyly. While we cannot be certain of the pathogenic contribution of any one SV in proband 4, we consider the number, size, and extent of de novo structural variation to be likely pathogenic. ACMG recommendations on the interpretation of copy number variation were recently published, and although the events in proband 4 appear to be copy neutral, we attempted to apply modifications of these guidelines to these events.[69] The most compelling evidence for pathogenicity of these events is their de novo status (evidence code 5A); disruption of at least six protein-coding genes at the breakpoints (3A), at least one of which is predicted to be haploinsufficient (2H); and the total number and genomic extent of large SVs. While several of these can be captured by current evidence codes, they are weakened by the lack of affected disease-associated genes and the lack of a highly specific phenotype in the proband. Further, although the SVs are large events, including a shattering of a >9 Mb region of the genome, we do not know the molecular effect on genes that are nearby but not spanning the breakpoints. Identification of additional complex structural variation like that in this proband will aid in development of additional guidelines for classification of these events. Retrospective analysis of the disease-associated events described here did identify reads in the IGS data that support the majority of the breakpoints (Figures S12–S15; Table S6). However, there are multiple reasons why these events were not originally identified by our standard IGS analyses, including discrepancies among calling algorithms, incorrect or incomplete descriptions of the sizes and natures of the events, and filtration steps that are required to make IGS interpretation pipelines effective and sustainable. We note that our sample size, with only six total trios and two individuals with clinically relevant discoveries, is clearly too small to make precise predictions about the diagnostic yield of long-read sequencing. However, we believe the yield will be substantial. As a baseline, it is likely to be at least as high as that from short reads, given that there is no evidence of a sensitivity loss for short-read-detectable variation (e.g., SNVs and short indels). The key unknown is thus the additional yield from long-read sequencing in cases that harbor no clinically relevant variation detected by short-read sequencing. In that light, our observations are inconsistent with a very low yield. If we were to assume, as an example, that the true yield for long reads in unsolved cases is only 1%, it is unlikely that we would have observed 2 successes in 6 individuals (p = 0.0015, binomial test). Of course, the 6 unsolved probands were not randomly sampled from the set of all unsolved probands, and small counts are always intrinsically uncertain. Thus, studies of larger cohorts are necessary to estimate the magnitude of increased diagnostic yield from long-read genome sequencing. In addition to the need for larger studies, it is also important to consider factors like costs and DNA input requirements, which remain obstacles to widespread adoption of long-read genome sequencing. Additionally, refining and optimizing computational pipelines and establishing benchmarks and quality-control metrics will also be necessary. That said, there have been considerable improvements, especially recently, on cost and DNA input requirements,[70] and the computational and analytical challenges, while non-trivial, are tractable. Considering the evidence supporting the superior variant detection ability of long reads presented here and elsewhere,[70,71] we believe that the overall diagnostic yield for long reads will prove to be substantially better than current yields and that long-read genome analysis will supplant short-read analysis of individuals with rare disease in the coming years.

65 in total

1. A t(4;9)(q34;p22) translocation associated with partial epilepsy, mental retardation, and dysmorphism.

Authors: Pasquale Striano; Maurizio Elia; Lucia Castiglia; Ornella Galesi; Sabina Pelligra; Salvatore Striano
Journal: Epilepsia Date: 2005-08 Impact factor: 5.864

2. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.

Authors: Sergey Nurk; Brian P Walenz; Arang Rhie; Mitchell R Vollger; Glennis A Logsdon; Robert Grothe; Karen H Miga; Evan E Eichler; Adam M Phillippy; Sergey Koren
Journal: Genome Res Date: 2020-08-14 Impact factor: 9.043

Review 3. MLL translocations, histone modifications and leukaemia stem-cell development.

Authors: Andrei V Krivtsov; Scott A Armstrong
Journal: Nat Rev Cancer Date: 2007-11 Impact factor: 60.716

4. Prevalence of carriers of premutation-size alleles of the FMRI gene--and implications for the population genetics of the fragile X syndrome.

Authors: F Rousseau; P Rouillard; M L Morel; E W Khandjian; K Morgan
Journal: Am J Hum Genet Date: 1995-11 Impact factor: 11.025

5. A copy number variation morbidity map of developmental delay.

Authors: Gregory M Cooper; Bradley P Coe; Santhosh Girirajan; Jill A Rosenfeld; Tiffany H Vu; Carl Baker; Charles Williams; Heather Stalker; Rizwan Hamid; Vickie Hannig; Hoda Abdel-Hamid; Patricia Bader; Elizabeth McCracken; Dmitriy Niyazov; Kathleen Leppig; Heidi Thiese; Marybeth Hummel; Nora Alexander; Jerome Gorski; Jennifer Kussmann; Vandana Shashi; Krys Johnson; Catherine Rehder; Blake C Ballif; Lisa G Shaffer; Evan E Eichler
Journal: Nat Genet Date: 2011-08-14 Impact factor: 38.330

6. Analysis of protein-coding genetic variation in 60,706 humans.

Authors: Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal: Nature Date: 2016-08-18 Impact factor: 49.962

7. Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy.

Authors: Katherine I Kendig; Saurabh Baheti; Matthew A Bockol; Travis M Drucker; Steven N Hart; Jacob R Heldenbrand; Mikel Hernaez; Matthew E Hudson; Michael T Kalmbach; Eric W Klee; Nathan R Mattson; Christian A Ross; Morgan Taschuk; Eric D Wieben; Mathieu Wiepert; Derek E Wildman; Liudmila S Mainzer
Journal: Front Genet Date: 2019-08-20 Impact factor: 4.599

Review 8. Structural variant calling: the long and the short of it.

Authors: Medhat Mahmoud; Nastassia Gobet; Diana Ivette Cruz-Dávalos; Ninon Mounier; Christophe Dessimoz; Fritz J Sedlazeck
Journal: Genome Biol Date: 2019-11-20 Impact factor: 13.583

9. DELLY: structural variant discovery by integrated paired-end and split-read analysis.

Authors: Tobias Rausch; Thomas Zichner; Andreas Schlattl; Adrian M Stütz; Vladimir Benes; Jan O Korbel
Journal: Bioinformatics Date: 2012-09-15 Impact factor: 6.937

10. The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology.

Authors: Eugene J Gardner; Vincent K Lam; Daniel N Harris; Nelson T Chuang; Emma C Scott; W Stephen Pittard; Ryan E Mills; Scott E Devine
Journal: Genome Res Date: 2017-08-30 Impact factor: 9.043

10 in total

Review 1. Long-read sequencing for molecular diagnostics in constitutional genetic disorders.

Authors: Laura K Conlin; Erfan Aref-Eshghi; Deborah A McEldrew; Minjie Luo; Ramakrishnan Rajagopalan
Journal: Hum Mutat Date: 2022-09-18 Impact factor: 4.700

2. Polygenic risk scores of endo-phenotypes identify the effect of genetic background in congenital heart disease.

Authors: Sarah J Spendlove; Leroy Bondhus; Gentian Lluri; Jae Hoon Sul; Valerie A Arboleda
Journal: HGG Adv Date: 2022-04-25

3. Results from Genetic Studies in Patients Affected with Craniosynostosis: Clinical and Molecular Aspects.

Authors: Ewelina Bukowska-Olech; Anna Sowińska-Seidler; Dawid Larysz; Paweł Gawliński; Grzegorz Koczyk; Delfina Popiel; Lidia Gurba-Bryśkiewicz; Anna Materna-Kiryluk; Zuzanna Adamek; Aleksandra Szczepankiewicz; Paweł Dominiak; Filip Glista; Karolina Matuszewska; Aleksander Jamsheer
Journal: Front Mol Biosci Date: 2022-04-28

Review 4. Transcription Pause and Escape in Neurodevelopmental Disorders.

Authors: Kristel N Eigenhuis; Hedda B Somsen; Debbie L C van den Berg
Journal: Front Neurosci Date: 2022-05-09 Impact factor: 5.152

5. Targeted long-read sequencing identifies missing disease-causing variation.

Authors: Danny E Miller; Arvis Sulovari; Tianyun Wang; Hailey Loucks; Kendra Hoekzema; Katherine M Munson; Alexandra P Lewis; Edith P Almanza Fuerte; Catherine R Paschal; Tom Walsh; Jenny Thies; James T Bennett; Ian Glass; Katrina M Dipple; Karynne Patterson; Emily S Bonkowski; Zoe Nelson; Audrey Squire; Megan Sikes; Erika Beckman; Robin L Bennett; Dawn Earl; Winston Lee; Rando Allikmets; Seth J Perlman; Penny Chow; Anne V Hing; Tara L Wenger; Margaret P Adam; Angela Sun; Christina Lam; Irene Chang; Xue Zou; Stephanie L Austin; Erin Huggins; Alexias Safi; Apoorva K Iyengar; Timothy E Reddy; William H Majoros; Andrew S Allen; Gregory E Crawford; Priya S Kishnani; Mary-Claire King; Tim Cherry; Jessica X Chong; Michael J Bamshad; Deborah A Nickerson; Heather C Mefford; Dan Doherty; Evan E Eichler
Journal: Am J Hum Genet Date: 2021-07-02 Impact factor: 11.025

6. A neurodegenerative disease landscape of rare mutations in Colombia due to founder effects.

Authors: Juliana Acosta-Uribe; David Aguillón; Francisco Lopera; Kenneth S Kosik; J Nicholas Cochran; Margarita Giraldo; Lucía Madrigal; Bradley W Killingsworth; Rijul Singhal; Sarah Labib; Diana Alzate; Lina Velilla; Sonia Moreno; Gloria P García; Amanda Saldarriaga; Francisco Piedrahita; Liliana Hincapié; Hugo E López; Nithesh Perumal; Leonilde Morelo; Dionis Vallejo; Juan Marcos Solano; Eric M Reiman; Ezequiel I Surace; Tatiana Itzcovich; Ricardo Allegri; Raquel Sánchez-Valle; Andrés Villegas-Lanau; Charles L White; Diana Matallana; Richard M Myers; Sharon R Browning
Journal: Genome Med Date: 2022-03-08 Impact factor: 15.266

Review 7. Methods to Improve Molecular Diagnosis in Genomic Cold Cases in Pediatric Neurology.

Authors: Magda K Kadlubowska; Isabelle Schrauwen
Journal: Genes (Basel) Date: 2022-02-11 Impact factor: 4.096

8. Stable G-quadruplex DNA structures promote replication-dependent genome instability.

Authors: S Dean Rider; Rujuta Yashodhan Gadgil; David C Hitch; French J Damewood; Nathen Zavada; Matilyn Shanahan; Venicia Alhawach; Resha Shrestha; Kazuo Shin-Ya; Michael Leffak
Journal: J Biol Chem Date: 2022-04-18 Impact factor: 5.486

9. SvAnna: efficient and accurate pathogenicity prediction of coding and regulatory structural variants in long-read genome sequencing.

Authors: Daniel Danis; Julius O B Jacobsen; Parithi Balachandran; Qihui Zhu; Feyza Yilmaz; Justin Reese; Matthias Haimel; Gholson J Lyon; Ingo Helbig; Christopher J Mungall; Christine R Beck; Charles Lee; Damian Smedley; Peter N Robinson
Journal: Genome Med Date: 2022-04-28 Impact factor: 15.266

Review 10. Towards population-scale long-read sequencing.

Authors: Wouter De Coster; Matthias H Weissensteiner; Fritz J Sedlazeck
Journal: Nat Rev Genet Date: 2021-05-28 Impact factor: 53.242

10 in total