Literature DB >> 27153570

Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing.

Volkan Sevim¹, Ali Bashir^2,3, Chen-Shan Chin¹, Karen H Miga⁴.

Abstract

MOTIVATION: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly.
RESULTS: We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies.
AVAILABILITY AND IMPLEMENTATION: Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI CONTACT: ali.bashir@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2016 PMID： 27153570 PMCID： PMC4920115 DOI： 10.1093/bioinformatics/btw101

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Centromeres and other heterochromatic regions in most complex genomes are enriched with long arrays of near-identical tandem repeats, known as satellite DNAs, which challenge standard assembly efforts and are vastly underrepresented in genomic studies (Eichler ; Willard, 1990). Local arrangements of the fundamental repeat unit, or monomers, within each satellite array are commonly organized into multi-monomeric repeating units or, higher-order repeats (HORs; Willard and Waye, 1987a). These sites are observed to be highly variable in copy number and repeat structure (Warburton and Willard, 1990; Warburton ), providing a new and largely uncharacterized source of sequence variation to study in the context of sequence evolution and biomedical research. Previous efforts to characterize repeat variants from whole-genome sequencing (WGS) data relied on predictions made by short read assembly or k-mer frequency (Alkan ; Macas ). Although useful in predicting the most common repeat unit organizations in a given genome, such strategies are not designed to efficiently survey local rearrangement of repeat patterns involving variant sequence organization between adjacent repeats, precise sites of inversion in arrays and transition boundaries with non-satellite DNA that may interrupt an array. In contrast, long read analysis, where a complete repeating unit can be observed in its entirety more than once, enables direct characterization of these events without assembly based inference (Melters ). To automate the survey of repeat variants in quality-assessed long read datasets, we developed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), a bioinformatics pipeline for evaluating local HOR structure across long read WGS datasets. When provided with long-read data and a training dataset of repeat units from a satellite family of interest, Alpha-CENTAURI is designed to predict HOR patterns based on clustering discrete repeat units on each read. In doing so, the user is provided with characterization of HOR structure on the read: defined either as regular, or containing the same ordering of near-identical monomers, or irregular, that is, containing a rearrangement, inversion or discontinuous spacing in monomers when compared to other HORs on the same read. Further, regions of the read that do not contain an identifiable repeat unit are flagged to go through a secondary analysis using Repbase libraries (Jurka ) to label sites of characterized repeat insertions. We demonstrate the utility of Alpha-CENTAURI on WGS data from a hydatiform mole genome (CHM1), using a collection of previously characterized 171 bp monomers from an AT-rich human centromeric satellite family, known as alpha satellite (Manuelidis and Wu, 1978).

2 Materials and methods

2.1 Higher order repeat prediction of ordered monomers

Alpha-CENTAURI’s workflow is designed to detect tandem repeats containing at least two ordered monomers (i.e. dimers), providing a minimal definition for HOR prediction (Willard and Waye, 1987b) (Fig. 1). As outlined in Figure 1a, the user is initially required to provide two input databases: (i) a file of quality-assessed long reads and (ii) a monomer training set of fasta sequences to define the basic repeat unit for a given satellite family. Initial monomer positions are determined using a hidden Markov method using a satellite model provided by the input consensus sequence (HMMER, Eddy, 1998). Minimum monomer lengths are defined by default to be 150 bases. Characterization of discrete, ordered monomer units provide an index of start positions for each read. Monomers are clustered into groups based on pair-wise sequence identity using an implementation of an O(ND) alignment algorithm (Myers, 1986) within the FALCON genome assembler (https://github.com/PacificBiosciences/FALCON/). Cluster similarity threshold is determined per read by evaluating a range of identity values (98% to 88% by 1% decrements) and selecting the monomer clusters assignments with the highest percent identity that permits inference of HOR organization. Consensus sequence fasta and tab-delimited description of HOR summary statistics are provided for each read.

Fig. 1.

Alpha-CENTAURI workflow and the HOR detection algorithm illustrated. (a) The workflow. Alpha-CENTAURI takes two input files: a FASTA file containing long reads and an HMM database built using known alpha-satellite monomers. The HMM database is used to infer monomeric sequences in each read. Then, HOR structure is predicted based on the start and end positions of each monomer on the read. The repeat structure on the read is classified under three categories regular, irregular (including inversion), or cases where no HOR is detected. (b) An illustration of a read consisting of an array of alpha-satellite monomers, which are identified from the HMM database. Each block arrow corresponds to a monomer. Similar colors indicate similar sequences. (c) Identified monomers are clustered-based sequence similarity. Here, each cluster is labeled by a different letter. (d) HOR structure is predicted based on the start positions, end positions and the distances between monomers

2.2 Evaluation of regular or irregular repeat structures

HOR predictions for each read are classified as regular if cluster order, as provided by monomer position-based indexing and similarity clustering, follows a consistent pattern across the entirety of the read. Let, Xi be the start position of monomer in cluster X. Additionally, let d be an ordered set of distances between monomers in cluster X, where d,i = Xi + 1-Xi is the distance between monomer i and i + 1 (Fig. 1c), and d is the superset of such distances over all clusters. Conditions for regularity are listed below. All identified monomers are assigned to a cluster containing more than one monomer. Distances between the monomers in the same cluster are consistent across the read. Namely, 1.05 ≥ d,i/Median(d) ≥ 0.95. Distance between the tail of each monomer and the head of the adjacent monomer is equal or less than 5 bases. Orientation of all monomers within the same cluster is consistent, i.e. there are no inversions. Reads that fail any of these conditions are considered irregular, while reads failing condition 4 are additionally labeled as inversions. Irregular reads contain rearrangements in monomer ordering (regions of monomer insertion and deletion), shifts in orientation of monomer repeat units, and/or stretches of bases that are interspersed or are >5 bp at the start or end of monomers. HORs that contain more than one copy of a certain monomer fail condition 2, and therefore are currently characterized as an irregular. Monomeric satellite structures or highly divergent ordered arrays of monomers that are expected to have single monomer membership in each cluster group, are reported under the no-HOR category.

2.3 Evaluation of regular or irregular repeat structures

To evaluate alpha-CENTAURI, we characterized repeat structure predictions using a test dataset of 6438 alpha satellite containing reads (mean read length: 10,536 bp; min: 2000 bp; max: 31,594 bp) from the hydatidiform mole genome (CHM1) (CHM1 error-corrected sequences provided on the Alpha-Centuari git repository.). Predicted repeat structures were compared to a reference database of previously annotated HOR patterns (reviewed Alexandrov ) and using monomers obtained from the HuRef genome (Levy ; GRCh38 BioProject: PRJNA193213). A summary of repeat regularity is provided in Table 1.

Table 1.

Centromeric satellite structure predictions

Repeat structure predictions	Reads	Frequency
Regular	1470	0.23
Irregular	3723	0.58
Inversions	154
No repeat structure detected	1244	0.19

A summary of observed repeat regularity predictions using alpha satellite containing reads from CHM1. This dataset is included with the Alpha-CENTAURI source-code repository.

Centromeric satellite structure predictions A summary of observed repeat regularity predictions using alpha satellite containing reads from CHM1. This dataset is included with the Alpha-CENTAURI source-code repository. We observed 23% (1470/6438) of alpha satellite containing reads contained a regular HOR. Each regular read was assigned to one of 49 previously characterized HORs, with representation across all autosomes and the X chromosome (Supplementary Tables S1 and S2). Within the regular predictions, subsets of reads were observed to describe the same HOR, for example on the X chromosome (DXZ1 or ‘cenX’), 80 reads correctly identified the 12 monomer HOR with a median monomer alignment identity of 98.9%. We were are able to detect HORs that varied in monomer organization within a single array or between nearly identical inter-chromosomal arrays, as seen in the dimeric repeat assigned to chromosome 1, 5 and 19 centromeric regions, where the majority of D1Z7 (or ‘cen1_1’) reads were characterized either as a dimer (66% of reads, 143/218) or a 6-mer repeat (27% of reads, 60/218). Initially, the majority of predictions were labeled as irregular. After alpha satellite sequence annotation, it was determined that a subset of irregular assignments were due to missing information: events where HMMER did not correctly predict a monomer position (44 reads, or 1.2% of irregular predictions); events where a monomer length was less than the default threshold due to either rearrangement or incomplete HMMER monomer prediction (150 bp); and events where the read length was insufficient to span the full HOR more than once. Additionally, HORs that re-used near-identical monomers present within a single higher order repeat unit, as observed for the D21Z1 HOR, were labeled as irregular. The remaining irregular reads are expected to represent ‘true’ HOR variants. These include inversions (154, or 4% of irregular reads, with the highest prevalence on acrocentric array cen22, an HOR that is found on both chromosomes 14 and 22) as well as rearrangements of the HOR monomer patterns on the same read. We observed 1244 alpha satellite containing reads that did not provide evidence for an HOR. Roughly 44% (544/1244 reads) were classified a ‘monomeric’ satellites, or highly divergent alpha satellite monomers (i.e. monomers with less than the lower threshold of 88% pairwise identity) typically found in the pericentromeric arms. The remaining 690 reads in our study either represented incomplete HORs single copy HORs or included a transition into non-satellite adjacent sequences (defined here as containing at least 1 kb of non-alpha satellite sequence on the same read). Further analysis of these transitions with RepeatMasker (http://www.repeatmasker.org), using Repbase libraries (Jurka ), demonstrates the ability to track sites of transposable element insertion, of which we find LINE1 to be the most prevalent (18.9% of total bases, i.e. 273 042/1 393 472, surveyed from the 140 reads containing at least 1 kb of non-alpha satellite sequence), consistent with the literature of alpha satellite DNA (Schueler ).

3 Conclusion

We present Alpha-CENTAURI, a python-based workflow for characterizing satellite sequence variants in long read datasets. This automated method is useful in reporting de novo higher order repeat sequence organization and predicting structural variation within HORs on a per read basis, thereby characterizing a window of local repeat organization defined by the length of the read. Users are able to rapidly characterize three distinct subgroups of repeat structure, defined here as regular or irregular higher order repeats, thus defining HOR structures and sites of rearrangement. While deep, long-read sequencing human datasets are becoming increasingly available, one of the concerns with third-generation sequencing approaches is their relative expensive compared to second-generation technologies at comparable sequence depths. Given the high-copy number of HORs relative to the genome, one can provide highly-informative information of satellite structure at low-depth (potentially even <1× coverage of a genome). Therefore, we believe Alpha-CENTAURI to provide an opportunity to not only provide high-resolution surveys of HORs from deep sequencing projects, but also to enable low depth, low cost surveys across a population or in genomes without high-quality references.

Funding

This work was supported in part through the computational resources and staff expertise provided by the Department of Scientific Computing at the Icahn School of Medicine at Mount Sinai. Conflict of Interest: C.-S.C. is an employee and a shareholder of Pacific Biosciences, Inc.

14 in total

1. Alpha-satellite DNA of primates: old and new families.

Authors: I Alexandrov; A Kazakov; I Tumeneva; V Shepelev; Y Yurov
Journal: Chromosoma Date: 2001-08 Impact factor: 4.316

2. Genomic and genetic definition of a functional human centromere.

Authors: M G Schueler; A W Higgins; M K Rudd; K Gustashaw; H F Willard
Journal: Science Date: 2001-10-05 Impact factor: 47.728

3. Genomic analysis of sequence variation in tandemly repeated DNA. Evidence for localized homogeneous sequence domains within arrays of alpha-satellite DNA.

Authors: P E Warburton; H F Willard
Journal: J Mol Biol Date: 1990-11-05 Impact factor: 5.469

Review 4. Repbase Update, a database of eukaryotic repetitive elements.

Authors: J Jurka; V V Kapitonov; A Pavlicek; P Klonowski; O Kohany; J Walichiewicz
Journal: Cytogenet Genome Res Date: 2005 Impact factor: 1.636

Review 5. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

6. Chromosome-specific subsets of human alpha satellite DNA: analysis of sequence divergence within and between chromosomal subsets and evidence for an ancestral pentameric repeat.

Authors: H F Willard; J S Waye
Journal: J Mol Evol Date: 1987 Impact factor: 2.395

7. Nonrandom localization of recombination events in human alpha satellite repeat unit variants: implications for higher-order structural characteristics within centromeric heterochromatin.

Authors: P E Warburton; J S Waye; H F Willard
Journal: Mol Cell Biol Date: 1993-10 Impact factor: 4.272

8. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

9. Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data.

Authors: Can Alkan; Mario Ventura; Nicoletta Archidiacono; Mariano Rocchi; S Cenk Sahinalp; Evan E Eichler
Journal: PLoS Comput Biol Date: 2007-09 Impact factor: 4.475

10. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution.

Authors: Daniël P Melters; Keith R Bradnam; Hugh A Young; Natalie Telis; Michael R May; J Graham Ruby; Robert Sebra; Paul Peluso; John Eid; David Rank; José Fernando Garcia; Joseph L DeRisi; Timothy Smith; Christian Tobias; Jeffrey Ross-Ibarra; Ian Korf; Simon W L Chan
Journal: Genome Biol Date: 2013-01-30 Impact factor: 13.583

17 in total

Review 1. Chromatin dynamics during the cell cycle at centromeres.

Authors: Sebastian Müller; Geneviève Almouzni
Journal: Nat Rev Genet Date: 2017-01-31 Impact factor: 53.242

Review 2. Satellite DNA evolution: old ideas, new approaches.

Authors: Sarah Sander Lower; Michael P McGurk; Andrew G Clark; Daniel A Barbash
Journal: Curr Opin Genet Dev Date: 2018-03-23 Impact factor: 5.578

3. Global Repeat Map (GRM): Advantageous Method for Discovery of Largest Higher-Order Repeats (HORs) in Neuroblastoma Breakpoint Family (NBPF) Genes, in Hornerin Exon and in Chromosome 21 Centromere.

Authors: Vladimir Paar; Ines Vlahović; Marija Rosandić; Matko Glunčić
Journal: Prog Mol Subcell Biol Date: 2021

4. Automated annotation of human centromeres with HORmon.

Authors: Olga Kunyavskaya; Tatiana Dvorkina; Andrey V Bzikadze; Ivan A Alexandrov; Pavel A Pevzner
Journal: Genome Res Date: 2022-05-11 Impact factor: 9.438

Review 5. Genomic Tackling of Human Satellite DNA: Breaking Barriers through Time.

Authors: Mariana Lopes; Sandra Louzada; Margarida Gama-Carvalho; Raquel Chaves
Journal: Int J Mol Sci Date: 2021-04-29 Impact factor: 5.923

6. Centromere evolution and CpG methylation during vertebrate speciation.

Authors: Kazuki Ichikawa; Shingo Tomioka; Yuta Suzuki; Ryohei Nakamura; Koichiro Doi; Jun Yoshimura; Masahiko Kumagai; Yusuke Inoue; Yui Uchida; Naoki Irie; Hiroyuki Takeda; Shinich Morishita
Journal: Nat Commun Date: 2017-11-28 Impact factor: 14.919

7. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications.

Authors: Matthias H Weissensteiner; Andy W C Pang; Ignas Bunikis; Ida Höijer; Olga Vinnere-Petterson; Alexander Suh; Jochen B W Wolf
Journal: Genome Res Date: 2017-03-30 Impact factor: 9.043

8. Higher-order organisation of extremely amplified, potentially functional and massively methylated 5S rDNA in European pikes (Esox sp.).

Authors: Radka Symonová; Konrad Ocalewicz; Lech Kirtiklis; Giovanni Battista Delmastro; Šárka Pelikánová; Sonia Garcia; Aleš Kovařík
Journal: BMC Genomics Date: 2017-05-18 Impact factor: 3.969

9. Hybrid de novo genome assembly and centromere characterization of the gray mouse lemur (Microcebus murinus).

Authors: Peter A Larsen; R Alan Harris; Yue Liu; Shwetha C Murali; C Ryan Campbell; Adam D Brown; Beth A Sullivan; Jennifer Shelton; Susan J Brown; Muthuswamy Raveendran; Olga Dudchenko; Ido Machol; Neva C Durand; Muhammad S Shamim; Erez Lieberman Aiden; Donna M Muzny; Richard A Gibbs; Anne D Yoder; Jeffrey Rogers; Kim C Worley
Journal: BMC Biol Date: 2017-11-16 Impact factor: 7.431

10. Classification and monomer-by-monomer annotation dataset of suprachromosomal family 1 alpha satellite higher-order repeats in hg38 human genome assembly.

Authors: L I Uralsky; V A Shepelev; A A Alexandrov; Y B Yurov; E I Rogaev; I A Alexandrov
Journal: Data Brief Date: 2019-03-08