| Literature DB >> 27153570 |
Volkan Sevim1, Ali Bashir2,3, Chen-Shan Chin1, Karen H Miga4.
Abstract
MOTIVATION: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly.Entities:
Mesh:
Year: 2016 PMID: 27153570 PMCID: PMC4920115 DOI: 10.1093/bioinformatics/btw101
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Alpha-CENTAURI workflow and the HOR detection algorithm illustrated. (a) The workflow. Alpha-CENTAURI takes two input files: a FASTA file containing long reads and an HMM database built using known alpha-satellite monomers. The HMM database is used to infer monomeric sequences in each read. Then, HOR structure is predicted based on the start and end positions of each monomer on the read. The repeat structure on the read is classified under three categories regular, irregular (including inversion), or cases where no HOR is detected. (b) An illustration of a read consisting of an array of alpha-satellite monomers, which are identified from the HMM database. Each block arrow corresponds to a monomer. Similar colors indicate similar sequences. (c) Identified monomers are clustered-based sequence similarity. Here, each cluster is labeled by a different letter. (d) HOR structure is predicted based on the start positions, end positions and the distances between monomers
Centromeric satellite structure predictions
| Repeat structure predictions | Reads | Frequency |
|---|---|---|
| Regular | 1470 | 0.23 |
| Irregular | 3723 | 0.58 |
| Inversions | 154 | |
| No repeat structure detected | 1244 | 0.19 |
A summary of observed repeat regularity predictions using alpha satellite containing reads from CHM1. This dataset is included with the Alpha-CENTAURI source-code repository.