Literature DB >> 20576622

Cassis: detection of genomic rearrangement breakpoints.

Christian Baudet¹, Claire Lemaitre, Zanoni Dias, Christian Gautier, Eric Tannier, Marie-France Sagot.

Abstract

SUMMARY: Genomes undergo large structural changes that alter their organization. The chromosomal regions affected by these rearrangements are called breakpoints, while those which have not been rearranged are called synteny blocks. Lemaitre et al. presented a new method to precisely delimit rearrangement breakpoints in a genome by comparison with the genome of a related species. Receiving as input a list of one2one orthologous genes found in the genomes of two species, the method builds a set of reliable and non-overlapping synteny blocks and refines the regions that are not contained into them. Through the alignment of each breakpoint sequence against its specific orthologous sequences in the other species, we can look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed. Here, we present the package Cassis that implements this method of precise detection of genomic rearrangement breakpoints. AVAILABILITY: Perl and R scripts are freely available for download at http://pbil.univ-lyon1.fr/software/Cassis/. Documentation with methodological background, technical aspects, download and setup instructions, as well as examples of applications are available together with the package. The package was tested on Linux and Mac OS environments and is distributed under the GNU GPL License.

Entities: Disease Species

Mesh：

Year: 2010 PMID： 20576622 PMCID： PMC2905553 DOI： 10.1093/bioinformatics/btq301

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Large scale modifications of the genome, such as inversions or transpositions of DNA segments, translocations between non-homologous chromosomes, fusions or fissions of chromosomes and deletions or duplications of small or large portions are called rearrangements. They are further involved in evolution, speciation and also in cancer. One crucial step before analysing the rearrangements and their possible relation with other genomic features is to locate these events on a genome. In the case of two genomes, it is possible to identify conserved regions, also known as synteny blocks, by comparing the order and orientation of orthologous markers along their chromosome sequences. A region located between two consecutive synteny blocks on one genome, whose orthologous blocks are rearranged in the other genome (not consecutive or not in the same relative orientations), is called breakpoint. As far as we know, current methods for detecting breakpoints [Grimm-synteny (Pevzner and Tesler, 2003) Mauve (Darling et al., 2004), for example] are in fact strategies for detecting synteny blocks: they provide the coordinates of the breakpoint regions only as a byproduct, simply by returning regions that are not found in a conserved synteny. Lemaitre et al. (2008) developed a formal method that aims to go one step further and to extend the synteny blocks by focusing on the breakpoints themselves. This method was shown to improve significantly the precision of breakpoint locations on mammalian genomes and enables to better characterize breakpoint sequences and distributions (Lemaitre et al., 2008, 2009) (see also datasets and comparisons available together with the package). The first step of the method is to process a list of orthologous genes to identify synteny blocks between the genomes of two related species (a reference genome G and a second genome G). This step outputs a list of ordered and non-intersecting synteny blocks that are used to identify the breakpoints. For each breakpoint on the genome G, we can define three sequences: the breakpoint sequence S, and its two orthologous sequences on the second genome G, S and S (Fig. 1).

Fig. 1.

Sequence S is defined by the boundaries of two consecutive synteny blocks A and B on the genome G. S (S) is defined by the boundaries of the orthologous block A (B) and of the previous/next synteny block (according to the orientation of the blocks) in the genome G. To perform the segmentation, the package considers the extended version of the sequences S, S and, S which includes the first/last genes of the synteny blocks. In a second step, the method aligns the breakpoint sequence S against S and S and the information provided by the hits of the alignments is coded along S as a sequence of discrete values. A segmentation algorithm calculates the best segmentation of this sequence of discrete values into at most three segments: a segment related with S, a segment related with S and a central segment which will represent the refined breakpoint.

2 CASSIS

Cassis is a package which contains the implementation in Perl and R of the methods developed by Lemaitre et al. (2008). The package receives as input data a list of pairs of one2one orthologous genes which can be found in the genomes G and G. First, all pairs of intersecting genes which have same order and direction in both genomes are merged. Overlapping genes that do not respect this criterium are discarded. After that, the list of genes is used to create synteny blocks according to the algorithm described by Lemaitre et al. using k = 2. Basically, the parameter k controls for the flexibility degree of the method. With k = 2, the algorithm enables individual isolated genes to be out of order without disrupting a synteny block, and all synteny blocks must contain at least two genes. For each breakpoint on the genome G, we define the boundaries of the sequences S, S and, S according to the synteny blocks. We perform the alignment with LASTZ (Harris, 2007) of the sequences S against S and S against S. LASTZ was chosen because it was shown to be more sensitive in the alignment of intergenic sequences. To obtain better results in the segmentation step, we align the extended version of the sequences S, S and, S. This includes the genes that are on the boundaries of the blocks that define the sequence (Fig. 1). If at least one of the alignments (S against S or S against S) leads to a hit, the breakpoint sequence can be refined. The segmentation algorithm is applied to the breakpoint and the refined coordinates can thus be obtained. During this step, we perform a statistical test that verifies if the breakpoint region is actually structured into three segments to validate the obtained results. The package Cassis also works with lists of orthologous synteny blocks. In this case, the steps of overlapping identification and synteny blocks definition are not executed and the input data is directly submitted to the breakpoint identification step. As we do not have information about the genes that are inside of the synteny blocks that are given by the user, to build the extended sequences we add on each side of the sequence a fragment of length L. If the resulting extended sequence has length smaller than Lmin, it means that we have a considerable overlap between consecutive blocks. Thus, we cannot properly define the sequence and the corresponding breakpoint is not refined. The default values of the parameters L and Lmin are 50 kbp. This was chosen because it is close to the average size of a gene. The package contains a main script which controls the whole process of breakpoint identification and refinement. The script is very simple to use and receives the following parameters: Input table: tab separated values file that contains the orthology information. It can be a list of pairs of one2one ortologous genes or a list of pairs of orthologous synteny blocks, which can be found on the genomes G and G; Input type: flag that indicates the type of the given input table: G for genes and B for synteny blocks; Directory G (G): directory where the script can find the sequences of the chromosomes of the genome G (G); Output directory: directory which will receive the results; and Other optional parameters including a stringency level for the LASTZ alignments and the values for sequences extensions (L and Lmin). The script generates a table that contains, for each breakpoint, the chromosome of the genome G where the breakpoint is located, the coordinates of the breakpoint before and after the segmentation process and a flag that can have the following values: −1, 0 and 1. The value −1 denotes that it was impossible to execute the segmentation because the alignments output no hit. The values zero/one denote, respectively, that the segmentation failed/passed on the statistical test. The package also produces, for each breakpoint, a plot with the graphical representation of the segmentation. We recommend the use of chromosome sequences whose repeats have been masked. The alignment of masked sequences results in more relevant hits and, consequently, on better segmentation results. The package contains a main script which controls the execution of a set of scripts that performs atomic tasks. The modularization of the implementation answers to the needs of advanced users who may desire to create their own pipelines of breakpoint refinement. Funding: Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (4676/08-4 to C.B.); Conselho Nacional de Desenvolvimento Científico e Tecnológico (472504/2007-0, 479207/2007-0 and 483177/2009-1 to Z.D., partial); French project ANR (MIRI BLAN08-1335497); French-UK project ANR-BBSRC (MetNet4SysBio ANR-07-BSYS 003 02); Project ERC Advanced Grant Sisyphe. Conflict of Interest: none declared.

4 in total

1. Mauve: multiple alignment of conserved genomic sequence with rearrangements.

Authors: Aaron C E Darling; Bob Mau; Frederick R Blattner; Nicole T Perna
Journal: Genome Res Date: 2004-07 Impact factor: 9.043

2. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes.

Authors: Pavel Pevzner; Glenn Tesler
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

3. Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relation to genome organisation.

Authors: Claire Lemaitre; Lamia Zaghloul; Marie-France Sagot; Christian Gautier; Alain Arneodo; Eric Tannier; Benjamin Audit
Journal: BMC Genomics Date: 2009-07-24 Impact factor: 3.969

4. Precise detection of rearrangement breakpoints in mammalian chromosomes.

Authors: Claire Lemaitre; Eric Tannier; Christian Gautier; Marie-France Sagot
Journal: BMC Bioinformatics Date: 2008-06-18 Impact factor: 3.169

4 in total

14 in total

1. Advances in computer simulation of genome evolution: toward more realistic evolutionary genomics analysis by approximate bayesian computation.

Authors: Miguel Arenas
Journal: J Mol Evol Date: 2015-03-26 Impact factor: 2.395

2. Recombination rate variation in mice from an isolated island.

Authors: Richard J Wang; Melissa M Gray; Michelle D Parmenter; Karl W Broman; Bret A Payseur
Journal: Mol Ecol Date: 2016-12-21 Impact factor: 6.185

3. Detection of gene expression changes at chromosomal rearrangement breakpoints in evolution.

Authors: Adriana Muñoz; David Sankoff
Journal: BMC Bioinformatics Date: 2012-03-21 Impact factor: 3.169

4. Close 3D proximity of evolutionary breakpoints argues for the notion of spatial synteny.

Authors: Amélie S Véron; Claire Lemaitre; Christian Gautier; Vincent Lacroix; Marie-France Sagot
Journal: BMC Genomics Date: 2011-06-10 Impact factor: 3.969

5. A high-density linkage map enables a second-generation collared flycatcher genome assembly and reveals the patterns of avian recombination rate variation and chromosomal evolution.

Authors: Takeshi Kawakami; Linnéa Smeds; Niclas Backström; Arild Husby; Anna Qvarnström; Carina F Mugal; Pall Olason; Hans Ellegren
Journal: Mol Ecol Date: 2014-06-17 Impact factor: 6.185

6. Refining borders of genome-rearrangements including repetitions.

Authors: J A Arjona-Medina; O Trelles
Journal: BMC Genomics Date: 2016-10-25 Impact factor: 3.969

7. Recombination rates and genomic shuffling in human and chimpanzee--a new twist in the chromosomal speciation theory.

Authors: Marta Farré; Diego Micheletti; Aurora Ruiz-Herrera
Journal: Mol Biol Evol Date: 2012-11-30 Impact factor: 16.240

8. Unraveling the effect of genomic structural changes in the rhesus macaque - implications for the adaptive role of inversions.

Authors: Anna Ullastres; Marta Farré; Laia Capilla; Aurora Ruiz-Herrera
Journal: BMC Genomics Date: 2014-06-26 Impact factor: 3.969

9. Evaluating synteny for improved comparative studies.

Authors: Cristina G Ghiurcuta; Bernard M E Moret
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937

10. An update of the goat genome assembly using dense radiation hybrid maps allows detailed analysis of evolutionary rearrangements in Bovidae.

Authors: Xiaoyong Du; Bertrand Servin; James E Womack; Jianhua Cao; Mei Yu; Yang Dong; Wen Wang; Shuhong Zhao
Journal: BMC Genomics Date: 2014-07-23 Impact factor: 3.969