| Literature DB >> 21426544 |
Guruprasad Ananda1, Francesca Chiaromonte, Kateryna D Makova.
Abstract
BACKGROUND: While the abundance of available sequenced genomes has led to many studies of regional heterogeneity in mutation rates, the co-variation among rates of different mutation types remains largely unexplored, hindering a deeper understanding of mutagenesis and genome dynamics. Here, utilizing primate and rodent genomic alignments, we apply two multivariate analysis techniques (principal components and canonical correlations) to investigate the structure of rate co-variation for four mutation types and simultaneously explore the associations with multiple genomic features at different genomic scales and phylogenetic distances.Entities:
Mesh:
Year: 2011 PMID: 21426544 PMCID: PMC3129677 DOI: 10.1186/gb-2011-12-3-r27
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Mutation rates investigated in the present study
| Type | Measurement | Alignment used |
|---|---|---|
| Insertion rate | Insertions/bp | Human-orangutan-macaque |
| Deletion rate | Deletions/bp | Human-orangutan-macaque |
| Nucleotide substitution rate | Substitutions/bp | Human-orangutan |
| Mononucleotide microsatellite mutability | Mutability/bp | Human-orangutan |
Mutation rates, which are used as input to PCA and as response set in CCA, are listed, along with the measurement unit and alignments used for their estimation.
Genomic features investigated in the present study
| Feature | Measurement (per Mb) | Source |
|---|---|---|
| GC content | Percentage of G and C bases | 'GC Percent' track from the UCSC Genome Browser |
| CpG islands | Count | 'CpG island' track from the UCSC Genome Browser |
| Non-CG methyl-cytosines | Count | [ |
| LINE | Count | 'RepeatMasker' track from the UCSC Genome Browser |
| SINE | Count | 'RepeatMasker' track from the UCSC Genome Browser |
| Nuclear lamina | Number of LaminB1 interaction sites with positive intensity | 'NKI LaminB1' track from the UCSC Genome Browser |
| Telomere | Distance in bp | 'Gap' track from the UCSC Genome Browser |
| Female recombination rate (1 Mb) | Centimorgan (cM) | 'Recomb rate' track from the UCSC Genome Browser |
| Male recombination rate (1 Mb) | Centimorgan (cM) | 'Recomb rate' track from the UCSC Genome Browser |
| Recombination rate (0.5 Mb and 0.1 Mb) | Centimorgan (cM) | [ |
| SNP | Count | 'SNPs 129' track from the UCSC Genome Browser |
| Replication timing | Time through S-phase | [ |
| Nucleosome-free regions | Coverage | [ |
| Coding exons | Coverage | 'UCSC Genes' track from the UCSC Genome Browser |
| Conserved elements | Coverage | '28-way most conserved' track from the UCSC Genome Browser |
Genomic features, used as predictors in CCA, are listed along with their measurement unit and source. LINE, long interspersed repetitive elements; SINE, short interspersed repetitive element.
Figure 1Biplots of the first two PCA components for our four mutation rates, as obtained from the AR and NCNR subgenomes along the human-orangutan branch for 1-Mb windows. Black dots represent projected observations (that is, projected windows). The vectors labeled INS, DEL, SUB, and MS depict loadings for insertion rate, deletion rate, substitution rate, and mononucleotide microsatellite mutability, respectively. See Tables S1 and S2 in Additional file 1 for summary statistics.
Figure 2Genome-wide locations of windows driving non-linear signals in the data. (a-c) Black circles denote windows without marked non-linearity. Green and blue circles denote windows displaying mutation rate non-linearity in PCA (a) and CCA in the response space (b). Red circles denote windows displaying genomic feature non-linearity in CCA in the predictor space (c). Yellow triangles represent the location of the centromeres on each of the chromosomes.
Figure 3Helioplots for CCA performed on the AR and NCNR sub-genomes along the human-orangutan branch for 1-Mb windows. The labels on the plots are as follows: CV, canonical variate; GC, GC content; CpG, number of CpG islands; nCGm, number of non-CpG methyl-cytosines; LINE, number of LINE elements; SINE, number of SINE elements; NLp, number of nuclear lamina associated regions; telo, distance to the telomere; fRec and mRec, female and male recombination rates; SNPd, SNP density; RepT, replication time; nucFree, density of nucleosome-free regions; cExon, coverage by coding exons; mostCons, coverage by most conserved elements. Red bars indicate positive loadings, and blue bars negative loadings. See Table S6 in Additional file 1 for summary statistics.
Figure 4Galaxy workflow developed for estimating mutation rates and computing principal components. A similar workflow (not shown) was implemented to compute canonical correlation component pairs. MAF, multiple alignment format.
'Regional variation', 'multiple regression' and 'multivariate analysis' toolsets in Galaxy
| Data pre-processing tools | |
| Make windows | To partition genome into windows of a user-specified size |
| Feature coverage | To apportion various genomic features in genomic windows |
| Filter nucleotides | To identify and mask low-quality nucleotides from alignments based on a quality score cutoff specified by the user |
| Mask CpG/non-CpG sites | To identify and mask CpG/non-CpG-containing sites from alignments |
| Tools for identifying mutations and computing their rates | |
| Fetch Indels | To identify insertions and deletions from three-way alignments using a user-specified outgroup |
| Estimate indel rates | To estimate indel rates by aggregating insertions and deletions in genomic regions specified by the user |
| Fetch substitutions | To identify nucleotide substitutions from pair-wise alignments |
| Estimate substitution rates | To estimate substitution rate according to Jukes-Cantor JC69 model |
| Extract orthologous microsatellites | To fetch microsatellites using SPUTNIK, and detect orthologous repeats |
| Estimate microsatellite mutability | To estimate microsatellite mutability by grouping (and sub-grouping) repeats based on their size, unit and motif |
| Multiple regression tools | |
| Perform linear regression | To construct a linear regression model using the user-selected predictors and response variables |
| Perform best-subsets regression | To examine all of the linear regression models that can be created from all possible combinations of the predictors variables |
| Compute RCVE | To compute RCVE (relative contribution to variance) for all possible variable subsets |
| Multivariate analysis tools | |
| PCA | To perform PCA on a set of variables |
| CCA | To perform CCA on two sets of variables |
| Kernel PCA | To perform kernel PCA on a set of variables, using a user-specified kernel |
| Kernel CCA | To perform kernel CCA on two sets of variables, using a user-specified kernel |
RCVE, relative contribution to variability explained.