| Literature DB >> 24028793 |
Mark Hills1, Kieran O'Neill1, Ester Falconer1, Ryan Brinkman1, Peter M Lansdorp2.
Abstract
Strand-seq is a single-cell sequencing technique to finely map sister chromatid exchanges (SCEs) and other rearrangements. To analyze these data, we introduce BAIT, software which assigns templates and identifies and localizes SCEs. We demonstrate BAIT can refine completed reference assemblies, identifying approximately 21 Mb of incorrectly oriented fragments and placing over half (2.6 Mb) of the orphan fragments in mm10/GRCm38. BAIT also stratifies scaffold-stage assemblies, potentially accelerating the assembling and finishing of reference genomes. BAIT is available at http://sourceforge.net/projects/bait/.Entities:
Year: 2013 PMID: 24028793 PMCID: PMC3971352 DOI: 10.1186/gm486
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Strand-seq involves sequencing of template strands only. Newly formed DNA strands containing BrdU (dashed lines) in parental cells (left panels), are removed in daughter cells after cell division, hence only the original template-strand DNA is sequenced (solid lines, right panels). One template is derived from the Watson (W) strand (shown in orange), and the other template is derived from the Crick (C) strand (shown in blue); centromeres are shown in green. (a) Identification of template strands by Strand-seq. Daughter cells inherit two template strands because there is a maternal (m) and paternal (p) copy of each chromosome (chromosome 1 shown). Chromatids segregate either with both Watson strands inherited into one daughter and both Crick strands in the other (top panel), or with one Watson and one Crick strand in each daughter cell (bottom panel). Sequence read density is plotted onto ideograms (gray bars) representing the template state of each chromosome; the template-strand 'dose’ is inferred from W and C read counts (scale bar shown at bottom of ideograms). (b) Sister chromatid exchange (SCE) results in changes to templates on chromosomes. An SCE event (red outline) has reads aligning to different template strands on either side of it. These events are reciprocal between daughter cells, and will always be seen as a change from a WC state to either a CC or WW state. (c) Translocations and inversions are identified by Strand-seq. Translocations will align in the direction of the template strand of the chromosome to which they translocated, but still map to their original chromosome location. For example, for the Philadelphia translocation between chr9 and chr22, sequence reads from the translocated portion of chr22 will still map to chr22, but will have the template inheritance pattern of chr9 (chr9 fragments shown as solid boxes, chr22 fragments shown as open boxes).
Figure 2Automated identification of sister chromatid exchange (SCE) from Strand-seq data. (a) Gross directional mapping data are thresholded to remove bins with unexpectedly high or low read numbers, and analyzed using DNAcopy. Inherited template numbers are converted to a value between 1 and -1 for DNAcopy to make only one of three calls: WW, WC, or CC. DNAcopy defines an interval across two bins, so with a bin size set to 200 kb, the SCE event will be located to within 400 kb. (b) Localization is then iterated by subdividing the identified region into bins one-fifth of the original size (80 kb on first iteration), and re-running DNAcopy. A single bin size is used as padding to aid detection of SCE events at bin boundaries. The iterations of re-running DNAcopy continue until less than 50 reads remain within the interval. (c) A second algorithm identifies the first read to map in a different direction (W read at chr13:19,203,283), then performs a check that the 10 preceding reads are all in the expected direction (10 C reads), and at least 20% of succeeding reads are in the other direction. The interval is refined to a distance between two reads. Abbreviations: C, Crick; W, Watson.
Figure 3Clustering contigs into linkage groups for early-assembly genomes. Using template strand directionality as a unique signature, all contigs in the early mouse assembly MGSCv3 were compared with each other across all 62 Strand-seq libraries. All contigs with similar (>85%) template inheritance patterns were stratified into linkage groups (LGs). (a) Heat plots of all BAIT-called LGs show limited similarity between groups. Through analysis of homozygous template states only (WW and CC, left panel) 57,581 contigs cluster into 33 LGs, with the association between linkage groups appearing as yellow points if groups are in the same orientation, or blue points if the groups are in opposite orientations. The LGs are then reanalyzed after merging and reorientation of associated clusters, resulting in only 20 linkage groups consisting of 54,832 contigs. (b) Histogram of the number of fragments within a linkage group that map to a particular chromosome. The LG with the largest number of contigs are shown at the bottom in dark gray, with groups that contain the next largest numbers of contigs shown in progressively lighter grays. Most LGs contain contigs that belong to the same chromosome (see Additional file 4: Figure S3), and in general, most chromosomes are represented by one or two linkage groups. Note: contigs derived from sex chromosomes in male libraries can be distinguished as they are haploid, and are not computed as an initial heat plot. Any contigs derived from haploid chromosomes are separated and clustered independently. Almost all contigs clustered into this linkage group mapped to the X chromosome (right histogram). Abbreviations: C, Crick; W, Watson.
Figure 4Bioinformatic Analysis of Inherited Templates (BAIT) localizes unplaced scaffolds in late-version assemblies. Orphan scaffolds can be correctly oriented and localized relative to the rest of the genome by comparing template-strand inheritance. The orientation of an orphan scaffold is arbitrary, because it is not anchored to the rest of the genome, so it can be correctly oriented with respect its located chromosome, or misoriented. (a) For a single library where the unplaced scaffold GL456239.1 is WW, BAIT maps its potential location (shown in red) to both WW genomic regions (correctly oriented), and CC genomic regions (misoriented). If only one library is analyzed, all locations map with 100% concordance. Note that a WW scaffold will not locate to a WC chromosome, so chr8, chr14, chr16, chr18, and chr19 are 0% concordant. (b) BAIT iterates over a second library where GL456239.1 is CC. The results of the two libraries combined reduce the number of potential mapping locations from 17 to only 3 that map with 100% concordance. Because chr8, chr14, and chr16 are WC in this library also, these chromosomes map with 0% concordance. (c) BAIT iterates over a third library where GL456239.1 is WC, and thus maps to all chromosomes that are WC. The result of the three combined libraries reduces the number of potential mapping locations to 2: the centromeric tips of chr1 and chr4. (d) The combined results after iteration of all 62 libraries refine the location of GL456239.1 to the first 10 Mb of chr1 in the reverse orientation (with a concordance of 91%). The fragment was further refined to an unbridged gap occupying the first 3 Mb of chr1. Abbreviations: C, Crick; chr, chromosome; W, Watson.
Figure 5Accuracy of automated sister chromatid exchange (SCE) detection by Bioinformatic Analysis of Inherited Templates (BAIT). (a) By comparing the number of SCE events identified by BAIT to those determined manually, we calculated the percentage of computational calls that were incorrect (false positives) or not detected (false negatives). Filtering the data by only including bins that deviated minimally from the mean changed the results, with highly conservative filtering increasing the level of false negatives, and very broad filtering increasing the level of false positives. (b) The frequency of (left) false positives and (right) false negatives with respect to library background. Cleaner, high-quality libraries with < 1% of reads mapping incorrectly had a lower false-positive rate than libraries with medium background (<5% incorrectly mapped reads), and an even lower rate than libraries with high background (<10% incorrectly mapped reads). Error bars are ± standard deviation.
Figure 6Validation of using Strand-seq to map unplaced scaffolds to built genomes. To confirm that Bioinformatic Analysis of Inherited Templates (BAIT) can successfully locate orphan scaffolds, the reads were aligned to MGSCv37/mm9, which has 202 orphan scaffolds, of which 60 can be mapped to a specific location in GRCm38/mm10. We used BAIT to locate these scaffolds in MGSCv37/mm9, and then cross-referenced these locations to the actual location in the GRCm38/mm10 assembly version. BAIT correctly located all regions in which there were more than 10 libraries to analyze, and where the percentage concordance was above 68%. Green points indicate correctly mapped fragments, and red points indicate incorrectly mapped fragments. Dashed lines show the minimum number of libraries and minimal concordance needed to make confident calls.
Locations of unplaced scaffolds on GRCm38/mm10
| GL456382.1 | 23.2 | chrX:0–57.6 | + | 100 | 18 | 14 | 26 | | |
| GL456379.1 | 72.4 | chrX:0–57.6 | - | 97.8 | 46 | 14 | 26 | | |
| GL456233.1 | 336.9 | chrX:0–57.6 | - | 96.1 | 51 | 14 | 26 | | |
| JH584299.1 | 953.0 | chr5:90.4-96.6 | + | 94.2 | 52 | 0 | 1 | chr5:94,088,336-94,138,335 | |
| GL456239.1 | 40.1 | chr1:0–12.6 | - | 91.1 | 56 | 3 | 1 | chr1:0–3,000,000 | |
| GL456367.1 | 42.1 | chrX:0–57.6 | - | 90.0 | 40 | 14 | 26 | | |
| GL456381.1 | 25.9 | chrX:0–57.6 | - | 90.0 | 50 | 14 | 26 | | |
| GL456393.1 | 55.7 | chr3:28.6-31.6 | + | 89.3 | 56 | 0 | 0 | chr3:40,550,618-40,650,617 | |
| GL456359.1 | 23.0 | chr4:136.4-156.2 | - | 88.7 | 53 | 1 | 19 | chr4:156,408,117-156,508,116 | chr4:130,393,226-130,516,309 |
| GL456354.1 | 196.0 | chr5:90.4-100.6 | - | 88.6 | 35 | 0 | 1 | chr5:94,088,336-94,138,335 | |
| GL456385.1 | 35.2 | chr13:0–6.8 | - | 87.3 | 55 | 3 | 0 | chr13:1–3,000,000 | |
| GL456360.1 | 31.7 | chr15:88.4-103.8 | + | 87.0 | 54 | 1 | 3 | chr15:103,943,686-104,043,685 | |
| GL456366.1 | 47.1 | chr15:62.4-103.8 | + | 85.5 | 55 | 1 | 3 | chr15:103,943,686-104,043,685 | |
| GL456216.1 | 66.7 | chr4:136.4-156.2 | + | 80.4 | 51 | 2 | 19 | chr4:156,408,117-156,508,116 | chr4:130,393,226-130,516,309 |
| JH584296.1 | 199.4 | chr5:83.6-113.2 | - | 80.0 | 10 | 1 | 1 | chr5:113,521,975-113,535,974 | |
| JH584297.1 | 205.8 | chr5:88.6-100.6 | - | 77.8 | 18 | 0 | 1 | chr5:94,088,336-94,138,335 | |
| GL456368.1 | 20.2 | chr4:129.2-156.2 | - | 76.2 | 42 | 2 | 19 | chr4:130,393,226-130,516,309 | chr4:156,408,117-156,508,116 |
| GL456221.1 | 207.0 | chr1:79.8-123.2 | + | 73.7 | 57 | 2 | 4 | chr1:85,347,104-85,447,103 | chr1:75,055,557-75,121,556 |
| GL456392.1 | 23.6 | chr2:0–8.2 | - | 73.5 | 34 | 4 | 0 | chr2:0–3,050,000 | |
| JH584292.1 | 14.9 | chr4:107.8-108.6 | + | 73.5 | 49 | 0 | 1 | chr4:99,842,111-99,876,234 | |
| GL456372.1 | 28.7 | chr1:127.2-146.2 | - | 69.2 | 52 | 1 | 1 | chr1:156,118,744-156,168,743 | |
| GL456389.1 | 28.8 | chrX:0–57.6 | - | 63.6 | 33 | 14 | 26 | | |
| GL456370.1 | 26.8 | chr4:67.6-68.8 | - | 62.0 | 50 | 0 | 3 | chr4:61,344,177-61,394,176 | |
aOf the 44 orphan scaffolds, 23 had enough reads to determine their genomic location by calculating mapping concordance.
bThe scaffold accession numbers and BAIT-determined locations are given, together with the strand direction, which gives the relative orientation of the scaffolds with respect to the genome.
cThe percentage concordance (% conc) and the number of libraries with enough information to make a concordance call are also given.
dFinally, BAIT cross-referenced these locations to unbridged and bridged gaps falling over the interval (gap (u) and gap (b) respectively.
ePrimary and alternate gap locations are given. To ensure no regions were missed, gaps were included within 10 Mb away from the determined interval.