| Literature DB >> 32845983 |
Marcela Sandoval-Velasco1, Juan Antonio Rodríguez2, Cynthia Perez Estrada3, Guojie Zhang4, Erez Lieberman Aiden3,5,6,7, Marc A Marti-Renom2,8,9,10, M Thomas P Gilbert1,11, Oliver Smith1,12.
Abstract
BACKGROUND: Hi-C experiments couple DNA-DNA proximity with next-generation sequencing to yield an unbiased description of genome-wide interactions. Previous methods describing Hi-C experiments have focused on the industry-standard Illumina sequencing. With new next-generation sequencing platforms such as BGISEQ-500 becoming more widely available, protocol adaptations to fit platform-specific requirements are useful to give increased choice to researchers who routinely generate sequencing data.Entities:
Keywords: BGISEQ-500; Hi-C; chromosome conformation capture; next-generation sequencing
Year: 2020 PMID: 32845983 PMCID: PMC7448675 DOI: 10.1093/gigascience/giaa087
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
TADbit mapping and quality statistics of the Hi-C-BGI experimental results
| Out of 100,000 reads | Total No. interactions; both reads mapped (% initial) | ||||||
|---|---|---|---|---|---|---|---|
| Sample | Index | Read | Initial reads | Digested sites (%) | Reads with ligation site (%) | Uniquely mapped pairs (% initial) | |
| Oz13 | 10 | 1 | 29,436,352 | 83.6 | 31.6 | 18,670,087 (63.2%) | 18,269,316 (62.1%) |
| 2 | 29,436,352 | 80.4 | 29.8 | 18,127,333 (61.6%) | |||
| Mz13 | 18 | 1 | 3,069,136 | 68.5 | 14.8 | 2,396,713 (78.1%) | 2,201,431 (71.7%) |
| 2 | 3,069,136 | 66.2 | 14.4 | 2,297,654 (74.9%) | |||
| Mz17 | 19 | 1 | 2,274,286 | 44.8 | 7.8 | 1,711,896 (75.3%) | 1,554,764 (68.4%) |
| 2 | 2,274,286 | 42.5 | 7.3 | 1,673,397 (73.6%) | |||
For each of the 3 experiments, reads lost after each of the filters applied, approximate comparison to numbers expected for a standard Illumina experiment, and final number of valid pairs considered
| Filter type | Illumina expected % (for MboI) | Oz13 | % | Mz13 | % | Mz17 | % |
|---|---|---|---|---|---|---|---|
| Self-circle | <1 | 11,385 | 0.06 | 540 | 0.02 | 440 | 0.03 |
| Dangling end | 2.3–58.4 | 6,072,260 | 33.2 | 934,066 | 42.4 | 781,281 | 50.3 |
| Error | <1–3.4 | 8,941 | 0.05 | 586 | 0.02 | 587 | 0.03 |
| Extra dangling end | 9.5–70.1 | 5,275,752 | 28.9 | 744,126 | 33.8 | 517,306 | 33.3 |
| Too close from RES | 23–93.7 | 5,348,693 | 29.3 | 642,844 | 29.2 | 221,954 | 14.3 |
| Too short | 3.6–20 | 924,068 | 5 | 94,940 | 4.3 | 44,972 | 2.9 |
| Too large | <1–0.14 | 21 | 0.0001 | 1 | 0.000045 | 0 | 0 |
| Over-represented | 3.16–13.2 | 3,362,668 | 18.4 | 446,571 | 20.3 | 64,224 | 4.1 |
| Duplicated | 1.94–64.2 | 13,637,482 | 74.6 | 1,186,193 | 53.9 | 259,406 | 16.7 |
| Random breaks | <1–3.5 | 979,909 | 5.4 | 129,690 | 6.3 | 173,515 | 11.4 |
| Valid pairs | 7.5–98 | 4,043,904 | 22.1 | 848,181 | 38.5 | 1,117,826 | 71.7 |
Note that the number of reads does not necessarily add up to the total number of interactions because the same read can be categorized within >1 of the filter categories. Valid pairs represents the number of reads used for generating the Hi-C maps seen in Fig. 1. RES: restriction enzyme sites.
Figure 1:Comparison of values obtained for 15 parameters evaluated in our 3 samples within a context of 316 in situ Hi-C samples processed with the same restriction enzyme (RE). From the left, the first 4 parameters are intrinsic values for quality control of the experimental processing before mapping, where “dangling ends r1,2” for each read refers to the number of reads that have been digested but have not been mapped and “ligated” is the number of sites that have been re-ligated (contacting different fragments). The remaining 11 parameters are as follows: Self-circle: both read ends are mapped to the same RE fragment in opposed orientation. Dangling-ends: both read ends are mapped to the same RE fragment in facing orientation. Error: both read ends are mapped to the same RE fragment in the same orientation. Extra dangling end: the read ends are mapped to different RE fragments in facing orientation but are close enough (less than max_molecule_length bp) from the RE cut-site to be considered part of adjacent RE fragments that were not separated by digestion. The max_molecule_length parameter can be inferred from the fragment_size function previously detailed. Too close from RE sites (RES): the start position of 1 of the read ends is too close (5 bp by default) to the RE cutting site. Too short: 1 of the read ends is mapped to RE fragments of <75 bp. These are removed because there is ambiguity on where the read end is mapped because it could also belong to any of the 2 neighbouring RE fragments. Too large: the read ends are mapped to long RE fragments (default: 100 kb, P < 10–5 to occur in a randomized genome) and they likely represent poorly assembled or repetitive regions. Over-represented: the read ends coming from the top 0.5% most frequently detected RE fragments; they may represent PCR artefacts, random breaks, or genome assembly errors. PCR artefacts or duplicated: the combination of the start positions, mapped length, and strands of both read ends are identical. In this case, only 1 copy is kept. Random breaks: the start position of 1 read end is too far (less than minimum_distance_to_RE) from the RE cut-site. These are produced most probably by non-canonical enzyme activity or by random physical breakage of the chromatin. Valid pairs: are those pairs of contacting reads that were kept as valid contacts, after removing all other 10 categories of read pairs. Additional details can be found in the filtering function of the TADbit method: https://3dgenomes.github.io/TADbit/tutorial/tutorial_6-Filtering_mapped_reads.html. In the plot, the lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles) and the line in the middle is the median value. The upper whisker extends from the top hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called "outlying" points and are plotted individually.
Figure 2:Unique reads per total mapped reads subsampling. Each point within the dotted lines indicates a 5% increase in the total number of the reads (X axis) for each sample. On the Y axis we show the proportion of unique reads mapped for each 5% increment.
Figure 3:(A) Hi-C Vanilla-normalized contact matrix representation of the genome of sample Oz13. Note the presence of the rearrangement/translocations in chromosome NC_011465.1, as mentioned in the text. (B) Close-up detail of the chromosome NC_011465, Vanilla normalized at a 500-kb resolution. Marked with blue and green arrows are the intrachromosomal rearrangements with relation to the reference genome.