3-dimensional genome:�сhromatin conformation analysis. Part I.
Based on materials by: Anna Kononkova,
(IITP RAS, JS in Khrameeva lab, Skoltech)
Hi-C: the whole genome chromosome conformation capture
2
output
Hi-C map: general approach
Paired reads
(for R1 and R2 we have same id)
Length of the fragment depends on the restriction enzyme and varies in mean range (500bp for DpnII, 4000bp for HindIII)
Walks policy
4
.pairs format
Walks policy
5
from MultiQC report for Hi-C
Multiple ligations (walks)
6
option --walks-policy all (for pairtools)
Read id | chr1 | pos1 | chr2 | pos2 |
id1 | chr1 | 10000 | chr1 | 10900 |
id2 | chr1 | 10000 | chr1 | 12000 |
id12 | chr1 | 10200 | chr1 | 12500 |
id4 | chr1 | 10250 | chr1 | 15500 |
id16 | chr1 | 13000 | chr1 | 17000 |
id6 | chr1 | 14100 | chr1 | 21300 |
id7 | chr1 | 17000 | chr1 | 17200 |
id3 | chr1 | 18000 | chr1 | 19460 |
id10 | chr1 | 18500 | chr1 | 18700 |
id9 | chr1 | 18800 | chr1 | 22800 |
id17 | chr1 | 21000 | chr1 | 24999 |
id8 | chr1 | 22000 | chr1 | 24999 |
2 3 4
2
3
4
a piece of 5kb matrix with raw number of contacts
bins
bin | chr | start | end |
0 | chr1 | 0 | 5000 |
1 | chr1 | 5000 | 10000 |
2 | chr1 | 10000 | 15000 |
3 | chr1 | 15000 | 20000 |
4 | chr1 | 20000 | 25000 |
bins
Dummies example: how is a Hi-C map generated from pairs?
Read id | chr1 | pos1 | chr2 | pos2 |
id1 | chr1 | 10000 | chr1 | 10900 |
id2 | chr1 | 10000 | chr1 | 12000 |
id12 | chr1 | 10200 | chr1 | 12500 |
id4 | chr1 | 10250 | chr1 | 15500 |
id16 | chr1 | 13000 | chr1 | 17000 |
id6 | chr1 | 14100 | chr1 | 21300 |
id7 | chr1 | 17000 | chr1 | 17200 |
id3 | chr1 | 18000 | chr1 | 19460 |
id10 | chr1 | 18500 | chr1 | 18700 |
id9 | chr1 | 18800 | chr1 | 22800 |
id17 | chr1 | 21000 | chr1 | 24999 |
id8 | chr1 | 22000 | chr1 | 24999 |
3
3
2
2
1
1
2 3 4
2
3
4
a piece of 5kb matrix with raw number of contacts
bins
bin | chr | start | end |
0 | chr1 | 0 | 5000 |
1 | chr1 | 5000 | 10000 |
2 | chr1 | 10000 | 15000 |
3 | chr1 | 15000 | 20000 |
4 | chr1 | 20000 | 25000 |
bins
heatmap
3
2
1
An example for dummies: how is a Hi-C map generated from pairs?
9
P(s) curve
Distance dependence of Hi-C counts: prominent decay is inherent feature of Hi-C data.
Valid pairs and artefacts
10
How to deal with artefacts?
11
The most of Hi-C tools provide user with MultiQC report.
This can help to create decision on data quality and define the number of diagonal which should be dropped.
HiC map resolution
The choice of resolution depends on:
human, mouse – long genome, long genes
Drosophila – shorter genome, shorter genes
12
| Compartments | TADs | Loops, point interactions |
Human, mouse | 100Kb | 40Kb | 10Kb |
Drosophila | 20Kb | 5Kb | 2-5Kb |
HiC map resolution
13
The highest resolution is required for SNP annotation and a study of promoter-enhancer interactions
��Accounting for experimental bias��
Explicit method (Yaffe and Tanay) - assumes that systematic biases are known (three systematic biases),statistically effective but computationally intensive.
Implicit method - assumes that systemic biases are unknown and all the bias is captured by the sequencing coverage of each bin (equal visibility), much more faster. Leverages the matrix-balancing Sinkhorn–Knopp algorithm.
14
Iterative correction* is a widely used application of the implicit method
15
*Imakaev et al, doi: 10.1038/nmeth.2148
Iterative correction (matrix balancing)
16
Xij=Xij/sum(Xi)
Xij=Xij/sum(Xj)
…
…
During iterative correction, elements in the first row are divided by the total sum of this row and elements in the first column are divided by the total sum in this column. The procedure repeats consequently for all the rows and columns, starting from the first to the last.
1 round of iterative correction
Repeating this procedure provides balanced matrix where total sum in each row (as well as column) is equal to 1 (or nearly).
Alternative bias correction
Coverage based normalization – covNorm
https://doi.org/10.1016/j.csbj.2021.05.041
Data are approximated with negative binomial distribution and expected ligation frequency is calculated.
Expected normalized data are ready for further analysis.
17
Primary analysis
P(s)-curves:
18
P(s) curves
P-curve provides a rough estimate of data quality and replicas similarity
Replicas comparison
Peculiar data need peculiar approach – stratum-adjusted correlation coefficient (SCC)
19
HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research, 2017
Search for implementation in python (notR!): https://github.com/dejunlin/hicrep
20
The SCC statistic was calculated by computing a Pearson correlation coefficient for each stratum and then aggregating the stratum-specific correlation coefficients using a weighted average, with the weights derived from the generalized Cochran–Mantel–Haenszel (CMH) statistic
Comparison of data with different quality
Hi-C maps with dramatically different coverage should not be compared.
Hi-C maps with more or less similar quality (number of contacts) should be downsampled to the same total number of contacts:
cooltools random-sample -c …..
Https://cooltools.readthedocs.io/en/latest/
21
22
Forcato et al. Nature Methods 2017
Methods for Hi-C map generation from raw reads
Nf-core pipelines | |
Distiller-nf – our ’workhorse’ | Nf-core/hic |
Based on: bwa-mem, pairtools ,cooler | Based on: HiC-Pro and cooler |
old, but time to time is usefull
��Distiller-nf: a modular Hi-C mapping pipeline for reproducible data analysis.��
23
The distiller pipeline aims to provide the following functionality:
��Distiller-nf: a modular Hi-C mapping pipeline for reproducible data analysis.��
24
https://pairtools.readthedocs.io/en/latest/
Distiller-nf output
Cool and mcool file format becomes more and more popular due to wide range of utilitites developed for that.
25
Cool/mcool file formats
Distiller-nf, HiCPro: .cool or .mcool format:
26
Cooler and cooltools packages (python) are the popular packages for working with cool and mcool files
trans-contacts, unbalanced data
cis-contacts, balanced data
Cool/mcool file formats
27
raw interactions
information about bins weights
Other formats:
Juicer output – .hic format which is binary and appropriate for Juicer tools.
Hiclib output – .hdf format can be opened with h5py python library.
Hiclib – old and almost deprecated tool, but still some bioinformaticians find it more accurate (more precise artefact filtration)
28
Conversion of Hi-C matrices of different file formats
29
h5, cool, hic, homer, hicpro, 2D-text
to
cool, h5, homer, ginteractions, mcool, hicpro
One can also create cool file from matrix in txt format:
create bins
create pixels
create cooler with command cooler.create_cooler
see slide 27
Useful links
30
Practical part
https://drive.google.com/drive/folders/1uIAsVJF-JTjkHK7VPAP8FcocDrNkoIKs?usp=sharing
Goals:
We have replicates for 2 drosophila cell lines: Bg3 and Kc167.
Bg3 - nervous cell line.
Kc167 - embryonic cell line.
The aim is to conduct replicates clusterization, using SCC and demonstrate that replicates tend to be closer to each other comparing to different cell types.
Homework report:�1) scaling plot in log-log coordinates with description;�2) replicates clusterization for all files (in directory for the lecture) with description (plot dendrogram);
31