1 of 32

3-dimensional genome:�сhromatin conformation analysis. Part I.

Based on materials by: Anna Kononkova,

(IITP RAS, JS in Khrameeva lab, Skoltech)

2 of 32

Hi-C: the whole genome chromosome conformation capture

2

output

3 of 32

Hi-C map: general approach

Paired reads

(for R1 and R2 we have same id)

Length of the fragment depends on the restriction enzyme and varies in mean range (500bp for DpnII, 4000bp for HindIII)

4 of 32

Walks policy

4

.pairs format

5 of 32

Walks policy

5

from MultiQC report for Hi-C

6 of 32

Multiple ligations (walks)

6

option --walks-policy all (for pairtools)

From https://pairtools.readthedocs.io/en/latest/

7 of 32

Read id	chr1	pos1	chr2	pos2
id1	chr1	10000	chr1	10900
id2	chr1	10000	chr1	12000
id12	chr1	10200	chr1	12500
id4	chr1	10250	chr1	15500
id16	chr1	13000	chr1	17000
id6	chr1	14100	chr1	21300
id7	chr1	17000	chr1	17200
id3	chr1	18000	chr1	19460
id10	chr1	18500	chr1	18700
id9	chr1	18800	chr1	22800
id17	chr1	21000	chr1	24999
id8	chr1	22000	chr1	24999

2 3 4

2

3

4

a piece of 5kb matrix with raw number of contacts

bins

bin	chr	start	end
0	chr1	0	5000
1	chr1	5000	10000
2	chr1	10000	15000
3	chr1	15000	20000
4	chr1	20000	25000

bins

Dummies example: how is a Hi-C map generated from pairs?

8 of 32

Read id	chr1	pos1	chr2	pos2
id1	chr1	10000	chr1	10900
id2	chr1	10000	chr1	12000
id12	chr1	10200	chr1	12500
id4	chr1	10250	chr1	15500
id16	chr1	13000	chr1	17000
id6	chr1	14100	chr1	21300
id7	chr1	17000	chr1	17200
id3	chr1	18000	chr1	19460
id10	chr1	18500	chr1	18700
id9	chr1	18800	chr1	22800
id17	chr1	21000	chr1	24999
id8	chr1	22000	chr1	24999

3

2

1

2 3 4

2

3

4

a piece of 5kb matrix with raw number of contacts

bins

bin	chr	start	end
0	chr1	0	5000
1	chr1	5000	10000
2	chr1	10000	15000
3	chr1	15000	20000
4	chr1	20000	25000

bins

heatmap

3

2

1

An example for dummies: how is a Hi-C map generated from pairs?

9 of 32

9

P(s) curve

Distance dependence of Hi-C counts: prominent decay is inherent feature of Hi-C data.

10 of 32

Valid pairs and artefacts

10

From https://doi.org/10.1371/journal.pcbi.1008839

11 of 32

How to deal with artefacts?

11

The most of Hi-C tools provide user with MultiQC report.

This can help to create decision on data quality and define the number of diagonal which should be dropped.

12 of 32

HiC map resolution

The choice of resolution depends on:

Genome:

human, mouse – long genome, long genes

Drosophila – shorter genome, shorter genes

Sequencing depth
Restriction enzyme cutting DNA in Hi-C protocol
General aim:

12

	Compartments	TADs	Loops, point interactions
Human, mouse	100Kb	40Kb	10Kb
Drosophila	20Kb	5Kb	2-5Kb

13 of 32

HiC map resolution

13

The highest resolution is required for SNP annotation and a study of promoter-enhancer interactions

14 of 32

��Accounting for experimental bias��

Three major sources of experimental biases in Hi-C data: restriction enzyme fragment lengths, GC content and sequence mappability.

In general, there have been two types of approaches to account for biases in C-data:

Explicit method (Yaffe and Tanay) - assumes that systematic biases are known (three systematic biases),statistically effective but computationally intensive.

Implicit method - assumes that systemic biases are unknown and all the bias is captured by the sequencing coverage of each bin (equal visibility), much more faster. Leverages the matrix-balancing Sinkhorn–Knopp algorithm.

14

15 of 32

Iterative correction* is a widely used application of the implicit method

15

*Imakaev et al, doi: 10.1038/nmeth.2148

16 of 32

Iterative correction (matrix balancing)

16

X_ij=X_ij/sum(X_i)

X_ij=X_ij/sum(X_j)

…

During iterative correction, elements in the first row are divided by the total sum of this row and elements in the first column are divided by the total sum in this column. The procedure repeats consequently for all the rows and columns, starting from the first to the last.

1 round of iterative correction

Repeating this procedure provides balanced matrix where total sum in each row (as well as column) is equal to 1 (or nearly).

17 of 32

Alternative bias correction

Coverage based normalization – covNorm

https://doi.org/10.1016/j.csbj.2021.05.041

Data are approximated with negative binomial distribution and expected ligation frequency is calculated.

Expected normalized data are ready for further analysis.

17

18 of 32

Primary analysis

P(s)-curves:

calculate by hand
or with hicexplorer (and other tools)

18

P(s) curves

P-curve provides a rough estimate of data quality and replicas similarity

19 of 32

Replicas comparison

Peculiar data need peculiar approach – stratum-adjusted correlation coefficient (SCC)

19

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research, 2017

Search for implementation in python (notR!): https://github.com/dejunlin/hicrep

20 of 32

20

The SCC statistic was calculated by computing a Pearson correlation coefficient for each stratum and then aggregating the stratum-specific correlation coefficients using a weighted average, with the weights derived from the generalized Cochran–Mantel–Haenszel (CMH) statistic

21 of 32

Comparison of data with different quality

What if we have 2 Hi-C maps with different total number of contacts?

Hi-C maps with dramatically different coverage should not be compared.

Hi-C maps with more or less similar quality (number of contacts) should be downsampled to the same total number of contacts:

cooltools random-sample -c …..

Https://cooltools.readthedocs.io/en/latest/

21

22 of 32

22

Forcato et al. Nature Methods 2017

Methods for Hi-C map generation from raw reads

Nf-core pipelines
Distiller-nf – our ’workhorse’	Nf-core/hic
Based on: bwa-mem, pairtools ,cooler	Based on: HiC-Pro and cooler

old, but time to time is usefull

23 of 32

��Distiller-nf: a modular Hi-C mapping pipeline for reproducible data analysis.��

23

The distiller pipeline aims to provide the following functionality:

Align the sequences of Hi-C molecules to the reference genome (bwa-mem allowing split of reads)
Parse .sam alignment and form files with Hi-C pairs (pairtools)
Filter PCR duplicates
Aggregate pairs into binned matrices of Hi-C interactions (cooler)

https://github.com/open2c/distiller-nf

24 of 32

��Distiller-nf: a modular Hi-C mapping pipeline for reproducible data analysis.��

24

https://pairtools.readthedocs.io/en/latest/

25 of 32

Distiller-nf output

.cool and .mcool files
.cool file – 1 resolution (default is 1kb)
.mcool files – multiple resolution (can be specified by user)
separate replicates and merged replicates
the file (cool/mcool) stores information for raw and balanced (iteratively corrected) interactions

Cool and mcool file format becomes more and more popular due to wide range of utilitites developed for that.

25

26 of 32

Cool/mcool file formats

Distiller-nf, HiCPro: .cool or .mcool format:

26

Cooler and cooltools packages (python) are the popular packages for working with cool and mcool files

trans-contacts, unbalanced data

cis-contacts, balanced data

27 of 32

Cool/mcool file formats

27

raw interactions

information about bins weights

28 of 32

Other formats:

Juicer output – .hic format which is binary and appropriate for Juicer tools.

Hiclib output – .hdf format can be opened with h5py python library.

Hiclib – old and almost deprecated tool, but still some bioinformaticians find it more accurate (more precise artefact filtration)

28

29 of 32

Conversion of Hi-C matrices of different file formats

HiCExplorer can be used to convert different file formats:

29

h5, cool, hic, homer, hicpro, 2D-text

to

cool, h5, homer, ginteractions, mcool, hicpro

One can also create cool file from matrix in txt format:

create bins

create pixels

create cooler with command cooler.create_cooler

see slide 27

30 of 32

Useful links

30

31 of 32

Practical part

As every type of omics data, raw fastq files after Hi-C should be trimmed first (cutadapt, trimgalore).

https://drive.google.com/drive/folders/1uIAsVJF-JTjkHK7VPAP8FcocDrNkoIKs?usp=sharing

Goals:

To get acquainted with distiller-nf
Try to operate with mcool/cool file, matrix, bins, pixels, cooler for command line and python

Replicates clusterization

We have replicates for 2 drosophila cell lines: Bg3 and Kc167.

Bg3 - nervous cell line.

Kc167 - embryonic cell line.

The aim is to conduct replicates clusterization, using SCC and demonstrate that replicates tend to be closer to each other comparing to different cell types.

Homework report:�1) scaling plot in log-log coordinates with description;�2) replicates clusterization for all files (in directory for the lecture) with description (plot dendrogram);

31

32 of 32

https://colab.research.google.com/drive/1EFRLARKwyqEeHVciZFLR1jmaP-ZaOXo-?usp=sharing

https://nf-co.re/hic/2.1.0/docs/usage

https://github.com/open2c/distiller-nf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5668950/

32