1 of 32

3-dimensional genome:�сhromatin conformation analysis. Part I.

Based on materials by: Anna Kononkova,

(IITP RAS, JS in Khrameeva lab, Skoltech)

2 of 32

Hi-C: the whole genome chromosome conformation capture

2

output

3 of 32

Hi-C map: general approach

Paired reads

(for R1 and R2 we have same id)

Length of the fragment depends on the restriction enzyme and varies in mean range (500bp for DpnII, 4000bp for HindIII)

4 of 32

Walks policy

4

.pairs format

5 of 32

Walks policy

5

from MultiQC report for Hi-C

6 of 32

Multiple ligations (walks)

6

option --walks-policy all (for pairtools)

7 of 32

Read id

chr1

pos1

chr2

pos2

id1

chr1

10000

chr1

10900

id2

chr1

10000

chr1

12000

id12

chr1

10200

chr1

12500

id4

chr1

10250

chr1

15500

id16

chr1

13000

chr1

17000

id6

chr1

14100

chr1

21300

id7

chr1

17000

chr1

17200

id3

chr1

18000

chr1

19460

id10

chr1

18500

chr1

18700

id9

chr1

18800

chr1

22800

id17

chr1

21000

chr1

24999

id8

chr1

22000

chr1

24999

2 3 4

2

3

4

a piece of 5kb matrix with raw number of contacts

bins

bin

chr

start

end

0

chr1

0

5000

1

chr1

5000

10000

2

chr1

10000

15000

3

chr1

15000

20000

4

chr1

20000

25000

bins

Dummies example: how is a Hi-C map generated from pairs?

8 of 32

Read id

chr1

pos1

chr2

pos2

id1

chr1

10000

chr1

10900

id2

chr1

10000

chr1

12000

id12

chr1

10200

chr1

12500

id4

chr1

10250

chr1

15500

id16

chr1

13000

chr1

17000

id6

chr1

14100

chr1

21300

id7

chr1

17000

chr1

17200

id3

chr1

18000

chr1

19460

id10

chr1

18500

chr1

18700

id9

chr1

18800

chr1

22800

id17

chr1

21000

chr1

24999

id8

chr1

22000

chr1

24999

3

3

2

2

1

1

2 3 4

2

3

4

a piece of 5kb matrix with raw number of contacts

bins

bin

chr

start

end

0

chr1

0

5000

1

chr1

5000

10000

2

chr1

10000

15000

3

chr1

15000

20000

4

chr1

20000

25000

bins

heatmap

3

2

1

An example for dummies: how is a Hi-C map generated from pairs?

9 of 32

9

P(s) curve

Distance dependence of Hi-C counts: prominent decay is inherent feature of Hi-C data.

10 of 32

Valid pairs and artefacts

10

11 of 32

How to deal with artefacts?

11

The most of Hi-C tools provide user with MultiQC report.

This can help to create decision on data quality and define the number of diagonal which should be dropped.

12 of 32

HiC map resolution

The choice of resolution depends on:

  • Genome:

human, mouse – long genome, long genes

Drosophila – shorter genome, shorter genes

  • Sequencing depth
  • Restriction enzyme cutting DNA in Hi-C protocol
  • General aim:

12

Compartments

TADs

Loops, point interactions

Human, mouse

100Kb

40Kb

10Kb

Drosophila

20Kb

5Kb

2-5Kb

13 of 32

HiC map resolution

13

The highest resolution is required for SNP annotation and a study of promoter-enhancer interactions

14 of 32

��Accounting for experimental bias��

  • Three major sources of experimental biases in Hi-C data: restriction enzyme fragment lengths, GC content and sequence mappability.

  • In general, there have been two types of approaches to account for biases in C-data:

Explicit method (Yaffe and Tanay) - assumes that systematic biases are known (three systematic biases),statistically effective but computationally intensive.

Implicit method - assumes that systemic biases are unknown and all the bias is captured by the sequencing coverage of each bin (equal visibility), much more faster. Leverages the matrix-balancing Sinkhorn–Knopp algorithm.

14

15 of 32

Iterative correction* is a widely used application of the implicit method

15

*Imakaev et al, doi: 10.1038/nmeth.2148

16 of 32

Iterative correction (matrix balancing)

16

Xij=Xij/sum(Xi)

Xij=Xij/sum(Xj)

During iterative correction, elements in the first row are divided by the total sum of this row and elements in the first column are divided by the total sum in this column. The procedure repeats consequently for all the rows and columns, starting from the first to the last.

1 round of iterative correction

Repeating this procedure provides balanced matrix where total sum in each row (as well as column) is equal to 1 (or nearly).

17 of 32

Alternative bias correction

Coverage based normalization – covNorm

https://doi.org/10.1016/j.csbj.2021.05.041

Data are approximated with negative binomial distribution and expected ligation frequency is calculated.

Expected normalized data are ready for further analysis.

17

18 of 32

Primary analysis

P(s)-curves:

  • calculate by hand
  • or with hicexplorer (and other tools)

18

P(s) curves

P-curve provides a rough estimate of data quality and replicas similarity

19 of 32

Replicas comparison

Peculiar data need peculiar approach – stratum-adjusted correlation coefficient (SCC)

19

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research, 2017

Search for implementation in python (notR!):  https://github.com/dejunlin/hicrep

20 of 32

20

The SCC statistic was calculated by computing a Pearson correlation coefficient for each stratum and then aggregating the stratum-specific correlation coefficients using a weighted average, with the weights derived from the generalized Cochran–Mantel–Haenszel (CMH) statistic

21 of 32

Comparison of data with different quality

  • What if we have 2 Hi-C maps with different total number of contacts?

Hi-C maps with dramatically different coverage should not be compared.

Hi-C maps with more or less similar quality (number of contacts) should be downsampled to the same total number of contacts:

cooltools random-sample -c …..

Https://cooltools.readthedocs.io/en/latest/

21

22 of 32

 

22

Forcato et al. Nature Methods 2017

Methods for Hi-C map generation from raw reads

Nf-core pipelines

Distiller-nf – our  ’workhorse

Nf-core/hic

Based on: bwa-mem, pairtools ,cooler

Based on: HiC-Pro and cooler

old, but time to time is usefull

23 of 32

��Distiller-nf: a modular Hi-C mapping pipeline for reproducible data analysis.��

23

The distiller pipeline aims to provide the following functionality:

  • Align the sequences of Hi-C molecules to the reference genome (bwa-mem allowing split of reads)
  • Parse .sam alignment and form files with Hi-C pairs (pairtools)
  • Filter PCR duplicates
  • Aggregate pairs into binned matrices of Hi-C interactions (cooler)

https://github.com/open2c/distiller-nf

24 of 32

��Distiller-nf: a modular Hi-C mapping pipeline for reproducible data analysis.��

24

https://pairtools.readthedocs.io/en/latest/

25 of 32

Distiller-nf output

  • .cool and .mcool files
  • .cool file – 1 resolution (default is 1kb)
  • .mcool files – multiple resolution (can be specified by user)
  • separate replicates and merged replicates
  • the file (cool/mcool) stores information for raw and balanced (iteratively corrected) interactions

Cool and mcool file format becomes more and more popular due to wide range of utilitites developed for that.

25

26 of 32

Cool/mcool file formats

Distiller-nf, HiCPro: .cool or .mcool format:

26

Cooler and cooltools packages (python) are the popular packages for working with cool and mcool files

trans-contacts, unbalanced data

cis-contacts, balanced data

27 of 32

Cool/mcool file formats

27

raw interactions

information about bins weights

28 of 32

Other formats:

Juicer output – .hic format which is binary and appropriate for Juicer tools.

Hiclib output – .hdf format can be opened with h5py python library.

Hiclib – old and almost deprecated tool, but still some bioinformaticians find it more accurate (more precise artefact filtration)

28

29 of 32

Conversion of Hi-C matrices of different file formats

  • HiCExplorer can be used to convert different file formats:

29

h5, cool, hic, homer, hicpro, 2D-text

to

cool, h5, homer, ginteractions, mcool, hicpro

One can also create cool file from matrix in txt format:

create bins

create pixels

create cooler with command cooler.create_cooler

see slide 27

30 of 32

Useful links

30

31 of 32

Practical part

  • As every type of omics data, raw fastq files after Hi-C should be trimmed first (cutadapt, trimgalore).

https://drive.google.com/drive/folders/1uIAsVJF-JTjkHK7VPAP8FcocDrNkoIKs?usp=sharing

Goals:

  • To get acquainted with distiller-nf
  • Try to operate with mcool/cool file, matrix, bins, pixels, cooler for command line and python

  • Replicates clusterization

We have replicates for 2 drosophila cell lines: Bg3 and Kc167.

Bg3 - nervous cell line.

Kc167 - embryonic cell line.

The aim is to conduct replicates clusterization, using SCC and demonstrate that replicates tend to be closer to each other comparing to different cell types.

Homework report:�1) scaling plot in log-log coordinates with description;�2) replicates clusterization for all files (in directory for the lecture) with description (plot dendrogram);

31

32 of 32

32