JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 151

Atelier Chip-Seq

Elodie Darbo, INSERM U1218, Bordeaux

Stéphanie Le Gras, IGBMC, Strasbourg

Tao Ye, IGBMC, Strasbourg

Morgane Thomas-Chollier, IBENS, Paris

École de bioinformatique AVIESAN-IFB 2021

2 of 151

Contents

Introduction
Experimental Design
Quality Control of the reads
Mapping
Quality control on mapped reads
Visualisation et normalisation
Peak Calling
Motifs Analysis
Annotation
Conclusions

3 of 151

TIPS

Keep track of all command lines you run. You can for example, create a text file in which you write every commands you run.
Give content-explicit names to the files you’re generating.
Give to files the right extension.
Create directories with explicit names!!
Compress big files (with gzip for instance).

4 of 151

Introduction

5 of 151

Chip-Seq analysis

Experimental design, Quality Controls, Mapping

Normalization & peak calling

Reads

Peaks

6 of 151

Chip-Seq analysis

Experimental design, Quality Controls, Mapping

Normalization & peak calling

Motif analysis

Peak annotation

Reads

Peaks

Annotations

Motifs

7 of 151

ChIP-seq

ChIP (=Chromatin Immuno-Precipitation)

differences in methods to detect the bound DNA

- small-scale: PCR / qPCR

- large-scale:

- microarray = ChIP-on-chip

- sequencing = ChIP-seq

8 of 151

Experimental design

9 of 151

ENCODE

The Encyclopedia of DNA Elements (ENCODE) Consortium has carried out thousands of ChIP–seq experiments and has used this experience to develop a set of working standards and guidelines

See: https://www.encodeproject.org/about/experiment-guidelines/

Landt SG, Marinov GK, Kundaje A et al. (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research 22, 1813–1831.

10 of 151

Considerations on ChIP

Antibody

Antibody quality varies, even between independently prepared lots of the same antibody (Egelhofer, T. A. et al. 2011)

Number of cells

large number of cells are required for a ChIP experiment (limitation for small organisms or precious samples)

Shearing of DNA (Mnase I, sonication, Covaris): trying to narrow down the size distribution of DNA fragments

Complexity in DNA fragments

11 of 151

Library prep

Size selection (200 or 400 bp)

PCR amplification

Single-end Sequencing Paired-end Sequencing

Ligation of Adapters

ChIP

Step between ChIP and sequencing
Starting material: ChIP sample (1-10ng of sheared DNA)

12 of 151

Sequencing

Sequencer : Illumina HiSeq 4000
No. of reads per sample:

(HiSeq 2500) 4 samples per lane :~41 millions per sample
(HiSeq 4000) 8 samples per lane :~43 millions per sample

Length of DNA fragment : ~200bp
No. of cycle per run : 50

13 of 151

Single end or paired end ?

Single end (most of the time until 2016)
Paired-end (more and more these days)

Improve identification of duplicated reads
Better estimation of the fragment size distribution
Increase the mapping efficiency to repeat regions
The price! But 2 x 40bp is affordable

14 of 151

Sequencing depth

Consider the depth needed depending on:

Chipped protein

15 of 151

Sequencing depth

Consider the depth needed depending on:

Chipped protein
Number of expected binding sites

16 of 151

Sequencing depth

Consider the depth needed depending on:

Chipped protein
Number of expected binding sites
Size of the genome of interest

Ex:

For human genomes

20 million uniquely mapped read sequences for point-source peaks,
40 million for broad-source peaks.

For fly genome: 8 million reads
For worm genome: 10 million reads

17 of 151

Controls

Used mostly to filter out false positives (high level of noise)

Idea: potential false positive will be enriched in both treatment and control.

A control will fail to filter out false positives if its enrichment profile is very different from the enrichment profile of false positive regions in the treatment sample
Most commonly used control: Input DNA (a portion of DNA sample removed prior to IP)
Choice of control is extremely important
It is recommended to cover the control in a higher extent than the IPs

18 of 151

Why an Input is required ?

The input is used to model local noise level

Accessible regions are expected to produce more reads��
Amplified regions (CNV) are expected to produce more reads��

Closed

Open

Closed

Open

Closed

Open

19 of 151

Why an Input is required ?

The input is used to model local noise level

Accessible regions are expected to produce more reads��
Amplified regions (CNV) are expected to produce more reads
Moreover, most peak callers are configured with an input as control ��

Closed

Open

Closed

Open

Closed

Open

20 of 151

Other controls

IgG (mock IP): controls for non-specific IP enrichment

Problem : low-complexity library (few reads)

Histone H3 (for H3 variants)
Uninduced condition (for inducible TFs)

Example : Glucocorticoid Réceptor
Induced by Dexamethasone (Dex)
Control vehicule = Ethanol (EthOH)

KO of your protein of interest
Non flagged cell lines
...

�

21 of 151

Replicates

A minimum of two replicates should be carried out per experiment.
Each replicate should be a biological rather than a technical replicate; that is, it results from an independent cell culture, embryo pool or tissue sample.

22 of 151

Data analysed in this course

23 of 151

Dataset used

All experiments (GEO): GSE41187
Experiment: FNR IP ChIP-seq Anaerobic A (SRX189773 - SRR576933)
Control: anaerobic INPUT DNA (SRX189778 - SRR576938)

24 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

25 of 151

Downloading ChIP-seq reads from NCBI

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

26 of 151

Connect to the server and set up your environment

Protocol

27 of 151

Quality control of the reads

28 of 151

Quality control of the reads

As for any NGS datasets
FastQC program (c.f. Course “preprocessing” Denis Puthier Monday afternoon)

29 of 151

Quality control of the reads and statistics

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

30 of 151

Mapping

31 of 151

Mapping

Find out the position of the reads within the reference genome

Ref. Genome

Reads

One position in the genome
Many possible positions (Repeat regions, duplicate regions, pseudogenes…)

32 of 151

Mapping example

Human chromosomes

ATGCGATTA

33 of 151

Genomic coordinates

chromosome 5

ATGCGATTA

181,538,259

alignment

125,326,318 start

125,326,358 end

Genomic coordinate of the mapped read :

chr5 125326318 125326358 +

GPS

34 of 151

Mapping tool used: Bowtie

(c.f. Course “mapping” Denis Puthier Monday afternoon)
Designed to align reads if:

many of the reads have at least one good, valid alignment,
many of the reads are relatively high-quality
the number of alignments reported per read is small (close to 1)

Langmead B. et al, Genome Biology 2009
Langmead B (2010) Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics Chapter 11: Unit 11 17

35 of 151

Duplicated genomic regions

Keep 1 position randomly

Keep all possible position

Keep none

3 possible alignments

36 of 151

Mappability

Mappability (a): how many times a read of a given length can align at a given position in the genome

a=1 (read align once)
a=1/n (read align n times)

37 of 151

Mapping the reads with Bowtie

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

38 of 151

Mapping: expected signal

For a transcription factor signal is expected to be sharp

Sens alignments

Rev/comp alignments

The binding site itself is

generally not sequenced !

39 of 151

Mapping: the expected signal

For most histone marks the signal is expected to be broad
Asymmetry is less/not pronounced
Peak calling algorithms need to adapt to these various signals

40 of 151

Mapping: observed signal

Histone mark

(H3K4me1)

Trans. Factor

(ESR1)

41 of 151

Filtering mapped reads

42 of 151

Which reads to filter ?

Low-quality read alignments

Tool : samtools

Multi-mapped reads (unless removed during the mapping step)

Tool : samtools

Duplicated reads (PCR duplicates)

Tool : Picard MarkDuplicates

43 of 151

Source of confusion

uniquely mapped reads = reads that “matches” only 1 region in the genome

duplicated reads = reads that “match” at the SAME location (same start, strand)

Uniquely mapped

multi-mapped

“Stack” of duplicated reads

44 of 151

PCR duplicates

PCR duplicates

Related to poor library complexity
The same set of fragments are amplified, may indicates that immuno-precipitation failed
Tools to check for

FastQC report (duplicate diagram)
PCR bottleneck metric (ENCODE)

45 of 151

QC : PBC (PCR Bottleneck Coefficient)

An approximate measure of library complexity
PBC = N₁/N_d

N₁= Genomic position with 1 read aligned
N_d = Genomic position with ≧ 1 read aligned

Value :

0-0.5: severe bottlenecking
0.5-0.8: moderate bottlenecking
0.8-0.9: mild bottlenecking
0.9-1.0: no bottlenecking

https://genome.ucsc.edu/ENCODE/qualityMetrics.html

✅

❌

46 of 151

Estimating the number of duplicated reads

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

47 of 151

Quality Control on mapped reads

48 of 151

Assessing ChIP quality

Guidelines from ENCODE
Various metrics

Check duplicate rate (see previous Filtering section)
Use a Lorenz Curve (implemented in Deeptools fingerprint)
Look at strand cross-correlation (implemented in SPP BioC package and phantompeakqualtools)
Fraction of reads in peaks (FRiP, as proposed by ENCODE), but requires to find peaks.

49 of 151

Lorenz curve

Analyze income among workers by computing cumulative sum.

If uniform income distribution :

Straight line

Line of equality

Fraction of total income

100%

“Ideal” world

Cumulative income of people from lowest

to highest income

50 of 151

Lorenz curve

Analyze income among workers by computing cumulative sum.

If uniform income distribution :

Straight line

If they were trumpized

Lorenz curve

Line of equality

Lorenz curve

Fraction of total income

100%

“Ideal” world

“Trumpized” world

Cumulative income of people from lowest

to highest income

51 of 151

Lorenz curve

Analyze income among workers by computing cumulative sum.

If uniform income distribution :

Straight line

If they were trumpized

Lorenz curve

Here the workers are the genome windows and incomes are reads

Line of equality

Lorenz curve

Fraction of total reads

100%

“Ideal” world

(Bad ChIP)

“Trumpized” world

(Good ChIP)

Cumulative genome windows from lowest read counts to highest

52 of 151

Plot the Lorenz curve with Deeptools

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

53 of 151

Strand cross-correlation

Compute strand cross correlation for each window w across the genome.
Use various distance d and compute the mean cross-correlation observed

Coverage windows

Strand cross-correlation

for each window and various d values

r = 1

Mean cross-correlation

100

54 of 151

55 of 151

Strand cross-correlation

56 of 151

Strand cross-correlation

NSC: normalized strand coefficient

Relative strand correlation (RSC)

NSC ≥ 1.05 is recommended� �

RSC ≥ 0.8 is recommended�

57 of 151

Visualization: computing a genomic coverage file

58 of 151

Visualizing the data in a genome browser

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

59 of 151

Bam files are fat

BAM files are fat as they do contain exhaustive information about read alignments

Memory issues (can only visualize fraction of the BAM)�

Need a more lightweight file format containing only genomic coverage information:

❌ Wig (not compressed, not indexed)
✅ TDF (compressed, indexed)
✅ BigWig (compressed, indexed)

60 of 151

Coverage file and read extension

BAM files do not contain fragment location but read location
We need to extend reads to compute fragments coordinates before coverage analysis
Not required for PE

w_i

w_i+1

w_i+2

w_i+3

w_i+4

Window

Coverage

61 of 151

Library size normalization

IP condition A

input condition A

IP condition B

input condition B

IP vs input

normalization

IP vs input

normalization

Inter-condition

normalization

62 of 151

Library size normalization (input vs IP)

63 of 151

Library size normalization (input vs IP)

64 of 151

Library size normalization (input vs IP)

65 of 151

Library size normalization

Example of PeakSeq

All peaks

Signal peaks removed

66 of 151

Viewing scaled data

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

67 of 151

Peak Calling

Reads

Peaks

68 of 151

From reads to peaks

Chip-seq peaks are a mixture of two signals:

+ strand reads (Watson)
- strand reads (Cricks)

The sequence read density accumulates on forward and reverse strands centered around the binding site

69 of 151

From reads to peaks

Get the signal at the right position

Read shift
Extension

Estimate the fragment size
Do paired-end

70 of 151

Peak callers

The peak caller should be chosen based on

Experimental design

SE or PE (E.g MACS1.4 vs MACS2)

Expected signal

Sharp peaks (e.g. Transcription Factors).

E.g. MACS

Broad peaks (e.g. epigenetic marks).

E.g MACS, SICER,...

Sharp peaks

Broad peaks

71 of 151

A variety of peak callers

60 programs listed on OMICTOOLS
Most support a control

72 of 151

MACS [Zhang et al, 2008]

1. Modeling the shift size of ChIP-Seq tags

slides 2*bandwidth windows across the genome to find regions with tags more than mfold enriched relative to a random tag genome distribution
randomly samples 1,000 of these highly enriched regions
separates their + and - reads, and aligns them by the midpoint between their + and - read centers
define d as the distance in bp between the summit of the two distribution

73 of 151

MACS [Zhang et al, 2008]

2. Peak detection

Scales the total Input read count to be the same as the total ChIP read count
Duplicate read removal
Reads are shifted by d/2

(d value is the model obtained

in step 1)

Pepke et al, 2009

74 of 151

MACS [Zhang et al, 2008]

Slides 2d windows across the genome to find candidate peaks with a significant read enrichment (Poisson distribution p-value based on λ_BG, default 10^-5)
Estimate parameter λ_localof Poisson distribution
Keep peaks significant under λ_BG and λ_local and with p-value < threshold

75 of 151

MACS [Zhang et al, 2008]

3. Multiple testing correction (FDR)

Swap treatment and input and call negative peaks
Take all the peaks (neg + pos) and sort them by increasing p-values

FDR(p) =

# Negative peaks with p-value < p

# Selected peaks

FDR = 2/27 = 0.074

76 of 151

MACS in summary

Step 1 : search for candidate regions that look like good peaks, to produce a fine-tuned model of the peaks (d value) to search in Step 2
Step 2 : actual peak calling

sIiding window length = 2*d
In each window : test if the region is a peak, by comparing the number of reads in the treatment and the expected number of reads
Comparison is based on a statistical test with a Poisson distribution, keeping only regions with p-value < threshold

Step 3 : correction for multiple testing (many windows were tested), calculation of FDR

77 of 151

Peak calling with MAC S

(stop after step 3)

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

78 of 151

How to deal with replicates

79 of 151

How to deal with replicates

Sample 1.a

Sample 1.b

Analyze samples separately and takes union or intersection of resulting peaks

Sample 1.a

Sample 1.b

Sample 1

Merge samples prior to the peak calling (e.g recommended by MACS) => “pooling”

80 of 151

IDR - Irreproducible Discovery Rate (ENCODE)

Measures consistency between replicates
Uses reproducibility in score rankings between peaks in each replicate to determine an optimal cutoff for significance.
Idea:

The most significant peaks are expected to have high consistency between replicates
The peaks with low significance are expected to have low consistency

https://sites.google.com/site/anshulkundaje/projects/idr

81 of 151

IDR

(!) IDR doesn’t work on broad source data!

82 of 151

Combine replicate (IDR)

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

83 of 151

Processing steps are over !

Let’s do Biology !!!

peaks

84 of 151

85 of 151

« see if you can find something in the data »

86 of 151

« see if you can find something in the data »

87 of 151

What is the biological question ?

Where do a transcription factor (TF) bind ?

How do a transcription factor (TF) bind ?

Which binding motif(s) (can be several for a given TF !!)
Is the binding direct to DNA or via protein-protein interactions ?
Are there cofactors (maybe affecting the motif !!), and if so, identify them

Which regulated genes are directly regulated by a given TF ?

Where are the promoters (PolII) and chromatin marks ?

88 of 151

Should drive all « downstream » analyses

Will take time

to « do it all » !!!

89 of 151

cell biology (eg: luciferase assay) ?
in vitro assays (eg: EMSA) ?
Proteomic (eg: mass spectrometry) ?
Transgenics ?
Will depend on

the organism
available infrastructure

90 of 151

Discovering motifs in peaks

Annotations

Motifs

91 of 151

Biological concepts of transcriptional regulation

92 of 151

Transcription factor specificity

How do TF « know » where to bind DNA ?

TF recognize TFBS with specific DNA sequences

a given TF is able to bind DNA on TFBSs with different sequences

93 of 151

Binding specificity

94 of 151

From binding sites to binding motif�

Collection of binding sites �used to build the Sox2 matrix (TRANSFAC M01272 )

R15133 GCCCTCATTGTTATGC

R15201 AAACTCTTTGTTTGGA

R15231 TTCACCATTGTTCTAG

R15267 GACTCTATTGTCTCTG

R16367 GATATCTTTGTTTCTT

R17099 TGCACCTTTGTTATGC

R19276 AATTCCATTGTTATGA

R19367 AAACTCTTTGTTTGGA

R19510 ATGGACATTGTAATGC

R22342 AGGCCTTTTGTCCTGG

R22344 TGTGCTTTTGTNNNNN

R22359 CTCAACTTTGTAATTT

R22961 GCAGCCATTGTGATGC

R23679 CACCCCTTTGTTATGC

R25928 TTTTCTATTGTTTTTA

R27428 AAAGGCATTGTGTTTC

A	6	7	4	4	2	0	8	0	0	0	0	2	7	0	1	4
C	2	2	6	5	9	12	0	0	0	0	0	2	2	2	0	6
G	4	3	2	4	1	0	0	0	0	16	0	2	0	2	9	3
T	4	4	4	3	4	4	8	16	16	0	16	9	6	11	5	2

Position-specific scoring matrix (PSSM)

Sequence logo

95 of 151

A [24 54 59 0 65 71 4 24 9 ]

C [ 7 6 4 72 4 2 0 6 9 ]

G [31 7 0 2 0 1 1 38 55 ]

T [14 9 13 2 7 2 71 8 3 ]

>mm9_chr1_39249116_39251316_+

gagaggaagggggagaaagagggagggggagGGTGATAGGTAGCCAGGAG

CCAATGGGGGCGTTTTCCTTGTCCAGGCCACTTGCTGGAATGTGAGATGT

AGAATGACCCAAAGAGAGCTGCCAAGACAGAGCTCTGCCCCAGGAATTGA

ACTCAAAGGGTGTCAGAAAGCAGGTGGCCTTTGTGCACCTGGCGCGGGGA

CGTGGCTCCCCTCTTCCGGCTGGTCTAGCCAGGtgcctgcctgcctgcct

gccGTGATCTCTGGACGCCAGTAGAGGGTTGTTGTGGGTTTGGGTGAAAC

ACGCCACCCCTGAGCTCTTCCGCGGGGCTAGCAATCTCCCCATCACCCCA

TTCGCGCTCAGAACCCCCTCAGCGAGTCTAACAGCAGGCCTGGTTCCCCG

Discovered motif

ChIP-seq peaks

DNA sequence

Motif logo

96 of 151

De novo motif discovery

97 of 151

De novo motif discovery

Find exceptional motifs based on the sequence only

(No prior knowledge of the motif to look for)

Criteria of exceptionality:

Over-/under-representation: higher/lower frequency than expected by chance
Position bias: concentration at specific positions relative to some reference coordinates (e.g. TSS, peak center, …).

98 of 151

Some motif discovery tools

MEME (Bailey et al., 1994)
RSAT oligo-analysis (van Helden et al., 1998)
AlignACE (Roth et al. 1998)
RSAT position-analysis (van Helden et al., 2000)
Weeder (Pavesi et al. 2001)
MotifSampler (Thijs et al., 2001)
… many others

99 of 151

New approaches for ChIP-seq datasets

100 of 151

De novo motif discovery

target gene

cis-regulatory

elements

target gene

Case 1: promoters of co-expressed genes

TF binding site

Case 2: ChIP-seq peaks

binding motif �(represented as a� sequence logo)

101 of 151

Regulatory sequence Analysis Tools (rsat.eu)

243 Fungi

9638 Bacteria + Archaea

186 Protists

92 Metazoa

39 Plants

102 of 151

Peak-motifs

•fast and scalable

•treat full-size datasets

•complete pipeline

•web interface

•accessible to non-specialists

103 of 151

Peak-motifs: why providing yet another tool?

typical ChIP-seq dataset

Thomas-Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012

size limit of other websites

fast and scalable
treat full-size datasets
using 4 complementary algorithms

Global over-representation

oligo-analysis
dyad-analysis (spaced motifs)

Positional bias

position-analysis
local-words

104 of 151

RSAT menu

1. Get sequences

2. Run the analysis

3. Visualization

Help: tutorials,

105 of 151

RSAT Web forms

Tool name

Tool description

Tool parameters

Output

Go button (launches the analysis)

Demo button (fill in the form for test purposes)

Help

106 of 151

Motif analysis

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

107 of 151

Motif discovery: frequency

108 of 151

Motif discovery: positional bias

109 of 151

Direct versus indirect binding

ChIP-seq does not necessarily reveal direct binding: The motif of the targeted TF is not always found in peaks!

Direct binding

Indirect binding

110 of 151

Annotating peaks

Annotations

111 of 151

Are peaks biased towards any genomic features?

How are the peaks distributed on the chromosomes?

Are there genomic features (promoters, intergenic, intronic, exonic regions) enriched in the peaks?

How are the peaks distributed compared to gene structures (TSS, TTS, introns, exons)?

How are they distributed compared to the genes?

112 of 151

Various tools available

ChIPseeker (Bioconductor) https://goo.gl/BemEsw
bedtools annotate : http://bedtools.readthedocs.io
HOMER annotatePeaks.pl

Warning : rely on the organism annotation and assembly version

=> not all organisms supported by all programs !

113 of 151

Which are the closest genes?

HOMER is a well-maintained suite of tools for functional genomics sequencing data sets. It can perform peak-calling and motif analysis, but we will use it for annotation of the peaks only.

114 of 151

What are the genes associated to the peaks ?

Are there some functional categories over-represented ?

ChIP-seq peaks

GO Molecular Function

GO Biological Process

Disease Ontology

Pathways

…

Genes

Ontology terms

115 of 151

Various tools available

These tools work with gene lists

GSEA: http://www.broadinstitute.org/gsea
gProfiler: http://biit.cs.ut.ee/gprofiler/gost
GSAn: https://gsan.labri.fr/
HOMER: http://homer.salk.edu/homer
DAVID: http://david.abcc.ncifcrf.gov

These tools work with regions (BED files)

EnrichR: http://amp.pharm.mssm.edu/Enrichr/enrich
LOLA (Bioconductor) https://goo.gl/pWDZEs
GREAT: http://great.stanford.edu/public/html/

116 of 151

Associate peaks to closest genes

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

117 of 151

Now that we have the genes,

Are there some functional categories over-represented ?

ChIP-seq peaks

GO Molecular Function

GO Biological Process

Disease Ontology

Pathways

…

Genes

Ontology terms

118 of 151

HOMER [Heinz et al., Mol Cell 2010]

Gene Ontology Analysis of Associated Genes: annotatePeaks go option

Genome Ontology: Looking for Enriched Genomic Annotations: annotatePeaks genomeOntology option

119 of 151

GREAT

Note: Only human (hg19), mouse (mm9, mm10) and zebrafish (danRer7) genomes are supported

120 of 151

GREAT

Note: Only human (hg19), mouse (mm9, mm10) and zebrafish (danRer7) genomes are supported

121 of 151

GREAT

Input

bed file with peaks

Output

Enriched GO terms and functions

122 of 151

Other analyses

Clustering peaks

(Deeptools, HOMER, seqMINER)

The darker the red the higher the read enrichment

Ye et al, 2011

123 of 151

Other analyses

ReMAP (http://tagc.univ-mrs.fr/remap/)

Is my peak dataset enriched for known TF peaks?

124 of 151

Identification of Super-Enhancers from ChIP-seq peaks with ROSE [Loven et al., Cell 2013]

https://bitbucket.org/young_computation/rose

Identify SE-associated genes

125 of 151

Conclusions

atelier Chip-Seq

École de bioinformatique AVIESAN-IFB 2019

126 of 151

Bilan du pipeline ChIP-seq

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

127 of 151

128 of 151

Beyond ChIP-seq : ChIP-exo

129 of 151

Beyond ChIP-seq

130 of 151

Beyond ChIP-seq : ChIP-nexus

131 of 151

Beyond ChIP-seq : native ChIP

132 of 151

Beyond ChIP-seq : native ChIP

133 of 151

Beyond ChIP-seq : low-input and single-cell

More recent proof-of-concept

Grosselin et al, Nature Genetics, 2019

134 of 151

Beyond ChIP-seq : Cut&TAG (2019)

135 of 151

Beyond ChIP-seq : Cut&TAG (2019)

136 of 151

ATAC-seq

Assay for Transposase-Accessible Chromatin with highthroughput sequencing

137 of 151

Chromatin accessibility assays

•Chromatin accessibility is the degree to which nuclear macromolecules are able to physically contact chromatinized DNA and is determined by the occupancy and topological organization of nucleosomes as well as other chromatin-binding factors that occlude access to DNA (Klemm et al, 2019)

•Open chromatin regions contains generally transcriptionally active genes

•The accessible genome comprises ~2–3% of total DNA sequence yet captures more than 90% of regions bound by TFs (Thurman et al, 2012)

•Chromatin accessibility is measured by quantifying the susceptibility of chromatin to either enzymatic methylation or cleavage of its constituent DNA

•Chromatin accessibility assays (non exhaustive list)

FAIRE-seq, DNAse-seq, MNAse-seq, ATAC-seq

138 of 151

Chromatin accessibility assays

ATAC-seq has become the most widely used method to detect and analyze open chromatin regions

139 of 151

ATAC-seq

•Buenrostro et al, 2013

•ATAC-seq is a method for determining chromatin accessibility across the genome

•Transcription factor binding sites and positions of nucleosomes can be identified from the analysis of ATAC-Seq

•Advantages of ATAC-seq over other chromatin accessibility assays:

•The sensitivity and specificity are comparable to DNase-seq but superior to FAIRE-seq

•Straightforward and rapidly implemented protocol. ATAC-seq libraries can be generated in less than 3 hours

•Low number of cells required (500 - 50,000 cells cells)

•single-cell ATAC-seq (scATAC-seq) protocol (Cusanovich et al, 2015)

140 of 151

ATAC-seq process

Sample processing

ATAC-seq

Data analysis

141 of 151

ATAC-seq

Sample processing

ATAC-seq

Data analysis

142 of 151

ATAC-seq

•ATAC-seq protocol utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions

•In a process called "tagmentation", Tn5 transposase cleaves and tags double-stranded DNA with sequencing adaptors

•No additional library prep is needed

•Expected results are enrichments of sequenced reads in open chromatin regions as closed chromatin regions, DNA regions bound by TFs or wrapped around nucleosomes should be protected from Tn5 cleavage activity.

143 of 151

ATAC-seq

• Paired-end sequencing so that by looking at the distance between the two reads of a pair, we know in which the chromatin environment (Nucleosome Free Region (NFR), around a mono, di,-nucleosome, around a TF) of the DNA fragment.

144 of 151

ATAC-seq process

Sample processing

ATAC-seq

Data analysis

145 of 151

Analysis of ATAC-seq data

• Overall analysis resemble ChIP-seq data analysis

• Description of particularities of ATAC-seq data analysis

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

mapping

Post alignment processing & QC

Peak detection

Peak annotation

146 of 151

Analysis of ATAC-seq data

• Some cleaning steps are required for ATAC-seq. For example:

> A large percentage of reads are derived from mitochondrial DNA. These reads are removed as mitochondrial genome is generally not of interest.

> Omni-ATAC (Corces et al, 2017)

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

mapping

Post alignment processing & QC

Peak detection

Peak annotation

147 of 151

Analysis of ATAC-seq data

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

mapping

Post alignment processing & QC

Peak detection

Peak annotation

ChIP-seq

ATAC-seq

Adapted parameters for peak calling (MACS2) : --shift 75 --extsize 150 --nomodel -B --SPMR --keep-dup all --call-summits

148 of 151

Analysis of ATAC-seq data

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

mapping

Post alignment processing & QC

Peak detection

Peak annotation

149 of 151

Footprinting analysis

• Tn5 cuts in open chromatin regions

• DNA is protected from cleavage at position of TF binding creating a “notch” in ATAC-seq signal

• Footprinting analysis identifies TF activities

> Height of the notch reflects TF activity

> Compare TF activity between different conditions

150 of 151

Footprinting analysis

• Volcano plots showing differential TF binding activity as predicted by TOBIAS footprinting analysis in ATAC-seq data of NKp, iNK and mNK from Shin et al. (c) iNK vs NKp; (d) mNK vs NKp; (e) mNK vs iNK.

• Each dot represents a TF

• TFs which activity is changing between the two compared developmental stages are colored (see color legend below volcano plots)

Figure: Zhang et al, 2021

Data: Shih et al, 2016

151 of 151

Atelier ChIP-seq: tour de table des données

Les questions qui pourraient moduler le pipeline d’analyse

Narrow peak ou broad peak ?

Paired-end ou single-end ?

Disponibilité du génome de référence (partie annotation) ?

Utilisation de spike-ins

Qualité de l’assemblage du génome ?

A	6	7	4	4	2	0	8	0	0	0	0	2	7	0	1	4
C	2	2	6	5	9	12	0	0	0	0	0	2	2	2	0	6
G	4	3	2	4	1	0	0	0	0	16	0	2	0	2	9	3
T	4	4	4	3	4	4	8	16	16	0	16	9	6	11	5	2

A	6	7	4	4	2	0	8	0	0	0	0	2	7	0	1	4
C	2	2	6	5	9	12	0	0	0	0	0	2	2	2	0	6
G	4	3	2	4	1	0	0	0	0	16	0	2	0	2	9	3
T	4	4	4	3	4	4	8	16	16	0	16	9	6	11	5	2

A	6	7	4	4	2	0	8	0	0	0	0	2	7	0	1	4
C	2	2	6	5	9	12	0	0	0	0	0	2	2	2	0	6
G	4	3	2	4	1	0	0	0	0	16	0	2	0	2	9	3
T	4	4	4	3	4	4	8	16	16	0	16	9	6	11	5	2