Atelier Chip-Seq
Elodie Darbo, INSERM U1218, Bordeaux
Stéphanie Le Gras, IGBMC, Strasbourg
Tao Ye, IGBMC, Strasbourg
Morgane Thomas-Chollier, IBENS, Paris
École de bioinformatique AVIESAN-IFB 2021
Contents
TIPS
Introduction
Chip-Seq analysis
Reads
Peaks
Chip-Seq analysis
Reads
Peaks
Annotations
Motifs
ChIP-seq
ChIP (=Chromatin Immuno-Precipitation)
differences in methods to detect the bound DNA
- small-scale: PCR / qPCR
- large-scale:
- microarray = ChIP-on-chip
- sequencing = ChIP-seq
Experimental design
ENCODE
See: https://www.encodeproject.org/about/experiment-guidelines/
Landt SG, Marinov GK, Kundaje A et al. (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research 22, 1813–1831.
Considerations on ChIP
Complexity in DNA fragments
Library prep
Size selection (200 or 400 bp)
PCR amplification
Single-end Sequencing Paired-end Sequencing
Ligation of Adapters
ChIP
Sequencing
Single end or paired end ?
Sequencing depth
Sequencing depth
Sequencing depth
Ex:
Controls
Why an Input is required ?
Closed
Closed
Open
Closed
Closed
Open
Closed
Closed
Open
Why an Input is required ?
Closed
Closed
Open
Closed
Closed
Open
Closed
Closed
Open
Other controls
�
Replicates
21
Data analysed in this course
Dataset used
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Protocol
Quality control of the reads
Quality control of the reads
28
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Mapping
Mapping
Ref. Genome
Reads
Mapping example
Human chromosomes
ATGCGATTA
?
?
?
Genomic coordinates
chromosome 5
ATGCGATTA
1
181,538,259
alignment
125,326,318 start
125,326,358 end
Genomic coordinate of the mapped read :
chr5 125326318 125326358 +
GPS
Mapping tool used: Bowtie
Duplicated genomic regions
?
Keep 1 position randomly
Keep all possible position
Keep none
3 possible alignments
Mappability
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Mapping: expected signal
TF
TF
TF
TF
Sens alignments
Rev/comp alignments
The binding site itself is
generally not sequenced !
TF
Mapping: the expected signal
H
H
H
H
Mapping: observed signal
Histone mark
(H3K4me1)
Trans. Factor
(ESR1)
Filtering mapped reads
Which reads to filter ?
Source of confusion
uniquely mapped reads = reads that “matches” only 1 region in the genome
duplicated reads = reads that “match” at the SAME location (same start, strand)
Uniquely mapped
multi-mapped
“Stack” of duplicated reads
PCR duplicates
QC : PBC (PCR Bottleneck Coefficient)
https://genome.ucsc.edu/ENCODE/qualityMetrics.html
✅
❌
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Quality Control on mapped reads
Assessing ChIP quality
Lorenz curve
Line of equality
Fraction of total income
100%
100%
“Ideal” world
Cumulative income of people from lowest
to highest income
Lorenz curve
Line of equality
Lorenz curve
Fraction of total income
100%
100%
“Ideal” world
“Trumpized” world
Cumulative income of people from lowest
to highest income
Lorenz curve
Line of equality
Lorenz curve
Fraction of total reads
100%
100%
“Ideal” world
(Bad ChIP)
“Trumpized” world
(Good ChIP)
Cumulative genome windows from lowest read counts to highest
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Strand cross-correlation
2
Coverage windows
4
6
4
2
0
0
2
4
6
4
2
0
0
Strand cross-correlation
for each window and various d values
r = 1
d
Mean cross-correlation
0
100
d
Strand cross-correlation
Strand cross-correlation
NSC: normalized strand coefficient
Relative strand correlation (RSC)
NSC ≥ 1.05 is recommended� �
RSC ≥ 0.8 is recommended�
Visualization: computing a genomic coverage file
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Bam files are fat
Coverage file and read extension
wi
wi+1
wi+2
wi+3
wi+4
15
6
20
14
5
Window
Coverage
Library size normalization
IP condition A
input condition A
IP condition B
input condition B
IP vs input
normalization
IP vs input
normalization
Inter-condition
normalization
Library size normalization (input vs IP)
Library size normalization (input vs IP)
Library size normalization (input vs IP)
Library size normalization
Example of PeakSeq
All peaks
Signal peaks removed
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Peak Calling
Reads
Peaks
From reads to peaks
From reads to peaks
Peak callers
Sharp peaks
Broad peaks
A variety of peak callers
MACS [Zhang et al, 2008]
1. Modeling the shift size of ChIP-Seq tags
MACS [Zhang et al, 2008]
2. Peak detection
(d value is the model obtained
in step 1)
Pepke et al, 2009
MACS [Zhang et al, 2008]
MACS [Zhang et al, 2008]
3. Multiple testing correction (FDR)
FDR(p) =
# Negative peaks with p-value < p
# Selected peaks
FDR = 2/27 = 0.074
MACS in summary
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
How to deal with replicates
How to deal with replicates
Sample 1.a
Sample 1.b
Analyze samples separately and takes union or intersection of resulting peaks
Sample 1.a
Sample 1.b
Sample 1
Merge samples prior to the peak calling (e.g recommended by MACS) => “pooling”
IDR - Irreproducible Discovery Rate (ENCODE)
https://sites.google.com/site/anshulkundaje/projects/idr
IDR
(!) IDR doesn’t work on broad source data!
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Processing steps are over !
Let’s do Biology !!!
peaks
« see if you can find something in the data »
« see if you can find something in the data »
What is the biological question ?
Should drive all « downstream » analyses
Will take time
to « do it all » !!!
Discovering motifs in peaks
Annotations
Motifs
Biological concepts of transcriptional regulation
Transcription factor specificity
How do TF « know » where to bind DNA ?
TF recognize TFBS with specific DNA sequences
a given TF is able to bind DNA on TFBSs with different sequences
Binding specificity
From binding sites to binding motif�
Collection of binding sites �used to build the Sox2 matrix (TRANSFAC M01272 )
R15133 GCCCTCATTGTTATGC
R15201 AAACTCTTTGTTTGGA
R15231 TTCACCATTGTTCTAG
R15267 GACTCTATTGTCTCTG
R16367 GATATCTTTGTTTCTT
R17099 TGCACCTTTGTTATGC
R19276 AATTCCATTGTTATGA
R19367 AAACTCTTTGTTTGGA
R19510 ATGGACATTGTAATGC
R22342 AGGCCTTTTGTCCTGG
R22344 TGTGCTTTTGTNNNNN
R22359 CTCAACTTTGTAATTT
R22961 GCAGCCATTGTGATGC
R23679 CACCCCTTTGTTATGC
R25928 TTTTCTATTGTTTTTA
R27428 AAAGGCATTGTGTTTC
A | 6 | 7 | 4 | 4 | 2 | 0 | 8 | 0 | 0 | 0 | 0 | 2 | 7 | 0 | 1 | 4 |
C | 2 | 2 | 6 | 5 | 9 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 0 | 6 |
G | 4 | 3 | 2 | 4 | 1 | 0 | 0 | 0 | 0 | 16 | 0 | 2 | 0 | 2 | 9 | 3 |
T | 4 | 4 | 4 | 3 | 4 | 4 | 8 | 16 | 16 | 0 | 16 | 9 | 6 | 11 | 5 | 2 |
Position-specific scoring matrix (PSSM)
Sequence logo
A [24 54 59 0 65 71 4 24 9 ]
C [ 7 6 4 72 4 2 0 6 9 ]
G [31 7 0 2 0 1 1 38 55 ]
T [14 9 13 2 7 2 71 8 3 ]
>mm9_chr1_39249116_39251316_+
gagaggaagggggagaaagagggagggggagGGTGATAGGTAGCCAGGAG
CCAATGGGGGCGTTTTCCTTGTCCAGGCCACTTGCTGGAATGTGAGATGT
AGAATGACCCAAAGAGAGCTGCCAAGACAGAGCTCTGCCCCAGGAATTGA
ACTCAAAGGGTGTCAGAAAGCAGGTGGCCTTTGTGCACCTGGCGCGGGGA
CGTGGCTCCCCTCTTCCGGCTGGTCTAGCCAGGtgcctgcctgcctgcct
gccGTGATCTCTGGACGCCAGTAGAGGGTTGTTGTGGGTTTGGGTGAAAC
ACGCCACCCCTGAGCTCTTCCGCGGGGCTAGCAATCTCCCCATCACCCCA
TTCGCGCTCAGAACCCCCTCAGCGAGTCTAACAGCAGGCCTGGTTCCCCG
Discovered motif
ChIP-seq peaks
DNA sequence
Motif logo
De novo motif discovery
De novo motif discovery
(No prior knowledge of the motif to look for)
Some motif discovery tools
New approaches for ChIP-seq datasets
De novo motif discovery
target gene
cis-regulatory
elements
target gene
target gene
Case 1: promoters of co-expressed genes
TF binding site
Case 2: ChIP-seq peaks
binding motif �(represented as a� sequence logo)
Regulatory sequence Analysis Tools (rsat.eu)
243 Fungi
9638 Bacteria + Archaea
186 Protists
92 Metazoa
39 Plants
Peak-motifs
•fast and scalable
•treat full-size datasets
•complete pipeline
•web interface
•accessible to non-specialists
Peak-motifs: why providing yet another tool?
typical ChIP-seq dataset
1h
Thomas-Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012
size limit of other websites
RSAT menu
1. Get sequences
2. Run the analysis
3. Visualization
Help: tutorials,
RSAT Web forms
Tool name
Tool description
Tool parameters
Output
Go button (launches the analysis)
Demo button (fill in the form for test purposes)
Help
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Motif discovery: frequency
Motif discovery: positional bias
Direct versus indirect binding
ChIP-seq does not necessarily reveal direct binding: The motif of the targeted TF is not always found in peaks!
Direct binding
Indirect binding
Annotating peaks
Annotations
Are peaks biased towards any genomic features?
Various tools available
Warning : rely on the organism annotation and assembly version
=> not all organisms supported by all programs !
Which are the closest genes?
HOMER is a well-maintained suite of tools for functional genomics sequencing data sets. It can perform peak-calling and motif analysis, but we will use it for annotation of the peaks only.
What are the genes associated to the peaks ?
Are there some functional categories over-represented ?
ChIP-seq peaks
GO Molecular Function
GO Biological Process
Disease Ontology
Pathways
…
Genes
Ontology terms
Various tools available
These tools work with gene lists
These tools work with regions (BED files)
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Protocol
Now that we have the genes,
Are there some functional categories over-represented ?
ChIP-seq peaks
GO Molecular Function
GO Biological Process
Disease Ontology
Pathways
…
Genes
Ontology terms
HOMER [Heinz et al., Mol Cell 2010]
Gene Ontology Analysis of Associated Genes: annotatePeaks go option
Genome Ontology: Looking for Enriched Genomic Annotations: annotatePeaks genomeOntology option
GREAT
Note: Only human (hg19), mouse (mm9, mm10) and zebrafish (danRer7) genomes are supported
GREAT
Note: Only human (hg19), mouse (mm9, mm10) and zebrafish (danRer7) genomes are supported
GREAT
Other analyses
(Deeptools, HOMER, seqMINER)
The darker the red the higher the read enrichment
Ye et al, 2011
Other analyses
Identification of Super-Enhancers from ChIP-seq peaks with ROSE [Loven et al., Cell 2013]
https://bitbucket.org/young_computation/rose
Identify SE-associated genes
Conclusions
atelier Chip-Seq
École de bioinformatique AVIESAN-IFB 2019
Bilan du pipeline ChIP-seq
Quality control
Mapping (bam)
Filtering (bam)
Annotation
Motif discovery
Visualization
Peak calling (bed)
Quality Control
Raw Data (fastq)
Beyond ChIP-seq : ChIP-exo
Beyond ChIP-seq
Beyond ChIP-seq : ChIP-nexus
Beyond ChIP-seq : native ChIP
Beyond ChIP-seq : native ChIP
Beyond ChIP-seq : low-input and single-cell
More recent proof-of-concept
Grosselin et al, Nature Genetics, 2019
Beyond ChIP-seq : Cut&TAG (2019)
Beyond ChIP-seq : Cut&TAG (2019)
ATAC-seq
Assay for Transposase-Accessible Chromatin with highthroughput sequencing
Chromatin accessibility assays
•Chromatin accessibility is the degree to which nuclear macromolecules are able to physically contact chromatinized DNA and is determined by the occupancy and topological organization of nucleosomes as well as other chromatin-binding factors that occlude access to DNA (Klemm et al, 2019)
•Open chromatin regions contains generally transcriptionally active genes
•The accessible genome comprises ~2–3% of total DNA sequence yet captures more than 90% of regions bound by TFs (Thurman et al, 2012)
•Chromatin accessibility is measured by quantifying the susceptibility of chromatin to either enzymatic methylation or cleavage of its constituent DNA
•Chromatin accessibility assays (non exhaustive list)
FAIRE-seq, DNAse-seq, MNAse-seq, ATAC-seq
Chromatin accessibility assays
ATAC-seq has become the most widely used method to detect and analyze open chromatin regions
ATAC-seq
•Buenrostro et al, 2013
•ATAC-seq is a method for determining chromatin accessibility across the genome
•Transcription factor binding sites and positions of nucleosomes can be identified from the analysis of ATAC-Seq
•Advantages of ATAC-seq over other chromatin accessibility assays:
•The sensitivity and specificity are comparable to DNase-seq but superior to FAIRE-seq
•Straightforward and rapidly implemented protocol. ATAC-seq libraries can be generated in less than 3 hours
•Low number of cells required (500 - 50,000 cells cells)
•single-cell ATAC-seq (scATAC-seq) protocol (Cusanovich et al, 2015)
ATAC-seq process
Sample processing
ATAC-seq
Data analysis
ATAC-seq
Sample processing
ATAC-seq
Data analysis
ATAC-seq
•ATAC-seq protocol utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions
•In a process called "tagmentation", Tn5 transposase cleaves and tags double-stranded DNA with sequencing adaptors
•No additional library prep is needed
•Expected results are enrichments of sequenced reads in open chromatin regions as closed chromatin regions, DNA regions bound by TFs or wrapped around nucleosomes should be protected from Tn5 cleavage activity.
ATAC-seq
• Paired-end sequencing so that by looking at the distance between the two reads of a pair, we know in which the chromatin environment (Nucleosome Free Region (NFR), around a mono, di,-nucleosome, around a TF) of the DNA fragment.
ATAC-seq process
Sample processing
ATAC-seq
Data analysis
Analysis of ATAC-seq data
• Overall analysis resemble ChIP-seq data analysis
• Description of particularities of ATAC-seq data analysis
Advanced analysis
Motif enrichment analysis
Differential analysis
Meta profiles /
Clustering
Footprinting analysis
Nucleosome positioning
...
QC
mapping
Post alignment processing & QC
Peak detection
Peak annotation
Analysis of ATAC-seq data
• Some cleaning steps are required for ATAC-seq. For example:
> A large percentage of reads are derived from mitochondrial DNA. These reads are removed as mitochondrial genome is generally not of interest.
> Omni-ATAC (Corces et al, 2017)
Advanced analysis
Motif enrichment analysis
Differential analysis
Meta profiles /
Clustering
Footprinting analysis
Nucleosome positioning
...
QC
mapping
Post alignment processing & QC
Peak detection
Peak annotation
Analysis of ATAC-seq data
Advanced analysis
Motif enrichment analysis
Differential analysis
Meta profiles /
Clustering
Footprinting analysis
Nucleosome positioning
...
QC
mapping
Post alignment processing & QC
Peak detection
Peak annotation
ChIP-seq
ATAC-seq
Adapted parameters for peak calling (MACS2) : --shift 75 --extsize 150 --nomodel -B --SPMR --keep-dup all --call-summits
Analysis of ATAC-seq data
Advanced analysis
Motif enrichment analysis
Differential analysis
Meta profiles /
Clustering
Footprinting analysis
Nucleosome positioning
...
QC
mapping
Post alignment processing & QC
Peak detection
Peak annotation
Footprinting analysis
• Tn5 cuts in open chromatin regions
• DNA is protected from cleavage at position of TF binding creating a “notch” in ATAC-seq signal
• Footprinting analysis identifies TF activities
> Height of the notch reflects TF activity
> Compare TF activity between different conditions
Footprinting analysis
• Volcano plots showing differential TF binding activity as predicted by TOBIAS footprinting analysis in ATAC-seq data of NKp, iNK and mNK from Shin et al. (c) iNK vs NKp; (d) mNK vs NKp; (e) mNK vs iNK.
• Each dot represents a TF
• TFs which activity is changing between the two compared developmental stages are colored (see color legend below volcano plots)
Figure: Zhang et al, 2021
Data: Shih et al, 2016
Atelier ChIP-seq: tour de table des données
Les questions qui pourraient moduler le pipeline d’analyse
Narrow peak ou broad peak ?
Paired-end ou single-end ?
Disponibilité du génome de référence (partie annotation) ?
Utilisation de spike-ins
Qualité de l’assemblage du génome ?