1 of 151

Atelier Chip-Seq

Elodie Darbo, INSERM U1218, Bordeaux

Stéphanie Le Gras, IGBMC, Strasbourg

Tao Ye, IGBMC, Strasbourg

Morgane Thomas-Chollier, IBENS, Paris

École de bioinformatique AVIESAN-IFB 2021

2 of 151

Contents

3 of 151

TIPS

  • Keep track of all command lines you run. You can for example, create a text file in which you write every commands you run.
  • Give content-explicit names to the files you’re generating.
  • Give to files the right extension.
  • Create directories with explicit names!!
  • Compress big files (with gzip for instance).

4 of 151

Introduction

5 of 151

Chip-Seq analysis

  • Experimental design, Quality Controls, Mapping

  • Normalization & peak calling

Reads

Peaks

6 of 151

Chip-Seq analysis

  • Experimental design, Quality Controls, Mapping

  • Normalization & peak calling

  • Motif analysis

  • Peak annotation

Reads

Peaks

Annotations

Motifs

7 of 151

ChIP-seq

ChIP (=Chromatin Immuno-Precipitation)

differences in methods to detect the bound DNA

- small-scale: PCR / qPCR

- large-scale:

- microarray = ChIP-on-chip

- sequencing = ChIP-seq

8 of 151

Experimental design

9 of 151

ENCODE

  • The Encyclopedia of DNA Elements (ENCODE) Consortium has carried out thousands of ChIP–seq experiments and has used this experience to develop a set of working standards and guidelines

See: https://www.encodeproject.org/about/experiment-guidelines/

Landt SG, Marinov GK, Kundaje A et al. (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research 22, 1813–1831.

10 of 151

Considerations on ChIP

  • Antibody
    • Antibody quality varies, even between independently prepared lots of the same antibody (Egelhofer, T. A. et al. 2011)
  • Number of cells
    • large number of cells are required for a ChIP experiment (limitation for small organisms or precious samples)
  • Shearing of DNA (Mnase I, sonication, Covaris): trying to narrow down the size distribution of DNA fragments

Complexity in DNA fragments

11 of 151

Library prep

Size selection (200 or 400 bp)

PCR amplification

Single-end Sequencing Paired-end Sequencing

Ligation of Adapters

ChIP

  • Step between ChIP and sequencing
  • Starting material: ChIP sample (1-10ng of sheared DNA)

12 of 151

Sequencing

  • Sequencer : Illumina HiSeq 4000
  • No. of reads per sample:
    • (HiSeq 2500) 4 samples per lane :~41 millions per sample
    • (HiSeq 4000) 8 samples per lane :~43 millions per sample
  • Length of DNA fragment : ~200bp
  • No. of cycle per run : 50

13 of 151

Single end or paired end ?

  • Single end (most of the time until 2016)
  • Paired-end (more and more these days)
    • Improve identification of duplicated reads
    • Better estimation of the fragment size distribution
    • Increase the mapping efficiency to repeat regions
    • The price! But 2 x 40bp is affordable

14 of 151

Sequencing depth

  • Consider the depth needed depending on:
    • Chipped protein

15 of 151

Sequencing depth

  • Consider the depth needed depending on:
    • Chipped protein
    • Number of expected binding sites

16 of 151

Sequencing depth

  • Consider the depth needed depending on:
    • Chipped protein
    • Number of expected binding sites
    • Size of the genome of interest

Ex:

      • For human genomes
        • 20 million uniquely mapped read sequences for point-source peaks,
        • 40 million for broad-source peaks.
      • For fly genome: 8 million reads
      • For worm genome: 10 million reads

17 of 151

Controls

  • Used mostly to filter out false positives (high level of noise)
    • Idea: potential false positive will be enriched in both treatment and control.
  • A control will fail to filter out false positives if its enrichment profile is very different from the enrichment profile of false positive regions in the treatment sample
  • Most commonly used control: Input DNA (a portion of DNA sample removed prior to IP)
  • Choice of control is extremely important
  • It is recommended to cover the control in a higher extent than the IPs

18 of 151

Why an Input is required ?

  • The input is used to model local noise level
    • Accessible regions are expected to produce more reads����
    • Amplified regions (CNV) are expected to produce more reads�������

Closed

Closed

Open

Closed

Closed

Open

Closed

Closed

Open

19 of 151

Why an Input is required ?

  • The input is used to model local noise level
    • Accessible regions are expected to produce more reads����
    • Amplified regions (CNV) are expected to produce more reads
    • Moreover, most peak callers are configured with an input as control �������

Closed

Closed

Open

Closed

Closed

Open

Closed

Closed

Open

20 of 151

Other controls

  • IgG (mock IP): controls for non-specific IP enrichment
    • Problem : low-complexity library (few reads)
  • Histone H3 (for H3 variants)
  • Uninduced condition (for inducible TFs)
    • Example : Glucocorticoid Réceptor
    • Induced by Dexamethasone (Dex)
    • Control vehicule = Ethanol (EthOH)
  • KO of your protein of interest
  • Non flagged cell lines
  • ...

21 of 151

Replicates

  • A minimum of two replicates should be carried out per experiment.
  • Each replicate should be a biological rather than a technical replicate; that is, it results from an independent cell culture, embryo pool or tissue sample.

21

22 of 151

Data analysed in this course

23 of 151

Dataset used

  • All experiments (GEO): GSE41187
  • Experiment: FNR IP ChIP-seq Anaerobic A (SRX189773 - SRR576933)
  • Control: anaerobic INPUT DNA (SRX189778 - SRR576938)

24 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

25 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

26 of 151

Protocol

27 of 151

Quality control of the reads

28 of 151

Quality control of the reads

  • As for any NGS datasets
  • FastQC program (c.f. Course “preprocessing” Denis Puthier Monday afternoon)

28

29 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

30 of 151

Mapping

31 of 151

Mapping

  • Find out the position of the reads within the reference genome

Ref. Genome

Reads

  • One position in the genome
  • Many possible positions (Repeat regions, duplicate regions, pseudogenes…)

32 of 151

Mapping example

Human chromosomes

ATGCGATTA

?

?

?

33 of 151

Genomic coordinates

chromosome 5

ATGCGATTA

1

181,538,259

alignment

125,326,318 start

125,326,358 end

Genomic coordinate of the mapped read :

chr5 125326318 125326358 +

GPS

34 of 151

Mapping tool used: Bowtie

  • (c.f. Course “mapping” Denis Puthier Monday afternoon)
  • Designed to align reads if:
    • many of the reads have at least one good, valid alignment,
    • many of the reads are relatively high-quality
    • the number of alignments reported per read is small (close to 1)

  • Langmead B. et al, Genome Biology 2009
  • Langmead B (2010) Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics Chapter 11: Unit 11 17

35 of 151

Duplicated genomic regions

?

Keep 1 position randomly

Keep all possible position

Keep none

3 possible alignments

36 of 151

Mappability

  • Mappability (a): how many times a read of a given length can align at a given position in the genome
    • a=1 (read align once)
    • a=1/n (read align n times)

37 of 151

  • Mapping the reads with Bowtie

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

38 of 151

Mapping: expected signal

  • For a transcription factor signal is expected to be sharp

TF

TF

TF

TF

Sens alignments

Rev/comp alignments

The binding site itself is

generally not sequenced !

TF

39 of 151

Mapping: the expected signal

  • For most histone marks the signal is expected to be broad
  • Asymmetry is less/not pronounced
  • Peak calling algorithms need to adapt to these various signals

H

H

H

H

40 of 151

Mapping: observed signal

Histone mark

(H3K4me1)

Trans. Factor

(ESR1)

41 of 151

Filtering mapped reads

42 of 151

Which reads to filter ?

  • Low-quality read alignments
    • Tool : samtools
  • Multi-mapped reads (unless removed during the mapping step)
    • Tool : samtools
  • Duplicated reads (PCR duplicates)
    • Tool : Picard MarkDuplicates

43 of 151

Source of confusion

uniquely mapped reads = reads that “matches” only 1 region in the genome

duplicated reads = reads that “match” at the SAME location (same start, strand)

Uniquely mapped

multi-mapped

“Stack” of duplicated reads

44 of 151

PCR duplicates

  • PCR duplicates
    • Related to poor library complexity
    • The same set of fragments are amplified, may indicates that immuno-precipitation failed
    • Tools to check for
      • FastQC report (duplicate diagram)
      • PCR bottleneck metric (ENCODE)

45 of 151

QC : PBC (PCR Bottleneck Coefficient)

  • An approximate measure of library complexity
  • PBC = N1/Nd
    • N1= Genomic position with 1 read aligned
    • Nd = Genomic position with ≧ 1 read aligned
  • Value :
    • 0-0.5: severe bottlenecking
    • 0.5-0.8: moderate bottlenecking
    • 0.8-0.9: mild bottlenecking
    • 0.9-1.0: no bottlenecking

https://genome.ucsc.edu/ENCODE/qualityMetrics.html

46 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

47 of 151

Quality Control on mapped reads

48 of 151

Assessing ChIP quality

  • Guidelines from ENCODE
  • Various metrics
    • Check duplicate rate (see previous Filtering section)
    • Use a Lorenz Curve (implemented in Deeptools fingerprint)
    • Look at strand cross-correlation (implemented in SPP BioC package and phantompeakqualtools)
    • Fraction of reads in peaks (FRiP, as proposed by ENCODE), but requires to find peaks.

49 of 151

Lorenz curve

  • Analyze income among workers by computing cumulative sum.
    • If uniform income distribution :
      • Straight line

Line of equality

Fraction of total income

100%

100%

“Ideal” world

Cumulative income of people from lowest

to highest income

50 of 151

Lorenz curve

  • Analyze income among workers by computing cumulative sum.
    • If uniform income distribution :
      • Straight line
    • If they were trumpized
      • Lorenz curve

Line of equality

Lorenz curve

Fraction of total income

100%

100%

“Ideal” world

“Trumpized” world

Cumulative income of people from lowest

to highest income

51 of 151

Lorenz curve

  • Analyze income among workers by computing cumulative sum.
    • If uniform income distribution :
      • Straight line
    • If they were trumpized
      • Lorenz curve
  • Here the workers are the genome windows and incomes are reads

Line of equality

Lorenz curve

Fraction of total reads

100%

100%

“Ideal” world

(Bad ChIP)

“Trumpized” world

(Good ChIP)

Cumulative genome windows from lowest read counts to highest

52 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

53 of 151

Strand cross-correlation

  • Compute strand cross correlation for each window w across the genome.
  • Use various distance d and compute the mean cross-correlation observed

2

Coverage windows

4

6

4

2

0

0

2

4

6

4

2

0

0

Strand cross-correlation

for each window and various d values

r = 1

d

Mean cross-correlation

0

100

d

54 of 151

55 of 151

Strand cross-correlation

56 of 151

Strand cross-correlation

NSC: normalized strand coefficient

Relative strand correlation (RSC)

NSC ≥ 1.05 is recommended� �

RSC ≥ 0.8 is recommended�

57 of 151

Visualization: computing a genomic coverage file

58 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

59 of 151

Bam files are fat

  • BAM files are fat as they do contain exhaustive information about read alignments
    • Memory issues (can only visualize fraction of the BAM)�
  • Need a more lightweight file format containing only genomic coverage information:
    • ❌ Wig (not compressed, not indexed)
    • ✅ TDF (compressed, indexed)
    • ✅ BigWig (compressed, indexed)

60 of 151

Coverage file and read extension

  • BAM files do not contain fragment location but read location
  • We need to extend reads to compute fragments coordinates before coverage analysis
  • Not required for PE

wi

wi+1

wi+2

wi+3

wi+4

15

6

20

14

5

Window

Coverage

61 of 151

Library size normalization

IP condition A

input condition A

IP condition B

input condition B

IP vs input

normalization

IP vs input

normalization

Inter-condition

normalization

62 of 151

Library size normalization (input vs IP)

63 of 151

Library size normalization (input vs IP)

64 of 151

Library size normalization (input vs IP)

65 of 151

Library size normalization

Example of PeakSeq

All peaks

Signal peaks removed

66 of 151

  • Viewing scaled data

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

67 of 151

Peak Calling

Reads

Peaks

68 of 151

From reads to peaks

  • Chip-seq peaks are a mixture of two signals:
    • + strand reads (Watson)
    • - strand reads (Cricks)
  • The sequence read density accumulates on forward and reverse strands centered around the binding site

69 of 151

From reads to peaks

  • Get the signal at the right position
    • Read shift
    • Extension
  • Estimate the fragment size
  • Do paired-end

70 of 151

Peak callers

  • The peak caller should be chosen based on
    • Experimental design
      • SE or PE (E.g MACS1.4 vs MACS2)
    • Expected signal
      • Sharp peaks (e.g. Transcription Factors).
        • E.g. MACS
      • Broad peaks (e.g. epigenetic marks).
        • E.g MACS, SICER,...

Sharp peaks

Broad peaks

71 of 151

A variety of peak callers

  • 60 programs listed on OMICTOOLS
  • Most support a control

72 of 151

MACS [Zhang et al, 2008]

1. Modeling the shift size of ChIP-Seq tags

  • slides 2*bandwidth windows across the genome to find regions with tags more than mfold enriched relative to a random tag genome distribution
  • randomly samples 1,000 of these highly enriched regions
  • separates their + and - reads, and aligns them by the midpoint between their + and - read centers
  • define d as the distance in bp between the summit of the two distribution

73 of 151

MACS [Zhang et al, 2008]

2. Peak detection

  • Scales the total Input read count to be the same as the total ChIP read count
  • Duplicate read removal
  • Reads are shifted by d/2

(d value is the model obtained

in step 1)

Pepke et al, 2009

74 of 151

MACS [Zhang et al, 2008]

  • Slides 2d windows across the genome to find candidate peaks with a significant read enrichment (Poisson distribution p-value based on λBG, default 10-5)
  • Estimate parameter λlocal of Poisson distribution
  • Keep peaks significant under λBG and λlocal and with p-value < threshold

75 of 151

MACS [Zhang et al, 2008]

3. Multiple testing correction (FDR)

  • Swap treatment and input and call negative peaks
  • Take all the peaks (neg + pos) and sort them by increasing p-values

FDR(p) =

# Negative peaks with p-value < p

# Selected peaks

FDR = 2/27 = 0.074

76 of 151

MACS in summary

  • Step 1 : search for candidate regions that look like good peaks, to produce a fine-tuned model of the peaks (d value) to search in Step 2
  • Step 2 : actual peak calling
    • sIiding window length = 2*d
    • In each window : test if the region is a peak, by comparing the number of reads in the treatment and the expected number of reads
    • Comparison is based on a statistical test with a Poisson distribution, keeping only regions with p-value < threshold
  • Step 3 : correction for multiple testing (many windows were tested), calculation of FDR

77 of 151

  • Peak calling with MACS

(stop after step 3)

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

78 of 151

How to deal with replicates

79 of 151

How to deal with replicates

Sample 1.a

Sample 1.b

Analyze samples separately and takes union or intersection of resulting peaks

Sample 1.a

Sample 1.b

Sample 1

Merge samples prior to the peak calling (e.g recommended by MACS) => “pooling”

80 of 151

IDR - Irreproducible Discovery Rate (ENCODE)

  • Measures consistency between replicates
  • Uses reproducibility in score rankings between peaks in each replicate to determine an optimal cutoff for significance.
  • Idea:
    • The most significant peaks are expected to have high consistency between replicates
    • The peaks with low significance are expected to have low consistency

https://sites.google.com/site/anshulkundaje/projects/idr

81 of 151

IDR

(!) IDR doesn’t work on broad source data!

82 of 151

  • Combine replicate (IDR)

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

83 of 151

Processing steps are over !

Let’s do Biology !!!

peaks

84 of 151

85 of 151

« see if you can find something in the data »

86 of 151

« see if you can find something in the data »

87 of 151

What is the biological question ?

  • Where do a transcription factor (TF) bind ?

  • How do a transcription factor (TF) bind ?
    • Which binding motif(s) (can be several for a given TF !!)
    • Is the binding direct to DNA or via protein-protein interactions ?
    • Are there cofactors (maybe affecting the motif !!), and if so, identify them

  • Which regulated genes are directly regulated by a given TF ?

  • Where are the promoters (PolII) and chromatin marks ?

88 of 151

Should drive all « downstream » analyses

Will take time

to « do it all » !!!

89 of 151

  • cell biology (eg: luciferase assay) ?
  • in vitro assays (eg: EMSA) ?
  • Proteomic (eg: mass spectrometry) ?
  • Transgenics ?
  • Will depend on
    • the organism
    • available infrastructure

90 of 151

Discovering motifs in peaks

Annotations

Motifs

91 of 151

Biological concepts of transcriptional regulation

92 of 151

Transcription factor specificity

How do TF « know » where to bind DNA ?

TF recognize TFBS with specific DNA sequences

a given TF is able to bind DNA on TFBSs with different sequences

93 of 151

Binding specificity

94 of 151

From binding sites to binding motif�

Collection of binding sites �used to build the Sox2 matrix (TRANSFAC M01272 )

R15133 GCCCTCATTGTTATGC

R15201 AAACTCTTTGTTTGGA

R15231 TTCACCATTGTTCTAG

R15267 GACTCTATTGTCTCTG

R16367 GATATCTTTGTTTCTT

R17099 TGCACCTTTGTTATGC

R19276 AATTCCATTGTTATGA

R19367 AAACTCTTTGTTTGGA

R19510 ATGGACATTGTAATGC

R22342 AGGCCTTTTGTCCTGG

R22344 TGTGCTTTTGTNNNNN

R22359 CTCAACTTTGTAATTT

R22961 GCAGCCATTGTGATGC

R23679 CACCCCTTTGTTATGC

R25928 TTTTCTATTGTTTTTA

R27428 AAAGGCATTGTGTTTC

A

6

7

4

4

2

0

8

0

0

0

0

2

7

0

1

4

C

2

2

6

5

9

12

0

0

0

0

0

2

2

2

0

6

G

4

3

2

4

1

0

0

0

0

16

0

2

0

2

9

3

T

4

4

4

3

4

4

8

16

16

0

16

9

6

11

5

2

Position-specific scoring matrix (PSSM)

Sequence logo

95 of 151

A [24 54 59 0 65 71 4 24 9 ]

C [ 7 6 4 72 4 2 0 6 9 ]

G [31 7 0 2 0 1 1 38 55 ]

T [14 9 13 2 7 2 71 8 3 ]

>mm9_chr1_39249116_39251316_+

gagaggaagggggagaaagagggagggggagGGTGATAGGTAGCCAGGAG

CCAATGGGGGCGTTTTCCTTGTCCAGGCCACTTGCTGGAATGTGAGATGT

AGAATGACCCAAAGAGAGCTGCCAAGACAGAGCTCTGCCCCAGGAATTGA

ACTCAAAGGGTGTCAGAAAGCAGGTGGCCTTTGTGCACCTGGCGCGGGGA

CGTGGCTCCCCTCTTCCGGCTGGTCTAGCCAGGtgcctgcctgcctgcct

gccGTGATCTCTGGACGCCAGTAGAGGGTTGTTGTGGGTTTGGGTGAAAC

ACGCCACCCCTGAGCTCTTCCGCGGGGCTAGCAATCTCCCCATCACCCCA

TTCGCGCTCAGAACCCCCTCAGCGAGTCTAACAGCAGGCCTGGTTCCCCG

Discovered motif

ChIP-seq peaks

DNA sequence

Motif logo

96 of 151

De novo motif discovery

97 of 151

De novo motif discovery

  • Find exceptional motifs based on the sequence only

(No prior knowledge of the motif to look for)

  • Criteria of exceptionality:
    • Over-/under-representation: higher/lower frequency than expected by chance
    • Position bias: concentration at specific positions relative to some reference coordinates (e.g. TSS, peak center, …).

98 of 151

Some motif discovery tools

  • MEME (Bailey et al., 1994)
  • RSAT oligo-analysis (van Helden et al., 1998)
  • AlignACE (Roth et al. 1998)
  • RSAT position-analysis (van Helden et al., 2000)
  • Weeder (Pavesi et al. 2001)
  • MotifSampler (Thijs et al., 2001)
  • … many others

99 of 151

New approaches for ChIP-seq datasets

100 of 151

De novo motif discovery

target gene

cis-regulatory

elements

target gene

target gene

Case 1: promoters of co-expressed genes

TF binding site

Case 2: ChIP-seq peaks

binding motif �(represented as a� sequence logo)

101 of 151

Regulatory sequence Analysis Tools (rsat.eu)

243 Fungi

9638 Bacteria + Archaea

186 Protists

92 Metazoa

39 Plants

102 of 151

Peak-motifs

fast and scalable

treat full-size datasets

complete pipeline

web interface

accessible to non-specialists

103 of 151

Peak-motifs: why providing yet another tool?

typical ChIP-seq dataset

1h

Thomas-Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012

size limit of other websites

  • fast and scalable
  • treat full-size datasets
  • using 4 complementary algorithms
    • Global over-representation
        • oligo-analysis
        • dyad-analysis (spaced motifs)
    • Positional bias
        • position-analysis
        • local-words

104 of 151

RSAT menu

1. Get sequences

2. Run the analysis

3. Visualization

Help: tutorials,

105 of 151

RSAT Web forms

Tool name

Tool description

Tool parameters

Output

Go button (launches the analysis)

Demo button (fill in the form for test purposes)

Help

106 of 151

  • Motif analysis

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

107 of 151

Motif discovery: frequency

108 of 151

Motif discovery: positional bias

109 of 151

Direct versus indirect binding

ChIP-seq does not necessarily reveal direct binding: The motif of the targeted TF is not always found in peaks!

Direct binding

Indirect binding

110 of 151

Annotating peaks

Annotations

111 of 151

Are peaks biased towards any genomic features?

  • How are the peaks distributed on the chromosomes?

  • Are there genomic features (promoters, intergenic, intronic, exonic regions) enriched in the peaks?

  • How are the peaks distributed compared to gene structures (TSS, TTS, introns, exons)?

  • How are they distributed compared to the genes?

112 of 151

Various tools available

  • ChIPseeker (Bioconductor) https://goo.gl/BemEsw
  • bedtools annotate : http://bedtools.readthedocs.io
  • HOMER annotatePeaks.pl

Warning : rely on the organism annotation and assembly version

=> not all organisms supported by all programs !

113 of 151

Which are the closest genes?

HOMER is a well-maintained suite of tools for functional genomics sequencing data sets. It can perform peak-calling and motif analysis, but we will use it for annotation of the peaks only.

114 of 151

What are the genes associated to the peaks ?

Are there some functional categories over-represented ?

ChIP-seq peaks

GO Molecular Function

GO Biological Process

Disease Ontology

Pathways

Genes

Ontology terms

115 of 151

Various tools available

These tools work with gene lists

These tools work with regions (BED files)

116 of 151

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

Protocol

117 of 151

Now that we have the genes,

Are there some functional categories over-represented ?

ChIP-seq peaks

GO Molecular Function

GO Biological Process

Disease Ontology

Pathways

Genes

Ontology terms

118 of 151

HOMER [Heinz et al., Mol Cell 2010]

Gene Ontology Analysis of Associated Genes: annotatePeaks go option

Genome Ontology: Looking for Enriched Genomic Annotations: annotatePeaks genomeOntology option

119 of 151

GREAT

Note: Only human (hg19), mouse (mm9, mm10) and zebrafish (danRer7) genomes are supported

120 of 151

GREAT

Note: Only human (hg19), mouse (mm9, mm10) and zebrafish (danRer7) genomes are supported

121 of 151

GREAT

  • Input
    • bed file with peaks
  • Output
    • Enriched GO terms and functions

122 of 151

Other analyses

  • Clustering peaks

(Deeptools, HOMER, seqMINER)

The darker the red the higher the read enrichment

Ye et al, 2011

123 of 151

Other analyses

  • ReMAP (http://tagc.univ-mrs.fr/remap/)
    • Is my peak dataset enriched for known TF peaks?

124 of 151

Identification of Super-Enhancers from ChIP-seq peaks with ROSE [Loven et al., Cell 2013]

https://bitbucket.org/young_computation/rose

Identify SE-associated genes

125 of 151

Conclusions

atelier Chip-Seq

École de bioinformatique AVIESAN-IFB 2019

126 of 151

Bilan du pipeline ChIP-seq

Quality control

Mapping (bam)

Filtering (bam)

Annotation

Motif discovery

Visualization

Peak calling (bed)

Quality Control

Raw Data (fastq)

127 of 151

128 of 151

Beyond ChIP-seq : ChIP-exo

129 of 151

Beyond ChIP-seq

130 of 151

Beyond ChIP-seq : ChIP-nexus

131 of 151

Beyond ChIP-seq : native ChIP

132 of 151

Beyond ChIP-seq : native ChIP

133 of 151

Beyond ChIP-seq : low-input and single-cell

More recent proof-of-concept

Grosselin et al, Nature Genetics, 2019

134 of 151

Beyond ChIP-seq : Cut&TAG (2019)

135 of 151

Beyond ChIP-seq : Cut&TAG (2019)

136 of 151

ATAC-seq

Assay for Transposase-Accessible Chromatin with highthroughput sequencing

137 of 151

Chromatin accessibility assays

Chromatin accessibility is the degree to which nuclear macromolecules are able to physically contact chromatinized DNA and is determined by the occupancy and topological organization of nucleosomes as well as other chromatin-binding factors that occlude access to DNA (Klemm et al, 2019)

Open chromatin regions contains generally transcriptionally active genes

The accessible genome comprises ~2–3% of total DNA sequence yet captures more than 90% of regions bound by TFs (Thurman et al, 2012)

Chromatin accessibility is measured by quantifying the susceptibility of chromatin to either enzymatic methylation or cleavage of its constituent DNA

Chromatin accessibility assays (non exhaustive list)

FAIRE-seq, DNAse-seq, MNAse-seq, ATAC-seq

138 of 151

Chromatin accessibility assays

ATAC-seq has become the most widely used method to detect and analyze open chromatin regions

139 of 151

ATAC-seq

Buenrostro et al, 2013

ATAC-seq is a method for determining chromatin accessibility across the genome

Transcription factor binding sites and positions of nucleosomes can be identified from the analysis of ATAC-Seq

Advantages of ATAC-seq over other chromatin accessibility assays:

The sensitivity and specificity are comparable to DNase-seq but superior to FAIRE-seq

Straightforward and rapidly implemented protocol. ATAC-seq libraries can be generated in less than 3 hours

Low number of cells required (500 - 50,000 cells cells)

single-cell ATAC-seq (scATAC-seq) protocol (Cusanovich et al, 2015)

140 of 151

ATAC-seq process

Sample processing

ATAC-seq

Data analysis

141 of 151

ATAC-seq

Sample processing

ATAC-seq

Data analysis

142 of 151

ATAC-seq

ATAC-seq protocol utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions

In a process called "tagmentation", Tn5 transposase cleaves and tags double-stranded DNA with sequencing adaptors

No additional library prep is needed

Expected results are enrichments of sequenced reads in open chromatin regions as closed chromatin regions, DNA regions bound by TFs or wrapped around nucleosomes should be protected from Tn5 cleavage activity.

143 of 151

ATAC-seq

Paired-end sequencing so that by looking at the distance between the two reads of a pair, we know in which the chromatin environment (Nucleosome Free Region (NFR), around a mono, di,-nucleosome, around a TF) of the DNA fragment.

144 of 151

ATAC-seq process

Sample processing

ATAC-seq

Data analysis

145 of 151

Analysis of ATAC-seq data

Overall analysis resemble ChIP-seq data analysis

Description of particularities of ATAC-seq data analysis

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

QC

mapping

Post alignment processing & QC

Peak detection

Peak annotation

146 of 151

Analysis of ATAC-seq data

• Some cleaning steps are required for ATAC-seq. For example:

> A large percentage of reads are derived from mitochondrial DNA. These reads are removed as mitochondrial genome is generally not of interest.

> Omni-ATAC (Corces et al, 2017)

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

QC

mapping

Post alignment processing & QC

Peak detection

Peak annotation

147 of 151

Analysis of ATAC-seq data

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

QC

mapping

Post alignment processing & QC

Peak detection

Peak annotation

ChIP-seq

ATAC-seq

Adapted parameters for peak calling (MACS2) : --shift 75 --extsize 150 --nomodel -B --SPMR --keep-dup all --call-summits

148 of 151

Analysis of ATAC-seq data

Advanced analysis

Motif enrichment analysis

Differential analysis

Meta profiles /

Clustering

Footprinting analysis

Nucleosome positioning

...

QC

mapping

Post alignment processing & QC

Peak detection

Peak annotation

149 of 151

Footprinting analysis

Tn5 cuts in open chromatin regions

DNA is protected from cleavage at position of TF binding creating a “notch” in ATAC-seq signal

Footprinting analysis identifies TF activities

> Height of the notch reflects TF activity

> Compare TF activity between different conditions

150 of 151

Footprinting analysis

Volcano plots showing differential TF binding activity as predicted by TOBIAS footprinting analysis in ATAC-seq data of NKp, iNK and mNK from Shin et al. (c) iNK vs NKp; (d) mNK vs NKp; (e) mNK vs iNK.

Each dot represents a TF

TFs which activity is changing between the two compared developmental stages are colored (see color legend below volcano plots)

Figure: Zhang et al, 2021

Data: Shih et al, 2016

151 of 151

Atelier ChIP-seq: tour de table des données

Les questions qui pourraient moduler le pipeline d’analyse

Narrow peak ou broad peak ?

Paired-end ou single-end ?

Disponibilité du génome de référence (partie annotation) ?

Utilisation de spike-ins

Qualité de l’assemblage du génome ?