Chip_seq_protocol_main

In this protocol we discuss and outline the process of 1. peak calling 2. finding differentially expressed peaks 3. annotating peaks and 4. finding common motifs from a set of peaks.

What is ChIP-Seq?

A method to analyze the interaction between proteins and DNA. Chromatin immunoprecipitation (ChIP) is combined with next generation sequencing technologies to discover binding sites of DNA-associated proteins.

ChIP allows extraction of the regions of the DNA that a target protein has bound. DNA that has been crosslinked to protein is sheared, the target DNA-protein complex is immunoprecipitated using an antibody against the protein of interest, finally, the DNA is unlinked and sequenced.

The technique is often used for transcription factors, chromatin-modifying enzymes, modified histones interacting with genomic DNA, and components of the basal transcriptional machinery. Theoretically any DNA-binding protein for which an antibody can be used to immunoprecipitate can be used to extract the DNA segments for which the protein was bound.

Sequence DNA reads correspond to regions of the genome where the protein was bound. The reads (tags) are then mapped to the reference genome. This step computationally determines where on the genome a read originated from. Regions of the genome where several reads have been mapped to are likely to be where the protein of interest was bound. To determine such regions, the process of peak calling is used.

http://en.wikipedia.org/wiki/ChIP-sequencing

http://www.biomedcentral.com/1741-7007/8/56

What is peak calling?

Peak calling looks for enriched regions of the genome where a significant number of tags have aligned usually compared to a control library. These reflect regions where the protein of interest has bound to the genome. Note that this is an assumption that does not always hold true. High GC content, high repetitive and mappable regions are just some factors that also cause enriched alignment. It is therefore important to have controls in the experiment design and filter for highly mappable regions of the genome.

Section 1: Read Quality Control

Purpose:

This section of the protocol will discuss your raw sequence and help you make informed decisions on how to handle it and maximise your chances of getting a good quality alignment. Knowledge of the read types, the number of reads, their GC content, possible contamination and other issues are important. This information will give you an idea of any quality issues with the data and guide you on the choice of data trimming/cleanup methods to use. Cleaning up the raw data before alignment can reduce misalignments and reduce issues that may occur in downstream analysis.

Steps involved and suggested tools:

Examine the quality

(The following is adapted from the De novo Genome Assembly for Illumina Data)

For FastQ files (the most common), the suggested tool is FastQC. Details can be found here. FastQC can be run from within Galaxy or on a local computer via command line.

FastQC on Galaxy is usually located in: NGS: QC and Manipulation → FastQC
Command line: fastqc. Details on installation and using can be found here.

Some important information

Quality encoding type - Typically most NGS software work with the Sanger format for quality scores. Modern sequencing centers will provide fastq files with this standard quality format, however it is best to check as not all software will alert you of this error.

In the case that your data is not in the correct quality score, you can use NGS: QC and Manipulation → FastQ Groomer

If the Galaxy you use has it, use the parallel version (FastQ Parallel Groomer) as it runs several times faster.

Total sequences - Gives an idea of coverage. You can also compare these between your samples. For example discrepancies in counts between replicates may indicate library prep issues, sample quality or may just be an indication of the biology. These need to be considered carefully depending on the experiment.
% GC - Generally this should be around 50% but will depend on the samples.
Dips in quality near the ends of reads are typical. Sometimes they are expected

For Illumina.

This can be due to the sequencing chemistry itself where the base calling confidence of the machine decreases as it reads further and further away from the start of the DNA fragment.
However If the fragment size is smaller than the read length. quality can drop off dramatically as the sequencer reads past the end of the fragment.

Quality dips can also be attributed to the way the sample was prepared. Dips at one base may be indicative of issues during a machine cycle.
Highly recurring k-mers or over-represented sequences could be

sequence adapters in which case you will need to trim them
explained by the biology of your experiment (you should not trim these)

N’s are bases where the sequencer is unsure of what base to call.

They can be seen in poor quality sequencing runs
For Illumina. They often appear in the first few reads of a file as these reads come from clusters located at the corner of a slide.
They may appear at the end of a read if the fragment size is short.
Reads with N’s can be trimmed or taken out entirely. What you do is personal preference, some points to consider:

Most aligners will ignore N’s at the end of the read, ie the alignment would be the same with the N’s trimmed
A low percentage of N’s may not drastically affect downstream analysis as some reads will be put into the “unaligned” bin, and appropriate mapping quality will be assigned to the ones that do align. It will be up to the downstream tools to take this into account.
Trimming / removing reads could cut down processing time downstream

Quality trimming/cleanup of read files.

there are numerous adapter trimming and quality trimming software. Most trim and cut out low quality reads at the same time.

For a description of Trimmomatic see De novo Genome Assembly for Illumina Data

Some trimming tools on the galaxy toolshed:

If you are using bwa aln the -q option can be used to filter tail end of reads by quality during alignment.

Read Length

Reads across all samples to be compared should have the same length. This is because reads of different lengths have different mapping properties.

For example: shorter reads are more likely to align, but with less likelihood of the alignment being unique.
Such reads can affect differential expression analysis downstream.

Several of the tools above allows an option to truncate reads

Section 2a: Mapping

Purpose:

Read mapping aligns the cleaned reads from Section 1 onto a reference genome. The process finds the part of the genome which best matches with the reads.

Steps involved and suggested tools:

Mapping is usually assessed by looking at the percentage of reads mapped and uniqueness of the mapping. Some eads will map but not uniquely. Some some will not map at all. Contributing factors may be: Contamination, adapter sequences (that somehow passed filtering). Mapping tools have several features which may not be relevant to ChIP-Seq analysis. Decisions on the aligner and aligner settings will vary between experiments, but generally for Chip-Seq: You would be aligning DNA to a DNA reference; Depending on the lab prep, reads would typically be short (<= 100bp).

BWA

BWA comes with several algorithms, bwa align, bwa bwasw, and bwa mem. The most commonly used for next generation sequencing data is bwa align, which is suitable for short reads up to 150bp. bwa bwasw supports longer 100bp-100kp reads, but does not (amongst other things) support paired end. The author (Heng Li) is developing a new algorithm, bwa mem which is a combination of the two algorithms designed as a replacement for bwa align with similar performance and support for long 150bp+ reads. (source)

At the time of writing, there is a wrapper for bwa align in the galaxy toolshed.

Running any of the bwa algorithms requires firstly, building of a genome index. Pre building genome indices is often required by many mappers, indices help speed up lookup of the genome. The input is a fasta file (.fa or .fasta) of the genome you want to align to. Generally you only need to do this once for a genome, the indices can be used again for mapping jobs. The program will create several index files in the same folder as the fasta file.

bwa index ref.fa

Note that BWA will assume the index files are in the same folder as ref.fa.

Using default arguments for these may get you workable results initially but is worthwhile to refine your arguments as this will affect mapping quality and speed. The simplest way to speed up your alignment is increasing the the “Number of threads” (-t). The number of threads should be less than or equal to the number of cores on your machine. NTHREADS <= NUM_CORES

Running bwa mem:

For single end reads:

bwa mem -t NTHREADS ref.fa reads.fq > aln-se.sam

For paired end reads:

bwa mem -t NTHREADS ref.fa read1.fq read2.fq > aln-pe.sam

Running bwa align:

This is done in two steps, first an intermediate .sai file is created, these are then used together to create a sam file.

For single end reads:

bwa aln -t NTHREADS ref.fa short_read.fq > aln_sa.sai

bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam

For paired end reads:

bwa aln -t NTHREADS ref.fa read1.fq > aln_sa1.sai

bwa aln -t NTHREADS ref.fa read2.fq > aln_sa2.sai

bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam

Bowtie / Bowtie 2

This is another popular and fast mapper. Bowtie 2 is the newer algorithm designed for longer reads, gapped alignments, better map quality scoring scheme and other improvements. (source). Bowtie 1 has two alignment modes. The -v alignment mode scores mismatches across the entire read. The -n alignment mode (default) scores mismatches by looking at a section at the beginning of the read first and then the entire read. Bowtie 2 has a simpler scoring scheme.

Similar to the previous mapper, a genome index needs to be built first. Bowtie 2 no longer needs the original fasta once the index has been built, so you can have this in a different folder. The indexer will create 6 files: NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>.

bowtie2-build ref.fa <bt2_base>

Running bowtie2

For single end reads:

bowtie2 -x <bt2_base> -U reads.fq -S aln-se.sam

For paired end reads:

bowtie2 -x <bt2_base> -1 reads_1.fq -2 reads_2.fq -S aln-pe.sam

Possible alternative software:

There are numerous alternatives in mapping software some of them are listed here (look for DNA mappers, not RNA):

Tools for mapping high-throughput sequencing data http://wwwdev.ebi.ac.uk/fg/hts_mappers/
http://seqanswers.com/wiki/Mapping

Section 2b: Mapping Postprocessing

Purpose:

Most mapping tools will output files in SAM format, a text file with a description of how and where each read aligns (or not align) to the reference genome along with the mapping score of each. The binary version of a SAM file is the BAM file. Computer programs can read BAM files more efficiently than SAM files, and most downstream programs take BAM as input. So it is often beneficial to convert SAM files to BAM after alignment.

Many tools will also benefit/require a BAM index (.bai), which is similar to a genome index which helps programs look up reads.

Some tools requires the BAM file to be sorted in a specific way. Sorting can be by chromosomal coordinate or read name

Steps involved and suggested tools:

The most common tools are samtools and picard. Picard is generally more versatile with more options.

Running samtools:

Creating bam:

samtools view -bS eg.sam > eg.bam

Sorting bam by chromosomal coordinates:

samtools sort eg.bam sorted

(will create a sorted.bam)

Creating bam index:

samtools index sorted.bam

(a eg.bai file will be created in the same folder)

Running picard:

Creating bam, sorting and creating index in one step:

java -Xmx2g -jar SortSam.jar CREATE_INDEX=true SORT_ORDER=coordinate INPUT=eg.sam OUTPUT=eg.bam

Possible alternative software:

Note that in Galaxy, BAM index files are created automatically

picard in Galaxy
samtools in Galaxy

Section 3: Peak Calling

Purpose:

Valouev 2008 doi: 10.1038/nmeth.1246

A Chip-Seq experiment pulls down proteins of interest that has bound to DNA. As such, it would be expected that more DNA fragments with affinity to this protein is sequenced than other DNA. In an alignment view, you would expect to see high coverage at regions where the protein has high affinity to the genome. Peak callers assess several properties of the alignment to identify such regions.

In a typical laboratory prep for ChIP-Seq, the DNA fragment is size selected, sequenced and alignment.

In single-end sequencing; the sequencer reads from the 5’ end of both the forward and reverse strands. Forward and reverse reads are sometimes referred to as forward tags and reverse tags. Together the forward and reverse tags would form an area of high coverage (or “peak”) of reads to certain areas of the genome. Raw alignments of these tags will look like the two blue and red peaks in the figure above.

Paired-end sequencing should produce similar looking peaks. Provided fragment lengths and read lengths are the same, forward tags peak would consist of reads from “read1” and reverse tags would be composed of reads from “read2”. The difference from single-end sequencing is that a read1 and read2 pair is guaranteed to originate from the same DNA fragment. Whereas there is no such pairing between forward and reverse tags. The extra information from paired-end sequencing provides higher confidence calculations for peak shifts (detailed below) and by extension peak calling

Most peak callers normalise coverage at a peak against the background coverage around the area of interest and if available a negative control (usually “input” DNA). An input is non-specific antibody ChIP, a background model which can account for biases in chromatin fragmentation, variations in sequencing efficiency and coverage at high/low mappability regions. Coverage in relation to the control guides the peak caller to regions of interest.

Furthermore, with similar fragment and read lengths the base pair distance between the forward and reverse tags should be a certain distance apart and follow a distribution. As shown in the figure, two peaks are generally formed, one from the forward tags and the other from reverse tags. The distance between the two peaks plus the tag length should model the length of the original fragment.

[Fragment length] = [Forward tag length]/2 + [Peak shift] + [Reverse tag length]/2

= [tag length] + [Peak shift]

Peak callers build peak models based around this property to estimate fragment size and to guide peak calling.

Caveats:

Blacklisted regions: Artifact regions which have extreme artifactual high signal in control datasets; Regions where the mapped reads are easily mapped elsewhere (not unique); Repeat regions such as centromeric, telomeric and satellite repeats (Gerstein 2012). The coverage in these regions are often extremely high but they are not real peaks. A peak caller may/may not be able to normalise and account for these regions based on the input, During or after the peak calling step, “peaks” which lie in these regions can be excluded. A file containing a list of such regions excluded in the ENCODE project can be found here.

Some reads may not map to the reference genome at all if the genomic source (a cell line or tissue) varies greatly in the areas of interest to the reference genome. This may be the cause for a large amount of high quality, non artifactual reads to not be mapped.

Tools:

MACS / MACS 2

Homer

FindPeaks

Possible alternative software:

Evaluation of Algorithm Performance in ChIP-Seq Peak Detection

Section 4: Assessing and Refining Peak Calls

Purpose:

Depending on the quality of the data, peak callers can call peaks which are actually noise. A good experiment design should include biological replicates which ideally will have high conservation in their peak calls. However, independent thresholding can lead to peaks that are above the threshold for one replicate while below the threshold in another, despite them having identical binding profiles.

To assess the correspondence between samples and to obtain a set of highly confident peaks between replicates the ENCODE project uses a framework based around the IDR (Irreproducible Discovery Rate). The approach finds peaks that are consistent across a number of replicates. Instead of only keeping peaks that occur across all replicates, this approach reduces true positives from being filtered out.

Steps involved and suggested tools:

Details of the Process can be found here.

Galaxy ToolShed has idr_package by modENCODE team

Section 5: Differential Analysis

Purpose:

Quantitative comparison of transcription factor or histone binding under different cell types or conditions can be done by comparing peak heights (ie read densities) for higher or lower binding affinity.

This is based on the assumption that the rate of binding to a specific region can correspond to the number of reads at this region. Because variations in noise can vary substantially in ChIP experiments, appropriate normalisation is necessary for comparison between samples.

You will need to start off with a high quality list of peaks (see Section 4: Assessing and Refining Peak Calls) which have been called from all of the samples you are comparing against.

Steps involved and suggested tools:

DiffBind identifies statistically significantly differentially bound sites based on evidence of binding affinity.
For instructions, see the manual and User’s Guide for details.
Here are some tips on usage

For Peaks use the high quality peaks you obtained from Section 4: Assessing and Refining Peak Calls
The tool supports BAM format reads. However in the case that the BAM files don’t work convert them to bed.gz format.

BAM/SAM to bed converters:

Galaxy’s NGS: SAM Tools -> Convert SAM (use the Print all = No mode)
bedtools bamtobed

you may also need to compress the read files using gzip

DiffBind strongly advises at least two biological replicates for each sample, if you have just 1 you will need to set the “minMembers” and “group1” and “group2” parameters explicitly

Possible alternative software (see Bailey et al 2013):

DIME - assumes that a significant proportion of peaks are common to the conditions under compar- ison
DBChIP
MAnorm - assumes that peaks that are common in both conditions do not change significantly,
ChIPDiff - for Histone
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287170/ - for RNA Pol II

Section 6: Motif Analysis

Purpose:

The binding sites obtained from peak lists provide regions of interest across the genome. The nucleotide sequence of these regions can then be extracted and compared for similarities between regions of interest within the same experiment or with other known motifs.

We expect to find binding motifs for important transcription factors to be relatively conserved. Conservation can be analysed at peaks that are conserved, or with loss/gain between different different conditions.

An example of such an investigation can be splitting the refined peaks (from Section 4: Assessing and Refining Peak Calls) and by how they are differentially expressed (see Section 5: Differential Analysis) and then analysing motif binding patterns separately for

Higher binding affinity peaks
Lower binding affinity peaks
Non significantly differentiated binding affinity peaks.

Steps involved and suggested tools:

Convert refined peak calls from interval (eg BED) format to DNA sequences

In Galaxy use Fetch Sequences -> Extract Genomic DNA
bedtools getfasta

eg. bedtools getfasta -fi <reference fasta file> -bed <refined peaks> -fo <output.fasta>

Use Sequences as input to a motif analysis tool

MEME-ChIP. Motif analysis especially designed for ChIP-seq experiments

Alternative online service here
Galaxy meme_chip

Possible alternative software:

Section 7: Functional Analysis

Purpose:

Functional analysis attempts to assign more dynamic aspects to the regions of interest such as what may be happening in transcription, translation and interactions with other proteins in terms of pathways.

The most straightforward method comes from recent high throughput sequencing data where regions of interest (in our case, our peaks) span across the whole genome that lie not only in genes but non-coding regions, pseudogenes, introns etc Recent functional analysis tools such as GREAT have been developed to take advantage of the gene agnostic nature of new sequencing data. The entire list of confident peaks in the format of a BED file is used as input for GREAT.

The traditional gene-by-gene approach can be performed on ChIP seq data by determining target genes to the binding peaks of the transcription factor. Doing this is nontrivial as proximity to genes can be dependent on spatial arrangement of the chromosome. Gene lengths can also bias the assignment of peaks to genes.

Comparative analysis with genes as well as other tracks derived from public data from UCSC, ENCODE can be done in an integrative level with the Genomic HyperBrowser

Steps involved and suggested tools:

Regions of interest

GREAT

Visit the GREAT website http://bejerano.stanford.edu/great/public/html/index.php
From a peak called, obtain a BED file of the regions of interest and use this as input.
GREAT is very specific about the input format and will not accept chromosomes it does not recognise. more details here

Genomic HyperBrowser

http://hyperbrowser.uio.no/hb/
See the paper

Possible alternative software:

Gene Ontology analysis with Genes of interest

See (Bardet 2012)

From (Bailey et al 2013):

References

Bardet, A. F., He, Q., Zeitlinger, J., & Stark, A. (2012). A computational pipeline for comparative ChIP-seq analyses. Nature protocols, 7(1), 45–61. doi:10.1038/nprot.2011.420

Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K.-K., Cheng, C., … Alexander, R. (2012). Architecture of the human regulatory network derived from ENCODE data. Nature, 489(7414), 91–100. doi:10.1038/nature11245

Landt, S. G., Marinov, G. K., Kundaje, a., Kheradpour, P., Pauli, F., Batzoglou, S., … Snyder, M. (2012). ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research, 22(9), 1813–1831. doi:10.1101/gr.136184.111

Wilbanks, E. G., & Facciotti, M. T. (2010). Evaluation of algorithm performance in ChIP-seq peak detection. PloS one, 5(7), e11471. doi:10.1371/journal.pone.0011471

Kidder, B. L., Hu, G., & Zhao, K. (2011). ChIP-Seq: technical considerations for obtaining high-quality data. Nature immunology, 12(10), 918–22. doi:10.1038/ni.2117

IDR: Reproducibility and automatic thresholding of ChIP-seq data - Anshul Kundaje. (n.d.). Retrieved August 20, 2013, from https://sites.google.com/site/anshulkundaje/projects/idr

Galaxy ToolShed idr_package. (n.d.). Retrieved August 20, 2013, from http://toolshed.g2.bx.psu.edu/repository/view_repository?sort=name&operation=view_or_manage_repository&f-free-text-search=idr&id=fcd21c613cf419b1

McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495–501. doi:10.1038/nbt.1630

Subramanian, A. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. Retrieved from http://www.pnas.org/content/102/43/15545.short

Sandve, G. K., Gundersen, S., Rydbeck, H., Glad, I. K., Holden, L., Holden, M., … Hovig, E. (2010). The Genomic HyperBrowser: inferential genomics at the sequence level. Genome biology, 11(12), R121. doi:10.1186/gb-2010-11-12-r121

Taslim, C., Huang, T., & Lin, S. (2011). DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models. Bioinformatics (Oxford, England), 27(11), 1569–70. doi:10.1093/bioinformatics/btr165

Stark, R., & Brown, G. (2012). DiffBind: differential binding analysis of ChIP-Seq peak data. Retrieved from http://www.bioconductor.org/packages/devel/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf

Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T., … Zhang, J. (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Computational Biology, 9(11), e1003326. doi:10.1371/journal.pcbi.1003326