ChIP-Seq Analysis Protocol | ||
Contents Protocol Overview / Introduction Section 1: Read Quality Control Steps involved and suggested tools: Steps involved and suggested tools: Possible alternative software: Section 2b: Mapping Postprocessing Steps involved and suggested tools: Steps involved and suggested tools: Possible alternative software: Section 4: Assessing and Refining Peak Calls Steps involved and suggested tools: Possible alternative software: Section 5: Differential Analysis Steps involved and suggested tools: Possible alternative software: Steps involved and suggested tools: Possible alternative software: Section 7: Functional Analysis Steps involved and suggested tools: Possible alternative software: Gene Ontology analysis with Genes of interest | ||
Protocol Overview / IntroductionIn this protocol we discuss and outline the process of 1. peak calling 2. finding differentially expressed peaks 3. annotating peaks and 4. finding common motifs from a set of peaks. What is ChIP-Seq? A method to analyze the interaction between proteins and DNA. Chromatin immunoprecipitation (ChIP) is combined with next generation sequencing technologies to discover binding sites of DNA-associated proteins. ChIP allows extraction of the regions of the DNA that a target protein has bound. DNA that has been crosslinked to protein is sheared, the target DNA-protein complex is immunoprecipitated using an antibody against the protein of interest, finally, the DNA is unlinked and sequenced. The technique is often used for transcription factors, chromatin-modifying enzymes, modified histones interacting with genomic DNA, and components of the basal transcriptional machinery. Theoretically any DNA-binding protein for which an antibody can be used to immunoprecipitate can be used to extract the DNA segments for which the protein was bound. Sequence DNA reads correspond to regions of the genome where the protein was bound. The reads (tags) are then mapped to the reference genome. This step computationally determines where on the genome a read originated from. Regions of the genome where several reads have been mapped to are likely to be where the protein of interest was bound. To determine such regions, the process of peak calling is used. http://en.wikipedia.org/wiki/ChIP-sequencing http://www.biomedcentral.com/1741-7007/8/56 What is peak calling? Peak calling looks for enriched regions of the genome where a significant number of tags have aligned usually compared to a control library. These reflect regions where the protein of interest has bound to the genome. Note that this is an assumption that does not always hold true. High GC content, high repetitive and mappable regions are just some factors that also cause enriched alignment. It is therefore important to have controls in the experiment design and filter for highly mappable regions of the genome. | ||
Section 1: Read Quality ControlPurpose:This section of the protocol will discuss your raw sequence and help you make informed decisions on how to handle it and maximise your chances of getting a good quality alignment. Knowledge of the read types, the number of reads, their GC content, possible contamination and other issues are important. This information will give you an idea of any quality issues with the data and guide you on the choice of data trimming/cleanup methods to use. Cleaning up the raw data before alignment can reduce misalignments and reduce issues that may occur in downstream analysis. Steps involved and suggested tools:
(The following is adapted from the De novo Genome Assembly for Illumina Data) For FastQ files (the most common), the suggested tool is FastQC. Details can be found here. FastQC can be run from within Galaxy or on a local computer via command line.
Some important information
In the case that your data is not in the correct quality score, you can use NGS: QC and Manipulation → FastQ Groomer If the Galaxy you use has it, use the parallel version (FastQ Parallel Groomer) as it runs several times faster.
| ||
Section 2a: MappingPurpose:Read mapping aligns the cleaned reads from Section 1 onto a reference genome. The process finds the part of the genome which best matches with the reads. Steps involved and suggested tools:Mapping is usually assessed by looking at the percentage of reads mapped and uniqueness of the mapping. Some eads will map but not uniquely. Some some will not map at all. Contributing factors may be: Contamination, adapter sequences (that somehow passed filtering). Mapping tools have several features which may not be relevant to ChIP-Seq analysis. Decisions on the aligner and aligner settings will vary between experiments, but generally for Chip-Seq: You would be aligning DNA to a DNA reference; Depending on the lab prep, reads would typically be short (<= 100bp). BWABWA comes with several algorithms, bwa align, bwa bwasw, and bwa mem. The most commonly used for next generation sequencing data is bwa align, which is suitable for short reads up to 150bp. bwa bwasw supports longer 100bp-100kp reads, but does not (amongst other things) support paired end. The author (Heng Li) is developing a new algorithm, bwa mem which is a combination of the two algorithms designed as a replacement for bwa align with similar performance and support for long 150bp+ reads. (source) At the time of writing, there is a wrapper for bwa align in the galaxy toolshed. Running any of the bwa algorithms requires firstly, building of a genome index. Pre building genome indices is often required by many mappers, indices help speed up lookup of the genome. The input is a fasta file (.fa or .fasta) of the genome you want to align to. Generally you only need to do this once for a genome, the indices can be used again for mapping jobs. The program will create several index files in the same folder as the fasta file. bwa index ref.fa Note that BWA will assume the index files are in the same folder as ref.fa. Using default arguments for these may get you workable results initially but is worthwhile to refine your arguments as this will affect mapping quality and speed. The simplest way to speed up your alignment is increasing the the “Number of threads” (-t). The number of threads should be less than or equal to the number of cores on your machine. NTHREADS <= NUM_CORES Running bwa mem: For single end reads: bwa mem -t NTHREADS ref.fa reads.fq > aln-se.sam For paired end reads: bwa mem -t NTHREADS ref.fa read1.fq read2.fq > aln-pe.sam Running bwa align: This is done in two steps, first an intermediate .sai file is created, these are then used together to create a sam file. For single end reads: bwa aln -t NTHREADS ref.fa short_read.fq > aln_sa.sai bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam For paired end reads: bwa aln -t NTHREADS ref.fa read1.fq > aln_sa1.sai bwa aln -t NTHREADS ref.fa read2.fq > aln_sa2.sai bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam Bowtie / Bowtie 2This is another popular and fast mapper. Bowtie 2 is the newer algorithm designed for longer reads, gapped alignments, better map quality scoring scheme and other improvements. (source). Bowtie 1 has two alignment modes. The -v alignment mode scores mismatches across the entire read. The -n alignment mode (default) scores mismatches by looking at a section at the beginning of the read first and then the entire read. Bowtie 2 has a simpler scoring scheme. Similar to the previous mapper, a genome index needs to be built first. Bowtie 2 no longer needs the original fasta once the index has been built, so you can have this in a different folder. The indexer will create 6 files: NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>. bowtie2-build ref.fa <bt2_base> Running bowtie2 For single end reads: bowtie2 -x <bt2_base> -U reads.fq -S aln-se.sam For paired end reads: bowtie2 -x <bt2_base> -1 reads_1.fq -2 reads_2.fq -S aln-pe.sam Possible alternative software:There are numerous alternatives in mapping software some of them are listed here (look for DNA mappers, not RNA):
| ||
Section 2b: Mapping PostprocessingPurpose:Most mapping tools will output files in SAM format, a text file with a description of how and where each read aligns (or not align) to the reference genome along with the mapping score of each. The binary version of a SAM file is the BAM file. Computer programs can read BAM files more efficiently than SAM files, and most downstream programs take BAM as input. So it is often beneficial to convert SAM files to BAM after alignment. Many tools will also benefit/require a BAM index (.bai), which is similar to a genome index which helps programs look up reads. Some tools requires the BAM file to be sorted in a specific way. Sorting can be by chromosomal coordinate or read name Steps involved and suggested tools:The most common tools are samtools and picard. Picard is generally more versatile with more options. Running samtools: Creating bam: samtools view -bS eg.sam > eg.bam Sorting bam by chromosomal coordinates: samtools sort eg.bam sorted (will create a sorted.bam) Creating bam index: samtools index sorted.bam (a eg.bai file will be created in the same folder) Running picard: Creating bam, sorting and creating index in one step: java -Xmx2g -jar SortSam.jar CREATE_INDEX=true SORT_ORDER=coordinate INPUT=eg.sam OUTPUT=eg.bam Possible alternative software: Note that in Galaxy, BAM index files are created automatically Section 3: Peak CallingPurpose:
A Chip-Seq experiment pulls down proteins of interest that has bound to DNA. As such, it would be expected that more DNA fragments with affinity to this protein is sequenced than other DNA. In an alignment view, you would expect to see high coverage at regions where the protein has high affinity to the genome. Peak callers assess several properties of the alignment to identify such regions. In a typical laboratory prep for ChIP-Seq, the DNA fragment is size selected, sequenced and alignment. In single-end sequencing; the sequencer reads from the 5’ end of both the forward and reverse strands. Forward and reverse reads are sometimes referred to as forward tags and reverse tags. Together the forward and reverse tags would form an area of high coverage (or “peak”) of reads to certain areas of the genome. Raw alignments of these tags will look like the two blue and red peaks in the figure above. Paired-end sequencing should produce similar looking peaks. Provided fragment lengths and read lengths are the same, forward tags peak would consist of reads from “read1” and reverse tags would be composed of reads from “read2”. The difference from single-end sequencing is that a read1 and read2 pair is guaranteed to originate from the same DNA fragment. Whereas there is no such pairing between forward and reverse tags. The extra information from paired-end sequencing provides higher confidence calculations for peak shifts (detailed below) and by extension peak calling Most peak callers normalise coverage at a peak against the background coverage around the area of interest and if available a negative control (usually “input” DNA). An input is non-specific antibody ChIP, a background model which can account for biases in chromatin fragmentation, variations in sequencing efficiency and coverage at high/low mappability regions. Coverage in relation to the control guides the peak caller to regions of interest. Furthermore, with similar fragment and read lengths the base pair distance between the forward and reverse tags should be a certain distance apart and follow a distribution. As shown in the figure, two peaks are generally formed, one from the forward tags and the other from reverse tags. The distance between the two peaks plus the tag length should model the length of the original fragment. [Fragment length] = [Forward tag length]/2 + [Peak shift] + [Reverse tag length]/2 = [tag length] + [Peak shift] Peak callers build peak models based around this property to estimate fragment size and to guide peak calling. Caveats: Blacklisted regions: Artifact regions which have extreme artifactual high signal in control datasets; Regions where the mapped reads are easily mapped elsewhere (not unique); Repeat regions such as centromeric, telomeric and satellite repeats (Gerstein 2012). The coverage in these regions are often extremely high but they are not real peaks. A peak caller may/may not be able to normalise and account for these regions based on the input, During or after the peak calling step, “peaks” which lie in these regions can be excluded. A file containing a list of such regions excluded in the ENCODE project can be found here. Some reads may not map to the reference genome at all if the genomic source (a cell line or tissue) varies greatly in the areas of interest to the reference genome. This may be the cause for a large amount of high quality, non artifactual reads to not be mapped. Tools:MACS / MACS 2 Homer FindPeaks Possible alternative software:Section 4: Assessing and Refining Peak CallsPurpose:Depending on the quality of the data, peak callers can call peaks which are actually noise. A good experiment design should include biological replicates which ideally will have high conservation in their peak calls. However, independent thresholding can lead to peaks that are above the threshold for one replicate while below the threshold in another, despite them having identical binding profiles. To assess the correspondence between samples and to obtain a set of highly confident peaks between replicates the ENCODE project uses a framework based around the IDR (Irreproducible Discovery Rate). The approach finds peaks that are consistent across a number of replicates. Instead of only keeping peaks that occur across all replicates, this approach reduces true positives from being filtered out. Steps involved and suggested tools:Details of the Process can be found here. Galaxy ToolShed has idr_package by modENCODE team Section 5: Differential AnalysisPurpose:Quantitative comparison of transcription factor or histone binding under different cell types or conditions can be done by comparing peak heights (ie read densities) for higher or lower binding affinity. This is based on the assumption that the rate of binding to a specific region can correspond to the number of reads at this region. Because variations in noise can vary substantially in ChIP experiments, appropriate normalisation is necessary for comparison between samples. You will need to start off with a high quality list of peaks (see Section 4: Assessing and Refining Peak Calls) which have been called from all of the samples you are comparing against. Steps involved and suggested tools:
Possible alternative software (see Bailey et al 2013):
Section 6: Motif AnalysisPurpose:The binding sites obtained from peak lists provide regions of interest across the genome. The nucleotide sequence of these regions can then be extracted and compared for similarities between regions of interest within the same experiment or with other known motifs. We expect to find binding motifs for important transcription factors to be relatively conserved. Conservation can be analysed at peaks that are conserved, or with loss/gain between different different conditions. An example of such an investigation can be splitting the refined peaks (from Section 4: Assessing and Refining Peak Calls) and by how they are differentially expressed (see Section 5: Differential Analysis) and then analysing motif binding patterns separately for
Steps involved and suggested tools:
Possible alternative software:Section 7: Functional AnalysisPurpose:Functional analysis attempts to assign more dynamic aspects to the regions of interest such as what may be happening in transcription, translation and interactions with other proteins in terms of pathways. The most straightforward method comes from recent high throughput sequencing data where regions of interest (in our case, our peaks) span across the whole genome that lie not only in genes but non-coding regions, pseudogenes, introns etc Recent functional analysis tools such as GREAT have been developed to take advantage of the gene agnostic nature of new sequencing data. The entire list of confident peaks in the format of a BED file is used as input for GREAT. The traditional gene-by-gene approach can be performed on ChIP seq data by determining target genes to the binding peaks of the transcription factor. Doing this is nontrivial as proximity to genes can be dependent on spatial arrangement of the chromosome. Gene lengths can also bias the assignment of peaks to genes. Comparative analysis with genes as well as other tracks derived from public data from UCSC, ENCODE can be done in an integrative level with the Genomic HyperBrowser Steps involved and suggested tools:Regions of interest
Possible alternative software:Gene Ontology analysis with Genes of interestSee (Bardet 2012) From (Bailey et al 2013): | ||
ReferencesBardet, A. F., He, Q., Zeitlinger, J., & Stark, A. (2012). A computational pipeline for comparative ChIP-seq analyses. Nature protocols, 7(1), 45–61. doi:10.1038/nprot.2011.420 Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K.-K., Cheng, C., … Alexander, R. (2012). Architecture of the human regulatory network derived from ENCODE data. Nature, 489(7414), 91–100. doi:10.1038/nature11245 Landt, S. G., Marinov, G. K., Kundaje, a., Kheradpour, P., Pauli, F., Batzoglou, S., … Snyder, M. (2012). ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research, 22(9), 1813–1831. doi:10.1101/gr.136184.111 Wilbanks, E. G., & Facciotti, M. T. (2010). Evaluation of algorithm performance in ChIP-seq peak detection. PloS one, 5(7), e11471. doi:10.1371/journal.pone.0011471 Kidder, B. L., Hu, G., & Zhao, K. (2011). ChIP-Seq: technical considerations for obtaining high-quality data. Nature immunology, 12(10), 918–22. doi:10.1038/ni.2117 IDR: Reproducibility and automatic thresholding of ChIP-seq data - Anshul Kundaje. (n.d.). Retrieved August 20, 2013, from https://sites.google.com/site/anshulkundaje/projects/idr Galaxy ToolShed idr_package. (n.d.). Retrieved August 20, 2013, from http://toolshed.g2.bx.psu.edu/repository/view_repository?sort=name&operation=view_or_manage_repository&f-free-text-search=idr&id=fcd21c613cf419b1 McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., … Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5), 495–501. doi:10.1038/nbt.1630 Subramanian, A. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. Retrieved from http://www.pnas.org/content/102/43/15545.short Sandve, G. K., Gundersen, S., Rydbeck, H., Glad, I. K., Holden, L., Holden, M., … Hovig, E. (2010). The Genomic HyperBrowser: inferential genomics at the sequence level. Genome biology, 11(12), R121. doi:10.1186/gb-2010-11-12-r121 Taslim, C., Huang, T., & Lin, S. (2011). DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models. Bioinformatics (Oxford, England), 27(11), 1569–70. doi:10.1093/bioinformatics/btr165 Stark, R., & Brown, G. (2012). DiffBind: differential binding analysis of ChIP-Seq peak data. Retrieved from http://www.bioconductor.org/packages/devel/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T., … Zhang, J. (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Computational Biology, 9(11), e1003326. doi:10.1371/journal.pcbi.1003326 |