Background

Introduction to ChIP-Seq

ChIP-Seq tutorials

ChIP-Seq with MACS

Quality control and alignment

Peak calling

Identification of motifs with MEME

Importing sequences from UCSC Genome Browser into Galaxy

Motif calling with MEME

Background

Introduction to ChIP-Seq

ChIP-Seq, or ChIP-sequencing (link to wiki) is an approach used for analysis of protein binding sites in genomes (Figure 1). It utilises chromatin immunoprecipitation (ChIP) and high-throughput sequencing technology. In nutshell, proteins are crosslinked to DNA, DNA is fragmented into small fragments, DNA-protein complexes were purified using antibodies specific to a certain protein (such as CTCF) resulting in enrichment of fragments bound to that protein, the DNA-protein bonds are restored to release DNA, which used for sequencing on high-throughput sequencing platforms such as Illumina HiSeq. The sequences (reads) are aligned to the reference genome, and dedicated peak calling software such as MACS (link to github) is used to find out regions with the high reads density, so called peaks. The rationale is that the enrichment of DNA fragments in immunoprecipitation correlates with the strength of protein binding, and the enrichment can be estimated by number of reads obtained in sequencing. It is good to have some sort of the input control sequenced in the ChIP-Seq experiment (link to Ren Lab ENCODE2 Chromatin Immunoprecipitation Protocol). Biological replicates are essential for ChIP-Seq analysis. A detailed review of ChIP-Seq approaches used in ENCODE and modENCODE projects is published by Landt et al. 2012.

Many DNA binding proteins associate with specific DNA motifs. Such motifs are present in regions (peaks) identified in ChIP-Seq experiments. DNA binding motifs can be identified in sequences from the peak regions by dedicated software packages such as MEME (link to the MEME site). The predicted peak regions are bigger than the binding sites (Figure 1), but it would be reasonable to suggest that binding sites for a certain protein have similar sequences in different peak regions. MEME would search for common motifs in provided set of sequences.  

ChIP-Seq tutorials

We provide two ChIP-Seq tutorials. The basic tutorial uses a single experimental ChIP-Seq dataset, without the input control, and it consist from the quality assessment of the sequencing data, alignment and peak calling by MACS. The advanced tutorial uses both experiment and control datasets, the peaks are called with MACS2. The advanced tutorial also covers data import from UCSC Genome Browser into Galaxy and includes motif finding via MEME. The quality assessment step for the input data was omitted in the advanced tutorial to shorten it. Both tutorials use the mouse CTCF ChIP-Seq data obtained Bing Ren Lab, UCSD, as a part of the ENCODE project. The tutorials are developed for Galaxy workflow platform. Learn more about Galaxy

Figure 1. Scheme of ChIP-Seq

Screen Shot 2014-09-19 at 4.05.54 pm.png

 


ChIP-Seq with MACS

Quality control and alignment

Generally the input data (reads) come in FASTQ format (link to wiki). It is essential to check the quality of high-throughput sequencing data. FastQC is one of the most popular choices for the assessment of the read quality. Note that ChIP-Seq data is generally derived from a very small fraction of the genome, and the sequences will be enriched with duplicated (repeated) reads. Some proteins bind to regions with biased composition, so the GC content of ChIP-Seq data may differ from the genomic average.  

Parameters used for alignment can influence the peak prediction. For the tutorials we used “Bowtie (Langmead et al., 2009) with a seed length of 25 and allowing up to two mismatches. Only the sequences that mapped to one location were used for further analysis” (link to CEO database, see Data processing). With such parameters reads mapped to identical / repeated regions will be excluded from the subsequent analysis.  

BOWTIE produces non-sorted alignment. Many software packages require sorted alignments and extra bits of informations called ‘read groups’. Current best practice on Galaxy for sorting and adding ‘read groups’ to alignments is Add or Replace Group tool  from Picard package. The tool also can convert alignment from the text SAM format into the compact binary BAM format.

Peak calling

The Basic ChIP-Seq Tutorial uses an early version of the MACS peak calling software, while the Advanced tutorial employs the newer MACS2. MACS2 can be used for prediction of very broad peaks (Landt et al. 2012).  


Identification of motifs with MEME

Importing sequences from UCSC Genome Browser into Galaxy

Note that the GVL Galaxy instances, such as Galaxy-tut, are linked to the GVL mirror of UCSC Genome Browser at https://ucsc.genome.edu.au.

Keep in mind that BED files linked to UCSC Genome Browser though Galaxy History menu are used through the tack hub and cannot be renamed in the Table Browser. Any subsequently uploaded BED file without the track name name will replace the existing file. To use multiple BED files on UCSC Genome Browser do the following:

  • download BED file on your computer
  • upload the file on the GVL mirror of UCSC Genome Browser
  • rename the track in Table menu
  • click manage custom tracks button
  • click on Track name
  • replace the text between quotation signs for track name and description 
  • click Submit button

Motif calling with MEME

MEME runs on the external service, and it is a computationally intensive procedure.

The maximum input for MEME is 60,000 characters (September 2014).


References

Landt, Stephen G., et al. "ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia." Genome research 22.9 (2012): 1813-1831.