ChIP-Seq

Introductory tutorial

Igor Makunin

Research Computing Centre

The University of Queensland

http://genome.edu.au

Contents

Tutorial Overview

Background [10 min]

Preparation [10 min]

Section 1: Quality control with FastQC [10 mins]

Section 2: Alignment with BOWTIE [25]

Section 3: Peak calling with MACS [20 min]

Tutorial Overview

In this tutorial we cover the concepts of chromatin immunoprecipitation followed by massively parallel sequencing using a small subset of data obtained in mouse cell culture.

What’s not covered

The tutorial is designed as an introduction to ChIP-Seq analysis. In practice the analysis can include control datasets, the datasets would be much larger.  

Background [10 min]

Where is the data in this tutorial from?

The data for this tutorial came from an ChIP-Seq experiment for CTCF protein in the mouse MEL cell line (link). We used a small subset of the original dataset containing ~400,000 reads mapped to the first 30 Mb on chromosome 19. The Phred quality score encoding of the FASTQ file used in the tutorial is in Sanger format (offset Phred+33). We will use only one replicate and no control dataset.

The protocol:

The protocol in a nutshell:

  • Import reads in FASTQ format
  • Do a quality control of the input data
  • Align reads to the mouse genome
  • Call peaks using MACS software package

Preparation [10 min]

  1. Register as a new user in Galaxy if you don’t already have an account.
  1. Open a browser and go to the the Galaxy server: http://galaxy-tut.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User > Register or login if you already have an account
  1. Import the ChIP-Seq data (reads in FASTQ format) for the workshop. You can do this in a few ways, of which by far the easiest is:

Go to Shared Data > Published Histories and click on  ‘Data for ChIP-Seq Basic tutorial. Then click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.

  1. This will create a new Galaxy history in your account with the required data file

Alternatively:

Download the data directly to your computer using these URLs, then upload to Galaxy:

FASTQ file with sequence reads:

https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/ChIP-Seq_tutorial/mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

Once files are downloaded, in Galaxy click on Get data > Upload File

Click 'Choose File', locate the local copy of the FASTQ file and upload

Make sure you select 'fastqsanger' under File Format or the next steps won't work!

Click Execute and wait for the file to upload to Galaxy. It can take a couple of minutes.

Or:

Upload the data directly to Galaxy using the file URLs

Follow the steps above but instead of downloading the file locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import it directly.

Remember to select the ‘fastqsangerFile Format for the sequence file.

  1. You should now have the following file in your Galaxy history:

mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

  1. View the FASTQ file
  1. Click on the eye icon to the top right of each fastq file to view the first part of the file. Note that the reads are very short (36 nt).
  2. If you’re not familiar with the FASTQ format, click here for an overview

Section 1: Quality control with FastQC [10 mins]

The ChIP-Seq reads are generally fairly short, eg the reads used in this tutorial are just 36 nt long. Because the reads are obtained from regions around binding sites of specific proteins, the GC content might differ from that of the whole genome. Because only tiny a proportion of a genome is sequenced in ChIP-Seq experiments, generally such datasets contain high proportion of identical (duplicated) sequences.

  1. Run FastQC over the input data

From the tools menu in the left hand panel of Galaxy, select NGS: QC and manipulation > FastQC:Read QC and run with the default parameters.

  1. Examine the output file

Click on the eye icon.

The input dataset has 375,301 sequences.

The GC content is 54% (very high).

The dataset has a high duplication level but it is ok for ChIP-Seq experiments.

Section 2: Alignment with BOWTIE [25 min]

The aim here is to:

  • Align the reads to the reference genome
  • Sort reads in alignment and add group tags
  • Visualise the alignment on UCSC Genome Browser

  1. Align the reads using BOWTIE

From the tools menu in the left hand panel of Galaxy, select NGS: Mapping > Map with Bowtie for Illumina.

Select a reference genome Mouse (mm10).

Change “Bowtie settings to use:” to Full parameters list:

Seed length (-l): 25

Suppress all alignments for a read if more than n reportable alignments exist (-m): 1

(this will keep reads mapped only to one position in the mouse genome)

This step generates an alignment in the SAM format (a readable text file) that can be examined by clicking on the eye icon.

  1. Sort reads in the alignment, add read groups and convert SAM to BAM

Some tools work only with sorted alignments. Some tools also require extra information for alignments called read groups. The text SAM format is bulky compared to the binary BAM format. BAM files are indexed for the faster access. Add or Replace Groups tool from Picard package adds group tags and sorts alignments.

From the tools menu select NGS: Picard > Add or Replace Groups.

Use the following groups:

Read group ID (ID tag): group1

Read group sample name (SM tag): replicate1

Read group library (LB tag): lib1

Read group platform (PL tag): illumina

Read group platform unit: rgpu1

Output bam instead of sam: yes (tick the box)  

Note: the output binary file is not readable, and clicking on the eye icon will download file to your computer. 

  1. Visualise the BAM alignment on the UCSC Genome Browser

Click on the name of the alignment (Add or Replace Groups…) to expand the info.

Click display at UCSC main

This will connect Galaxy server with UCSC Genome Browser which will be open in a new tab on your browser

Once the data passed to the browser, paste chr19:4,083,203-4,085,162 into position window on the browser and hit Enter / click go.

Click on the user track to expand the data. Alternatively, select ‘pack’ option in the pull-down menu under your custom track.

Different colors represent reads mapped in forward and reverse-complementary orientations.

Section 3: Peak calling with MACS [20 min]

The aim here is to:

  • Identify regions with high read density (peaks)
  • Visualise the called peaks and aligned reads on UCSC Genome Browser

  1. Identify peaks using MACS

From the tools menu select NGS: Peak Calling > MACS Model-based Analysis of ChIP-Seq. Use the BAM file from the step 4 as an input.

Change:

Effective genome size: 30000000 [7 zeroes] (the reads for the tutorial were selected from 30 Mb region on chr19)

In Galaxy MACS generates two files, a BED file with genomic intervals (peaks) and html report.  

Display the html report by clicking on the eye icon.

Download MACS_in_Galaxy_peaks.xls file by clicking on the name.

Open Excel file containing peaks identified by MACS.

The column transformed P-value ‘=-10*LOG10(pvalue)’ might be displayed as #NAME? You can fix it by adding ‘ at the start of the text.

  1. Upload the identified peak regions to UCSC Genome Browser

Click on MACS on data ... (peaks: bed) file name in your history.

Click on display at UCSC main.

This will open a new tab with UCSC Genome Browser.

Go to chr19:4,083,203-4,085,162

Note that not all regions with the mapped reads were classified as peaks.

The completed tutorial ‘ChIP-Seq Basic tutorial’ is available on Galaxy-tut in Shared Data > Published Histories