ChIP-Seq and discovery of motifs

Advanced tutorial

Igor Makunin

Research Computing Centre

The University of Queensland

http://genome.edu.au

Contents

Tutorial Overview

Background [10 min]

Preparation [10 min]

Section 1: Alignment with BOWTIE [30 min]

Section 2: Peak calling with MACS2 [20 min]

Section 3: Import sequences of the peak regions from UCSC Table Browser into Galaxy [20 min]

Section 4: Identification of the CTCF motif using MEME [20 min]

Tutorial Overview

In this tutorial we cover the concepts of chromatin immunoprecipitation followed by massively parallel sequencing using small subsets of data obtained in mouse cell culture. The tutorial includes peak calling for CTCF ChiP-Seq data with control, import of sequences corresponding to the peak regions from UCSC Genome Browser into Galaxy, and identification of CTCF binding sites by MEME software.

What’s not covered

The tutorial does not include any quality control steps. Quality control for input files were discussed in ChIP-Seq Basic tutorial. In practice the datasets would be much larger.  

Background [10 min]

Where is the data in this tutorial from?

The data for this tutorial came from an ChIP-Seq experiment for CTCF protein in the mouse MEL cell line (link). We use a small subset of the original dataset (Biological replicate 1) containing ~400,000 reads mapped to the first 30 Mb on chromosome 19. As a control we use ~80,000 reads mapped to the same region from the Biological replicate - 1 (link). The Phred quality score encoding of the FASTQ files is in Sanger format (offset Phred+33).

The protocol:

The protocol in a nutshell:

  • Import reads in FASTQ format
  • Align reads from experiment and control to the mouse genome
  • Call peaks using MACS2 software package
  • Select 50 peak regions with the lowest P-value and import DNA sequences for these regions from UCSC Table Browser into Galaxy
  • Identify the CTCF motif in 50 sequences using MEME package

Preparation [10 min]

  1. Register as a new user in Galaxy if you don’t already have an account.
  1. Open a browser and go to the the Galaxy server: http://galaxy-tut.genome.edu.au
  1. NOTE: Firefox/Safari/Chrome all work well, Internet Explorer not so well
  1. Register as a new user: User > Register or login if you already have an account
  1. Import the ChIP-Seq data (reads in FASTQ format) for the workshop. You can do this in a few ways, of which by far the easiest is:

Go to Shared Data > Published Histories and click on  ‘Data for ChIP-Seq Advanced tutorial. Then click 'Import History' at top right, wait for the history to be imported to your account, and then ‘start using this history’.

  1. This will create a new Galaxy history in your account with the required data files

Alternatively:

Download the data directly to your computer using these URLs, then upload to Galaxy:

FASTQ files with sequence reads:

https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/ChIP-Seq_tutorial/mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/ChIP-Seq_tutorial/mouse_ChIP-Seq_control_reads

Once files are downloaded, in Galaxy click on Get data > Upload File

Click 'Choose File', locate the local copy of the FASTQ file and upload

Make sure you select 'fastqsanger' under File Format or the next steps won't work!

Click Execute and wait for the file to upload to Galaxy. It can take a couple of minutes.

Or:

Upload the data directly to Galaxy using the file URLs

Follow the steps above but instead of downloading the file locally and then uploading, copy the URLs from the previous step into the URL/Text box and Galaxy will download and import it directly.

Remember to select the ‘fastqsangerFile Format for the sequence file.

  1. You should now have the following files in your Galaxy history:

mouse_ChIP-Seq_control_reads
mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

(If you uploaded the data to Galaxy, edit the file names by clicking on the pencil icon

eg rename

https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/ChIP-Seq_tutorial/mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

to

mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

  1. View the FASTQ file
  1. Click on the eye icon to the top right of each fastq file to view the first part of the file. Note that the reads are very short (36 nt).
  2. If you’re not familiar with the FASTQ format, click here for an overview

Section 1: Alignment with BOWTIE [30 min]

The aim here is to:

  • Align the reads to the reference genome
  • Sort reads in alignment and add group tags
  • Visualise the alignment on UCSC Genome Browser

  1. Align the reads using BOWTIE

From the tools menu in the left hand panel of Galaxy, select NGS: Mapping > Map with Bowtie for Illumina.

Select a reference genome Mouse (mm10).

FASTQ file: mouse_CTCF_ChIP-Seq_reads_30Mb_chr19

Change “Bowtie settings to use:” to Full parameters list:

Seed length (-l): 25

Suppress all alignments for a read if more than n reportable alignments exist (-m): 1

(this will keep reads mapped only to one position in the mouse genome)

This step generates an alignment in the SAM format (a readable text file) that can be examined by clicking on the eye icon.

Repeat for the control data as following:

click on the name of the active BOWTIE job (# 3) to expand the info

click the re-run button      

change the input FASTQ file to: mouse_ChIP-Seq_control_reads

click Execute button

  1. Sort reads in the alignment, add read groups and convert SAM to BAM

Some tools work only with sorted alignments. Some tools also require extra information for alignments called read groups. The text SAM format is bulky compared to the binary BAM format. BAM files are indexed for the faster access. Add or Replace Groups tool from Picard package adds group tags and sorts alignments.

From the tools menu select NGS: Picard > Add or Replace Groups.

Select file “Map with Bowtie for Illumina on data 1:mapped reads”  (the experiment reads)

Use the following groups:

Read group ID (ID tag): idexp

Read group sample name (SM tag): smexp

Read group library (LB tag): lbexp

Read group platform (PL tag): illumina

Read group platform unit: plexp 

Output bam instead of sam: yes (tick the box)  

Note: the output binary file is not readable, and clicking on the eye icon will download file to your computer. 

Change the name of the new file by clicking on the pencil icon

 Edit the name of the new file by clicking on the pencil icon.  Rename

Add or Replace Groups on data 3: bam with read groups replaced

to

Experiment

Repeat for the control data as following:

click on the name of the active or queued Add or Replace Groups job (# 5) to expand the info

click the re-run button

change the input file on “Map with Bowtie for Illumina on data 2:mapped reads”  (the control)

change the read groups to

Read group ID (ID tag): idctl

Read group sample name (SM tag): smctl

Read group library (LB tag): lbctl

Read group platform (PL tag): illumina

Read group platform unit: plctl 

click Execute button

 Edit name of the new file by clicking on the pencil icon.  Rename

Add or Replace Groups on data 4: bam with read groups replaced

to

Control

  1. Visualise the BAM alignment on the UCSC Genome Browser

Click on the name of the alignment (Add or Replace Groups…, step 5) to expand the info.

Click display at UCSC gvlmain

This will connect Galaxy server with UCSC Genome Browser which will be open in a new tab on your browser

Close the window with UCSC Genome Browser

Repeat for the control data (step 6)

Once the data is passed to the browser, paste chr19:4,083,203-4,085,162 into position window on the browser and hit Enter / click go.

Expand the tracks to ‘pack’ option in the pull-down menu under your custom track, and refresh.

Different colors represent reads mapped in forward and reverse-complementary orientations. Note the difference in reads density between the experiment and control data.

Section 2: Peak calling with MACS2 [20 min]

The aim here is to:

  • Identify regions with high read density (peaks)
  • Visualise the called peaks and aligned reads on UCSC Genome Browser

  1. Identify peaks using MACS

From the tools menu select NGS: Peak Calling > MACS2 Model-based Analysis of ChIP-Seq.

Use the BAM file from the step 5 (“Experiment”) as an input in ChIP-Seq Tag File.

Use the BAM file from the step 6 (“Control”) as ChIP-Seq Control File.

Change:

Effective genome size: 30000000 [7 zeroes] (the reads for the tutorial were selected from 30 Mb region on chr19)

Keep the rest default

Execute

In Galaxy MACS2 generates four files including a BED file with genomic intervals and html report.  

Click on the bed file name to display the info. It shows MACS2 identified 539 peak regions.

Display the html report by clicking on the eye icon.

Download MACS_in_Galaxy_peaks.xls file by clicking on the name.

Open Excel file containing peaks identified by MACS.

The column transformed P-value ‘=-10*LOG10(pvalue)’ might be displayed as #NAME? You can fix it by adding ‘ at the start of the text.

Section 3: Import sequences of the peak regions from UCSC Table Browser into Galaxy [20 min]

The aim here is to:

  • Select 50 peaks with the lowest P-values
  • Upload the coordinates of 50 regions to GVL mirror of UCSC Genome Browser
  • Import sequences of the peak regions from UCSC Table Browser into Galaxy

MEME has an upper limit for the input data. Total length of sequences for all peaks called by MACS2 exceeds MEME limit, so we will use top 50 peak regions.  

  1. Sort the peak regions on P-value

From the tools menu select Filter and Sort > Sort data in ascending or descending order.

Use BED file produced by MACS2: 7 MACS2: callpeak on data 6 and data 5 (peaks: bed) containing 539 regions.

Sort on column: c5

With flavor: Numerical sort

Everything in: Descending order

  1. Select 50 peak regions with the lowest P-values

From the tools menu select Text Manipulation > Select first lines from a dataset.

Select first: 50

Use output of the previous step (11 Sort on data 7) as an input.

  1. Display the 50 peak regions to UCSC Genome Browser

Click on 12 Select first on data 11 file name in your history.

Click on display at UCSC gvlmain.

This will open a new tab with UCSC Genome Browser. (If you see Warning/Error(s): redirected to non-http(s): /root click Ok)

Note that in this case the BED file is used through the track hub in the same way as the BAM files; that is, they are not loaded to the UCSC Browser as custom tracks, so the names of these tracks cannot be changed in the browser.

Alternatively, download the BED file 12 Select first on data 11 to your computer, and than upload it to GVL mirror of UCSC Genome Browser http://ucsc.genome.edu.au as a custom track for the mm10 genome assembly. This time the BED file is loaded to the UCSC Browser, so you can rename the manually uploaded custom tracks in Genome / Table  Browser. 

Go to chr19:4,083,203-4,085,162

Note that not all regions with the mapped reads were classified as peaks.

  1. Import sequences from UCSC Genome Browser into Galaxy

In Galaxy Tools menu go to Get Data > UCSC GVL Mirror table browser.

You should see UCSC Table Browser in the middle (working) window with setting for the mouse genome (mm10 assembly).

Select:

group: Custom Tracks

track: User Track (unless you rename the track in the previous step)

output format: sequence

Make sure that Send output to Galaxy is set to Yes (ticked)

Click Get output.

Do not use header.

In new window click Send query to Galaxy.

The sequences will be imported into Galaxy as a multi-fasta file 13 UCSC GVL Mirror on Mouse: ct_UserTrack_3545 (genome).

Section 4: Identification of the CTCF motif using MEME [20 min]

The aim here is to:

  • identify a nucleotide motif (a binding site consensus sequence) for CTCF protein

  1. Run MEME on 50 sequences

In Galaxy Tools menu go to Motif > MEME - Multiple Em for Motif Elicitation

Select multi-fasta file 13 UCSC GVL Mirror on Mouse: ct_UserTrack_3545 (genome) from the previous step as an input

Use BRAEMBL (Australia) mirror

MEME runs on an external BRAEMBL server.

I certify that I am not using this tool for commercial purposes.: yes

Click Execute

MEME generates three files in Galaxy history. Display the html file 16 MEME on data 13 (html) by clicking on the eye icon.

Click on MEME html output.

Compare the motif logo with known CTCF binding site consensus sequences, e.g. 

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3091324/figure/F2/

from Essien et al., Genome Biol. 2009; 10(11): R131.

Hint: select Reverse Complement for Motif 1.

Check positions of the CTCF sites within 50 peak regions.

 

The completed tutorial ‘ChIP-Seq Advanced tutorial’ is available on Galaxy-tut in Shared Data > Published Histories.

Extra. Run MACS2 without the control dataset. Check the difference in the number of identified peaks compared to Experiment/Control setup.