ENCODE3 pipeline v1 specifications

Pipeline Overview

Screen Shot 2014-05-01 at 4.10.40 PM.png


0. FASTQ read quality filtering

Optional, was not specified


0a. Crop FASTQ

Program(s)

  • Trimmomatic 0.39

Input(s)

  • Input: $fastq
  • crop_length: no default. works only when it’s defined.
  • crop_length_tol: works only when both crop_length and crop_length_tol are defined. Tolerance for cropping. 2 by default.

Output(s)

  • Cropped FASTQ:
  • SE: $cropped_fastq
  • PE: $cropped_fastq1, $cropped_fastq2

Commands

CROP=${crop_length}

MINLEN=$((crop_length-crop_length_tol))

# Any reads longer than CROP will be cropped to CROP

# Any reads shorter than MINLEN will be removed.

# For example of crop_length=100 , crop_length_tol=2

# CROP is 100

# MINLEN is 98 (=100 - 2)

# Trimmomatic will

#     crop all reads longer than 100.

#     keep reads with length 98 99 100 untrimmed.

#     drop all reads shorter than 98.

# NOTE: We have a separate cropping function (trimfastq.py) for cross-correlation analysis (xcor). We will use original FASTQ without cropping for xcor and orignal fastqs will be cropped there.

NTH=2

# FOR SINGLE ENDED FASTQ fastq

java -jar Trimmomatic.0.39.jar SE -threads ${NTH} ${fastq} ${cropped_fastq} MINLEN:${MINLEN} CROP:${CROP}

# FOR PAIRED END FASTQS:  fastq1, fastq2

#

java -jar Trimmomatic.0.39.jar PE -threads ${NTH} ${fastq1} ${fastq12} ${cropped_fastq1} ${tmp_cropped1} ${cropped_fastq2} ${tmp_cropped2} MINLEN:${MINLEN} CROP:${CROP}

rm -f ${tmp_cropped1} ${tmp_cropped2}

QC to report

N/A

Status

Frozen

1a. Read alignment (bowtie2 aligner)

Single-End ChIP-seq parameters

Program(s)

  • Bowtie2 version 2.2.6
  • SAMtools 1.7
  • SAMstats (version 0.2.1)

Input(s)

  • Input: $fastq
  • use $cropped_fastq in Step 0a if fastq is cropped (chip.crop_length>0)
  • Bowtie2 index: $bwt2_idx

Output(s)

  • BAM file $bam
  • mapping stats from flagstat (SAMstats) $flagstat_qc

Commands

bam   = "$prefix.bam"

log     = "$prefix.align.log"

flagstat_qc = “$prefix.flagstat.qc”

bowtie2 --mm -x $bwt2_idx --threads $nth_bwt2 -U <(zcat -f $fastq) 2> $log | \

                        samtools view -Su /dev/stdin | samtools sort - $prefix

samtools sort -n --threads 10 ${bam} -O SAM  | SAMstats  --sorted_sam_file -  --outf ${flagstat_qc}

QC to report

Output from last command i.e. samtools flagstat

Status

Frozen

Paired-End ChIP-seq parameters

Program(s)

  • Bowtie2 version 2.2.6
  • SAMtools 1.7
  • SAMstats (version 0.2.1)

Input(s)

  • Input: $fastq1, $fastq2
  • use $cropped_fastq1, $cropped_fastq2 in Step 0a if fastq is cropped (chip.crop_length>0)
  • Bowtie2 index: $bwt2_idx

Output(s)

  • BAM file $bam
  • mapping stats from flagstat (SAMstats) $flagstat_qc

Commands

bam   = "$prefix.bam"

log     = "$prefix.align.log"

flagstat_qc     = "$prefix.flagstat.qc"

bowtie2 -X2000 --mm --threads $nth_bwt2 -x $bwt2_idx \

                    -1 $fastq1 -2 $fastq2 \

                    2>$log | \

                             samtools view -Su /dev/stdin | samtools sort - $prefix

samtools sort -n --threads 10 ${bam} -O SAM  | SAMstats --sorted_sam_file -  --outf ${flagstat_qc}

QC to report

Output from last command i.e. samtools flagstat

Comment

Status

Frozen

 1a. Read alignment (BWA aligner)

Single-End ChIP-seq parameters

  • With dynamic read trimming q = 5
  • seed length l = 32
  • max. mismatches in seed k = 2

Program(s)

  • BWA version 0.7.13
  • SAMtools (version 1.7)
  • SAMstats (version 0.2.1)

Input(s)

  • FASTQ file ${FASTQ_FILE_1}
  • hg19 male or female or mm9 reference sequence ${BWA_INDEX_NAME}

Output(s)

  • BAM file ${RAW_BAM_FILE}
  • mapping stats from flagstat (SAMstats) ${RAW_BAM_FILE_MAPSTATS}

Commands

# ========================================

# Map reads to create raw SAM file

# ========================================

SAI_FILE_1="${OFPREFIX}.sai"

RAW_BAM_PREFIX="${OFPREFIX}.raw.srt"

RAW_BAM_FILE="${RAW_BAM_PREFIX}.bam" # To be stored

RAW_BAM_FILE_MAPSTATS="${RAW_BAM_PREFIX}.flagstat.qc" # QC File

module add bwa/0.7.13

module add samtools/1.7

bwa aln -q 5 -l 32 -k 2 -t ${NTHREADS} ${BWA_INDEX_NAME} ${FASTQ_FILE_1} > ${SAI_FILE_1}

bwa samse ${BWA_INDEX_NAME} ${SAI_FILE_1} ${FASTQ_FILE_1} | samtools view -Su - | samtools sort -o ${RAW_BAM_FILE} -

rm ${SAI_FILE_1}

# Use bwa-mem for reads >= 70 bp

# bwa mem -M -t ${NTHREADS} ${BWA_INDEX_NAME} ${FASTQ_FILE_1} | samtools sort -o ${RAW_BAM_FILE} -

samtools sort -n --threads 10 ${RAW_BAM_FILE} -O SAM  | SAMstats --sorted_sam_file -  --outf ${RAW_BAM_FILE_MAPSTATS}

QC to report

Output from last command i.e. samtools flagstat

Status

Frozen


Paired-End ChIP-seq parameters

  • With dynamic read trimming q = 5
  • seed length l = 32
  • max. mismatches in seed k = 2

Program(s)

  • BWA version 0.7.10
  • SAMtools (version 1.7)
  • SAMstats (version 0.2.1)

Input(s)

  • FASTQ file 1 ${FASTQ_FILE_1}
  • FASTQ file 2 ${FASTQ_FILE_2}
  • hg19 male or female or mm9 indexes ${BWA_INDEX_NAME}

Output(s)

  • Raw BAM file ${RAW_BAM_FILE}
  • mapping stats from flagstat (SAMstats) ${RAW_BAM_FILE_MAPSTATS}

Commands

SAI_FILE_1="${OFPREFIX}_1.sai"

SAI_FILE_2="${OFPREFIX}_2.sai"

RAW_SAM_FILE="${OFPREFIX}.raw.sam.gz"

bwa aln -q 5 -l 32 -k 2 -t ${NTHREADS} ${BWA_INDEX_NAME} ${FASTQ_FILE_1} > ${SAI_FILE_1}

bwa aln -q 5 -l 32 -k 2 -t ${NTHREADS} ${BWA_INDEX_NAME} ${FASTQ_FILE_2} > ${SAI_FILE_2}

bwa sampe ${BWA_INDEX_NAME} ${SAI_FILE_1} ${SAI_FILE_2} ${FASTQ_FILE_1} ${FASTQ_FILE_2} | gzip -nc > ${RAW_SAM_FILE}

rm ${SAI_FILE_1} ${SAI_FILE_2}

# Use bwa-mem for reads >= 70 bp

# bwa mem -M -t ${NTHREADS} ${BWA_INDEX_NAME} ${FASTQ_FILE_1} ${FASTQ_FILE_2} | gzip -nc > ${RAW_SAM_FILE}

# ==============================================================                                                                                                                                                                                                              

# Remove read pairs with bad CIGAR strings and sort by position                                                                                                                                                                                                              

# ==============================================================

RAW_BAM_PREFIX="${OFPREFIX}.raw.srt"

RAW_BAM_FILE="${RAW_BAM_PREFIX}.bam" # To be stored                                                                                                                                                                                                                          

BADCIGAR_FILE="${TMP}/badReads${RANDOM}.tmp"

RAW_BAM_FILE_MAPSTATS="${RAW_BAM_PREFIX}.flagstat.qc" # QC File

# Find bad CIGAR read names                                                                                                                                                                                                                                                  

zcat ${RAW_SAM_FILE} | awk 'BEGIN {FS="\t" ; OFS="\t"} ! /^@/ && $6!="*" { cigar=$6; gsub("[0-9]+D","",cigar); n = split(cigar,vals,"[A-Z]"); s = 0; for (i=1;i<=n;i++) s=s+vals[i]; seqlen=length($10) ; if (s!=seqlen) print $1”\t” ; }' | sort | uniq > ${BADCIGAR_FILE}

# Remove bad CIGAR read pairs                                                                                                                                                                                                                                                

if [[ $(cat ${BADCIGAR_FILE} | wc -l) -gt 0 ]]

then

    zcat ${RAW_SAM_FILE} | grep -v -F -f ${BADCIGAR_FILE} | samtools view -Su - | samtools sort - ${RAW_BAM_PREFIX}

else

    samtools view -Su ${RAW_SAM_FILE} | samtools sort - ${RAW_BAM_PREFIX}

fi

rm ${BADCIGAR_FILE} ${RAW_SAM_FILE}

samtools sort -n --threads 10 ${RAW_BAM_FILE} -O SAM  | SAMstats --sorted_sam_file -  --outf ${RAW_BAM_FILE_MAPSTATS}

samtools flagstat ${RAW_BAM_FILE} > ${RAW_BAM_FILE_MAPSTATS}

QC to report

Output from last command i.e. samtools flagstat

Comment

We don’t directly convert to BAM since BWA has some weird bugs where it sometimes generated Clipped CIGAR strings which are not compatible with samtools

Status

Frozen


1b. Post-alignment filtering

Single-End ChIP-seq parameters

  • Remove reads unmapped, not primary alignment, reads failing platform, duplicates (-F 1796 or -F 1804 to keep it the same as PE)
  • Remove multi-mapped reads (i.e. those with MAPQ < 30, using -q in SAMtools)

Program(s)

  • SAMtools (version 1.7)
  • MarkDuplicates (Picard - version 1.126)
  • bedtools 2.26.0

Input(s)

  • Raw BAM file ${RAW_BAM_FILE}

Output(s)

  • Filtered deduped position sorted BAM and index file ${FINAL_BAM_FILE} ${FINAL_BAM_INDEX_FILE}
  • Flagstat Metric for filtered BAM file ${FINAL_BAM_FILE_MAPSTATS}
  • Duplication metrics from MarkDuplicates ${DUP_FILE_QC}
  • Library complexity measures ${PBC_FILE_QC}

Commands

# =============================

# Remove  unmapped, mate unmapped

# not primary alignment, reads failing platform

# Remove low MAPQ reads

# ==================  

FILT_BAM_PREFIX="${OFPREFIX}.filt.srt"

FILT_BAM_FILE="${FILT_BAM_PREFIX}.bam"

MAPQ_THRESH=30

samtools view -F 1804 -q ${MAPQ_THRESH} -b ${RAW_BAM_FILE} -o ${FILT_BAM_FILE}

# ========================

# Mark duplicates

# ======================

module add picard-tools/1.126

TMP_FILT_BAM_FILE="${FILT_BAM_PREFIX}.dupmark.bam"

MARKDUP="/srv/gs1/software/picard-tools/1.126/picard MarkDuplicates"

DUP_FILE_QC="${FILT_BAM_PREFIX}.dup.qc" # QC file

java -Xmx4G -jar ${MARKDUP} INPUT=${FILT_BAM_FILE} OUTPUT=${TMP_FILT_BAM_FILE} METRICS_FILE=${DUP_FILE_QC} VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true REMOVE_DUPLICATES=false

mv ${TMP_FILT_BAM_FILE} ${FILT_BAM_FILE}

# ============================

# Remove duplicates

# Index final position sorted BAM

# ============================

FINAL_BAM_PREFIX="${OFPREFIX}.filt.nodup.srt"

FINAL_BAM_FILE="${FINAL_BAM_PREFIX}.bam" # To be stored

FINAL_BAM_INDEX_FILE="${FINAL_BAM_PREFIX}.bai" # To be stored

FINAL_BAM_FILE_MAPSTATS="${FINAL_BAM_PREFIX}.flagstat.qc" # QC file

samtools view -F 1804 -b ${FILT_BAM_FILE} -o ${FINAL_BAM_FILE}

# Index Final BAM file

samtools index ${FINAL_BAM_FILE} ${FINAL_BAM_INDEX_FILE}

samtools sort -n --threads 10 ${FINAL_BAM_FILE} -O SAM  | SAMstats --sorted_sam_file -  --outf ${FINAL_BAM_FILE_MAPSTATS}

# =============================

# Compute library complexity

# =============================

# sort by position and strand

# Obtain unique count statistics

module add bedtools/2.26.0

PBC_FILE_QC="${FINAL_BAM_PREFIX}.pbc.qc"

# PBC File output

# TotalReadPairs [tab] DistinctReadPairs [tab] OneReadPair [tab] TwoReadPairs [tab] NRF=Distinct/Total [tab] PBC1=OnePair/Distinct [tab] PBC2=OnePair/TwoPair

bedtools bamtobed -i ${FILT_BAM_FILE} | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$6}' | grep -v 'chrM' | sort | uniq -c | awk 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n",mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > ${PBC_FILE_QC}

rm ${FILT_BAM_FILE}

QC to report

(1) Flagstat output from Final filtered deduped BAM file ${FINAL_BAM_FILE_MAPSTATS}

(2) PICARD MarkDup output  ${DUP_FILE_QC} http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_What_is_meaning_of_the_histogram_produced_by_MarkDuplicates.3F

(3) Library complexity measures ${PBC_FILE_QC}

  • Format of file

TotalReads [tab] DistinctReads [tab] OneRead [tab] TwoRead [tab] NRF=Distinct/Total [tab] PBC1=OnePair/Distinct [tab] PBC2=OnePair/TwoPair

  • NRF (non redundant fraction)
  • PBC1 (PCR Bottleneck coefficient 1)
  • PBC2 (PCR Bottleneck coefficient 2)
  • PBC1 is the primary measure. Provisionally,
  • 0-0.5 is severe bottlenecking
  • 0.5-0.8 is moderate bottlenecking
  • 0.8-0.9 is mild bottlenecking
  • 0.9-1.0 is no bottlenecking.

Status

Frozen

2a. Convert SE BAM to tagAlign (BED 3+3 format)

Program(s)

  • bedtools 2.26.0
  • gawk
  • shuf

Input(s)

  • Filtered BAM file ${FINAL_BAM_FILE}

Output(s)

  • tagAlign file ${FINAL_TA_FILE}

Commands

# ===================

# Create tagAlign file

# ===================

module add bedtools/2.26.0

# Create SE tagAlign file

FINAL_TA_FILE="${FINAL_BAM_PREFIX}.SE.tagAlign.gz"

bedtools bamtobed -i ${FINAL_BAM_FILE} | awk 'BEGIN{OFS="\t"}{$4="N";$5="1000";print $0}' | gzip -nc > ${FINAL_TA_FILE}

QC to report

None

Status

Frozen


Paired-End ChIP-seq parameters

  • Remove reads unmapped, mate unmapped, not primary alignment, reads failing platform, duplicates (-F 1804).
  • Retain properly paired reads -f 2
  • Remove multi-mapped reads (i.e. those with MAPQ < 30, using -q in SAMtools)
  • Remove PCR duplicates (using Picard’s MarkDuplicates or FixSeq)

Program(s)

  • SAMtools (version 1.7)
  • MarkDuplicates (Picard - version 1.126)
  • bedtools 2.26.0

Input(s)

  • Raw BAM file ${RAW_BAM_FILE}

Output(s)

  • Filtered deduped position sorted BAM and index file ${FINAL_BAM_FILE} ${FINAL_BAM_INDEX_FILE}
  • Filtered deduped name sorted BAM file ${FINAL_NMSRT_BAM_FILE}
  • Flagstat Metric for filtered BAM file ${FINAL_BAM_FILE_MAPSTATS}
  • Duplication metrics from MarkDuplicates ${DUP_FILE_QC}
  • Library complexity measures ${PBC_FILE_QC}

Commands

# =============================

# Remove  unmapped, mate unmapped

# not primary alignment, reads failing platform

# Remove low MAPQ reads

# Only keep properly paired reads

# Obtain name sorted BAM file

# ==================

FILT_BAM_PREFIX="${OFPREFIX}.filt.srt"

FILT_BAM_FILE="${FILT_BAM_PREFIX}.bam"

TMP_FILT_BAM_PREFIX="tmp.${FILT_BAM_PREFIX}.nmsrt"

TMP_FILT_BAM_FILE="${TMP_FILT_BAM_PREFIX}.bam"

MAPQ_THRESH=30

samtools view -F 1804 -f 2 -q ${MAPQ_THRESH} -u ${RAW_BAM_FILE} | samtools sort -n - ${TMP_FILT_BAM_PREFIX} # Will produce name sorted BAM                                                                                                                                   

# Remove orphan reads (pair was removed)

# and read pairs mapping to different chromosomes

# Obtain position sorted BAM

samtools fixmate -r ${TMP_FILT_BAM_FILE} ${OFPREFIX}.fixmate.tmp

samtools view -F 1804 -f 2 -u ${OFPREFIX}.fixmate.tmp | samtools sort - ${FILT_BAM_PREFIX}

rm ${OFPREFIX}.fixmate.tmp

rm ${TMP_FILT_BAM_FILE}

# =============

# Mark duplicates

# =============

module add picard-tools/1.126

TMP_FILT_BAM_FILE="${FILT_BAM_PREFIX}.dupmark.bam"

MARKDUP="/srv/gs1/software/picard-tools/1.126/picard MarkDuplicates"

DUP_FILE_QC="${FILT_BAM_PREFIX}.dup.qc"

java -Xmx4G -jar ${MARKDUP} INPUT=${FILT_BAM_FILE} OUTPUT=${TMP_FILT_BAM_FILE} METRICS_FILE=${DUP_FILE_QC} VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true REMOVE_DUPLICATES=false

mv ${TMP_FILT_BAM_FILE} ${FILT_BAM_FILE}

# ============================

# Remove duplicates

# Index final position sorted BAM

# Create final name sorted BAM

# ============================

FINAL_BAM_PREFIX="${OFPREFIX}.filt.srt.nodup"

FINAL_BAM_FILE="${FINAL_BAM_PREFIX}.bam" # To be stored

FINAL_BAM_INDEX_FILE="${FINAL_BAM_PREFIX}.bai"

FINAL_BAM_FILE_MAPSTATS="${FINAL_BAM_PREFIX}.flagstat.qc" # QC file

FINAL_NMSRT_BAM_PREFIX="${OFPREFIX}.filt.nmsrt.nodup"

FINAL_NMSRT_BAM_FILE="${FINAL_NMSRT_BAM_PREFIX}.bam" # To be stored

samtools view -F 1804 -f 2 -b ${FILT_BAM_FILE} > ${FINAL_BAM_FILE}

samtools sort -n ${FINAL_BAM_FILE} ${FINAL_NMSRT_BAM_PREFIX}

# Index Final BAM file

samtools index ${FINAL_BAM_FILE} ${FINAL_BAM_INDEX_FILE}

samtools sort -n --threads 10 ${FINAL_BAM_FILE} -O SAM  | SAMstats --sorted_sam_file -  --outf ${FINAL_BAM_FILE_MAPSTATS}

# =============================

# Compute library complexity

# =============================

# Sort by name

# convert to bedPE and obtain fragment coordinates

# sort by position and strand

# Obtain unique count statistics

module add bedtools/2.26.0

PBC_FILE_QC="${FINAL_BAM_PREFIX}.pbc.qc"

# TotalReadPairs [tab] DistinctReadPairs [tab] OneReadPair [tab] TwoReadPairs [tab] NRF=Distinct/Total [tab] PBC1=OnePair/Distinct [tab] PBC2=OnePair/TwoPair

samtools sort -n ${FILT_BAM_FILE} ${OFPREFIX}.srt.tmp

bedtools bamtobed -bedpe -i ${OFPREFIX}.srt.tmp.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}' | grep -v 'chrM' | sort | uniq -c | awk 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n",mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > ${PBC_FILE_QC}

rm ${OFPREFIX}.srt.tmp.bam

rm ${FILT_BAM_FILE}

QC to report

(1) Flagstat output from Final filtered deduped BAM file ${FINAL_BAM_FILE_MAPSTATS}

(2) PICARD MarkDup output  ${DUP_FILE_QC} http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_What_is_meaning_of_the_histogram_produced_by_MarkDuplicates.3F

(3) Library complexity measures ${PBC_FILE_QC}

  • Format of file

TotalReadPairs [tab] DistinctReadPairs [tab] OneReadPair [tab] TwoReadPairs [tab] NRF=Distinct/Total [tab] PBC1=OnePair/Distinct [tab] PBC2=OnePair/TwoPair

  • NRF (non redundant fraction)
  • PBC1 (PCR Bottleneck coefficient 1)
  • PBC2 (PCR Bottleneck coefficient 2)
  • PBC1 is the primary measure. Provisionally,
  • 0-0.5 is severe bottlenecking
  • 0.5-0.8 is moderate bottlenecking
  • 0.8-0.9 is mild bottlenecking
  • 0.9-1.0 is no bottlenecking.

Status

Frozen


2a. Convert SE BAM to tagAlign (BED 3+3 format)

Program(s)

  • bedtools 2.26.0
  • gawk
  • shuf

Input(s)

  • Filtered BAM file ${FINAL_BAM_FILE}

Output(s)

  • tagAlign file ${FINAL_TA_FILE}

Commands

# ===================

# Create tagAlign file

# ===================

module add bedtools/2.26.0

# Create SE tagAlign file

FINAL_TA_FILE="${FINAL_BAM_PREFIX}.SE.tagAlign.gz"

bedtools bamtobed -i ${FINAL_BAM_FILE} | awk 'BEGIN{OFS="\t"}{$4="N";$5="1000";print $0}' | gzip -nc > ${FINAL_TA_FILE}

QC to report

None

Status

Frozen

2a. Convert PE BAM to tagAlign (BED 3+3 format)

Program(s)

  • bedtools 2.26.0
  • gawk
  • shuf

Input(s)

  • Filtered BAM file ${FINAL_BAM_FILE}

Output(s)

  • tagAlign file (virtual single end) ${FINAL_TA_FILE}
  • BEDPE file (with read pairs on each line) ${FINAL_BEDPE_FILE}

Commands

# ===================

# Create tagAlign file

# ===================

module add bedtools/2.26.0

# ================

# Create BEDPE file

# ================

FINAL_BEDPE_FILE="${FINAL_NMSRT_BAM_PREFIX}.bedpe.gz"

bedtools bamtobed -bedpe -mate1 -i ${FINAL_NMSRT_BAM_FILE} | gzip -nc > ${FINAL_BEDPE_FILE}

zcat ${FINAL_BEDPE_FILE} | awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}' | gzip -nc > ${FINAL_TA_FILE}

QC to report

None

Status

Frozen


2b. Calculate Cross-correlation QC scores

  • Code package: https://github.com/kundajelab/phantompeakqualtools.
  • Dependencies: unix, bash, R-3.20 and above, gawk, samtools, boost C++ libraries, R packages: SPP, caTools, snow
  • Cross-correlation analysis is done on a filtered (but not-deduped) and subsampled BAM. There is a special fastq trimming for cross-correlation analysis. Read1 fastq is trimmed to 50bp first using trimfastq.py (last modified 2017/11/08, https://github.com/ENCODE-DCC/chip-seq-pipeline2/blob/master/src/trimfastq.py). And then it is separately mapped as SE. Reads are filtered but duplicates are not removed. Then 15 million reads are randomly sampled and used for cross-correlation analysis.

Program(s)

  • phantompeakqualtools (v1.2.1)

Input(s)

  • Read1 FASTQ for PE ${FASTQ_R1} or FASTQ for SE

Output(s)

  • outFile containing NSC/RSC results in tab-delimited file of 11 columns (same file can be appended to from multiple runs) ${CC_SCORES_FILE}
  • cross-correlation plot ${CC_PLOT_FILE

Commands

#### for both PE and SE samples

# Trim R1 fastq to 50bp

python trimfastq.py $FASTQ_R1 50 | gzip -nc >  $TRIMMED_FASTQ_R1

# Align $TRIMMED_FASTQ_R1 (not paired) with bowtie2 (step 1a SE) and use it for filtering step (1b) and then get $FILT_BAM_FILE (not the deduped $FINAL_BAM_FILE), which is filtered but not deduped.

# =================================

# make tagAlign for filtered (but not deduped) BAM

# and subsample it for cross-correlation analysis

# ================================

bedtools bamtobed -i ${FILT_BAM_FILE} | awk 'BEGIN{OFS="\t"}{$4="N";$5="1000";print $0}' | gzip -nc > ${TA_FILE}

NREADS=15000000

SUBSAMPLED_TA_FILE="${OFPREFIX}.filt.sample.$((NREADS / 1000000)).SE.tagAlign.gz"

zcat ${TA_FILE} | grep -v “chrM” | shuf -n ${NREADS} --random-source=<(openssl enc -aes-256-ctr -pass pass:$(zcat -f ${TA_FILE} | wc -c) -nosalt </dev/zero 2>/dev/null) | gzip -nc > ${SUBSAMPLED_TA_FILE}

# Estimate read length from first 100 reads.

zcat ${TA_FILE} > ${TA_FILE}.tmp

READ_LEN=$(head -n 100 ${TA_FILE}.tmp | awk 'function abs(v) {{return v < 0 ? -v : v}} BEGIN{{sum=0}} {{sum+=abs($3-$2)}} END{{print int(sum/NR)}}')

# Determine exclusion range for fragment length estimation.

# Use a fixed lowerbound at -500.

# Upperbound EXCLUSION_RANGE_MAX is

#   TF ChIP-seq:  max(read_len + 10, 50)

#   Histone ChIP-seq:  max(read_len + 10, 100)

# lowerbound is fixed at 500 for both

EXCLUSION_RANGE_MIN=-500

rm -f ${TA_FILE}.tmp

### cross-correlation analysis

CC_SCORES_FILE="${SUBSAMPLED_TA_FILE}.cc.qc"

CC_PLOT_FILE="${SUBSAMPLED_TA_FILE}.cc.plot.pdf"

# CC_SCORE FILE format

# Filename <tab> numReads <tab> estFragLen <tab> corr_estFragLen <tab> PhantomPeak <tab> corr_phantomPeak <tab> argmin_corr <tab> min_corr <tab> phantomPeakCoef <tab> relPhantomPeakCoef <tab> QualityTag

Rscript $(which run_spp.R) -c=${SUBSAMPLED_TA_FILE} -p=${NTHREADS} -filtchr=chrM -savp=${CC_PLOT_FILE} -out=${CC_SCORES_FILE} -x=${EXCLUSION_RANGE_MIN}:${EXCLUSION_RANGE_MAX}

sed -r 's/,[^\t]+//g' ${CC_SCORES_FILE} > temp

mv temp ${CC_SCORES_FILE}

QC to report

format:Filename<tab>numReads<tab>estFragLen<tab>corr_estFragLen<tab>PhantomPeak<tab>corr_phantomPeak<tab>argmin_corr<tab>min_corr<tab>phantomPeakCoef<tab>relPhantomPeakCoef<tab>QualityTag

  • Normalized strand cross-correlation coefficient (NSC) = col9 in outFile
  • Relative strand cross-correlation coefficient (RSC) = col10 in outFile
  • Estimated fragment length = col3 in outFile, take the top value
  • Important columns highlighted, but all/whole file can be stored for display

Status

Frozen

2c. Generate self-pseudoreplicates for each replicate (SE datasets)

Program(s)

  • UNIX shuf
  • UNIX split
  • gawk

Input(s)

  • TagAlign file ${FINAL_TA_FILE}

Output(s)

  • 2 pseudoreplicate virtual SE tagAlign files ${PR1_TA_FILE}  ${PR2_TA_FILE}

Commands

# ========================

# Create pseudoReplicates

# =======================

PR_PREFIX="${OFPREFIX}.filt.nodup"

PR1_TA_FILE="${PR_PREFIX}.SE.pr1.tagAlign.gz"

PR2_TA_FILE="${PR_PREFIX}.SE.pr2.tagAlign.gz"

# Get total number of read pairs

nlines=$( zcat ${FINAL_TA_FILE} | wc -l )

nlines=$(( (nlines + 1) / 2 ))

# Shuffle and split BED file into 2 equal parts

zcat ${FINAL_TA_FILE} | shuf --random-source=<(openssl enc -aes-256-ctr -pass pass:$(zcat -f ${FINAL_TA_FILE} | wc -c) -nosalt </dev/zero 2>/dev/null)  | split -d -l ${nlines} - ${PR_PREFIX} # Will produce ${PR_PREFIX}00 and ${PR_PREFIX}01

# Convert reads into standard tagAlign file

gzip -nc “${PR_PREFIX}00" > ${PR1_TA_FILE}

rm "${PR_PREFIX}00"

gzip -nc “${PR_PREFIX}01" > ${PR2_TA_FILE}

rm "${PR_PREFIX}01"

QC to report

None

Status

Frozen

2c. Generate self-pseudoreplicates for each replicate (PE datasets)

Program(s)

  • UNIX shuf
  • UNIX split
  • gawk

Input(s)

  • TAGALIGN file ${FINAL_TA_FILE}

Output(s)

  • 2 pseudoreplicate virtual SE tagAlign files ${PR1_TA_FILE}  ${PR2_TA_FILE}

Commands

# ========================

# Create pseudoReplicates

# =======================

PR_PREFIX="${OFPREFIX}.filt.nodup"

PR1_TA_FILE="${PR_PREFIX}.PE2SE.pr1.tagAlign.gz"

PR2_TA_FILE="${PR_PREFIX}.PE2SE.pr2.tagAlign.gz"

joined=”temp.bedpe”

# Make temporary fake BEDPE file from FINAL_TA_FILE

zcat ${FINAL_TA_FILE} | sed 'N;s/\n/\t/' > $joined

# Get total number of read pairs

nlines=$( zcat ${joined} | wc -l )

nlines=$(( (nlines + 1) / 2 ))

# Shuffle and split BEDPE file into 2 equal parts

zcat ${joined} | shuf --random-source=<(openssl enc -aes-256-ctr -pass pass:$(zcat -f ${FINAL_TA_FILE} | wc -c) -nosalt </dev/zero 2>/dev/null)  | split -d -l ${nlines} - ${PR_PREFIX} # Will produce ${PR_PREFIX}00 and ${PR_PREFIX}01

# Convert fake BEDPE into standard tagAlign file

awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\t%s\t%s\t%s\n%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12} "${PR_PREFIX}00" | gzip -nc > ${PR1_TA_FILE}

rm "${PR_PREFIX}00"

awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\t%s\t%s\t%s\n%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}'"${PR_PREFIX}01" | gzip -nc > ${PR2_TA_FILE}

rm "${PR_PREFIX}01"

rm -f ${joined}

QC to report

None

Status

Frozen

2d. Generate pooled dataset and pooled-pseudoreplicates

Program(s)

  • gzip

Input(s)

  • Final tagalign files for all replicates ${REP1_TA_FILE} ${REP2_TA_FILE} obtained from ${FINAL_TA_FILE} of the step 2a.
  • Self-consistency pseudoreplicates for all replicates REP*_PR1_TA_FILE and REP*_PR2_TA_FILE obtained from ${PPR1_TA_FILE}  ${PPR2_TA_FILE} of step 2c.

Output(s)

  • Pooled tagAlign file ${POOLED_TA_FILE}
  • 2 pooled-pseudoreplicate tagAlign files

Commands

# ========================

# Create pooled datasets

# =======================

REP1_TA_FILE=”${DATASET_PREFIX}.Rep1.tagAlign.gz”

REP2_TA_FILE=”${DATASET_PREFIX}.Rep2.tagAlign.gz”

POOLED_TA_FILE=”${DATASET_PREFIX}.Rep0.tagAlign.gz”

zcat ${REP1_TA_FILE} ${REP2_TA_FILE} | gzip -nc > ${POOLED_TA_FILE}

# ========================

# Create pooled pseudoreplicates

# =======================

REP1_PR1_TA_FILE=”${DATASET_PREFIX}.Rep1.pr1.tagAlign.gz”

REP1_PR2_TA_FILE=”${DATASET_PREFIX}.Rep1.pr2.tagAlign.gz”

REP2_PR1_TA_FILE=”${DATASET_PREFIX}.Rep2.pr1.tagAlign.gz”

REP2_PR2_TA_FILE=”${DATASET_PREFIX}.Rep2.pr2.tagAlign.gz”

PPR1_TA_FILE=”${DATASET_PREFIX}.Rep0.pr1.tagAlign.gz”

PPR2_TA_FILE=”${DATASET_PREFIX}.Rep0.pr2.tagAlign.gz”

zcat ${REP1_PR1_TA_FILE} ${REP2_PR1_TA_FILE} | gzip -nc > ${PPR1_TA_FILE}

zcat ${REP1_PR2_TA_FILE} ${REP2_PR2_TA_FILE} | gzip -nc > ${PPR2_TA_FILE}

QC to report

None

Status

Frozen

2f. Calculate Jensen-Shannon distance (JSD)

Program(s)

  • deeptools (v3.3.0) plotFingerprint

Input(s)

  • Filtered/deduped BAM ${NODUP_BAM_REP1}, ${NODUP_BAM_REP2}
  • (TF only) Filtered/deduped control BAM ${CTL_NODUP_BAM} (only one BAM file is allowed, pick 1st BAM if you have multiple controls)
  • Blacklist BED file ${BLACKLIST}
  • mapping quality threshold = 30 (${MAPQ_THRESH})

Output(s)

  • JSD Plot ${JSD_PLOT}
  • JSD log ${JSD_LOG}: column description: https://github.com/deeptools/deepTools/blob/master/deeptools/plotFingerprint.py#L454
  • TF: 11 col including JS distance with a control
  • auc,  syn_auc, x_intercept, syn_x_intercept, elbow_pt, syn_elbow_pt, jsd, syn_jsd, pct_genome_enrich, diff_enrich, ch_div
  • histone: 7 col including synthetic JS distance without a control
  • auc,  syn_auc, x_intercept, syn_x_intercept, elbow_pt, syn_elbow_pt, syn_jsd

Commands

NTH=4 # number of threads

# BAMs are blacklist-filtered first for each replicate and control

NODUP_BFILT_BAM_REP1=${NODUP_BAM_REP1}.bfilt.bam

NODUP_BFILT_BAM_REP2=${NODUP_BAM_REP2}.bfilt.bam

CTL_NODUP_BFILT_BAM=${CTL_NODUP_BFILT_BAM}.bfilt.bam

bedtools intersect -nonamecheck -v -abam ${NODUP_BAM_REP1} -b ${BLACKLIST} > ${NODUP_BFILT_BAM_REP1}'

bedtools intersect -nonamecheck -v -abam ${NODUP_BAM_REP2} -b ${BLACKLIST} > ${NODUP_BFILT_BAM_REP2}'

bedtools intersect -nonamecheck -v -abam ${CTL_NODUP_BFILT_BAM} -b ${BLACKLIST} > ${CTL_NODUP_BFILT_BAM}'

# For TF chip-seq,

C_ALL=en_US.UTF-8 LANG=en_US.UTF-8 plotFingerprint -b ${NODUP_BFILT_BAM_REP1}

${NODUP_BFILT_BAM_REP2} --JSDsample ${CTL_NODUP_BFILT_BAM} --labels rep1 rep2 ctl1 --outQualityMetrics ${JSD_LOG} --minMappingQuality ${MAPQ_THRESH} -T "Fingerprints of different samples" --numberOfProcessors ${NTH} --plotFile ${JSD_PLOT}

# For histone chip-seq, (--JSDsample is absent)

C_ALL=en_US.UTF-8 LANG=en_US.UTF-8 plotFingerprint -b ${NODUP_BFILT_BAM_REP1}

${NODUP_BFILT_BAM_REP2} --labels rep1 rep2 --outQualityMetrics ${JSD_LOG} --minMappingQuality ${MAPQ_THRESH} -T "Fingerprints of different samples" --numberOfProcessors ${NTH} --plotFile ${JSD_PLOT}

QC to report

None

Status

Frozen

2g. Calculate GC bias

Program(s)

  • CollectGcBiasMetrics (Picard - version 1.126)
  • python packages: pandas, matplotlib

Input(s)

  • Filtered/deduped BAM for each true replicate ${NODUP_BAM_REPX}
  • Reference genome fasta ${REF_FA}

Output(s)

  • GC bias Plot ${GC_BIAS_PLOT}
  • GC bias log ${GC_BIAS_LOG}

Commands

# we don’t use plot directly generated from picard

# we process picard’s text output and make a plot

java -Xmx6G -XX:ParallelGCThreads=1 -jar \

picard.jar \

CollectGcBiasMetrics R=${REF_FA} I=${NODUP_BAM_REPX} O=${GC_BIAS_LOG} \

USE_JDK_DEFLATER=TRUE USE_JDK_INFLATER=TRUE \

VERBOSITY=ERROR QUIET=TRUE \

ASSUME_SORTED=FALSE \

CHART=${GC_BIAS_PLOT} S=summary.txt

# use ${GC_BIAS_LOG} into the following pyhton script

# data_file: ${GC_BIAS_LOG}

# prefix: any good prefix for output file name

def plot_gc(data_file, prefix):

    '''

    Replot the Picard output as png file to put into the html

    '''

    # Load data

    data = pd.read_table(data_file, comment="#")

    # Plot the data

    fig = plt.figure()

    ax = fig.add_subplot(111)

    plt.xlim((0, 100))

    lin1 = ax.plot(data['GC'], data['NORMALIZED_COVERAGE'],

                   label='Normalized coverage', color='r')

    ax.set_ylabel('Normalized coverage')

    ax2 = ax.twinx()

    lin2 = ax2.plot(data['GC'], data['MEAN_BASE_QUALITY'],

                    label='Mean base quality at GC%', color='b')

    ax2.set_ylabel('Mean base quality at GC%')

    ax3 = ax.twinx()

    lin3 = ax3.plot(data['GC'], data['WINDOWS']/np.sum(data['WINDOWS']),

                    label='Windows at GC%', color='g')

    ax3.get_yaxis().set_visible(False)

    lns = lin1 + lin2 + lin3

    labs = [l.get_label() for l in lns]

    ax.legend(lns, labs, loc='best')

    # plot_img = BytesIO()

    # fig.savefig(plot_img, format='png')

    prefix = data_file.rstrip('.gc.txt')

    plot_png = prefix + '.gc_plot.png'

    fig.savefig(plot_png, format='png')

QC to report

None

Status

Frozen

3. Call peaks on replicates, self-pseudoreplicates, pooled data and pooled-pseudoreplicates

Call peaks on all replicates, pooled data, self-pseudoreplicates of each replicate and the pooled-pseudoreplicates using 3 peak callers SPP, GEM and PeakSeq

Pooling controls: If control datasets (input DNA or Igg) have replicates as far as possible match ChIP replicates to appropriate control replicates. However, under some conditions listed below, its best to pool the control replicates.

  • If the no. of reads per replicate is < the no. of reads per ChIP replicate then pool the control replicate reads into a single control
  • If the no. of reads between control replicates differ by > a factor of 1.2 (“chip.ctl_depth_ratio”), then pool replicates (this is to avoid artificial differences in peak scores due to sequencing depth differences in different control replicates)

Subsamping of chosen control:

  • If each replicate’s chosen control’s read depth, whether is a pooled control or not, is deeper than the hard limit 200M (“chip.ctl_depth_limit”) or 5 (“chip.exp_ctl_depth_ratio_limit”) times each replicate’s read depth, then control is subsampled down to maximum of these two. Pseudo replicates 1 and 2 for each will use the same subsampled control.
  • The above subsampling logic for individual replicate is also applied to pooled replicate. Pooled-pseudo replicate uses the same control as pooled replicate’s.

Pooled-replicates or pooled-pseudoreplicates should always be compared to pooled controls

Self-pseudoreplicates for a particular ReplicateN should be compared to the same control that was used for ReplicateN.

3a. Peak calling - SPP

  • Use the estimated fragment length from column 3 from ${CC_SCORES_FILE}

Program(s)

SPP (v1.14) in phantompeakqualtool https://github.com/kundajelab/phantompeakqualtools

Input(s)

RepN ChIP ${REP1_TA_FILE} ${REP2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled replicate  ${POOLED_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate1 ${REP*_PR1_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate2  ${REP*_PR2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 1 ${PPR1_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 2 ${PPR2_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

a blacklist ${BLACKLIST} 

hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz

hg38: http://mitra.stanford.edu/kundaje/genome_data/hg38/hg38.blacklist.bed.gz


exclusion range min/max: exclude this interval from fragment length estimation algorithm

Output(s)

Narrowpeak file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.regionPeak.gz

Cross-correlation plot (for diagnosis only) ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.pdf

Cross-correlation score output (for diagnosis only) ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.ccscores

Commands

EXCLUSION_RANGE_MIN=-500 # fixed at -500

# EXCLUSION_RANGE_MAX can be fixed or estimated from read length by looking at first 100 reads

zcat ${CHIP_TA_PREFIX}.tagAlign.gz > ${CHIP_TA_PREFIX}.tagAlign.gz.tmp

EXCLUSION_RANGE_MAX=$(head -n 100 ${CHIP_TA_PREFIX}.tagAlign.gz.tmp | awk 'function abs(v) {{return v < 0 ? -v : v}} BEGIN{{sum=0}} {{sum+=abs($3-$2)}} END{{print int(sum/NR)}}')

rm -f ${CHIP_TA_PREFIX}.tagAlign.gz.tmp

# run SPP

Rscript run_spp.R -c=${CHIP_TA_PREFIX}.tagAlign.gz  -i=${CONTROL_TA_PREFIX}.tagAlign.gz -npeak=300000 -odir=${PEAK_OUTPUT_DIR} -speak=${FRAGLEN} -savr -savp -rf -x=${EXCLUSION_RANGE_MIN}:${EXCLUSION_RANGE_MAX} -out=${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.ccscores

| grep -P 'chr[\dXY]+[ \t]' | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}

# filter out peaks in blacklisted region

rpeakfile_raw=${CHIP_TA_PREFIX}.tagAlign_VS_${CONTROL_TA_PREFIX}.tagAlign.regionPeak.gz

rpeakfile=${CHIP_TA_PREFIX}.tagAlign_x_${CONTROL_TA_PREFIX}.tagAlign.regionPeak.gz

rpeakfile_filt=${CHIP_TA_PREFIX}.tagAlign_x_${CONTROL_TA_PREFIX}.tagAlign.filt.regionPeak.gz

zcat $rpeakfile_raw | awk 'BEGIN{OFS="\t"}{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}' | gzip -f -nc > $rpeakfile

bedtools intersect -v -a <(zcat -f $rpeakfile) -b <(zcat -f $blacklist) | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' | grep -P 'chr[\dXY]+[ \t]' | gzip -nc > $filt_rpeakfile;

QC to report

Number of peaks called

Status

Frozen


3b. Peak calling - GEM

Program(s)

GEM v2.4.1  http://cgs.csail.mit.edu/gem/ 

Input(s)

RepN ChIP ${REP1_TA_FILE} ${REP2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled replicate  ${POOLED_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate1 ${REP*_PR1_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate2  ${REP*_PR2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 1 ${PPR1_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 2 ${PPR2_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Genome sequence

In order to run the motif discovery of GEM algorithm, a genome sequence is needed. The path to directory containing the genome sequence files (by chromosome, *.fa or *.fasta files, with the prefix "chr") can be specified using option --genome (for example, --genome your_path/mm8/). Note that the chromosome name should match those in the "--g" genome_info file, as well as those in your read alignment file.

For hg19 these can be obtained from

Female: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/referenceSequences/femaleByChrom/ 

Male: Add chromosome Y from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/referenceSequences/maleByChrom/ 

Read distribution file

Obtain from http://cgs.csail.mit.edu/gem/ 

Output(s)

  1. Narrowpeak file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.narrowPeak.gz
  2. K-mer set motifs (KSM.txt)
  3. PFM file of PWM motifs (PFM.txt)
  4. HTML file summarizing the GEM event and motif results

See http://cgs.csail.mit.edu/gem/ for more details

Commands

# =============================

# See http://wiki.encodedcc.org/index.php/GPS/GEM of additional information

# =============================

gunzip ${CHIP_TA_PREFIX}.tagAlign.gz

java -Xmx15G -jar gem.jar --g hg19.info --d Read_Distribution_default.txt --s 2400000000 --expt ${CHIP_TA_PREFIX}.tagAlign  --ctrl ${CONTROL_TA_PREFIX}.tagAlign.gz --f BED --out  ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX} --genome ${SEQ_DIR} --k_min 6 --k_max 13 --outNP --q 0

rm ${CHIP_TA_PREFIX}.tagAlign

# =============================

# Sort peaks by signal value and truncate peaks to top 300K

# =============================

sort -k7nr,7nr ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}/${CHIP_TA_PREFIX}_GEM_events.narrowPeak | head -n 300000 | gzip -nc > temp

mv temp ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.narrowPeak.gz

QC to report

Number of peaks called

Status

Frozen


3c. Peak calling - PeakSeq

  • Use the estimated fragment length from column 3 from ${CC_SCORES_FILE}

Program(s)

PeakSeq v 1.25 http://wiki.encodedcc.org/index.php/PeakSeq 

Input(s)

RepN ChIP ${REP1_TA_FILE} ${REP2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled replicate  ${POOLED_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate1 ${REP*_PR1_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate2  ${REP*_PR2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 1 ${PPR1_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 2 ${PPR2_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Mappability file ${PEAKSEQ_MAP_FILE}.txt (Obtain from http://archive.gersteinlab.org/proj/PeakSeq/Mappability_Map/H.sapiens/Mapability_HG.txt  )

Output(s)

Narrowpeak file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.regionPeak.gz

Commands

# =============================

# The chip and input reads (chip.bam and input.bam) should be preprocessed before running:

# =============================

mkdir ${TMP_CHIP_DIR}

mkdir ${TMP_CONTROL_DIR}

zcat ${CHIP_TA_PREFIX}.tagAlign.gz | PeakSeq -preprocess tagAlign stdin ${TMP_CHIP_DIR}

zcat ${CONTROL_TA_PREFIX}.tagAlign.gz | PeakSeq -preprocess tagAlign stdin ${TMP_CONTROL_DIR}

# =============================

# Then it is necessary to setup the configuration file (config.dat). An example configuration file is included with the PeakSeq download. An example:

# =============================

Experiment_id $(basename ${CHIP_TA_PREFIX})

Enrichment_mapped_fragment_length ${FRAGLEN}

target_FDR 0.05

N_Simulations 10

Minimum_interpeak_distance ${FRAGLEN}

Mappability_map_file ${PEAKSEQ_MAP_FILE}.txt

ChIP_Seq_reads_data_dirs ${TMP_CHIP_DIR}

Input_reads_data_dirs

max_Qvalue 0.1

Background_model Simulated

# =============================

#Finally, the peaks are called using the configuration file:

# =============================

PeakSeq -peak_select config.dat

QC to report

Number of peaks called

Status

Frozen


4. Run IDR on all pairs of replicates, self-pseudoreplicates and pooled pseudoreplicates

Use IDR to compare all pairs of matched replicates

(1) True replicates narrowPeak files: ${REP1_PEAK_FILE} vs. ${REP2_PEAK_FILE} IDR results transferred to Pooled-replicates narrowPeak file  ${POOLED_PEAK_FILE}

(2) Pooled-pseudoreplicates: ${PPR1_PEAK_FILE} vs. ${PPR2_PEAK_FILE} IDR results transferred to Pooled-replicates narrowPeak file ${POOLED_PEAK_FILE}

(3) Rep1 self-pseudoreplicates: ${REP1_PR1_PEAK_FILE} vs. ${REP1_PR2_PEAK_FILE} IDR results transferred to Rep1 narrowPeak file ${REP1_PEAK_FILE}

(4) Rep2 self-pseudoreplicates: ${REP2_PR1_PEAK_FILE} vs. ${REP2_PR2_PEAK_FILE} IDR results transferred to Rep2 narrowPeak file ${REP2_PEAK_FILE}

IDR Threshold: Use IDR threshold of 5% for all pairwise analyses

4a. For True Replicates

Below we show the use for true replicates. The same steps can be applied for all other pairs.

Program(s)

IDR (https://github.com/kundajelab/idr) 2.0.4 / Installation instructions (https://github.com/kundajelab/idr#installation). NOTE: Works only with Python3

Input(s)

  • a pair of narrowPeak files for replicates  ${REP1_PEAK_FILE}  ${REP2_PEAK_FILE}
  • a pooled-replicate narrowPeak file ${POOLED_PEAK_FILE}
  • a blacklist ${BLACKLIST} 
  • For SPP use signal.value for ranking with a parameter ‘--rank signal.value’
  • For GEM use signal.value for ranking with a parameter ‘--rank signal.value’
  • For PeakSeq use q.value for ranking with a parameter ‘--rank q.value

Output(s)

  • The output from EM fitting: suffixed by overlapped-peaks.txt.png
  • The full set of peaks that overlap between the replicates with local and global IDR: suffixed by overlapped-peaks.txt ${IDR_OUTPUT}
  • IDR output file ${IDR_OUTPUT}
  • # Columns 1-10 are same as pooled common peaks narrowPeak columns
  • # Col 11: -log10(local IDR value)
  • # Col 12: -log10(global IDR value)
  • # Col 15: ranking measure from Rep1
  • # Col 19: ranking measure from Rep2
  • Final IDR thresholded file ${REP1_VS_REP2}.IDR0.05.narrowPeak.gz
  • Final IDR thresholded file filtered using a blacklist ${REP1_VS_REP2}.IDR0.05.filt.narrowPeak.gz

Commands

IDR_THRESH=0.05

# =============================

# Perform IDR analysis.

# Generate a plot and IDR output with additional columns including IDR scores.

# =============================

idr --samples ${REP1_PEAK_FILE} ${REP2_PEAK_FILE} --peak-list ${POOLED_PEAK_FILE} --input-file-type narrowPeak --output-file ${IDR_OUTPUT} --rank signal.value --soft-idr-threshold ${IDR_THRESH} --plot --use-best-multisummit-IDR

# =============================

# Get peaks passing IDR threshold of 5%

# =============================

IDR_THRESH_TRANSFORMED=$(awk -v p=${IDR_THRESH} 'BEGIN{print -log(p)/log(10)}')

awk 'BEGIN{OFS="\t"} $12>='"${IDR_THRESH_TRANSFORMED}"' {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' ${IDR_OUTPUT} | sort | uniq | sort -k7n,7n | gzip -nc > ${REP1_VS_REP2}.IDR0.05.narrowPeak.gz

NPEAKS_IDR=$(zcat ${REP1_VS_REP2}.IDR0.05.narrowPeak.gz | wc -l)

# =============================

# Filter using black list

# =============================

bedtools intersect -v -a ${REP1_VS_REP2}.IDR0.05.narrowPeak.gz -b ${BLACKLIST} | grep -P 'chr[\dXY]+[ \t]' | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}

| gzip -nc > ${REP1_VS_REP2}.IDR0.05.filt.narrowPeak.gz

Parameters

  • --samples : [REP1_PEAK_FILE] and [REP2_PEAK_FILE] are the peak calls for the pair of replicates in narrowPeak format. They must be compressed files.
    e.g. /peaks/reps/chipSampleRep1_VS_controlSampleRep0.narrowPeak.gz AND /peaks/reps/chipSampleRep2_VS_controlSampleRep0.narrowPeak.gz
  • --input-file-type : the peak file format (narrowPeak or broadPeak). Set to narrowPeak if it is narrowPeak/regionPeak or broadPeak if it is broadPeak. BroadPeak files do not contain Column 10.
  • --rank : the ranking measure to use. It can take only one of the following values signal.value , p.value or q.value
  • --soft-idr-threshold : IDR threshold, Set to ${IDR_THRESH}

QC to report

  • Number of peaks passing IDR thresholds of 5% ${NPEAKS_IDR}
  • For each pairwise analysis, we have a *overlapped-peaks.txt file. The 12th column of the overlapped-peaks.txt file has the global IDR score for each pair of overlapping peaks.
  • Also store ${POOLED_COMMON_PEAKS_IDR}
  • To get the number of peaks that pass an IDR threshold of T (e.g. 0.01) you simply find the number of lines in ${POOLED_COMMON_PEAKS_IDR} that have Column 14 <= T

Status

Frozen

  • If you have more than 2 true replicates select the longest peak list from all pairs that passes the IDR threshold.
  • Nt = Best no. of peaks passing IDR threshold by comparing true replicates

4b. IDR analysis - self-pseudoreplicates

  • Perform as with real replicates, but comparing pseudoreplicate 1 vs pseudoreplicate 2 made from each of the real biological replicate peaks
  • Rep1 self-pseudoreplicates: ${REP1_PR1_PEAK_FILE} vs. ${REP1_PR2_PEAK_FILE} and use ${REP1_PEAK_FILE} as pooled file
  • Rep2 self-pseudoreplicates: ${REP2_PR1_PEAK_FILE} vs. ${REP2_PR2_PEAK_FILE} and use ${REP2_PEAK_FILE} as pooled file
  • This gives the self-consistent IDR peaks
  • N1 and N2 = No. of peaks passing IDR threshold by comparing self-pseudoReplicates for Rep1 and Rep2 respectively

4c. IDR analysis - pooled pseudoreplicates

  • Perform as with real replicates, but comparing pooled-pseudoreplicate 1 vs pooled-pseudoreplicate 2 made from the pooled biological replicate peaks
  • ${PPR1_PEAK_FILE} vs. ${PPR2_PEAK_FILE} and use ${POOLED_PEAK_FILE} as pooled file
  • Np = No. of peaks passing IDR threshold by comparing pooled pseudo-replicates

4d. Select final peak calls - conservative set

  • If you have more than 2 true replicates select the longest peak list from all pairs that passes the 5% IDR threshold. This is the conservative peak set.
  • Nt = Best no. of peaks passing IDR threshold by comparing true replicates
  • Filter using black list:

hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz

hg38:

http://mitra.stanford.edu/kundaje/genome_data/hg38/hg38.blacklist.bed.gz

4e. Select final peak calls - optimal set

4f. Compute IDR QC scores

  • Rescue Ratio = max(Np,Nt) / min(Np,Nt)
    Nt and Np should be within a factor of 2 of each other
  • Self-consistency Ratio = max(N1,N2) / min(N1,N2)
    N1 and N2 should be within a factor of 2 of each other
  • If Rescue Ratio AND self-consistency Ratio are both > 2, Flag the file for reproducibility FAIL (-1)
  • If Rescue Ratio OR self-consistency Ratio are > 2, Flag the file for reproducibility Borderline (0)

Rescue Ratio v2 <TODO>

Self-Consistency Ratio v2 <TODO>

4f. Compute Fraction of Reads in Peaks (FRiP) : function frip( ta_file, cc_qc_log, idr_peak_file )

You compute the fraction of reads from each replicate tagAlign and pooled tagAlign that fall within the Np and Nt peak sets.

Program(s)

bedtools 2.26.0

Input(s)

Final tagAlign file for Rep1 ${REP1_TA_FILE} vs. IDR peak from pseudo replicates of Rep1 ${REP1_PR_IDR_PEAK_FILE} with estimated fragment length for replicate 1

Final tagAlign file for Rep2 ${REP2_TA_FILE} vs. IDR peak from pseudo replicates of Rep2 ${REP2_PR_IDR_PEAK_FILE} with estimated fragment length for replicate 2

Pooled tagAlign file ${POOLED_TA_FILE} vs. IDR peak from true replicates (Nt) ${CONS_IDR_PEAK_FILE} with mean estimated fragment length of rep1 and rep2

Pooled tagAlign file ${POOLED_TA_FILE} vs. IDR peak from pooled pseudo replicates (Np) ${OPTIMAL_IDR_PEAK_FILE} with mean estimated fragment length of rep1 and rep2

Output(s)

FRiP text file ${FRiP}

Commands

# get estimated fragment length from cross-corr. analysis log

FRAGLEN=$(cat ${CC_QC_LOG} | awk ‘{print $3}’)

HALF_FRAGLEN=$(( (FRAGLEN+1)/2 )) # rounding to integer

CHRSIZEFILE=<path_of_file_containing_chromosome_sizes>

# This file is a tab delimited file with 2 columns Col1 (chromosome name), Col2 (chromosome size in bp).

val1=$(bedtools slop -i ${TA_FILE} -g $CHRSIZEFILE -s -l -$HALF_FRAGLEN -r $HALF_FRAGLEN | \

    awk '{if ($2>=0 && $3>=0 && $2<=$3) print $0}' | \

    bedtools intersect -a stdin -b ${IDR_PEAK_FILE} -wa -u | wc -l)

val2=$(zcat $TA_FILE | wc -l)

awk 'BEGIN {print '${val1}'/'${val2}'}' > ${FRIP}

QC to report

Fraction of reads in peaks

Status

Frozen

5. Create signal tracks

Peak calling  and signal tracks using MACSv2 for TFs and histone marks

  • Use the estimated fragment length from column 3 from ${CC_SCORES_FILE}

Program(s)

MACSv2  https://github.com/taoliu/MACS/ Installation Instructions (https://github.com/taoliu/MACS/blob/master/INSTALL.rst ). NOTE: Works only with Python 2.7 (>=2.7.5). Does not work with Python 3.

Also requires slopBed, bedClip and bedGraphToBigWig from KentTools

Input(s)

RepN ChIP ${REP1_TA_FILE} ${REP2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled replicate  ${POOLED_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate1 ${REP*_PR1_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

RepN pseudoreplicate2  ${REP*_PR2_TA_FILE} vs. appropriate control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 1 ${PPR1_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Pooled-pseudoreplicate 2 ${PPR2_TA_FILE} vs. pooled control tagAlign ${CONTROL_TA_PREFIX}.tagAlign.gz

Output(s)

Narrowpeak file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.narrowPeak.gz

Gappedpeak file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.gappedPeak.gz

Fold-enrichment bigWig file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.fc.signal.bw

-log10(pvalue) bigWig file ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.pval.signal.bw

Commands

GENOMESIZE=’hs’ # for human

GENOMESIZE=’mm’ #for mouse

NPEAKS=500000 # capping number of peaks called from MACS2

===========================================

# Generate narrow peaks and preliminary signal tracks

============================================

macs2 callpeak -t ${REP1_TA_FILE}.tagAlign.gz -c ${CONTROL_TA_PREFIX}.tagAlign.gz -f BED -n ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX} -g ${GENOMESIZE} -p 1e-2 --nomodel --shift 0 --extsize ${FRAGLEN} --keep-dup all -B --SPMR

# Sort by Col8 in descending order and replace long peak names in Column 4 with Peak_<peakRank>

sort -k 8gr,8gr ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.narrowPeak | awk ‘BEGIN{OFS=”\t”}{$4=”Peak_”NR ; print $0}’ | head -n ${NPEAKS}  | gzip -nc > ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.narrowPeak.gz

# remove additional files

rm -f ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.xls ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.narrowPeak ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_summits.bed

===========================================

# Generate Broad and Gapped Peaks

============================================

macs2 callpeak -t ${REP1_TA_FILE}.tagAlign.gz -c ${CONTROL_TA_PREFIX}.tagAlign.gz -f BED -n ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX} -g ${GENOMESIZE} -p 1e-2 --broad --nomodel --shift 0 --extsize ${FRAGLEN} --keep-dup all

# Sort by Col8 (for broadPeak) or Col 14(for gappedPeak)  in descending order and replace long peak names in Column 4 with Peak_<peakRank>

sort -k 8gr,8gr ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.broadPeak | awk ‘BEGIN{OFS=”\t”}{$4=”Peak_”NR ; print $0}’ | gzip -nc > ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.broadPeak.gz

sort -k 14gr,14gr ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.gappedPeak | awk ‘BEGIN{OFS=”\t”}{$4=”Peak_”NR ; print $0}’ | gzip -nc > ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.gappedPeak.gz

# remove additional files

rm -f ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.xls ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.broadPeak ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_peaks.gappedPeak ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_summits.bed

===========================================

# For Fold enrichment signal tracks

============================================

CHRSIZEFILE=<path_of_file_containing_chromosome_sizes>

# This file is a tab delimited file with 2 columns Col1 (chromosome name), Col2 (chromosome size in bp).

macs2 bdgcmp -t ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_treat_pileup.bdg -c ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_control_lambda.bdg --outdir ${PEAK_OUTPUT_DIR} -o ${CHIP_TA_PREFIX}_FE.bdg -m FE

# Remove coordinates outside chromosome sizes (stupid MACS2 bug)

slopBed -i ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_FE.bdg -g ${CHRSIZEFILE} -b 0 |   awk '{if ($3 != -1) print $0}' |  bedClip stdin ${CHRSIZEFILE} ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.fc.signal.bedgraph

rm -f ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_FE.bdg

# Convert bedgraph to bigwig

bedGraphToBigWig ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.fc.signal.bedgraph ${CHRSIZEFILE} ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.fc.signal.bw

rm -f ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.fc.signal.bedgraph

===========================================

# For -log10(p-value) signal tracks

============================================

# Compute sval = min(no. of reads in ChIP, no. of reads in control) / 1,000,000

chipReads=$(zcat ${REP1_TA_FILE}.tagAlign.gz | wc -l | awk '{printf "%f", $1/1000000}');

controlReads=$(zcat ${CONTROL_TA_FILE}.tagAlign.gz | wc -l | awk '{printf "%f", $1/1000000}');

sval=$(echo "${chipReads} ${controlReads}" | awk '$1>$2{printf "%f",$2} $1<=$2{printf "%f",$1}');

macs2 bdgcmp -t ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_treat_pileup.bdg -c ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_control_lambda.bdg --outdir ${PEAK_OUTPUT_DIR} -o ${CHIP_TA_PREFIX}_ppois.bdg -m ppois -S ${sval}

# Remove coordinates outside chromosome sizes (stupid MACS2 bug)

slopBed -i ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_ppois.bdg -g ${CHRSIZEFILE} -b 0 |   awk '{if ($3 != -1) print $0}' |  bedClip stdin ${CHRSIZEFILE} ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.pval.signal.bedgraph

rm -rf ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_ppois.bdg

# Convert bedgraph to bigwig

bedGraphToBigWig ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.pval.signal.bedgraph ${CHRSIZEFILE} ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.pval.signal.bw

rm -f ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}.pval.signal.bedgraph

rm -f ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX}_treat_pileup.bdg ${peakFile}_control_lambda.bdg

QC to report

Status

Beta

Count signal track generation

Program(s)

  • bedtools 2.26.0
  • bedGraphToBigWig

Input(s)

RepX ChIP ${REPX_TA_FILE}, or Pooled replicate  ${POOLED_TA_FILE}

Chromosome sizes file ${CHRSZ}

Output(s)

Positive bigWig file prefix.positive.bigwig

Negative bigWig file prefix.negative.bigwig

Commands

zcat -f ${TA_FILE} | sort -k1,1 -k2,2n | bedtools genomecov -5 -bg -strand + -g ${CHRSZ} -i stdin > TMP.POS.BED

bedGraphToBigWig TMP.POS.BED ${CHRSZ} ${TA_FILE_PREFIX}.positive.bigwig

zcat -f ${TA_FILE} | sort -k1,1 -k2,2n | bedtools genomecov -5 -bg -strand - -g ${CHRSZ} -i stdin > TMP.NEG.BED

bedGraphToBigWig TMP.NEG.BED ${CHRSZ} ${TA_FILE_PREFIX}.negative.bigwig

QC to report

Status

Beta


6. Peak calling for Histone Marks

Naive overlap thresholding for histone peak calls

NOTE: We haven’t yet finalized an IDR protocol for histone marks. For now this is a simple overlap version that works reasonably well. IDR protocol for histone marks is in development

But here we do a similar analysis as IDR described in Section 4. Repeat the same procedure for the following set of combination of (Rep1, Rep2 and Pooled) to do reproducibility QC.

(Rep1, Rep2, Pooled_rep)

(Rep1-PR1, Rep1-PR2, Rep1)

(Rep2-PR1, Rep2-PR2, Rep2)

(PPR1,PPR2,Pooled_rep)

(I've just split the piped commands on separate lines for clarity)

# ======================

# For narrowPeak files

# ======================

# Find pooled peaks that overlap Rep1 and Rep2 where overlap is defined as the fractional overlap wrt any one of the overlapping peak pairs  >= 0.5

intersectBed -wo -a Pooled.narrowPeak.gz -b Rep1.narrowPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$13-$12; if (($21/s1 >= 0.5) || ($21/s2 >= 0.5)) {print $0}}' | cut -f 1-10 | sort | uniq |

intersectBed -wo -a stdin -b Rep2.narrowPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$13-$12; if (($21/s1 >= 0.5) || ($21/s2 >= 0.5)) {print $0}}' | cut -f 1-10 | sort | uniq > PooledInRep1AndRep2.narrowPeak.gz

# filter through blacklist

zcat PooledInRep1AndRep2.narrowPeak.gz PooledInPsRep1AndPsRep2.narrowPeak.gz | sort | uniq | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \

| grep -P 'chr[\dXY]+[ \t]' > PooledInRep1AndRep2.filt.narrowPeak.gz

# ======================

For BroadPeak files (there is just a difference is the awk commands wrt the column numbers)

# ======================

# Find pooled peaks that overlap Rep1 and Rep2 where overlap is defined as the fractional overlap wrt any one of the overlapping peak pairs  >= 0.5

intersectBed -wo -a Pooled.broadPeak.gz -b Rep1.broadPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$12-$11; if (($19/s1 >= 0.5) || ($19/s2 >= 0.5)) {print $0}}' | cut -f 1-9 | sort | uniq |

intersectBed -wo -a stdin -b Rep2.broadPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$12-$11; if (($19/s1 >= 0.5) || ($19/s2 >= 0.5)) {print $0}}' | cut -f 1-9 | sort | uniq > PooledInRep1AndRep2.broadPeak.gz

# Find pooled peaks that overlap PooledPseudoRep1 and PooledPseudoRep2 where overlap is defined as the fractional overlap wrt any one of the overlapping peak pairs  >= 0.5

intersectBed -wo -a Pooled.broadPeak.gz -b PsRep1.broadPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$12-$11; if (($19/s1 >= 0.5) || ($19/s2 >= 0.5)) {print $0}}' | cut -f 1-9 | sort | uniq |

intersectBed -wo -a stdin -b PsRep2.broadPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$12-$11; if (($19/s1 >= 0.5) || ($19/s2 >= 0.5)) {print $0}}' | cut -f 1-9 | sort | uniq > PooledInPsRep1AndPsRep2.broadPeak.gz

# Combine peak lists

zcat PooledInRep1AndRep2.broadPeak.gz PooledInPsRep1AndPsRep2.broadPeak.gz | sort | uniq > finalPeakList.broadPeak.gz

# ======================

For gappedPeak files (there is just a difference is the awk commands wrt the column numbers)

# ======================

# Find pooled peaks that overlap Rep1 and Rep2 where overlap is defined as the fractional overlap wrt any one of the overlapping peak pairs  >= 0.5

intersectBed -wo -a Pooled.gappedPeak.gz -b Rep1.gappedPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$18-$17; if (($31/s1 >= 0.5) || ($31/s2 >= 0.5)) {print $0}}' | cut -f 1-15 | sort | uniq |

intersectBed -wo -a stdin -b Rep2.gappedPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$18-$17; if (($31/s1 >= 0.5) || ($31/s2 >= 0.5)) {print $0}}' | cut -f 1-15 | sort | uniq > PooledInRep1AndRep2.gappedPeak.gz

# Find pooled peaks that overlap PooledPseudoRep1 and PooledPseudoRep2 where overlap is defined as the fractional overlap wrt any one of the overlapping peak pairs  >= 0.5

intersectBed -wo -a Pooled.gappedPeak.gz -b PsRep1.gappedPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$18-$17; if (($31/s1 >= 0.5) || ($31/s2 >= 0.5)) {print $0}}' | cut -f 1-15 | sort | uniq |

intersectBed -wo -a stdin -b PsRep2.gappedPeak.gz |

awk 'BEGIN{FS="\t";OFS="\t"}{s1=$3-$2; s2=$18-$17; if (($31/s1 >= 0.5) || ($31/s2 >= 0.5)) {print $0}}' | cut -f 1-15 | sort | uniq > PooledInPsRep1AndPsRep2.gappedPeak.gz

# Combine peak lists

zcat PooledInRep1AndRep2.gappedPeak.gz PooledInPsRep1AndPsRep2.gappedPeak.gz | sort | uniq > finalPeakList.gappedPeak.gz