A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | SORT | ORGANISM | DATA TYPE | DATA LEVEL | PROCESSING STEP | DATA FORMATS | DATA LOCATION | NOTES | DATA/PROCESSING CONTACT | DATA CONTACT PI | DATE | ||||||||
2 | 1 | Worm (WS220) | Genome Sequence | 0 | Genome Sequence | FASTA | http://hgdownload.cse.ucsc.edu/goldenPath/ce10/chromosomes/ | Per chromosome FASTA file | |||||||||||
3 | 2 | Worm (WS220) | ChIP-seq | 1 | Raw Alignments | SAM | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/alignments/bam/ | Uniformly mapped data using BWA (only unique mapping reads) | Carlos Araya (claraya@gmail.com) ; Philip Cayting (pcayting@stanford.edu) | Mike Snyder | |||||||||
4 | Worm (WS220) | ChIP-seq | Filtered Alignments | TAGALIGN/BED | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/alignments/distinctTagAlign/reps/ | Unique mapping reads, duplicates filtered | Carlos Araya (claraya@gmail.com) ; Anshul Kundaje (anshul@kundaje.net) | ||||||||||||
5 | 3 | Worm (WS220) | ChIP-seq | 2 | MetaData and Data Quality | EXCEL/TAB tables | https://docs.google.com/spreadsheet/ccc?key=0Algk3BSZDYzgdDlYNU00d2p3azJyZWlrZ09OQXNXTGc#gid=0 | Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
6 | 19 | Worm (WS220) | Blacklist | 4 | Blacklists | BED | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/blacklist/ce10-blacklist.bed.gz | Empirical blacklist of regions with artifactual unstructured signal | Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||
7 | 4 | Worm (WS220) | ChIP-seq | 3 | IDR Peak Calls | NARROWPEAK | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/idr/ | The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.05 was used. chrM peaks were removed as these were unreliable in most cases. See https://sites.google.com/site/anshulkundaje/projects/idr for details | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
8 | 4 | Worm (WS220) | ChIP-seq | 3 | Blacklist filtered IDR peak Calls (Use these) | NARROWPEAK | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/uniformPk/ | IDR Peak calls are filtered against blacklists | Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||
9 | 5 | Worm (WS220) | ChIP-seq | 5 | Relaxed peak calls (unthresholded) | NARROWPEAK | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/unthresholdPk/ | These are a large set of unthresholded peak calls (upto ~100K peaks) from SPP. Useful for analyses that want to analyze low signal peaks. | Carlos Araya (claraya@gmail.com) ; Philip Cayting (pcayting@stanford.edu) | Mike Snyder | |||||||||
10 | 6 | Worm (WS220) | Mappability | 7 | Unique Mappability track (Read Lengths 20 to 54) | BINARY | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/mappability/globalmap_k20tok54.tgz | A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
11 | 22 | Worm (WS220) | ChIP-seq | 6 | Signal tracks (Input normalized) | BIGWIG/BEDGRAPH | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/signal/foldChange/ | Signal tracks are generated for each dataset using MACSv2's signal processing module. Signal tracks represent ChIP signal compared to input control signal. - FoldEnrichment: The first type of track represents the fold enrichment of ChIP over input. This type of track is useful for detecting and analyzing regions with moderate to low enrichment. - Does NOT correct for local mappability - Does NOT differentiate between "missing data" at unmappable locations and true 0 signal. - The signal is not smoothed. Reads are extended to predominant fragment length. - It is recommended to smooth the signal before using it unless you are averaging over multiple sites (.e.g aggregation plots) - Separate tracks are generated for individual replicates and pooled data. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
12 | Worm (WS220) | TIP | TF target prediction | TIP | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/TFtargets/ | See README file in the directory; Chao Cheng's TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously. | Chao Cheng (Chao Cheng <chengchao12@gmail.com>) | ||||||||||||
13 | Worm (WS220) | HOT | HOT Regions | BED | http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/hotRegions/ | See README at http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/hotRegions/readme.txt | Carlos Araya (claraya@gmail.com) ; Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||||
14 | |||||||||||||||||||
15 | 7 | Fly (FB5.45) | Genome Sequence | 0 | Genome Sequence | FASTA | http://hgdownload.cse.ucsc.edu/goldenPath/dm3/chromosomes/ | Per chromosome FASTA file | |||||||||||
16 | 8 | Fly (FB5.45) | ChIP-seq | 1 | Raw data (Alignments) | SAM | http://www.broadinstitute.org/~anshul/projects/fly/mapped/sam/ | Uniformly mapped data using BWA (only unique mapping reads) | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
17 | Fly (FB5.45) | ChIP-seq | Filtered Alignments | TAGALIGN/BED | http://www.broadinstitute.org/~anshul/projects/fly/mapped/distinctTagAlign/qcmerge/ | Unique mapping reads, duplicates filtered | |||||||||||||
18 | 9 | Fly (FB5.45) | ChIP-seq | 2 | MetaData and Data Quality | EXCEL/TAB tables | https://docs.google.com/spreadsheet/ccc?key=0Algk3BSZDYzgdDU3cXVVMHdQeHRTUWtnYk1aSG13NEE&pli=1#gid=4 | Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
19 | 19 | Fly (FB5.45) | Blacklist | 4 | Blacklists | BED | http://www.broadinstitute.org/~anshul/projects/fly/blacklist/dm3-blacklist.bed.gz | Empirical blacklist of regions with artifactual unstructured signal | Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||
20 | 10 | Fly (FB5.45) | ChIP-seq | 3 | IDR Peak Calls | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/fly/peaks_macs/release/idrOptimal/pass/ | The MACSv2 peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.05 was used. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
21 | 10 | Fly (FB5.45) | ChIP-seq | 3 | Blacklist filtered IDR peak Calls | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/fly/peaks_macs/release/idrOptimalBlacklistFiltered/ | IDR Peak calls are filtered against blacklists | Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||
22 | 11 | Fly (FB5.45) | ChIP-seq | 5 | Relaxed peak calls (unthresholded) | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/fly/peaks_macs/release/combrep/regionPeak/ | These are a large set of unthresholded peak calls using MACSv2. Useful for analyses that want to analyze low signal peaks. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
23 | 12 | Fly (FB5.45) | Mappability | 7 | Unique Mappability track (Read Lengths 20 to 54) | BINARY | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/umap/dm3_build5/dm3_build5.all.globalmap_k20tok54.tgz | A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
24 | 22 | Fly (FB5.45) | ChIP-seq | 6 | Signal tracks (Input normalized) | BIGWIG/BEDGRAPH | http://www.broadinstitute.org/~anshul/projects/fly/uniformSignal/macs2signal/combrep/ | Signal tracks are generated for each dataset using MACSv2's signal processing module. Signal tracks represent ChIP signal compared to input control signal. - FoldEnrichment: The first type of track represents the fold enrichment of ChIP over input. This type of track is useful for detecting and analyzing regions with moderate to low enrichment. - Does NOT correct for local mappability - Does NOT differentiate between "missing data" at unmappable locations and true 0 signal. - The signal is not smoothed. Reads are extended to predominant fragment length. - It is recommended to smooth the signal before using it unless you are averaging over multiple sites (.e.g aggregation plots) - Separate tracks are generated for individual replicates and pooled data. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
25 | Fly (FB5.45) | TIP | TF target prediction | TIP | http://www.broadinstitute.org/~anshul/projects/fly/TFtargets/jun2012/ | See README.fly file in the directory; Chao Cheng's TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously. | Chao Cheng (Chao Cheng <chengchao12@gmail.com>) | ||||||||||||
26 | Fly (FB5.45) | HOT | HOT Regions | BED | http://stanford.edu/~claraya/metrn/data/hot/regions/dm/ | See README: http://stanford.edu/~claraya/metrn/data/hot/ | Carlos Araya (claraya@gmail.com) ; Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||||
27 | |||||||||||||||||||
28 | 7 | Human (hg19) | Genome Sequence | 0 | Genome Sequence | FASTA | http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/referenceSequences/ | Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability) | |||||||||||
29 | 15 | Human (hg19) | ChIP-seq | 1 | Raw data (Alignments) | BAM | http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC | FASTQ and BAM files can be downloaded from the URL. Different labs used different mappers and mapping strategies. Hence, these files should be filtered to standardize them. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
30 | 16 | Human (hg19) | ChIP-seq | 1 | Raw data (Unique mapping distinct alignments) | TAGALIGN | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/mapped/mar2012/distinctTagAlign/ | BAM files above are filtered to only keep unique mapping reads (tagAlign/ directory). Then duplicate reads were removed (only one read per position). These can be obtained in the distinctTagAlign/ directory | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
31 | 17 | Human (hg19) | ChIP-seq | 2 | MetaData and Data Quality | EXCEL/TAB tables | https://docs.google.com/spreadsheet/ccc?key=0Am6FxqAtrFDwdHdRcHNQUy03SjBoSVMxdUNyZV9Rdnc#gid=9 | Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
32 | 19 | Human (hg19) | Blacklist | 4 | Blacklists | BED | http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz | Brief summary of how the blacklist was generated can be found at http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability . More detailed analysis is at http://goo.gl/9FyQF | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
33 | 18 | Human (hg19) | ChIP-seq | 3 | IDR Peak Calls | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/peaks_spp/mar2012/distinct/idrOptimal/ | The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.02 was used. See https://sites.google.com/site/anshulkundaje/projects/idr for details | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
34 | 20 | Human (hg19) | ChIP-seq | 4 | Blacklist filtered IDR peak Calls (Use these) | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/peaks_spp/mar2012/distinct/idrOptimalBlackListFilt/ | IDR Peak calls are filtered against blacklists. THESE ARE THE HUMAN PEAK CALLS EVERYONE SHOULD USE. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
35 | 21 | Human (hg19) | ChIP-seq | 5 | Relaxed peak calls (unthresholded) | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/peaks_spp/mar2012/distinct/combrep/regionPeak/ | These are a large set of unthresholded peak calls (upto 300K peaks) from SPP. Useful for analyses that want to analyze low signal peaks. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
36 | 22 | Human (hg19) | ChIP-seq | 6 | Signal tracks (Input normalized) | BIGWIG/BEDGRAPH | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/signal/mar2012/pooledReps/bigwig/macs2signal/foldChange/ | Signal tracks are generated for each dataset using MACSv2's signal processing module. Signal tracks represent ChIP signal compared to input control signal. - FoldEnrichment: The first type of track represents the fold enrichment of ChIP over input. This type of track is useful for detecting and analyzing regions with moderate to low enrichment. - Does NOT correct for local mappability - Does NOT differentiate between "missing data" at unmappable locations and true 0 signal. - The signal is not smoothed. Reads are extended to predominant fragment length. - It is recommended to smooth the signal before using it unless you are averaging over multiple sites (.e.g aggregation plots) - Separate tracks are generated for individual replicates and pooled data. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
37 | Human (hg19) Male genome (with chrY) | Mappability | 7 | Unique Mappability track (Read Lengths 20 to 54) | BINARY | http://www.broadinstitute.org/~anshul/projects/umap/encodeHg19Male/globalmap_k20tok54.tgz | A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | ||||||||||
38 | Human (hg19) Female genome (no chrY) | Mappability | 7 | Unique Mappability track (Read Lengths 20 to 54) | BINARY | http://www.broadinstitute.org/~anshul/projects/umap/encodeHg19Female/globalmap_k20tok54.tgz | A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | ||||||||||
39 | Human (hg19) | TIP | TF target predictions | TIP | http://www.broadinstitute.org/~anshul/projects/encode/rawdata/TFtargets/ | See README.hsa file in the directory; Chao Cheng's TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. | Chao Cheng (Chao Cheng <chengchao12@gmail.com>) | Chao Cheng (Chao Cheng <chengchao12@gmail.com>) | |||||||||||
40 | Human (hg19) | HOT | HOT Regions | BED | http://stanford.edu/~claraya/metrn/data/hot/regions/hs/ | See README: http://stanford.edu/~claraya/metrn/data/hot/ | Carlos Araya (claraya@gmail.com) ; Alan Boyle (aboyle@stanford.edu) | Mike Snyder | |||||||||||
41 | |||||||||||||||||||
42 | 7 | Mouse (mm9) | Genome Sequence | 0 | Genome Sequence | FASTA | http://hgdownload-test.cse.ucsc.edu/goldenPath/mm9/chromosomes/ | Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability) | |||||||||||
43 | 23 | Mouse (mm9) | ChIP-seq | 1 | Raw data (Alignments) | BAM | http://hgdownload-test.cse.ucsc.edu/goldenPath/mm9/encodeDCC | Datasets were mapped by individual labs | Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng | Mike Snyder | |||||||||
44 | 24 | Mouse (mm9) | ChIP-seq | 2 | MetaData and Data Quality | EXCEL/TAB tables | https://docs.google.com/spreadsheet/ccc?key=0Ao3-Or4FCMJEdFpPY2lwWnlZTV92MUNLOHYxbEl4Vnc#gid=0 | Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics | Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng | Mike Snyder | |||||||||
45 | 19 | Mouse (mm9) | Blacklist | 4 | Blacklists | BED | http://www.broadinstitute.org/~anshul/projects/mouse/blacklist/mm9-blacklist.bed.gz | Brief summary of how the blacklist was generated can be found at http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability . More detailed analysis is at http://goo.gl/9FyQF | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
46 | 25 | Mouse (mm9) | ChIP-seq | 3 | IDR Peak Calls | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/mouse/peaks_spp/idrOptimal/ | The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.02 was used. See https://sites.google.com/site/anshulkundaje/projects/idr for details | Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng | Mike Snyder | |||||||||
47 | 26 | Mouse (mm9) | ChIP-seq | 4 | Blacklist filtered IDR peak Calls (Use these) | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/mouse/peaks_spp/idrOptimalBlacklistFiltered/ | IDR Peak calls are filtered against blacklists. | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | |||||||||
48 | 28 | Mouse (mm9) | ChIP-seq | 5 | Relaxed peak calls (unthresholded) | NARROWPEAK | http://www.broadinstitute.org/~anshul/projects/mouse/peaks_spp/combrep/ | These are a large set of unthresholded peak calls (upto 300K peaks) from SPP. Useful for analyses that want to analyze low signal peaks. | Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng | Mike Snyder | |||||||||
49 | Mouse (mm9) | Mappability | 7 | Unique Mappability track (Read Lengths 20 to 54) | BINARY | http://www.broadinstitute.org/~anshul/projects/umap/mm9/globalmap_k20tok54.tgz | A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer | Anshul Kundaje (anshul@kundaje.net) | Manolis Kellis | ||||||||||
50 | Mouse (mm9) | ChIP-seq | Signal tracks (Input normalized) | BIGWIG | Not sure where these are or if they were generated |