jointAWG_Data_Locations

	A	B	C	D	E	F	G	H	I	J	K
1	SORT	ORGANISM	DATA TYPE	DATA LEVEL	PROCESSING STEP	DATA FORMATS	DATA LOCATION	NOTES	DATA/PROCESSING CONTACT	DATA CONTACT PI	DATE

2	1	Worm (WS220)	Genome Sequence	0	Genome Sequence	FASTA	http://hgdownload.cse.ucsc.edu/goldenPath/ce10/chromosomes/	Per chromosome FASTA file
3	2	Worm (WS220)	ChIP-seq	1	Raw Alignments	SAM	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/alignments/bam/	Uniformly mapped data using BWA (only unique mapping reads)	Carlos Araya (claraya@gmail.com) ; Philip Cayting (pcayting@stanford.edu)	Mike Snyder
4		Worm (WS220)	ChIP-seq		Filtered Alignments	TAGALIGN/BED	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/alignments/distinctTagAlign/reps/	Unique mapping reads, duplicates filtered	Carlos Araya (claraya@gmail.com) ; Anshul Kundaje (anshul@kundaje.net)
5	3	Worm (WS220)	ChIP-seq	2	MetaData and Data Quality	EXCEL/TAB tables	https://docs.google.com/spreadsheet/ccc?key=0Algk3BSZDYzgdDlYNU00d2p3azJyZWlrZ09OQXNXTGc#gid=0	Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
6	19	Worm (WS220)	Blacklist	4	Blacklists	BED	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/blacklist/ce10-blacklist.bed.gz	Empirical blacklist of regions with artifactual unstructured signal	Alan Boyle (aboyle@stanford.edu)	Mike Snyder
7	4	Worm (WS220)	ChIP-seq	3	IDR Peak Calls	NARROWPEAK	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/idr/	The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.05 was used. chrM peaks were removed as these were unreliable in most cases. See https://sites.google.com/site/anshulkundaje/projects/idr for details	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
8	4	Worm (WS220)	ChIP-seq	3	Blacklist filtered IDR peak Calls (Use these)	NARROWPEAK	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/uniformPk/	IDR Peak calls are filtered against blacklists	Alan Boyle (aboyle@stanford.edu)	Mike Snyder
9	5	Worm (WS220)	ChIP-seq	5	Relaxed peak calls (unthresholded)	NARROWPEAK	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/unthresholdPk/	These are a large set of unthresholded peak calls (upto ~100K peaks) from SPP. Useful for analyses that want to analyze low signal peaks.	Carlos Araya (claraya@gmail.com) ; Philip Cayting (pcayting@stanford.edu)	Mike Snyder
10	6	Worm (WS220)	Mappability	7	Unique Mappability track (Read Lengths 20 to 54)	BINARY	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/mappability/globalmap_k20tok54.tgz	A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
11	22	Worm (WS220)	ChIP-seq	6	Signal tracks (Input normalized)	BIGWIG/BEDGRAPH	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/signal/foldChange/	Signal tracks are generated for each dataset using MACSv2's signal processing module. Signal tracks represent ChIP signal compared to input control signal. - FoldEnrichment: The first type of track represents the fold enrichment of ChIP over input. This type of track is useful for detecting and analyzing regions with moderate to low enrichment. - Does NOT correct for local mappability - Does NOT differentiate between "missing data" at unmappable locations and true 0 signal. - The signal is not smoothed. Reads are extended to predominant fragment length. - It is recommended to smooth the signal before using it unless you are averaging over multiple sites (.e.g aggregation plots) - Separate tracks are generated for individual replicates and pooled data.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
12		Worm (WS220)	TIP		TF target prediction	TIP	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/TFtargets/	See README file in the directory; Chao Cheng's TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously.	Chao Cheng (Chao Cheng <chengchao12@gmail.com>)
13		Worm (WS220)	HOT		HOT Regions	BED	http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/hotRegions/	See README at http://encodedcc.sdsc.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/hotRegions/readme.txt	Carlos Araya (claraya@gmail.com) ; Alan Boyle (aboyle@stanford.edu)	Mike Snyder
14
15	7	Fly (FB5.45)	Genome Sequence	0	Genome Sequence	FASTA	http://hgdownload.cse.ucsc.edu/goldenPath/dm3/chromosomes/	Per chromosome FASTA file
16	8	Fly (FB5.45)	ChIP-seq	1	Raw data (Alignments)	SAM	http://www.broadinstitute.org/~anshul/projects/fly/mapped/sam/	Uniformly mapped data using BWA (only unique mapping reads)	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
17		Fly (FB5.45)	ChIP-seq		Filtered Alignments	TAGALIGN/BED	http://www.broadinstitute.org/~anshul/projects/fly/mapped/distinctTagAlign/qcmerge/	Unique mapping reads, duplicates filtered
18	9	Fly (FB5.45)	ChIP-seq	2	MetaData and Data Quality	EXCEL/TAB tables	https://docs.google.com/spreadsheet/ccc?key=0Algk3BSZDYzgdDU3cXVVMHdQeHRTUWtnYk1aSG13NEE&pli=1#gid=4	Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
19	19	Fly (FB5.45)	Blacklist	4	Blacklists	BED	http://www.broadinstitute.org/~anshul/projects/fly/blacklist/dm3-blacklist.bed.gz	Empirical blacklist of regions with artifactual unstructured signal	Alan Boyle (aboyle@stanford.edu)	Mike Snyder
20	10	Fly (FB5.45)	ChIP-seq	3	IDR Peak Calls	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/fly/peaks_macs/release/idrOptimal/pass/	The MACSv2 peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.05 was used.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
21	10	Fly (FB5.45)	ChIP-seq	3	Blacklist filtered IDR peak Calls	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/fly/peaks_macs/release/idrOptimalBlacklistFiltered/	IDR Peak calls are filtered against blacklists	Alan Boyle (aboyle@stanford.edu)	Mike Snyder
22	11	Fly (FB5.45)	ChIP-seq	5	Relaxed peak calls (unthresholded)	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/fly/peaks_macs/release/combrep/regionPeak/	These are a large set of unthresholded peak calls using MACSv2. Useful for analyses that want to analyze low signal peaks.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
23	12	Fly (FB5.45)	Mappability	7	Unique Mappability track (Read Lengths 20 to 54)	BINARY	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/umap/dm3_build5/dm3_build5.all.globalmap_k20tok54.tgz	A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
24	22	Fly (FB5.45)	ChIP-seq	6	Signal tracks (Input normalized)	BIGWIG/BEDGRAPH	http://www.broadinstitute.org/~anshul/projects/fly/uniformSignal/macs2signal/combrep/	Signal tracks are generated for each dataset using MACSv2's signal processing module. Signal tracks represent ChIP signal compared to input control signal. - FoldEnrichment: The first type of track represents the fold enrichment of ChIP over input. This type of track is useful for detecting and analyzing regions with moderate to low enrichment. - Does NOT correct for local mappability - Does NOT differentiate between "missing data" at unmappable locations and true 0 signal. - The signal is not smoothed. Reads are extended to predominant fragment length. - It is recommended to smooth the signal before using it unless you are averaging over multiple sites (.e.g aggregation plots) - Separate tracks are generated for individual replicates and pooled data.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
25		Fly (FB5.45)	TIP		TF target prediction	TIP	http://www.broadinstitute.org/~anshul/projects/fly/TFtargets/jun2012/	See README.fly file in the directory; Chao Cheng's TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously.	Chao Cheng (Chao Cheng <chengchao12@gmail.com>)
26		Fly (FB5.45)	HOT		HOT Regions	BED	http://stanford.edu/~claraya/metrn/data/hot/regions/dm/	See README: http://stanford.edu/~claraya/metrn/data/hot/	Carlos Araya (claraya@gmail.com) ; Alan Boyle (aboyle@stanford.edu)	Mike Snyder
27
28	7	Human (hg19)	Genome Sequence	0	Genome Sequence	FASTA	http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/referenceSequences/	Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability)
29	15	Human (hg19)	ChIP-seq	1	Raw data (Alignments)	BAM	http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC	FASTQ and BAM files can be downloaded from the URL. Different labs used different mappers and mapping strategies. Hence, these files should be filtered to standardize them.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
30	16	Human (hg19)	ChIP-seq	1	Raw data (Unique mapping distinct alignments)	TAGALIGN	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/mapped/mar2012/distinctTagAlign/	BAM files above are filtered to only keep unique mapping reads (tagAlign/ directory). Then duplicate reads were removed (only one read per position). These can be obtained in the distinctTagAlign/ directory	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
31	17	Human (hg19)	ChIP-seq	2	MetaData and Data Quality	EXCEL/TAB tables	https://docs.google.com/spreadsheet/ccc?key=0Am6FxqAtrFDwdHdRcHNQUy03SjBoSVMxdUNyZV9Rdnc#gid=9	Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
32	19	Human (hg19)	Blacklist	4	Blacklists	BED	http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz	Brief summary of how the blacklist was generated can be found at http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability . More detailed analysis is at http://goo.gl/9FyQF	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
33	18	Human (hg19)	ChIP-seq	3	IDR Peak Calls	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/peaks_spp/mar2012/distinct/idrOptimal/	The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.02 was used. See https://sites.google.com/site/anshulkundaje/projects/idr for details	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
34	20	Human (hg19)	ChIP-seq	4	Blacklist filtered IDR peak Calls (Use these)	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/peaks_spp/mar2012/distinct/idrOptimalBlackListFilt/	IDR Peak calls are filtered against blacklists. THESE ARE THE HUMAN PEAK CALLS EVERYONE SHOULD USE.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
35	21	Human (hg19)	ChIP-seq	5	Relaxed peak calls (unthresholded)	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/peaks_spp/mar2012/distinct/combrep/regionPeak/	These are a large set of unthresholded peak calls (upto 300K peaks) from SPP. Useful for analyses that want to analyze low signal peaks.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
36	22	Human (hg19)	ChIP-seq	6	Signal tracks (Input normalized)	BIGWIG/BEDGRAPH	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/signal/mar2012/pooledReps/bigwig/macs2signal/foldChange/	Signal tracks are generated for each dataset using MACSv2's signal processing module. Signal tracks represent ChIP signal compared to input control signal. - FoldEnrichment: The first type of track represents the fold enrichment of ChIP over input. This type of track is useful for detecting and analyzing regions with moderate to low enrichment. - Does NOT correct for local mappability - Does NOT differentiate between "missing data" at unmappable locations and true 0 signal. - The signal is not smoothed. Reads are extended to predominant fragment length. - It is recommended to smooth the signal before using it unless you are averaging over multiple sites (.e.g aggregation plots) - Separate tracks are generated for individual replicates and pooled data.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
37		Human (hg19) Male genome (with chrY)	Mappability	7	Unique Mappability track (Read Lengths 20 to 54)	BINARY	http://www.broadinstitute.org/~anshul/projects/umap/encodeHg19Male/globalmap_k20tok54.tgz	A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
38		Human (hg19) Female genome (no chrY)	Mappability	7	Unique Mappability track (Read Lengths 20 to 54)	BINARY	http://www.broadinstitute.org/~anshul/projects/umap/encodeHg19Female/globalmap_k20tok54.tgz	A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
39		Human (hg19)	TIP		TF target predictions	TIP	http://www.broadinstitute.org/~anshul/projects/encode/rawdata/TFtargets/	See README.hsa file in the directory; Chao Cheng's TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method.	Chao Cheng (Chao Cheng <chengchao12@gmail.com>)	Chao Cheng (Chao Cheng <chengchao12@gmail.com>)
40		Human (hg19)	HOT		HOT Regions	BED	http://stanford.edu/~claraya/metrn/data/hot/regions/hs/	See README: http://stanford.edu/~claraya/metrn/data/hot/	Carlos Araya (claraya@gmail.com) ; Alan Boyle (aboyle@stanford.edu)	Mike Snyder
41
42	7	Mouse (mm9)	Genome Sequence	0	Genome Sequence	FASTA	http://hgdownload-test.cse.ucsc.edu/goldenPath/mm9/chromosomes/	Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability)
43	23	Mouse (mm9)	ChIP-seq	1	Raw data (Alignments)	BAM	http://hgdownload-test.cse.ucsc.edu/goldenPath/mm9/encodeDCC	Datasets were mapped by individual labs	Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng	Mike Snyder
44	24	Mouse (mm9)	ChIP-seq	2	MetaData and Data Quality	EXCEL/TAB tables	https://docs.google.com/spreadsheet/ccc?key=0Ao3-Or4FCMJEdFpPY2lwWnlZTV92MUNLOHYxbEl4Vnc#gid=0	Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics	Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng	Mike Snyder
45	19	Mouse (mm9)	Blacklist	4	Blacklists	BED	http://www.broadinstitute.org/~anshul/projects/mouse/blacklist/mm9-blacklist.bed.gz	Brief summary of how the blacklist was generated can be found at http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability . More detailed analysis is at http://goo.gl/9FyQF	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
46	25	Mouse (mm9)	ChIP-seq	3	IDR Peak Calls	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/mouse/peaks_spp/idrOptimal/	The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.02 was used. See https://sites.google.com/site/anshulkundaje/projects/idr for details	Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng	Mike Snyder
47	26	Mouse (mm9)	ChIP-seq	4	Blacklist filtered IDR peak Calls (Use these)	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/mouse/peaks_spp/idrOptimalBlacklistFiltered/	IDR Peak calls are filtered against blacklists.	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
48	28	Mouse (mm9)	ChIP-seq	5	Relaxed peak calls (unthresholded)	NARROWPEAK	http://www.broadinstitute.org/~anshul/projects/mouse/peaks_spp/combrep/	These are a large set of unthresholded peak calls (upto 300K peaks) from SPP. Useful for analyses that want to analyze low signal peaks.	Philip Cayting (pcayting@stanford.edu); Alan Boyle ; Yong Cheng	Mike Snyder
49		Mouse (mm9)	Mappability	7	Unique Mappability track (Read Lengths 20 to 54)	BINARY	http://www.broadinstitute.org/~anshul/projects/umap/mm9/globalmap_k20tok54.tgz	A position 'i' on a particular genomic strand 's' is considered uniquely mappable for a read-length 'k' if the k-mer starting at 'i' on strand 's' maps uniquely i.e. only to position 'i' on strand 's' (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an "optimistic" idealized mappability mask not accounting for mismatches. A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome. Each <organism>.<sex>.globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome) (a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space) (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54) (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54) (e) In order to obtain the uniqueness map for a particular read-length 'k', simply perform the following operation on each element of the vector (vector > 0) & (vector <= k) (f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <k-1>. i.e. if position 1 is UNIQUE on the + strand for read-length <k=3> then it implies position 3 is UNIQUE on the - strand =============================================== How to read the files in a programming language such as matlab/octave =============================================== %First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer	Anshul Kundaje (anshul@kundaje.net)	Manolis Kellis
50		Mouse (mm9)	ChIP-seq		Signal tracks (Input normalized)	BIGWIG	Not sure where these are or if they were generated