Bioconductor Packages
For
Cached File Management
BiocFileCache, AnnotationHub, ExperimentHub
BiocFileCache
Local File Management
Motivation:
It can be time consuming to download remote resource from the web. Let’s design a way to check a local resource to see if it needs to be updated or not.
Utilizes functions from httr to capture Last-modified time
> library(httr)
> cache_info(HEAD("https://en.wikipedia.org/wiki/Bioconductor"))$modified�[1] "2017-11-14 03:41:58 GMT"�
Motivation:
It can be time consuming to download remote resource from the web. Let’s design a way to check a local resource to see if it needs to be updated or not.
Let’s also have a way to better organize local files
BiocFileCache( )
Cache Info:
Adding Resources:
Investigating Resources:
Web Resources:
Updating Resources:
Removing Resources:
Clean/Remove Cache:
MetaData:
Example:
> BiocFileCache()�class: BiocFileCache�bfccache: /home/lori/.cache/BiocFileCache�bfccount: 0�For more information see: bfcinfo() or bfcquery()�
> bfcinfo()�# A tibble: 0 x 8�# ... with 8 variables: rid <chr>, rname <chr>, create_time <dbl>,�# access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <dbl>�
Example:
> bfcadd(rname="Wiki", fpath="https://en.wikipedia.org/wiki/Bioconductor")� |======================================================================| 100%� BFC1 �"/home/lori/.cache/BiocFileCache/282e8be47f6_Bioconductor"
> bfcinfo()�# A tibble: 1 x 8� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC1 Wiki 2017-11-28 16:42:45 2017-11-28 16:42:45�# ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>�
> bfcinfo() %>% select(last_modified_time, rpath)�# A tibble: 1 x 2� last_modified_time rpath� <chr> <chr>�1 2017-11-14 03:41:58 /home/lori/.cache/BiocFileCache/282e8be47f6_Bioconductor�
Example:
> pathToSave = bfcnew(rname="My RDS File", ext="rds")
�> pathToSave� BFC2 � "/home/lori/.cache/BiocFileCache/2feb30a96058_2feb30a96058.rds" �
> bfcinfo()�# A tibble: 2 x 8� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC1 Wiki 2017-11-28 16:42:45 2017-11-28 16:42:45�2 BFC2 My RDS File 2017-11-28 16:43:14 2017-11-28 16:43:14�# ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>
> saveRDS(myObj, file=pathToSave)�
Example:
> bfcneedsupdate()� BFC1 �FALSE
> bfcquery(query="RDS")�# A tibble: 1 x 8� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC2 My RDS File 2017-11-28 16:43:14 2017-11-28 16:43:14�# ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>�
> bfcrid(bfcquery(query="RDS"))�[1] "BFC2"
> bfcrpath(rids="BFC2")
BFC2 � "/home/lori/.cache/BiocFileCache/2feb30a96058_2feb30a96058.rds" �> readRDS(bfcrpath(rids="BFC2"))��
��
Example:
# data.frame or tibble
> meta = data.frame(rid="BFC2", info="pipeLine project X", numSamples=2000)
> bfc = BiocFileCache()
> bfcmeta(bfc, name="pipeLineXmeta") <- meta�> bfcmetalist()�[1] "pipeLineXmeta"�
> library(dplyr)�> bfcinfo(bfc) %>% select(rid, rname, info, numSamples)�# A tibble: 2 x 4� rid rname info numSamples� <chr> <chr> <chr> <dbl>�1 BFC1 Wiki <NA> NA�2 BFC2 My RData File pipeLine project X 2000�
��
Example:
> bfcquery(query="project X", field="info")�# A tibble: 1 x 10� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC2 My RDS File 2017-11-28 14:56:26 2017-11-28 14:58:03�# ... with 6 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>, info <chr>, numSamples <dbl>�
> bfcquerycols()� [1] "rid" "rname" "create_time" � [4] "access_time" "rpath" "rtype" � [7] "fpath" "last_modified_time" "info" �[10] "numSamples" ���
AnnotationHub/ExperimentHub
AnnotationHub
AnnotationHub( )
Example:
> hub = AnnotationHub()�snapshotDate(): 2017-11-28
�> hub�AnnotationHub with 42282 records�# snapshotDate(): 2017-11-28 �# $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...�# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus,...�# $rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, Rle, ChainFile, OrgD...�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["AH2"]]' �
�
Example:
> length(unique(tolower(hub$species)))�[1] 1207�
> length(unique(hub$rdataclass))�[1] 19
�> unique(hub$rdataclass)� [1] "FaFile" "GRanges" "data.frame" "Inparanoid8Db" � [5] "TwoBitFile" "ChainFile" "SQLiteConnection" "biopax" � [9] "BigWigFile" "AAStringSet" "MSnSet" "mzRpwiz" �[13] "mzRident" "list" "TxDb" "Rle" �[17] "EnsDb" "VcfFile" "OrgDb" �
�
�
Example:
> query(hub, c("Homo sapien", "UCSC", “GRanges"))�AnnotationHub with 5788 records�# snapshotDate(): 2017-11-28 �# $dataprovider: UCSC, Gencode�# $species: Homo sapiens�# $rdataclass: GRanges�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["AH5012"]]' �� title � AH5012 | Chromosome Band � AH5013 | STS Markers � AH5014 | FISH Clones � AH5015 | Recomb Rate � AH5016 | ENCODE Pilot � ... ... � AH27622 | wgEncodeUwTfbsWi38CtcfStdPkRep2.narrowPeak.gz� AH49554 | gencode.v23.2wayconspseudos.gff3.gz � AH53176 | UCSC cytoBand track for hg18 � AH53177 | UCSC cytoBand track for hg19 � AH53178 | UCSC cytoBand track for hg38 �
Example:
> hub["AH53178"]�AnnotationHub with 1 record�# snapshotDate(): 2017-11-28 �# names(): AH53178�# $dataprovider: UCSC�# $species: Homo sapiens�# $rdataclass: GRanges�# $rdatadateadded: 2017-01-05�# $title: UCSC cytoBand track for hg38�# $description: Approximate location of bands seen on Giemsa-stained chromos...�# $taxonomyid: 9606�# $genome: hg38�# $sourcetype: UCSC track�# $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBa...�# $sourcesize: NA�# $tags: c("cytoBand", "AHCytoBands") �# retrieve record with 'object[["AH53178"]]'
Example:
> gr = hub[["AH53178"]]�downloading 1 resources�retrieving 1 resource� |======================================================================| 100%�loading from cache � '/home/lori//.AnnotationHub/59916'
> gr = hub[["AH53178"]]�downloading 0 resources�loading from cache � '/home/lori//.AnnotationHub/59916'
> summary(gr)�[1] "GRanges object with 1293 ranges and 2 metadata columns"
> head(gr)�GRanges object with 6 ranges and 2 metadata columns:� seqnames ranges strand | name gieStain� <Rle> <IRanges> <Rle> | <factor> <factor>� [1] chr1 [ 1, 2300000] * | p36.33 gneg� [2] chr1 [ 2300001, 5300000] * | p36.32 gpos25� [3] chr1 [ 5300001, 7100000] * | p36.31 gneg� [4] chr1 [ 7100001, 9100000] * | p36.23 gpos25� [5] chr1 [ 9100001, 12500000] * | p36.22 gneg� [6] chr1 [12500001, 15900000] * | p36.21 gpos50� -------� seqinfo: 455 sequences from an unspecified genome; no seqlengths�
ExperimentHub
ExperimentHub( )
ExperimentHub data is associated with a Bioconductor package!
Example:
> eh = ExperimentHub()�snapshotDate(): 2017-10-30
> length(eh)�[1] 872��> query(eh, "HMP16SData")�ExperimentHub with 2 records�# snapshotDate(): 2017-10-30 �# $dataprovider: NIH Common Fund Human Microbiome Project�# $species: Homo Sapiens�# $rdataclass: SummarizedExperiment�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["EH1037"]]' �� title� EH1037 | V13 � EH1038 | V35 �
Example:
> query(eh, "TENxBrainData")�ExperimentHub with 4 records�# snapshotDate(): 2017-10-30 �# $dataprovider: 10X Genomics�# $species: Mus musculus�# $rdataclass: character�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["EH1039"]]' �� title � EH1039 | Brain scRNA-seq data, 'RLE-compressed' � EH1040 | Brain scRNA-seq data, 'rectangular' � EH1041 | Brain scRNA-seq data, sample (column) annotation� EH1042 | Brain scRNA-seq data, gene (row) annotation �
What’s the advantage?
Keeps the Package Lightweight!
Only download data as needed
Make large files accessible as simple objects
Resource are documented through package documentation
ExperimentHub Associated Package
Requires inst/scripts/make-data.R
R files/functions
TENxBrainData: inst/script/make-data.R
TENxBrainData: R function
TENxBrainData: R function
> library(TEXxBrainData)
> SCE = TENxBrainData( )�snapshotDate(): 2017-10-30
downloading 1 resources�retrieving 1 resource� |=====================================================================| 100%�loading from cache � '/home/lori//.ExperimentHub/1042'�downloading 1 resources�retrieving 1 resource� |======================================================================| 100%
loading from cache � '/home/lori//.ExperimentHub/1041'�downloading 1 resources�retrieving 1 resource� |======================================================================| 100%�loading from cache � '/home/lori//.ExperimentHub/1040'
> class(SCE)�[1] "SingleCellExperiment"�attr(,"package")�[1] "SingleCellExperiment"�
HMP16SData: R functions
R/zzz.R
.onLoad <- function
> library(HMP16SData)
> SE = V13()�snapshotDate(): 2017-10-30�see ?HMP16SData and browseVignettes('HMP16SData') for documentation�downloading 0 resources�loading from cache � '/home/lori//.ExperimentHub/1037'�
> class(SE)�[1] "SummarizedExperiment"�attr(,"package")�[1] "SummarizedExperiment"�
> query(eh, "HMP16SData")�ExperimentHub with 2 records�# snapshotDate(): 2017-10-30 �# $dataprovider: NIH Common Fund Human Microbiome Project�# $species: Homo Sapiens�# $rdataclass: SummarizedExperiment�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["EH1037"]]' �� title� EH1037 | V13 � EH1038 | V35 �
> eh[[“EH1037”]]
Future…
Lori Shepherd
Bioconductor Core Team
lori.shepherd@roswellpark.org