1 of 32

Bioconductor Packages

For

Cached File Management

BiocFileCache, AnnotationHub, ExperimentHub

2 of 32

BiocFileCache

Local File Management

3 of 32

Motivation:

It can be time consuming to download remote resource from the web. Let’s design a way to check a local resource to see if it needs to be updated or not.

4 of 32

Utilizes functions from httr to capture Last-modified time

  • HEAD()
  • cache_info()

> library(httr)

> cache_info(HEAD("https://en.wikipedia.org/wiki/Bioconductor"))$modified�[1] "2017-11-14 03:41:58 GMT"

5 of 32

Motivation:

It can be time consuming to download remote resource from the web. Let’s design a way to check a local resource to see if it needs to be updated or not.

Let’s also have a way to better organize local files

6 of 32

BiocFileCache( )

  • creates a cache object
  • sqlite database backend
  • add ‘resources’ (files) to the cache object to track

Cache Info:

  • bfccache ( )
  • length ( )
  • show ( )
  • bfcinfo ( )

Adding Resources:

  • bfcadd( )
  • bfcnew ( )

Investigating Resources:

  • bfcquerycols ( )
  • bfcquery ( )
  • bfccount ( )
  • bfcrid ( )
  • bfcpath ( )
  • bfcrpath ( )
  • [

Web Resources:

  • bfcneedsupdate ( )
  • bfcdownload ( )

Updating Resources:

  • bfcupdate ( )
  • [[

Removing Resources:

  • bfcremove ( )
  • bfcsync ( )

Clean/Remove Cache:

  • cleanbfc ( )
  • removebfc ( )

MetaData:

  • bfcmetalist ( )
  • bfcmeta ( )
  • bfcmeta ( ) <-
  • bfcmetaremove ( )

7 of 32

Example:

> BiocFileCache()�class: BiocFileCache�bfccache: /home/lori/.cache/BiocFileCache�bfccount: 0�For more information see: bfcinfo() or bfcquery()�

> bfcinfo()�# A tibble: 0 x 8�# ... with 8 variables: rid <chr>, rname <chr>, create_time <dbl>,�# access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <dbl>�

8 of 32

Example:

> bfcadd(rname="Wiki", fpath="https://en.wikipedia.org/wiki/Bioconductor")� |======================================================================| 100%� BFC1 �"/home/lori/.cache/BiocFileCache/282e8be47f6_Bioconductor"

> bfcinfo()�# A tibble: 1 x 8� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC1 Wiki 2017-11-28 16:42:45 2017-11-28 16:42:45�# ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>�

> bfcinfo() %>% select(last_modified_time, rpath)�# A tibble: 1 x 2� last_modified_time rpath� <chr> <chr>�1 2017-11-14 03:41:58 /home/lori/.cache/BiocFileCache/282e8be47f6_Bioconductor�

9 of 32

Example:

> pathToSave = bfcnew(rname="My RDS File", ext="rds")

�> pathToSave� BFC2 � "/home/lori/.cache/BiocFileCache/2feb30a96058_2feb30a96058.rds" �

> bfcinfo()�# A tibble: 2 x 8� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC1 Wiki 2017-11-28 16:42:45 2017-11-28 16:42:45�2 BFC2 My RDS File 2017-11-28 16:43:14 2017-11-28 16:43:14�# ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>

> saveRDS(myObj, file=pathToSave)

10 of 32

Example:

> bfcneedsupdate()� BFC1 �FALSE

> bfcquery(query="RDS")�# A tibble: 1 x 8� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC2 My RDS File 2017-11-28 16:43:14 2017-11-28 16:43:14�# ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>�

> bfcrid(bfcquery(query="RDS"))�[1] "BFC2"

> bfcrpath(rids="BFC2")

BFC2 � "/home/lori/.cache/BiocFileCache/2feb30a96058_2feb30a96058.rds" > readRDS(bfcrpath(rids="BFC2"))�

��

11 of 32

Example:

# data.frame or tibble

> meta = data.frame(rid="BFC2", info="pipeLine project X", numSamples=2000)

> bfc = BiocFileCache()

> bfcmeta(bfc, name="pipeLineXmeta") <- meta�> bfcmetalist()�[1] "pipeLineXmeta"�

> library(dplyr)�> bfcinfo(bfc) %>% select(rid, rname, info, numSamples)�# A tibble: 2 x 4� rid rname info numSamples� <chr> <chr> <chr> <dbl>�1 BFC1 Wiki <NA> NA�2 BFC2 My RData File pipeLine project X 2000�

12 of 32

Example:

> bfcquery(query="project X", field="info")�# A tibble: 1 x 10� rid rname create_time access_time� <chr> <chr> <chr> <chr>�1 BFC2 My RDS File 2017-11-28 14:56:26 2017-11-28 14:58:03�# ... with 6 more variables: rpath <chr>, rtype <chr>, fpath <chr>,�# last_modified_time <chr>, info <chr>, numSamples <dbl>�

> bfcquerycols()� [1] "rid" "rname" "create_time" � [4] "access_time" "rpath" "rtype" � [7] "fpath" "last_modified_time" "info" �[10] "numSamples" ��

13 of 32

AnnotationHub/ExperimentHub

14 of 32

AnnotationHub

15 of 32

AnnotationHub( )

  • creates a hub object
  • sqlite database backend
  • Files are stored remotely and downloaded as needed
    • Bioconductor AWS S3 Buckets
    • After downloaded, cached for quick access for future runs

16 of 32

Example:

> hub = AnnotationHub()snapshotDate(): 2017-11-28

�> hub�AnnotationHub with 42282 records�# snapshotDate(): 2017-11-28 �# $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...�# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus,...�# $rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, Rle, ChainFile, OrgD...�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["AH2"]]' �

17 of 32

Example:

> length(unique(tolower(hub$species)))�[1] 1207�

> length(unique(hub$rdataclass))[1] 19

�> unique(hub$rdataclass)� [1] "FaFile" "GRanges" "data.frame" "Inparanoid8Db" � [5] "TwoBitFile" "ChainFile" "SQLiteConnection" "biopax" � [9] "BigWigFile" "AAStringSet" "MSnSet" "mzRpwiz" �[13] "mzRident" "list" "TxDb" "Rle" �[17] "EnsDb" "VcfFile" "OrgDb" �

18 of 32

Example:

> query(hub, c("Homo sapien", "UCSC", “GRanges"))AnnotationHub with 5788 records�# snapshotDate(): 2017-11-28 �# $dataprovider: UCSC, Gencode�# $species: Homo sapiens�# $rdataclass: GRanges�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["AH5012"]]' �� title � AH5012 | Chromosome Band � AH5013 | STS Markers � AH5014 | FISH Clones � AH5015 | Recomb Rate � AH5016 | ENCODE Pilot � ... ... � AH27622 | wgEncodeUwTfbsWi38CtcfStdPkRep2.narrowPeak.gz� AH49554 | gencode.v23.2wayconspseudos.gff3.gz � AH53176 | UCSC cytoBand track for hg18 � AH53177 | UCSC cytoBand track for hg19 � AH53178 | UCSC cytoBand track for hg38 �

19 of 32

Example:

> hub["AH53178"]�AnnotationHub with 1 record�# snapshotDate(): 2017-11-28 �# names(): AH53178�# $dataprovider: UCSC�# $species: Homo sapiens�# $rdataclass: GRanges�# $rdatadateadded: 2017-01-05�# $title: UCSC cytoBand track for hg38�# $description: Approximate location of bands seen on Giemsa-stained chromos...�# $taxonomyid: 9606�# $genome: hg38�# $sourcetype: UCSC track�# $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBa...�# $sourcesize: NA�# $tags: c("cytoBand", "AHCytoBands") �# retrieve record with 'object[["AH53178"]]'

20 of 32

Example:

> gr = hub[["AH53178"]]�downloading 1 resources�retrieving 1 resource� |======================================================================| 100%�loading from cache � '/home/lori//.AnnotationHub/59916'

> gr = hub[["AH53178"]]�downloading 0 resources�loading from cache � '/home/lori//.AnnotationHub/59916'

> summary(gr)�[1] "GRanges object with 1293 ranges and 2 metadata columns"

> head(gr)�GRanges object with 6 ranges and 2 metadata columns:� seqnames ranges strand | name gieStain� <Rle> <IRanges> <Rle> | <factor> <factor>� [1] chr1 [ 1, 2300000] * | p36.33 gneg� [2] chr1 [ 2300001, 5300000] * | p36.32 gpos25� [3] chr1 [ 5300001, 7100000] * | p36.31 gneg� [4] chr1 [ 7100001, 9100000] * | p36.23 gpos25� [5] chr1 [ 9100001, 12500000] * | p36.22 gneg� [6] chr1 [12500001, 15900000] * | p36.21 gpos50� -------� seqinfo: 455 sequences from an unspecified genome; no seqlengths

21 of 32

ExperimentHub

22 of 32

ExperimentHub( )

  • creates a hub object
  • sqlite database backend
  • Files are stored remotely and downloaded as needed
    • Bioconductor AWS S3 Buckets
    • After downloaded, cached for quick access for future runs

ExperimentHub data is associated with a Bioconductor package!

23 of 32

Example:

> eh = ExperimentHub()�snapshotDate(): 2017-10-30

> length(eh)�[1] 872��> query(eh, "HMP16SData")�ExperimentHub with 2 records�# snapshotDate(): 2017-10-30 �# $dataprovider: NIH Common Fund Human Microbiome Project�# $species: Homo Sapiens�# $rdataclass: SummarizedExperiment�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["EH1037"]]' �� title� EH1037 | V13 � EH1038 | V35 �

24 of 32

Example:

> query(eh, "TENxBrainData")�ExperimentHub with 4 records�# snapshotDate(): 2017-10-30 �# $dataprovider: 10X Genomics�# $species: Mus musculus�# $rdataclass: character�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["EH1039"]]' �� title � EH1039 | Brain scRNA-seq data, 'RLE-compressed' � EH1040 | Brain scRNA-seq data, 'rectangular' � EH1041 | Brain scRNA-seq data, sample (column) annotation� EH1042 | Brain scRNA-seq data, gene (row) annotation �

25 of 32

What’s the advantage?

Keeps the Package Lightweight!

Only download data as needed

Make large files accessible as simple objects

Resource are documented through package documentation

26 of 32

ExperimentHub Associated Package

Requires inst/scripts/make-data.R

  • Shows the preprocessing of raw files to R objects for reproducibility

R files/functions

  • Potentially can construct complex structures from simple objects, behind the scene
  • Helper functions to download data directly

27 of 32

TENxBrainData: inst/script/make-data.R

28 of 32

TENxBrainData: R function

29 of 32

TENxBrainData: R function

> library(TEXxBrainData)

> SCE = TENxBrainData( )�snapshotDate(): 2017-10-30

downloading 1 resources�retrieving 1 resource |=====================================================================| 100%�loading from cache � '/home/lori//.ExperimentHub/1042'�downloading 1 resources�retrieving 1 resource� |======================================================================| 100%

loading from cache � '/home/lori//.ExperimentHub/1041'�downloading 1 resources�retrieving 1 resource� |======================================================================| 100%�loading from cache � '/home/lori//.ExperimentHub/1040'

> class(SCE)�[1] "SingleCellExperiment"�attr(,"package")�[1] "SingleCellExperiment"�

30 of 32

HMP16SData: R functions

http://bioconductor.org/packages/3.7/bioc/vignettes/ExperimentHubData/inst/doc/CreateAnExperimentHubPackage.html

R/zzz.R

.onLoad <- function

> library(HMP16SData)

> SE = V13()�snapshotDate(): 2017-10-30�see ?HMP16SData and browseVignettes('HMP16SData') for documentation�downloading 0 resources�loading from cache � '/home/lori//.ExperimentHub/1037'�

> class(SE)�[1] "SummarizedExperiment"�attr(,"package")�[1] "SummarizedExperiment"

> query(eh, "HMP16SData")�ExperimentHub with 2 records�# snapshotDate(): 2017-10-30 �# $dataprovider: NIH Common Fund Human Microbiome Project�# $species: Homo Sapiens�# $rdataclass: SummarizedExperiment�# additional mcols(): taxonomyid, genome, description,�# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,�# rdatapath, sourceurl, sourcetype �# retrieve records with, e.g., 'object[["EH1037"]]' �� title� EH1037 | V13 � EH1038 | V35 �

> eh[[“EH1037”]]

31 of 32

Future…

32 of 32

Lori Shepherd

Bioconductor Core Team

lori.shepherd@roswellpark.org