BlobToolKit
Interactive quality assessment of genome assemblies
Interactive Exercises
Command line examples will look like this
Follow the URLs for browser-based examples
$ cd ~/blobtoolkit
$ ls
Blobtools2
insdc-pipeline
specification
taxdump
viewer
Overview
Why blobs?
We want to sequence a tardigrade
– but it comes with a soup of other organisms
Why blobs?
All tardigrade DNA will be at the same molarity*
* Organellar genomes
* Sex chomosomes
* Allelic divergence
* Repeats
Contaminant DNA will be at different molarities*
* Single-copy symbionts
* Or not
Why blobs?
coverage
<1x 1x 2x 4x
Relative frequency
Why blobs?
coverage
<1x 1x 2x 4x
Relative frequency
Why blobs?
All tardigrade DNA will have similar GC content*
* Organellar genomes
* Localised variation
* Repeats
Contaminant DNA GC content may differ*
* Or not
Why Blobs?
GC proportion
0% 25% 50% 75% 100%
Relative frequency
Why Blobs?
GC proportion
0% 25% 50% 75% 100%
coverage
Why Blobs?
add taxonomic annotation to each contig
GC proportion
0% 25% 50% 75% 100%
coverage
Why Blobs?
GC proportion
0% 25% 50% 75% 100%
coverage
Hypsibius dujardini
Chitinophaga
Pseudomonas
Stenotrophomonas
alphaproteobacterium
Blobology Sujai Kumar and colleagues 2013
Blobology Sujai Kumar and colleagues 2013
BlobTools Dominik Laetsch and colleagues 2017
BlobTools Dominik Laetsch and colleagues 2017
BlobToolKit
University of Edinburgh &
Wellcome Sanger Institute
European Nucleotide Archive
European Bioinformatics Institute
BlobToolKit
BlobToolKit
BlobDir dataset
DatasetID
|— meta.json
|— identifiers.json
|— gc.json
|— length.json
|— ncount.json
|— {LIBRARYNAME}_cov.json
|— {LIBRARYNAME}_read_cov.json
|— {TAXRULE}_positions.json
|— {TAXRULE}_{RANK}.json
|— {TAXRULE}_{RANK}_cindex.json
|— {TAXRULE}_{RANK}_positions.json
|— {TAXRULE}_{RANK}_score.json
|— {LINEAGE}_busco.json
BlobToolKit
BlobDir dataset
BlobTools2
DatasetID
|— meta.json
...
DatasetID
|— meta.json
|— identifiers.json
|— gc.json
|— length.json
|— ncount.json
|— {LIBRARYNAME}_cov.json
|— {LIBRARYNAME}_read_cov.json
|— {TAXRULE}_positions.json
|— {TAXRULE}_{RANK}.json
|— {TAXRULE}_{RANK}_cindex.json
|— {TAXRULE}_{RANK}_positions.json
|— {TAXRULE}_{RANK}_score.json
|— {LINEAGE}_busco.json
$ ./blobtools create --fasta ACVV01.fasta \
... /path/to/BlobDir
BlobToolKit
BlobDir dataset
BlobTools2
DatasetID
|— meta.json
...
DatasetID
|— meta.json
|— identifiers.json
|— gc.json
|— length.json
|— ncount.json
|— {LIBRARYNAME}_cov.json
|— {LIBRARYNAME}_read_cov.json
|— {TAXRULE}_positions.json
|— {TAXRULE}_{RANK}.json
|— {TAXRULE}_{RANK}_cindex.json
|— {TAXRULE}_{RANK}_positions.json
|— {TAXRULE}_{RANK}_score.json
|— {LINEAGE}_busco.json
$ ./blobtools create --fasta ACVV01.fasta \
... /path/to/BlobDir
Pipeline
BlobToolKit
BlobDir dataset
BlobTools2
DatasetID
|— meta.json
...
$ ./blobtools create --fasta ACVV01.fasta \
... /path/to/BlobDir
Pipeline
Viewer
BlobToolKit
BlobDir dataset
BlobTools2
DatasetID
|— meta.json
...
$ ./blobtools create --fasta ACVV01.fasta \
... /path/to/BlobDir
ENA browser
Pipeline
Viewer
BlobToolKit Pipeline
BlobToolKit Pipeline
BlobToolKit Pipeline
BlobToolKit Pipeline
Pipeline configuration file
assembly:
accession: GCA_00029833$
alias: DroAlb_1.0
bioproject: PRJNA39511
level: scaffold
span: 253560284
prefix: ACVV01
taxon:
taxid: 7291
name: Drosophila albomi$
BlobToolKit Pipeline
Pipeline configuration file
similarity:
defaults:
evalue: 1e-25
max_target_seqs: 10
root: 1
mask_ids: [7215]
databases:
- {name: nt_v5, local$
- {name: reference_pr$ taxrule: bestsumorder
BlobToolKit Pipeline
Pipeline configuration file
reads:
paired:
- [SRR01,ILLUMINA,482$
- [SRR02,ILLUMINA,552$
single:
- [SRR03,PACBIO]
coverage:
max: 100
min: 0.5
BlobToolKit Pipeline
Pipeline configuration file
busco:
lineages:
- diptera_odb9
- arthropoda_odb9
- eukaryota_odb9
lineage_dir: /busco/lin$
BlobToolKit Pipeline
Snakemake command to run the Pipeline
snakemake -p \
--use-conda \
--conda-prefix $CONDA_DIR \
--directory $WORKDIR/ \
--configfile $WORKDIR/$ASSEMBLY.yaml \
--stats $ASSEMBLY.snakemake.stats \
-j $THREADS \
--resources btk=1 \
-n
BlobToolKit Pipeline
BlobToolKit Pipeline
Cluster configuration
__default__:
mem: 100
queue: 'small'
bamtools_stats:
threads: 1
mem: 1000
run_blastn:
threads: 16
mem: 100000
queue: 'normal'
BlobToolKit Pipeline
Snakemake command for running the Pipeline on a cluster
snakemake -p --cluster-config cluster.yaml \
--drmaa " -o {log}.o \
-e {log}.e \
-R \"select[mem>{cluster.mem}] rusag$
-M {cluster.mem} \
-n {cluster.threads} \
-q {cluster.queue}" \
...
Using the Viewer
Using the Viewer
Using the Viewer
Finding datasets
Using the Viewer
BlobToolKit views
Using the Viewer
Indicators of assembly quality
Using the Viewer
Exploring non-target data
Using the Viewer
Digging deeper
Using the Viewer
Customising plots
Using the Viewer
Reproducibility
Using the Viewer
Find assemblies with good and bad conventional metrics to compare plots
Search for assemblies with cobionts, e.g.:
Take a look at some of these assemblies:
Running BlobToolKit
Running BlobToolKit
Check BlobToolKit has been downloaded
$ cd ~/blobtoolkit
$ ls
blobtools2
insdc-pipeline
specification
taxdump
viewer
Running BlobToolKit
Check the Conda package manager has been installed
$ conda activate btk_env
(btk_env) $ which python3
/home/username/miniconda3/envs/btk_env/bin/python3
Running BlobToolKit
Hosting a local Viewer instance
Hosting a local Viewer instance
Download a BlobDir dataset from the public Viewer
$ mkdir -p ~/blobtoolkit/datasets
$ cd ~/blobtoolkit/datasets
$ curl http://blobtoolkit.genomehubs.org/download/AC/ACVV01/ACVV01.blobdir.tar.gz | tar xf -
Hosting a local Viewer instance
Viewer configuration environment variables
NODE_ENV=local
BTK_CLIENT_PORT=8080
BTK_API_PORT=8000
BTK_API_URL=http://localhost:8000/api/v1
BTK_BASENAME=/view
BTK_ORIGINS='http://localhost:8080 http://localhost null'
BTK_HOST=localhost
BTK_FILE_PATH=/home/username/blobtoolkit/datasets
BTK_USE_DEFAULT_LINKS=true
BTK_STATIC_THRESHOLD=100000
BTK_NOHIT_THRESHOLD=1000000
Hosting a local Viewer instance
Create file with Viewer environment variables
$ cd ~/blobtoolkit/viewer
$ cp .env.dist .env
$ pwd
/home/username/blobtoolkit/viewer
$ nano .env
...
BTK_FILE_PATH=/home/username/blobtoolkit/datasets
...
Hosting a local Viewer instance
Start the Viewer API (back end server)
Start the Viewer (front end server)
$ cd ~/blobtoolkit/viewer
$ conda activate btk_env
(btk_env) $ npm start
...
$ cd ~/blobtoolkit/viewer
$ conda activate btk_env
(btk_env) $ npm run client
...
Hosting a local Viewer instance
Hosting a local Viewer instance
Environment variables for publicly available site
NODE_ENV=production
BTK_API_URL=https://blobtoolkit.genomehubs.org/api/v1
BTK_HTTPS=true
BTK_ORIGINS='https:localhost:8080 https://blobtoolkit.genom$
BTK_HOST='blobtoolkit.genomehubs.org'
BTK_KEYFILE='/path/to/privkey.pem'
BTK_CERTFILE='/path/to/cert.pem'
BTK_GDPR_URL=https://genomehubs.org/gdpr
BTK_DATASET_TABLE=true
Running BlobToolKit
Running BlobTools2
Running BlobTools2
Running BlobTools2
Running BlobTools2
Running BlobTools2
Running BlobTools2
Command to create a BlobDir dataset
$ ./blobtools create --fasta ACVV01.fasta \
--meta ACVV01.yaml \
--hits ACVV01.blastn.out \
--hits ACVV01.diamond.out \
--cov ACVV01.SR01.bam \
--busco ACVV01.busco.diptera_odb9.tsv \
--taxid 7291 \
--taxdump /path/to/taxdump \
/path/to/BlobDir
Running BlobTools2
Options when adding BLAST results to a BlobDir dataset
$ ./blobtools add --hits ACVV01.blastn.vs.custom.db.out \
--taxdump /path/to/taxdump \
--taxrule bestsum=myTaxruleName \
--bitscore 500 \
--evalue 1e-75 \
--hit-count 5 \
/path/to/BlobDir
Running BlobTools2
Download BLAST results from the public Viewer
$ cd ~/blobtoolkit
$ conda activate btk_env
(btk_env) $ curl http://blobtoolkit.genomehubs.org/download/AC/ACVV01/ACVV01.blastn.nt.root.1.minus.7215.out.gz | gunzip > ACVV01.blastn.out
Running BlobTools2
Import BLAST results with non-default settings
$ cd ~/blobtoolkit
$ ./blobtools2/blobtools add --hits ACVV01.blastn.out --taxrule bestsum=alt --taxdump ./taxdump --bitscore 500 ./datasets/ACVV01
$ ls ./datasets/ACVV01/alt_p*
datasets/ACVV01/alt_phylum.json
datasets/ACVV01/alt_phylum_cindex.json
datasets/ACVV01/alt_phylum_positions.json
datasets/ACVV01/alt_phylum_score.json
datasets/ACVV01/alt_positions.json
Running BlobToolKit
Extending BlobTools2
Generic datatypes
One parser per analysis type
Running BlobToolKit
BlobDir validation
BlobDir validation
Validate a BlobDir dataset
$ cd ~/blobtoolkit
$ ./specification/validate.py ./datasets/ACVV01/meta.json
VALID
Running BlobToolKit
Which method seems most useful to you?
Which analyses would you like to see incorporated?
Where do you think development efforts should be focused?
Programmatic Access
Programmatic Access
Viewer API
Viewer API
Access API endpoints on the command line
$ curl -s https://blobtoolkit.genomehubs.org/api/v1/dataset/id/ACVV01/assembly/span
253560284
Viewer API
Use jq to process API data
$ curl -s https://blobtoolkit.genomehubs.org/api/v1/field/ACVV01/length | jq '.values | add'
253560284
$ curl -s https://blobtoolkit.genomehubs.org/api/v1/field/ACVV01/length | jq '.values | map(select(. > 5000)) | add'
214698551
Programmatic Access
Viewer plots
X11
Generate a plot using blobtools view
$ ./blobtools2/blobtools view --ports 8000-8099 \
--param gc--Min=0.3 --param plotShape=hex \
./datasets/ACVV01
Initializing viewer |███████████████████████████████████████| 15/15 seconds
Loading_http://localhost:8013/view/dataset/ACVV01/blob?staticThreshold=Infinity&nohitThreshold=Infinity&plotGraphics=svg&gc--Min=0.3&plotShape=hex
...
waiting for file 'ACVV01.blob.hex.png'
Viewer plots
Generate a plot using blobtools view
$ ./blobtools2/blobtools view --ports 8000-8099 \
--param gc--Min=0.3 --param plotShape=hex \
./datasets/ACVV01
Initializing viewer |███████████████████████████████████████| 15/15 seconds
Loading_http://localhost:8013/view/dataset/ACVV01/blob?staticThreshold=Infinity&nohitThreshold=Infinity&plotGraphics=svg&gc--Min=0.3&plotShape=hex
...
waiting for file 'ACVV01.blob.hex.png'
Viewer plots
Command we’d like to use to generate a taxon-filtered plot
$ ./blobtools2/blobtools view \
--host https://blobtoolkit.genomehubs.org \
--view snail \
--param bestsumorder_phylum--Keys=Proteobacteria \
ACVV01
Viewer plots
Command we’d like to use to generate a taxon-filtered plot
$ ./blobtools2/blobtools view \
--host https://blobtoolkit.genomehubs.org \
--view snail \
--param bestsumorder_phylum--Keys=Proteobacteria \
ACVV01
Viewer plots
Use jq to view bestsumorder_phylum category keys
$ jq '.keys' ./datasets/ACVV01/bestsumorder_phylum.json
[
"no-hit",
"Proteobacteria",
"Arthropoda",
"undef",
"Ascomycota",
"Chordata",
"Mollusca",
...
Viewer plots
Use jq to view bestsumorder_phylum category keys
$ curl -s https://blobtoolkit.genomehubs.org/api/v1/field/ACVV01/bestsumorder_phylum | jq '.keys'
[
"no-hit",
"Proteobacteria",
"Arthropoda",
"undef",
"Ascomycota",
...
Viewer plots
Use the key value to generate a plot from a publicly hosted dataset
$ ./blobtools2/blobtools view --view snail \
--host https://blobtoolkit.genomehubs.org \
--param bestsumorder_phylum--Keys=1 ACVV01
Loading https://blobtoolkit.genomehubs.org/view/dataset/ACVV01/snail?staticThreshold=Infinity&nohitThreshold=Infinity&plotGraphics=svg&bestsumorder_phylum--Keys=1
Fetching ACVV01.snail.png
waiting for element snail_save_png
waiting for file 'ACVV01.snail.png'
Viewer plots
Use the key value to generate a plot from a publicly hosted dataset
$ ./blobtools2/blobtools view --view snail \
--host https://blobtoolkit.genomehubs.org \
--param bestsumorder_phylum--Keys=1 ACVV01
Loading https://blobtoolkit.genomehubs.org/view/dataset/ACVV01/snail?staticThreshold=Infinity&nohitThreshold=Infinity&plotGraphics=svg&bestsumorder_phylum--Keys=1
Fetching ACVV01.snail.png
waiting for element snail_save_png
waiting for file 'ACVV01.snail.png'
Viewer plots
Alternate command to host a local Viewer instance
$ ./blobtools2/blobtools view --ports 8000-8099 --remote \
./datasets/ACVV01
Initializing viewer |███████████████████████████████████████| 15/15 seconds
Open dataset at http://localhost:8001/view/dataset/BlobDir/blob?
For remote access use:
ssh -L 8001:127.0.0.1:8001 -L 8000:127.0.0.1:8000 username@remote_host
Programmatic Access
Filtering datasets
Filter a local BlobDir dataset
$ ./blobtools2/blobtools filter --param length--Min=3000000 --table STDOUT ./datasets/ACVV01
[
["index","identifiers","gc","length","SRR026696_cov","best$
[17958,"JH855722.1",0.3877,3161164,0.6789,"Arthropoda"],
[21431,"JH859027.1",0.3881,7262926,0.753,"Arthropoda"]
]
Filtering datasets
Use filters to compare alternative taxonomic inferences
$ ./blobtools2/blobtools filter \
--param bestsumorder_phylum--Keys=no-hit \
--table ACVV01.alt_taxrule.tsv \
--table-fields bestsumorder_phylum,alt_phylum \
./datasets/ACVV01
$ head ACVV01.alt_taxrule.tsv
index identifiers bestsumorder_phylum demo_phylum
1 JH838199.1 Proteobacteria no-hit
4 JH838202.1 Arthropoda no-hit
...
Filtering datasets
Generate multiple outputs from a single filter command
$ ./blobtools2/blobtools filter \
--param bestsumorder_phylum--Keys=Proteobacteria \
--param bestsumorder_phylum--Inv=true \
--table ACVV01.proteobacteria.tsv \
--table-fields length,bestsumorder_genus,SRR026696_cov \
--summary ACVV01.proteobacteria.json \
--summary-rank genus \
--out ./datasets/ACVV01_proteobacteria \
./datasets/ACVV01
Filtering datasets
Inspect the genus-level taxonomy of scaffolds in the filtered dataset
$ head ACVV01.proteobacteria.tsv
index identifiers length bestsumorder_genus SRR026696_cov
1 JH838199.1 1836 Acetobacter 0.0403
37 JH838235.1 2833 Gluconobacter 0.041600000000000005
46 JH838244.1 1575 Acetobacter 0.0407
69 JH838267.1 2008 Gluconobacter 0
91 JH838288.1 23979 Acetobacter 0.1859
99 JH838296.1 1326 Gluconobacter 0.0463
118 JH838315.1 1342 Acetobacter 0
124 JH838321.1 1445 Gluconobacter 0.0333
Filtering datasets
View summary data for scaffolds assigned to Acetobacter
$ jq '.summaryStats.hits.Acetobacter' ACVV01.proteobacteria.json
{ "span": 4272045,
"count": 695,
"gc": [0.564,0.5814,0.4864,0.6416,0.3952,0.6496],
"cov": [0.1937,0.2394,0.0637,0.5886,0.0216,2.2387],
"n50": 48752,
"l50": 14,
"n90": 1756,
"l90": 366 }
Filtering datasets
Generate a blob plot from the filtered dataset
$ ./blobtools2/blobtools view --ports 8000-8099 \
--param plotShape=circle \
--param catField=bestsumorder_genus \
--param bestsumorder_genus--Active=true \
./datasets/ACVV01_proteobacteria
Initializing viewer |███████████████████████████████████████| 15/15 seconds
Loading
...
waiting for file 'ACVV01_proteobacteria.blob.circle.png'
Filtering datasets
Generate a blob plot from the filtered dataset
$ ./blobtools2/blobtools view --ports 8000-8099 \
--param plotShape=circle \
--param catField=bestsumorder_genus \
--param bestsumorder_genus--Active=true \
./datasets/ACVV01_proteobacteria
Initializing viewer |███████████████████████████████████████| 15/15 seconds
Loading
...
waiting for file 'ACVV01_proteobacteria.blob.circle.png'
Programmatic Access
Filtering assembly files
Filter a FASTA file based on taxonomic inference
$ ./blobtools2/blobtools filter --fasta ACVV01.fasta \
--param bestsumorder_phylum--Keys=no-hit \
--suffix with_taxonomy ./datasets/ACVV01
$ ./blobtools2/blobtools filter --fasta ACVV01.fasta \
--param bestsumorder_phylum--Keys=no-hit \
--param bestsumorder_phylum--Inv=no-hit \
--suffix without_taxonomy ./datasets/ACVV01
$ ls
ACVV01.fasta ACVV01.without_taxonomy.fasta
ACVV01.with_taxonomy.fasta
Filtering assembly files
Filter FASTQ files based on taxonomic inference
$ ./blobtools2/blobtools filter \
--fastq SRR01_1.fastq.gz \
--fastq SRR01_2.fastq.gz \
--cov ACVV01.SRR01.bam \
--param bestsumorder_phylum--Keys=no-hit \
--suffix with_taxonomy ./datasets/ACVV01
$ ls
ACVV01.fastq.gz ACVV01.fastq.with_taxonomy.gz
Exploring Further
Download a BlobDir dataset from the public Viewer
$ cd ~/blobtoolkit/datasets
$ curl http://blobtoolkit.genomehubs.org/download/AC/ACVV01/ACVV01.blobdir.tar.gz | tar xf -
Try alternate taxrule parameters
Filter and compare tables and summaries
Reproduce interactive plots
BlobToolKit
University of Edinburgh &
Wellcome Sanger Institute
European Nucleotide Archive
European Bioinformatics Institute