Do you really need that?
Or, how to squeeze your sequencing data (even) smaller (while not losing germline or somatic calls)
Yossi Farjoun, Hussein Elgridly
Who are we?
What do we want?
Overview of this talk
We’ve been investigating two questions lately:
⇒ It makes everyone’s job easier if we have less data to store/compress!
Spoiler: lossy is okay… provided you are very careful about it… :)
We have been very happy with CRAM. Most standard genomics analysis software has already been adapted to support it.
Any new formats being proposed will likely not be adopted by the genomics community unless they work seamlessly with existing analysis tools.
CRAM/BAM
CRAM
???
frequently accessed
must be fast
= we’ll pay $$$
long term storage
infrequently accessed
can be slow
must be cheap
standard
Nearline /
infrequent
Coldline / Glacier
yearly
constantly
monthly
How we want to store our data
What can we drop without materially affecting downstream analysis?
CRAM
???
frequently accessed
must be fast
= we’ll pay $$$
long term storage
infrequently accessed
can be slow
must be cheap
standard
Nearline /
infrequent
Coldline / Glacier
yearly
constantly
monthly
Q1: Do we need to store everything?
CRAM/BAM
Drop the OQ tag
We don’t use them for germline or somatic variant calling, and they’re enormous.
We have checked that reanalysing from QS (not OQ) doesn’t affect germline results, but need to check somatic too...
Drop read names�
CRAM already supports dropping read names via the lossy_readnames option.
We’re planning on adopting that strategy after examining effects on library size estimation -- but this is a story for a different day.
bytes/read over 101 crams, calculated using James Bonfield’s cram_size tool
read names
Quality scores take up even more space�
40% of Broad CRAM footprint is consists of quality scores!
Our current approach uses 4 bins.
Can we do better?
bytes/read over 101 crams, calculated using James Bonfield’s cram_size tool
quality scores
Various strategies for reducing quality score footprint
Prior art: Crumble
Algorithm: reduce range of quals to within a low and high qual. Run a consensus caller, use fewer bins where consensus is high
Experiment: use only 2 bins
≥ Q20 → Q30�< Q20 → Q2
Experiment: remove oscillations
2 bins, then take <N high quals and turn them low
original quals
2 bins
bin + deoscillate
Testing on real data
Unmodified CRAM files vs. Crumble vs. 2-bins vs. 2-bins + deoscillation.
Germline: run GATK HaplotypeCaller, then GenotypeConcordance on NA12878
Somatic: run Mutect 2 on the TCGA DREAM Challenge data
Takeaways
Crumble gets you smaller files, but at significant loss of somatic SNP sensitivity.
Using 2 bins gets you files barely larger than Crumble but with good variant calling performance.
Deoscillation: not a good idea!
Imagine if we’d only tested germline here...
Somatic variant calling (vs. file size) with modified quality scores
809GB
529GB
493GB
522GB
SCRAM: a hastily written super cold storage format
CRAM
???
frequently accessed
must be fast
= we’ll pay $$$
long term storage
infrequently accessed
can be slow
must be cheap
standard
Nearline /
infrequent
Coldline / Glacier
yearly
constantly
monthly
Q2: How should we store archival data?
CRAM/BAM
Strategy: trade random access for smaller files
Here’s the gamble: for infrequently accessed files, the savings in storage cost should be greater than the cost of turning it from and to CRAM / BAM
Total file size for 8 samples in the TCGA DREAM Challenge data set:
Results
(size in GB) | BAM | CRAM | 2-bin CRAM | 2-bin CRAM, no readnames | 2-bin SCRAM, no readnames |
with OQs | 1349 | 809 60% | 529 39% | 450 33% | 311 23% |
no OQs | 835 | 466 56% | 185 22% | 106 13% | 31 3.7% |
Specific approach
The best kind of compression is not storing things in the first place!
Store data in columns for better compression
Further thoughts
We could keep only what’s required to reconstruct an unmapped BAM:
Less data to store (no flags, MAPQ, or tags), but need to realign to get BAM back
This would get our 8 samples down to a total of 31GB (from 37GB)
Summary
1349 GB
835 GB
~62%
810 GB
~60%
466 GB
~35%
185 GB
~14%
106 GB
~7.8%
37 GB
~2.8%
31 GB
~2.3%
BAM
BAM
no OQ
CRAM
CRAM
no OQ
CRAM
lossy QS
no OQ
CRAM
lossy QS
lossy RN
no OQ
SCRAM
lossy QS
no RN
no OQ
SCRAM
“unmapped”,
requires realignment