1 of 19

Do you really need that?

Or, how to squeeze your sequencing data (even) smaller (while not losing germline or somatic calls)

Yossi Farjoun, Hussein Elgridly

(this talk is on YouTube)

2 of 19

Who are we?

  • Data scientists, from a large US genome center (the Broad Institute)
  • We work very closely with data scientists from many other genome centers
  • We are not experts in compression, nor are we trying to be
  • We are users of the data, we are the ones paying for the storage, and we understand downstream effects of changes to data

3 of 19

What do we want?

  • Our storage bill should be as cheap as possible
    • It’s currently $9M/year, doubling storage every 18mo (not a typo)
  • Sometimes, we need fast random access to reads for active analysis
  • Much of the time, storage needs are archival and we just want the footprint to be as small as possible
  • Open formats! Free for us, our collaborators, and the world to use

4 of 19

Overview of this talk

We’ve been investigating two questions lately:

  1. Do we need ALL of our data? How lossy can we be without materially affecting downstream analysis?

⇒ It makes everyone’s job easier if we have less data to store/compress!

  1. For archival data, can we make files smaller by eliminating redundant information?

Spoiler: lossy is okay… provided you are very careful about it… :)

5 of 19

We have been very happy with CRAM. Most standard genomics analysis software has already been adapted to support it.

Any new formats being proposed will likely not be adopted by the genomics community unless they work seamlessly with existing analysis tools.

CRAM/BAM

CRAM

???

frequently accessed

must be fast

= we’ll pay $$$

long term storage

infrequently accessed

can be slow

must be cheap

standard

Nearline /

infrequent

Coldline / Glacier

yearly

constantly

monthly

How we want to store our data

6 of 19

What can we drop without materially affecting downstream analysis?

CRAM

???

frequently accessed

must be fast

= we’ll pay $$$

long term storage

infrequently accessed

can be slow

must be cheap

standard

Nearline /

infrequent

Coldline / Glacier

yearly

constantly

monthly

Q1: Do we need to store everything?

CRAM/BAM

7 of 19

Drop the OQ tag

We don’t use them for germline or somatic variant calling, and they’re enormous.

  • 38% of BAM footprint is OQ
  • 42% for CRAM

We have checked that reanalysing from QS (not OQ) doesn’t affect germline results, but need to check somatic too...

8 of 19

Drop read names

CRAM already supports dropping read names via the lossy_readnames option.

We’re planning on adopting that strategy after examining effects on library size estimation -- but this is a story for a different day.

bytes/read over 101 crams, calculated using James Bonfield’s cram_size tool

read names

9 of 19

Quality scores take up even more space

40% of Broad CRAM footprint is consists of quality scores!

Our current approach uses 4 bins.

Can we do better?

bytes/read over 101 crams, calculated using James Bonfield’s cram_size tool

quality scores

10 of 19

Various strategies for reducing quality score footprint

Prior art: Crumble

Algorithm: reduce range of quals to within a low and high qual. Run a consensus caller, use fewer bins where consensus is high

Experiment: use only 2 bins

≥ Q20 → Q30< Q20 → Q2

Experiment: remove oscillations

2 bins, then take <N high quals and turn them low

original quals

2 bins

bin + deoscillate

11 of 19

Testing on real data

Unmodified CRAM files vs. Crumble vs. 2-bins vs. 2-bins + deoscillation.

  • How much can we shrink file sizes and still call variants accurately?

Germline: run GATK HaplotypeCaller, then GenotypeConcordance on NA12878

  • Results: SNP and INDEL accuracy within 1% of original data for all strategies

Somatic: run Mutect 2 on the TCGA DREAM Challenge data

  • It’s critically important to check how your changes affect somatic analysis, because somatic variant callers are much more sensitive to changes in data
  • The results here are much more interesting...

12 of 19

Takeaways

Crumble gets you smaller files, but at significant loss of somatic SNP sensitivity.

Using 2 bins gets you files barely larger than Crumble but with good variant calling performance.

Deoscillation: not a good idea!

Imagine if we’d only tested germline here...

Somatic variant calling (vs. file size) with modified quality scores

809GB

529GB

493GB

522GB

13 of 19

SCRAM: a hastily written super cold storage format

CRAM

???

frequently accessed

must be fast

= we’ll pay $$$

long term storage

infrequently accessed

can be slow

must be cheap

standard

Nearline /

infrequent

Coldline / Glacier

yearly

constantly

monthly

Q2: How should we store archival data?

CRAM/BAM

14 of 19

Strategy: trade random access for smaller files

  • Organize the reads for compression efficiency, not fast lookup
  • Drop any data that can be reconstituted

Here’s the gamble: for infrequently accessed files, the savings in storage cost should be greater than the cost of turning it from and to CRAM / BAM

15 of 19

Total file size for 8 samples in the TCGA DREAM Challenge data set:

Results

(size in GB)

BAM

CRAM

2-bin CRAM

2-bin CRAM,

no readnames

2-bin SCRAM,

no readnames

with OQs

1349

809

60%

529

39%

450

33%

311

23%

no OQs

835

466

56%

185

22%

106

13%

31

3.7%

16 of 19

Specific approach

The best kind of compression is not storing things in the first place!

  • Drop OQs
  • Drop read names
  • Drop duplicate and filtered reads
  • Encode bases relative to the reference à la CRAM
  • Don’t store things that can be inferred from the mate or other reads
    • mate chr, mate pos, TLEN, some bitflags
    • bases, quals for secondary alignments
  • Require 2 bins for quals → store as 1 bit
  • Some tags can be reconstituted later: NM, MD, MC, MQ, UQ, SA

Store data in columns for better compression

17 of 19

Further thoughts

We could keep only what’s required to reconstruct an unmapped BAM:

  • chr + pos + quals + ref-relative bases for one alignment per read

Less data to store (no flags, MAPQ, or tags), but need to realign to get BAM back

  • Most of the time we’re bringing data out of cold storage because we want to realign it, so maybe this is moot?

This would get our 8 samples down to a total of 31GB (from 37GB)

18 of 19

Summary

  1. Storing data is expensive. Put it in the right storage class! This is often a bigger win than small gains from improved compression.
  2. Throw away your OQs and read names.
  3. Extremely aggressive binning of quality scores gets you much smaller files with minimal effect on variant calling performance.
  4. For archival storage, you can get down to <3% of your original BAM size by eliminating redundant data.

19 of 19

1349 GB

835 GB

~62%

810 GB

~60%

466 GB

~35%

185 GB

~14%

106 GB

~7.8%

37 GB

~2.8%

31 GB

~2.3%

BAM

BAM

no OQ

CRAM

CRAM

no OQ

CRAM

lossy QS

no OQ

CRAM

lossy QS

lossy RN

no OQ

SCRAM

lossy QS

no RN

no OQ

SCRAM

“unmapped”,

requires realignment