1 of 14

Sparse Project VCF

Mike Lin @DNAmlin GA4GH & MPEG-G Genome Compression Workshop

Basel, 3 October 2018

compressed genotypes for millions of genomes

2 of 14

Sequencing to mine germline variant associations

Project VCF (pVCF)

3 of 14

Project VCF (pVCF): prevailing format for small variants in cohort sequencing

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188

22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94

./.:0:0,0:.,.,.,.,.,.

1/2:42:4,20,18:93,83,76,87,0,77

4 of 14

pVCF file size vs. cohort participants (N)

chr2 exomes

5 of 14

pVCF scaling inefficiencies

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188

22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94

./.:0:0,0:.,.,.,.,.,.

1/2:42:4,20,18:93,83,76,87,0,77

  • Dense matrix grows ~N1.5 as more rare variants must be genotyped across cohort
  • All genotypes accompanied by high-entropy QC measures, usually to excess
  • Large N → closely-spaced sites → similar QC values down columns (often literally repeated, from gVCF)

6 of 14

Sparse Project VCF (spVCF) aims

Modest evolution to address worst inefficiencies

Add sparse encoding to existing VCF tab format and conceptual model (+ baggage)

Elide nonessential QC details for reference-identical cells (nearly all)

7 of 14

spVCF encoding

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188

22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94

./.:0:0,0:.,.,.,.,.,.

1/2:42:4,20,18:93,83,76,87,0,77

  1. Replace 0/0 cell with “ if identical to cell above

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188

22 1012 . CT C ... GT:DP:AD:PL 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94

./.:0:0,0:.,.,.,.,.,.

1/2:42:4,20,18:93,83,76,87,0,77

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188

22 1012 . CT C ... GT:DP:AD:PL 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 2 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94

./.:0:0,0:.,.,.,.,.,.

1/2:42:4,20,18:93,83,76,87,0,77

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:32 0/0:16 0/0:16

22 1012 . CT C ... GT:DP:AD:PL 2 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 2 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:16 ./.:0

1/2:42:4,20,18:93,83,76,87,0,77

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35 0/0:29 0/0:22

22 1012 . CT C ... GT:DP:AD:PL 0/0:31 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 2 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28

./.:0

1/2:42:4,20,18:93,83,76,87,0,77

  • Replace 0/0 cell with “ if identical to cell above
  • Run-length encode “ across rows
  • Replace 0/0 cell with “ if identical to cell above
  • Run-length encode “ across rows
  • Optional: “squeeze” cells reporting zero non-reference reads (AD=*,0)
    1. Truncate to GT:DP
    2. Round DP down to power of two (0, 1, 2, 4, 8, 16, ...)

8 of 14

spVCF encoding: before & after

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188

22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94

./.:0:0,0:.,.,.,.,.,.

1/2:42:4,20,18:93,83,76,87,0,77

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:32 0/0:16 0/0:16

22 1012 . CT C ... GT:DP:AD:PL 2 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 2 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:16 ./.:0

1/2:42:4,20,18:93,83,76,87,0,77

9 of 14

Squeezed spVCF of 50K exomes

10 of 14

spVCF file size vs. N

chr2 exomes

11 of 14

Compression ratios: synergy between sparse & squeeze

12 of 14

spVCF reference implementation

#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol

22 1000 . A G ... GT:DP:AD:PL 0/0:35 0/0:29 0/0:22

22 1012 . CT C ... GT:DP:AD:PL 0/0:31 0/1:28:17,11:74,0,188

22 1018 . G A ... GT:DP:AD:PL 2 1/1:27:0,27:312,87,0

22 1074 . T C,G ... GT:DP:AD:PL 0/0:28 ./.:0

1/2:42:4,20,18:93,83,76,87,0,77

github.com/mlin/spVCF (Apache license)

bgzip -dc my.vcf.gz | spvcf encode --squeeze | bgzip -c > my.spvcf.gz

Checkpoints facilitate range access, accounting for columnar data dependencies (details in SPEC.md)

tabix my.spvcf.gz

spvcf tabix my.spvcf.gz chr2:8675309-8739065 > my.slice.spvcf

spvcf decode my.slice.spvcf > my.slice.vcf

bgzip -dc my.vcf.gz | spvcf squeeze | bgzip -c > my.squeezed.vcf.gz

Squeeze only: several-fold size reduction, remains plain VCF (no-brainer!?)

13 of 14

Acknowledgements

Jeff Reid John Penn Xiaodong Bai

DiscovEHR participants

Will Salerno Olga Krasheninina

NHGRI

Albert V. Smith Yossi Farjoun Louis Bergelson Thomas Keane Rishi Nag

14 of 14

Sparse Project VCF (spVCF)

Simple evolution of VCF with sparse encoding of the (site × sample) genotype matrix

Judicious convention to elide nonessential QC details

Together, these bend the growth of compressed file sizes from ~N1.5 to ~N1.1

15× reduction observed for N=50K

50× projected for N=1M

github.com/mlin/spVCF (Apache license)

bgzip -dc my.vcf.gz | spvcf encode | bgzip -c > my.spvcf.gz