Sparse Project VCF
Mike Lin @DNAmlin GA4GH & MPEG-G Genome Compression Workshop
Basel, 3 October 2018
compressed genotypes for millions of genomes
Sequencing to mine germline variant associations
Project VCF (pVCF)
Project VCF (pVCF): prevailing format for small variants in cohort sequencing
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188
22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94
./.:0:0,0:.,.,.,.,.,.
1/2:42:4,20,18:93,83,76,87,0,77
pVCF file size vs. cohort participants (N)
chr2 exomes
pVCF scaling inefficiencies
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188
22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94
./.:0:0,0:.,.,.,.,.,.
1/2:42:4,20,18:93,83,76,87,0,77
Sparse Project VCF (spVCF) aims
Modest evolution to address worst inefficiencies
Add sparse encoding to existing VCF tab format and conceptual model (+ baggage)
Elide nonessential QC details for reference-identical cells (nearly all)
spVCF encoding
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188
22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94
./.:0:0,0:.,.,.,.,.,.
1/2:42:4,20,18:93,83,76,87,0,77
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188
22 1012 . CT C ... GT:DP:AD:PL “ 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL “ “ 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94
./.:0:0,0:.,.,.,.,.,.
1/2:42:4,20,18:93,83,76,87,0,77
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188
22 1012 . CT C ... GT:DP:AD:PL “ 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL “2 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94
./.:0:0,0:.,.,.,.,.,.
1/2:42:4,20,18:93,83,76,87,0,77
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:32 0/0:16 0/0:16
22 1012 . CT C ... GT:DP:AD:PL “2 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL “2 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:16 ./.:0
1/2:42:4,20,18:93,83,76,87,0,77
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35 0/0:29 0/0:22
22 1012 . CT C ... GT:DP:AD:PL “ 0/0:31 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL “2 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28
./.:0
1/2:42:4,20,18:93,83,76,87,0,77
spVCF encoding: before & after
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:29:29,0:0,109,387 0/0:22:22,0:0,63,188
22 1012 . CT C ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL 0/0:35:35,0:0,117,402 0/0:31:31,0:0,117,396 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28:28,0,0:0,48,62,52,71,94
./.:0:0,0:.,.,.,.,.,.
1/2:42:4,20,18:93,83,76,87,0,77
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:32 0/0:16 0/0:16
22 1012 . CT C ... GT:DP:AD:PL “2 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL “2 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:16 ./.:0
1/2:42:4,20,18:93,83,76,87,0,77
Squeezed spVCF of 50K exomes
spVCF file size vs. N
chr2 exomes
Compression ratios: synergy between sparse & squeeze
spVCF reference implementation
#CHROM POS ID REF ALT ... FORMAT Alice Bob Carol
22 1000 . A G ... GT:DP:AD:PL 0/0:35 0/0:29 0/0:22
22 1012 . CT C ... GT:DP:AD:PL “ 0/0:31 0/1:28:17,11:74,0,188
22 1018 . G A ... GT:DP:AD:PL “2 1/1:27:0,27:312,87,0
22 1074 . T C,G ... GT:DP:AD:PL 0/0:28 ./.:0
1/2:42:4,20,18:93,83,76,87,0,77
github.com/mlin/spVCF (Apache license)
bgzip -dc my.vcf.gz | spvcf encode --squeeze | bgzip -c > my.spvcf.gz
Checkpoints facilitate range access, accounting for columnar data dependencies (details in SPEC.md)
tabix my.spvcf.gz
spvcf tabix my.spvcf.gz chr2:8675309-8739065 > my.slice.spvcf
spvcf decode my.slice.spvcf > my.slice.vcf
bgzip -dc my.vcf.gz | spvcf squeeze | bgzip -c > my.squeezed.vcf.gz
Squeeze only: several-fold size reduction, remains plain VCF (no-brainer!?)
Acknowledgements
Jeff Reid John Penn Xiaodong Bai
DiscovEHR participants
Will Salerno Olga Krasheninina
NHGRI
Albert V. Smith Yossi Farjoun Louis Bergelson Thomas Keane Rishi Nag
Sparse Project VCF (spVCF)
Simple evolution of VCF with sparse encoding of the (site × sample) genotype matrix
Judicious convention to elide nonessential QC details
Together, these bend the growth of compressed file sizes from ~N1.5 to ~N1.1
15× reduction observed for N=50K
50× projected for N=1M
github.com/mlin/spVCF (Apache license)
bgzip -dc my.vcf.gz | spvcf encode | bgzip -c > my.spvcf.gz