1 of 32

Whole Genome Sequencing Data Release 1

Jina Song, PhD 

April 22, 2022

A Partnership with Veterans

2 of 32

Overview

  • Goal for MVP WGS data release 1
  • 10,390 MVP WGS samples 
  • Population-based variant quality control (QC) pipelines  
    • Individual genome QC 
    • 10k gVCF aggregation process
    • Population-based variant QC conditions
    • High quality variant dataset
  • Validation : GWAS of Height
    • Comparison with GWAS results using the imputed genotype dataset from the same samples

3 of 32

Goal for WGS Data Release 1

  • Goal : The VA Million Veteran Program (MVP) is a large population genomics cohort established to investigate the interactions of genes, environment, and military exposure upon Veteran health, with a significant commitment to whole genome sequence data. 
  • Need : The success of population-based common and rare variant association analyses depends on the accuracy of the variant call-set; specifically, sensitivity to low-frequency variants and filtering out false positives. However, it is challenging due to requiring sophisticated standards in filtering methods as well as lots of computational resources. 
  • Development : We establish a population-based quality control (QC) pipeline for generating high-quality variant sets from MVP whole genomes and use big data technologies to apply them to large datasets.  
  • Data Release :  We provide a high-quality variant dataset for 10,390 genomes in PGEN format.

4 of 32

WGS Data Release 1 demographics

5 of 32

Population-based variant quality control pipeline  

  • Based on quality-control procedures established by biobank scale studies like the Genome Aggregation Database (gnomAD) and NHGRI Centers for Common Disease Genetics (CCDG), we designed our quality control pipeline to include methods for variant-level, sample-level, and genotype-level filtration.

Computational model: We set up Hail in a Dataproc cluster in the Google Cloud and tested and optimized the performance for the population-based variant quality control process

6 of 32

Quality control pipeline - Individual genome quality control 

Analysis

Tool

Input

Results

Check Items and condition 

Sequencing Reads

FastQC 0.11.4

bam

Base sequence quality 

Read quality

GC contents

Average of per base sequence quality > 28

40 <= GC content (%) <= 41

Read Alignment

Samtools 0.1.19 (flagstat)

bam

Mapped reads

Properly mapped reads

Mapped properly paired read pct > 95%

Variant Calling

RTG Tools 3.7.1 (vcfstats)

vcf

SNV, Indels, Ti/Tv, 

SNP het/hom ratios

SNP count :  3-sigma limits

Indel count:  3-sigma limits

Ti/Tv ratio :  3-sigma limits

SNP het/hom ratio :  3-sigma limits

Contamination

verifyBAMID 

bam

Contamination rate

Contamination rate < 5%

QC tools and conditions

* 3-sigma limit : the data are within three standard deviations from a mean

  • If a sample does not satisfy any of the given conditions, we consider the sample as an outlier and exclude it. A typical outlier rate for the entire MVP WGS dataset is around 5%.

7 of 32

Quality control pipeline – gVCF aggregation

  • Hail provides the function, vcf_combiner, for combining single-sample gVCFs into a multi-sample matrix table.  It was successfully applied to jointly calling in bio-bank scale studies like the Genome Aggregation Database (gnomAD) and NHGRI Centers for Common Disease Genetics (CCDG).
  • Essential job parameters including batch and branch factors, which defined the job sizes and input-to-output sample ratios

8 of 32

Quality control pipeline – Conditions for retaining genotypes

MAF >= 1% :    lowest hwe P-value > 1e-5MAF < 1% :    lowest P-value hwe > 1e-6

Kinship Analysis

Marked up to 2nd-degree using a King tool

  • phi > 0.354 : Twin
  • 0.177 < phi <= 0.354 : 1st-degree
  • 0.0884 < phi <= 0.177 : 2nd-degree

Ancestry Prediction

- Population-specified p-value

- Used Trace tool for PCA with 1000G data as a reference and Random Forest algorithm for prediction

Blacklisted regions:

   1. Low-complexity regions (LCR)

   2. ENCODE blacklist regions 

* DP > 5 for haploid genotypes on sex chromosomes

Genotype-level QC

Pre-process

Variant-level QC

Sample-level QC

Remove blacklisted regions 

and homozygous reference calls

 10* ≤ DP (Depth) ≤ 400

hom_ref() : 

   - DP ≥ 10*, GQ ≥ 20

hom_var() : 

   - AD[1] / DP ≥ 0.9

   - PL[0] ≥ 20, 

   - variant-level AF[0]  ≤ 0.5 &  

      GQ ≥ 20

het() : 

   - (AD[0] + AD[1]) / DP ≥ 0.9    

   - PL[0] ≥ 20

   - AD[1] / DP ≥ 0.20

call rate ≥ 0.80

call rate ≥ 0.97

Mean DP ≥ 18

Applied the conditions as below to the calculated HWE p-values by the predicted ancestry per variant and marked variants with a very low MAF    

    - MAF >= 1% : 

           lowest HWE P-value > 1e-5

    - MAF < 1% : 

           lowest HWE P-value > 1e-6

9 of 32

Quality control pipeline – Performance

  • Thresholds for QC metrics were fine-tuned based on the distribution of each metric generated while calling individual variants and aggregating gVCFs. We applied this approach using Hail running in a Dataproc cluster on Google Cloud Platform. The high confidence variants generated from the pipeline pave a path towards a broad spectrum of analyses, such as genome-wide association studies and rare variant collapsing analysis.

Process

Output File

Steps

Process

Output

Format

Size

Number of samples

Number of 

variant calls

Step1

Individual genome-level QC

gVCF per qualified sample

gVCF

62.5TB

10,413

-

Aggregating gVCFs

gVCFs-merged

Matrix Table

29.7TB

10,413

2,900,012,781

Step2

Pre-processing

VCFs-merged

Matrix Table

8.14TB

10,413

242,156,902

Genotype-level QC

Qualified 

entries and variants

Matrix Table

Filtering 6.67% entries out

Variant-level QC

-

187,790,701

Sample-level QC

Qualified samples

Matrix Table

10,413

-

Other

Remove problematic samples from other analyses 

-

-

10,390

Final

QC-passed

QCed Matrix Table

Matrix Table

6.74TB

10,390

187,790,701

10 of 32

Quality control pipeline - Data Release

Process

Output  Files

Process

Output 

Format

Size

Number of columns (Sample)

Number of rows

Data 

Release

QC-passed data

QCed Matrix Table

Matrix Table

 6.74TB

10,390

187,790,701

Variant Statistical Information

Text 

8GB (compressed)

-

187,790,701

Sample Statistical Information

Text

2.5MB

10,390

-

Genotype 

PGEN

34GB

10,390

187,790,701

11 of 32

Average per genome statistics by population

ALL

AFR

EUR

HIS

Sample count

10390

3595

6347

445

Read Depth (DP)

28.16

28.06

28.20

28.38

Genotype Quality (GQ)

74.37

74.21

74.43

74.80

Fraction of calls not missing or filtered

0.99

0.99

0.99

0.99

Number of non-missing calls

186,486,498

186,332,526

186,568,120

186,563,846

Number of heterozygous calls

2,576,471

3,155,182

2,386,911

2,522,028

Number of homozygous alternate calls

1,562,213

1,581,534

1,552,180

1,548,675

Number of SNPs

5,194,374

5,679,522

4,929,257

5,057,311

Number of insertion

301,728

327,496

287,636

294,594

Number of deletion

288,427

313,057

274,970

281,439

Number of transition (A-G, C-T) 

3,526,996

3,856,714

3,346,844

3,433,447

Number of transversion

1,667,378

1,822,808

1,582,413

1,623,864

Transition/transversion ratio

2.12

2.12

2.12

2.12

Heterozygous/homozygous alternate call ratio

1.70

2.00

1.54

1.64

Insertion/deletion allele ratio

1.05

1.05

1.05

1.05

* 3 samples in ASN

12 of 32

Distribution of per genome statistics

Sequencing Read Depth (DP)

Genotype Quality (GQ)

13 of 32

Distribution of per genome statistics by population

Ratio of heterozygous to homozygous alternative calls

14 of 32

Distribution of call-rate per genome/variant 

Call-rate distribution by sample 

Call-rate distribution by variant

15 of 32

Datasets used in GWAS

WGS dataset

Imputed genotype dataset

Genomic Dataset

Sample (EUR only)

6347

6340

Phenotype data

Covariates

Age, Age-squared, Sex, BMI, 

10 principal components from PCA

Age, Age-squared, Sex, 

10 principal components from PCA

Phenotype

Height

Height

GWAS 

Algorithm

Linear Regression

Linear Regression

WGS dataset validation 

We compared the results of Height GWAS using the MVP WGS dataset with those of Height GWAS using the imputed genotype dataset from the same samples. We chose the European population because it included the largest number of samples in our dataset.

European GWAS of Height : WGS dataset vs Imputed genotype dataset 

16 of 32

European GWAS of Height : WGS dataset vs Imputed genotype dataset 

Manhattan plot comparison

- The common variants (n=8,932,105)

in two datasets are only considered

GWAS with WGS

GWAS with Imputed genotype

- Visual similarity

- Correlation coefficient : 0.957�Pearson Correlation (-1,1), which represents linear correlation between two sets of data, has been applied to the p-value datasets. The value close to 1 means that two datasets are highly correlated 

17 of 32

Top 100 significant variants 

Comparison

100 variant loci list ordered by p-value (reverse)

- The 81 common loci are indicated by green shadow in the list 

- 48 of the top-ranked 50 significant variants in each variant set are found in the other variant set

The 81 common variants exist between two variant sets

European GWAS of Height : WGS dataset vs Imputed genotype dataset 

18 of 32

Conclusions and future directions

Conclusion

    • We established a population-based quality control pipeline for generating high-quality variant sets from MVP whole genomes and used big data technologies to apply it to large datasets. The dataset was validated by comparison with the imputed genotype dataset from the same sample set. 

Future direction

    • Apply our population-based variant quality control pipeline to larger whole genome sequencing sample sets to facilitate genomic association studies.

19 of 32

Acknowledgements

Cuiping Pan, PhD

Xihong Lin, PhD

Hufeng Zhou, PhD

Paul Billing-Ross, MS

Bryan Gorman, PhD 

External Contributors

Philip Tsao, PhD

Center for Common Disease Genomics

VA Million Veteran Program

Palo Alto Group

Boston Group

Jina Song, PhD

Saiju Pyarajan, PhD

Kyriacos Markianos, PhD 

Stanford Center for Genomics & Personalized Medicine

Vandhana Krishnan, MS

Keith Bettinger, PhD

Greater Bay Area Institute of Precision Medicine (China)

Institutions and Tools

Patrick Schreiner, PhD 

Derek Harkins, MS 

Joe Sarro, PhD

Chandri Yandava, PhD 

Prathima Vembu, MS

Rodrigo Guarischi Sousa, PhD

20 of 32

Resources available online

Check out our website to find this slide deck and more:

gbsc.stanford.edu

OR

https://med.stanford.edu/gbsc/vapahcs.html

21 of 32

Supplement

22 of 32

Variant counts by chromosome

Count of variants

Chromosome

23 of 32

Chromosome

Missingness rate

by chromosome

chromosome

call counts

missingness rate

chr1

148,984,721,345

0.55%

chr2

162,727,004,496

0.46%

chr3

133,268,008,002

0.42%

chr4

130,403,237,040

0.44%

chr5

120,380,802,638

0.42%

chr6

114,497,809,429

0.44%

chr7

107,288,209,659

0.51%

chr8

104,986,537,672

0.46%

chr9

84,273,762,913

0.63%

chr10

89,940,072,168

0.49%

chr11

91,352,226,073

0.45%

chr12

87,990,042,351

0.47%

chr13

65,414,896,629

0.47%

chr14

60,364,492,280

0.53%

chr15

55,608,109,342

0.61%

chr16

61,971,527,853

0.61%

chr17

54,059,623,596

0.64%

chr18

51,225,209,956

0.42%

chr19

41,461,126,752

0.78%

chr20

42,853,850,440

0.55%

chr21

25,622,163,200

0.76%

chr22

26,914,795,208

0.85%

chrX

70,406,562,152

4.53%

chrY

5,588,216,210

6.78%

All chromosomes

1,937,583,007,404

0.68%

24 of 32

Kinship analysis

Relationship

Phi coefficient range

Sample count

Twin

phi > 0.354

0

1st-degree

0.177 < phi <= 0.354 

8

2nd-degree

0.0884 < phi <= 0.177 

12

  • Used King algorithm
  • Marked samples which have a relative pair in the dataset

Phi value

Distribution of phi values

25 of 32

Ancestry prediction

- Used Trace tool for PCA with 1000 Genomes data as a reference and Random Forest algorithm for prediction

- Ancestry prediction categories: EUR, AFR, EAS, SAS, AMR (5 Super population)

- HARE** categories:  EUR, AFR, HIS, ASN (4)

Concordance between the ancestry prediction results and self-reported ethnicity 

Predicted Pop.

HARE**

Sample count

1

EUR

EUR

6277

HIS

8

2

AFR

AFR

3591

HIS

5

3

EAS(ASN)*

ASN

2

4

SAS(ASN)*

AFR

1

5

AMR(HIS)*

EUR

70

AFR

3

HIS

432

ASN

1

Total

10390

The concordance

0.992 

()* : corrected labels between 1000G data populations and MVP HARE**, refer to 'Feb2022 All of Us Beta Release Genomic Quality Report.pdf'

** : HARE - a ethnicity classification software developed at VA 

26 of 32

Distribution of Allele Frequency (AF)

Log-scale frequency from 0% to 100%

Log-scale frequency from 0% to 1%

(Rare variants)

AF >= 1%

14,258,796 (7.6%)

AF < 1%

173,531,905 (92.4%)

Variant count

27 of 32

QQ Plot for WGS European Height GWAS

28 of 32

Information in the release matrix table

 

 

 

s

SHIPxx1

SHIPxx2

SHIPxx3

 

 

 

sample_qc

..

..

..

locus

alleles

variant_qc

 

 

 

 

chr1:10101

[‘T’]

:

 

 

 

 

chr1:10102

[‘A’]

:

 

 

 

 

chr1:10103

[‘A’]

:

 

 

 

 

Matrix table

29 of 32

Row fields in the release matrix table

Structrue

Feature

Type

example

description

memo

locus

locus<GRCh38>

chr1:10467

Variant locus

alleles

array<str>

["CTCGCGG","C"]

Variant alleles

rsid

str

-

The number that defines a human variation in the public genetic database dbSNP

No information

a_index

int

1

The original index of this alternate allele in the multiallelic representation

was_split

bool

true

True if this variant was originally multiallelic, otherwise False

variant_qc

dp_stats.

(mean,stdev,min,max)

float

2.32e+01

Read Depth

(Mean value, Standard deviation, Minimum value, Maximum value)

gq_stats.

(mean,stdev,min,max)

float

4.79e+01

Genotype Quality

(Mean value, Standard deviation, Minimum value, Maximum value)

AC

array<int>

[14605,1]

Calculated allele count, one element per allele, including the reference. Sums to AN

AF

array<float>

[1.00e+00,6.85e-05]

Calculated allele frequency, one element per allele, including the reference. Sums to one. Equivalent to AC / AN.

AN

int

14606

Total number of called alleles

Homozygote_count

int

[7302,0]

Number of homozygotes per allele. One element per allele, including the reference

call_rate

float

7.01e-01

Fraction of calls neither missing nor filtered

n_called

int

7303

Number of samples with a defined GT

n_not_called

int

0

Number of samples with a missing GT

n_filtered

int

3110

Number of filtered entries

n_het

int

1

Number of heterozygous samples

n_non_het

int

1

Number of samples with at least one called non-reference allele

het_freq_hwe

float

1.37e-04

Expected frequency of heterozygous samples under Hardy-Weinberg equilibrium

(See functions.hardy_weinberg_test() in Hail)

p_value_hwe

float

5.00e-01

p-value from test of Hardy-Weinberg equilibrium

(See functions.hardy_weinberg_test() in Hail)

hwe

AFR,EUR,AMR,EAS,SAS

float

5.09e-01,…., 5.00e-01

p-value from test of Hardy-Weinberg equilibrium by population

lowestphwe

float

5.00e-01

The lowest value of p-values from test of Hardy-Weinberg equilibrium by population

qcpass

bool

true

True if all conditions in genotype-level QC and variant-level QC are satisfied, otherwise False

qcpass_wo_hwe

bool

true

True if all conditions except the HWE condition in genotype-level QC and variant-level QC are satisfied, otherwise False

30 of 32

Column fields in the release matrix table

Structrue

Feature

Type

example

description

s

str

SHIP*******

Sample ID

sample_qc

dp_stats.

(mean,stdev,min,max)

float

2.53e+01

Read Depth 

(Mean value, Standard deviation, Minimum value, Maximum value)

gq_stats.

(mean,stdev,min,max)

float

6.77e+01

Genotype Quality

(Mean value, Standard deviation, Minimum value, Maximum value)

call_rate

float

9.58e-01

Fraction of calls not missing or filtered

n_called

int

187591690

Number of non-missing calls

n_not_called

int

0

Number of missing calls

n_filtered

int

8232804

Number of filtered entries

n_hom_ref

int

183582501

Number of homozygous reference calls

n_het

int

2464591

Number of heterozygous calls

n_hom_var

int

1544598

Number of homozygous alternate calls

n_non_ref

int

4009189

Sum of n_het and n_hom_var

n_singleton

int

8276

Number of private alleles

n_snp

int

4967631

Number of SNP alternate alleles

n_insertion

int

296715

Number of insertion alternate alleles

n_deletion

int

289441

Number of deletion alternate alleles

n_transition

int

3366311

Number of transition (A-G, C-T) alternate alleles

n_transversion

int

1601320

Number of transversion alternate alleles

n_star

int

0

Number of star (upstream deletion) alleles

r_ti_tv

float

2.1e+00

Transition/Transversion ratio

r_het_hom_var

float

1.6e+00

Het/HomVar call ratio

r_insertion_deletion

float

1.03e+00

Insertion/Deletion allele ratio

super_pop

bool

EUR

Population predicted with Random Forest algorithm and 1000 genome data as a reference

(AFR – African, EUR - European, SAS – South Asian, EAS-East Asian, AMR-Ad Mixed American) 

kin

bool

false

True if kinship coefficients are greater than 0.0884, meaning there is a relative pair, otherwise False

31 of 32

Entry fields in the release matrix table

Structrue

Feature

Type

example

description

memo

RGQ

int

22

Reference genotype quality

DP

int

12

Approximate read depth

GQ

int

22

Genotype quality

MIN_DP

int

12

Minimum DP observed within the GVCF block

PID

str

-

Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group

No information

SB

array<int>

Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias

GT

call

0/1

Genotype

PGT

call

-

Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another

No information

AD

array<int>

[29,13]

Allelic depths for the ref and alt alleles in the order listed

PL

array<int>

[273,0,774]

Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification

gvcf_info

BaseQRankSum

float

-6.06e-01

Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities

ClippingRankSum

float

0

Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases

DS

bool

falsd

Were any of the samples downsampled?

End

int

Stop position of the interval

ExcessHet

float

3.01e+00

Phred-scaled p-value for exact test of excess heterozygosity

InbreedingCoeff

float

-

Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation

No information

MLEAC

array<int>

[1,0]

Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed

MLEAF

array<float>

[5.00e-01,0.00e+00]

Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed

MQ

float

-

RMS Mapping Quality

No information

MQRankSum

float

-3.55e-01

Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities

RAW_MQ

float

2.11e+04

Raw data for RMS Mapping Quality

ReadPosRankSum

float

-1.20e+00

Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias

32 of 32

Quality control pipeline – Performance & Cost

Process

Output File

Steps

Process

Time / Cost

Output

Format

Size

Number of columns (Sample)

Number of 

rows 

Step1

Individual genome-level QC

-

gVCF per qualified sample

gVCF

62.5TB

10,413

-

Aggregating gVCFs

16h / $1200

gVCFs-merged

Matrix Table

29.7TB

10,413

2,900,012,781

Step2

Pre-processing

27h / $500

VCFs-merged

Matrix Table

8.14TB

10,413

242,156,902

Genotype-level QC

12h / $230

Qualified 

entries and variants

Matrix Table

Filtering 6.67% entries out

Variant-level QC

-

187,790,701

Sample-level QC

20m / $93

Qualified samples

Matrix Table

10,413

-

Other

Remove problematic samples from other analyzes 

-

-

10,390

Final

Total Process

~3.5 days / ~$2300

Release Matrix Table

Matrix Table

6.74TB

10,390

187,790,701