1 of 32

Whole Genome Sequencing Data Release 1

Jina Song, PhD

April 22, 2022

A Partnership with Veterans

2 of 32

Overview

Goal for MVP WGS data release 1
10,390 MVP WGS samples
Population-based variant quality control (QC) pipelines

Individual genome QC
10k gVCF aggregation process
Population-based variant QC conditions
High quality variant dataset

Validation : GWAS of Height

Comparison with GWAS results using the imputed genotype dataset from the same samples

3 of 32

Goal for WGS Data Release 1

Goal : The VA Million Veteran Program (MVP) is a large population genomics cohort established to investigate the interactions of genes, environment, and military exposure upon Veteran health, with a significant commitment to whole genome sequence data.
Need : The success of population-based common and rare variant association analyses depends on the accuracy of the variant call-set; specifically, sensitivity to low-frequency variants and filtering out false positives. However, it is challenging due to requiring sophisticated standards in filtering methods as well as lots of computational resources.
Development : We establish a population-based quality control (QC) pipeline for generating high-quality variant sets from MVP whole genomes and use big data technologies to apply them to large datasets.
Data Release : We provide a high-quality variant dataset for 10,390 genomes in PGEN format.

4 of 32

WGS Data Release 1 demographics

5 of 32

Population-based variant quality control pipeline

Based on quality-control procedures established by biobank scale studies like the Genome Aggregation Database (gnomAD) and NHGRI Centers for Common Disease Genetics (CCDG), we designed our quality control pipeline to include methods for variant-level, sample-level, and genotype-level filtration.

Computational model: We set up Hail in a Dataproc cluster in the Google Cloud and tested and optimized the performance for the population-based variant quality control process

6 of 32

Quality control pipeline - Individual genome quality control

Analysis	Tool	Input	Results	Check Items and condition
Sequencing Reads	FastQC 0.11.4	bam	Base sequence quality Read quality GC contents	Average of per base sequence quality > 28
Sequencing Reads	FastQC 0.11.4	bam	Base sequence quality Read quality GC contents	40 <= GC content (%) <= 41
Read Alignment	Samtools 0.1.19 (flagstat)	bam	Mapped reads Properly mapped reads	Mapped properly paired read pct > 95%
Variant Calling	RTG Tools 3.7.1 (vcfstats)	vcf	SNV, Indels, Ti/Tv, SNP het/hom ratios	SNP count : 3-sigma limits
				Indel count: 3-sigma limits
				Ti/Tv ratio : 3-sigma limits
				SNP het/hom ratio : 3-sigma limits
Contamination	verifyBAMID	bam	Contamination rate	Contamination rate < 5%

QC tools and conditions

* 3-sigma limit : the data are within three standard deviations from a mean

If a sample does not satisfy any of the given conditions, we consider the sample as an outlier and exclude it. A typical outlier rate for the entire MVP WGS dataset is around 5%.

7 of 32

Quality control pipeline – gVCF aggregation

Hail provides the function, vcf_combiner, for combining single-sample gVCFs into a multi-sample matrix table. It was successfully applied to jointly calling in bio-bank scale studies like the Genome Aggregation Database (gnomAD) and NHGRI Centers for Common Disease Genetics (CCDG).
Essential job parameters including batch and branch factors, which defined the job sizes and input-to-output sample ratios

8 of 32

Quality control pipeline – Conditions for retaining genotypes

MAF >= 1% : lowest hwe P-value > 1e-5MAF < 1% : lowest P-value hwe > 1e-6

Kinship Analysis

Marked up to 2^nd-degree using a King tool

phi > 0.354 : Twin
0.177 < phi <= 0.354 �: 1st-degree
0.0884 < phi <= 0.177 �: 2nd-degree�

Ancestry Prediction

- Population-specified p-value

- Used Trace tool for PCA with 1000G data as a reference and Random Forest algorithm for prediction

Blacklisted regions:

1. Low-complexity regions (LCR)

2. ENCODE blacklist regions

* DP > 5 for haploid genotypes on sex chromosomes

Genotype-level QC

Pre-process

Variant-level QC

Sample-level QC

Remove blacklisted regions and homozygous reference calls

10* ≤ DP (Depth) ≤ 400
hom_ref() : - DP ≥ 10*, GQ ≥ 20
hom_var() : - AD[1] / DP ≥ 0.9 - PL[0] ≥ 20, - variant-level AF[0] ≤ 0.5 & GQ ≥ 20
het() : - (AD[0] + AD[1]) / DP ≥ 0.9 - PL[0] ≥ 20 - AD[1] / DP ≥ 0.20

call rate ≥ 0.80

call rate ≥ 0.97 Mean DP ≥ 18

Applied the conditions as below to the calculated HWE p-values by the predicted ancestry per variant and marked variants with a very low MAF

- MAF >= 1% :

lowest HWE P-value > 1e-5

- MAF < 1% :

lowest HWE P-value > 1e-6

9 of 32

Quality control pipeline – Performance

Thresholds for QC metrics were fine-tuned based on the distribution of each metric generated while calling individual variants and aggregating gVCFs. We applied this approach using Hail running in a Dataproc cluster on Google Cloud Platform. The high confidence variants generated from the pipeline pave a path towards a broad spectrum of analyses, such as genome-wide association studies and rare variant collapsing analysis.

Process		Output File
Steps	Process	Output	Format	Size	Number of samples	Number of variant calls
Step1	Individual genome-level QC	gVCF per qualified sample	gVCF	62.5TB	10,413	-
Step1	Aggregating gVCFs	gVCFs-merged	Matrix Table	29.7TB	10,413	2,900,012,781
Step2	Pre-processing	VCFs-merged	Matrix Table	8.14TB	10,413	242,156,902
	Genotype-level QC	Qualified entries and variants	Matrix Table		Filtering 6.67% entries out
	Variant-level QC	Qualified entries and variants	Matrix Table		-	187,790,701
	Sample-level QC	Qualified samples	Matrix Table		10,413	-
Other	Remove problematic samples from other analyses	-	-		10,390
Final	QC-passed	QCed Matrix Table	Matrix Table	6.74TB	10,390	187,790,701

10 of 32

Quality control pipeline - Data Release

Process		Output Files
	Process	Output	Format	Size	Number of columns (Sample)	Number of rows
Data Release	QC-passed data	QCed Matrix Table	Matrix Table	6.74TB	10,390	187,790,701
		Variant Statistical Information	Text	8GB (compressed)	-	187,790,701
		Sample Statistical Information	Text	2.5MB	10,390	-
		Genotype	PGEN	34GB	10,390	187,790,701

11 of 32

Average per genome statistics by population

	ALL	AFR	EUR	HIS
Sample count	10390	3595	6347	445
Read Depth (DP)	28.16	28.06	28.20	28.38
Genotype Quality (GQ)	74.37	74.21	74.43	74.80
Fraction of calls not missing or filtered	0.99	0.99	0.99	0.99
Number of non-missing calls	186,486,498	186,332,526	186,568,120	186,563,846
Number of heterozygous calls	2,576,471	3,155,182	2,386,911	2,522,028
Number of homozygous alternate calls	1,562,213	1,581,534	1,552,180	1,548,675
Number of SNPs	5,194,374	5,679,522	4,929,257	5,057,311
Number of insertion	301,728	327,496	287,636	294,594
Number of deletion	288,427	313,057	274,970	281,439
Number of transition (A-G, C-T)	3,526,996	3,856,714	3,346,844	3,433,447
Number of transversion	1,667,378	1,822,808	1,582,413	1,623,864
Transition/transversion ratio	2.12	2.12	2.12	2.12
Heterozygous/homozygous alternate call ratio	1.70	2.00	1.54	1.64
Insertion/deletion allele ratio	1.05	1.05	1.05	1.05

* 3 samples in ASN

12 of 32

Distribution of per genome statistics

Sequencing Read Depth (DP)

Genotype Quality (GQ)

13 of 32

Distribution of per genome statistics by population

Ratio of heterozygous to homozygous alternative calls

14 of 32

Distribution of call-rate per genome/variant

Call-rate distribution by sample

Call-rate distribution by variant

15 of 32

Datasets used in GWAS

		WGS dataset	Imputed genotype dataset
Genomic Dataset	Sample (EUR only)	6347	6340
Phenotype data	Covariates	Age, Age-squared, Sex, BMI, 10 principal components from PCA	Age, Age-squared, Sex, 10 principal components from PCA
Phenotype data	Phenotype	Height	Height
GWAS	Algorithm	Linear Regression	Linear Regression

WGS dataset validation

We compared the results of Height GWAS using the MVP WGS dataset with those of Height GWAS using the imputed genotype dataset from the same samples. We chose the European population because it included the largest number of samples in our dataset.

European GWAS of Height : WGS dataset vs Imputed genotype dataset

16 of 32

European GWAS of Height : WGS dataset vs Imputed genotype dataset

Manhattan plot comparison

- The common variants (n=8,932,105)

in two datasets are only considered

GWAS with WGS

GWAS with Imputed genotype

- Visual similarity

- Correlation coefficient : 0.957�Pearson Correlation (-1,1), which represents linear correlation between two sets of data, has been applied to the p-value datasets. The value close to 1 means that two datasets are highly correlated

17 of 32

Top 100 significant variants

Comparison

100 variant loci list ordered by p-value (reverse)

- The 81 common loci are indicated by green shadow in the list

- 48 of the top-ranked 50 significant variants in each variant set are found in the other variant set

The 81 common variants exist between two variant sets

European GWAS of Height : WGS dataset vs Imputed genotype dataset

18 of 32

Conclusions and future directions

Conclusion

We established a population-based quality control pipeline for generating high-quality variant sets from MVP whole genomes and used big data technologies to apply it to large datasets. The dataset was validated by comparison with the imputed genotype dataset from the same sample set.

Future direction

Apply our population-based variant quality control pipeline to larger whole genome sequencing sample sets to facilitate genomic association studies.

19 of 32

Acknowledgements

Cuiping Pan, PhD

Xihong Lin, PhD

Hufeng Zhou, PhD

Paul Billing-Ross, MS

Bryan Gorman, PhD

External Contributors

Philip Tsao, PhD

Center for Common Disease Genomics

VA Million Veteran Program

Palo Alto Group

Boston Group

Jina Song, PhD

Saiju Pyarajan, PhD

Kyriacos Markianos, PhD

Stanford Center for Genomics & Personalized Medicine

Vandhana Krishnan, MS

Keith Bettinger, PhD

Greater Bay Area Institute of Precision Medicine (China)

Institutions and Tools

Patrick Schreiner, PhD

Derek Harkins, MS

Joe Sarro, PhD

Chandri Yandava, PhD

Prathima Vembu, MS

Rodrigo Guarischi Sousa, PhD

20 of 32

Resources available online

Check out our website to find this slide deck and more:

gbsc.stanford.edu

OR

https://med.stanford.edu/gbsc/vapahcs.html

21 of 32

Supplement

22 of 32

Variant counts by chromosome

Count of variants

Chromosome

23 of 32

Chromosome

Missingness rate

by chromosome

chromosome	call counts	missingness rate
chr1	148,984,721,345	0.55%
chr2	162,727,004,496	0.46%
chr3	133,268,008,002	0.42%
chr4	130,403,237,040	0.44%
chr5	120,380,802,638	0.42%
chr6	114,497,809,429	0.44%
chr7	107,288,209,659	0.51%
chr8	104,986,537,672	0.46%
chr9	84,273,762,913	0.63%
chr10	89,940,072,168	0.49%
chr11	91,352,226,073	0.45%
chr12	87,990,042,351	0.47%
chr13	65,414,896,629	0.47%
chr14	60,364,492,280	0.53%
chr15	55,608,109,342	0.61%
chr16	61,971,527,853	0.61%
chr17	54,059,623,596	0.64%
chr18	51,225,209,956	0.42%
chr19	41,461,126,752	0.78%
chr20	42,853,850,440	0.55%
chr21	25,622,163,200	0.76%
chr22	26,914,795,208	0.85%
chrX	70,406,562,152	4.53%
chrY	5,588,216,210	6.78%
All chromosomes	1,937,583,007,404	0.68%

24 of 32

Kinship analysis

Relationship	Phi coefficient range	Sample count
Twin	phi > 0.354	0
1st-degree	0.177 < phi <= 0.354	8
2nd-degree	0.0884 < phi <= 0.177	12

Used King algorithm

http://people.virginia.edu/~wc9c/KING/manual.html

Marked samples which have a relative pair in the dataset

Phi value

Distribution of phi values

25 of 32

Ancestry prediction

- Used Trace tool for PCA with 1000 Genomes data as a reference and Random Forest algorithm for prediction

- Ancestry prediction categories: EUR, AFR, EAS, SAS, AMR (5 Super population)

- HARE** categories: EUR, AFR, HIS, ASN (4)

Concordance between the ancestry prediction results and self-reported ethnicity

Predicted Pop.		HARE**	Sample count
1	EUR	EUR	6277
1	EUR	HIS	8
2	AFR	AFR	3591
2	AFR	HIS	5
3	EAS(ASN)*	ASN	2
4	SAS(ASN)*	AFR	1
5	AMR(HIS)*	EUR	70
		AFR	3
		HIS	432
		ASN	1
	Total		10390

The concordance

0.992

()* : corrected labels between 1000G data populations and MVP HARE**, refer to 'Feb2022 All of Us Beta Release Genomic Quality Report.pdf'

** : HARE - a ethnicity classification software developed at VA

26 of 32

Distribution of Allele Frequency (AF)

Log-scale frequency from 0% to 100%

Log-scale frequency from 0% to 1%

(Rare variants)

AF >= 1%
14,258,796 (7.6%)
AF < 1%
173,531,905 (92.4%)

Variant count

27 of 32

QQ Plot for WGS European Height GWAS

28 of 32

Information in the release matrix table

			s	SHIPxx1	SHIPxx2	SHIPxx3
			sample_qc	..	..	..
locus	alleles	variant_qc
chr1:10101	[‘T’]	:
chr1:10102	[‘A’]	:
chr1:10103	[‘A’]	:

Matrix table

29 of 32

Row fields in the release matrix table

Structrue	Feature	Type	example	description	memo
	locus	locus<GRCh38>	chr1:10467	Variant locus
	alleles	array<str>	["CTCGCGG","C"]	Variant alleles
	rsid	str	-	The number that defines a human variation in the public genetic database dbSNP	No information
	a_index	int	1	The original index of this alternate allele in the multiallelic representation
	was_split	bool	true	True if this variant was originally multiallelic, otherwise False
variant_qc	dp_stats. (mean,stdev,min,max)	float	2.32e+01	Read Depth (Mean value, Standard deviation, Minimum value, Maximum value)
	gq_stats. (mean,stdev,min,max)	float	4.79e+01	Genotype Quality (Mean value, Standard deviation, Minimum value, Maximum value)
	AC	array<int>	[14605,1]	Calculated allele count, one element per allele, including the reference. Sums to AN
	AF	array<float>	[1.00e+00,6.85e-05]	Calculated allele frequency, one element per allele, including the reference. Sums to one. Equivalent to AC / AN.
	AN	int	14606	Total number of called alleles
	Homozygote_count	int	[7302,0]	Number of homozygotes per allele. One element per allele, including the reference
	call_rate	float	7.01e-01	Fraction of calls neither missing nor filtered
	n_called	int	7303	Number of samples with a defined GT
	n_not_called	int	0	Number of samples with a missing GT
	n_filtered	int	3110	Number of filtered entries
	n_het	int	1	Number of heterozygous samples
	n_non_het	int	1	Number of samples with at least one called non-reference allele
	het_freq_hwe	float	1.37e-04	Expected frequency of heterozygous samples under Hardy-Weinberg equilibrium (See functions.hardy_weinberg_test() in Hail)
	p_value_hwe	float	5.00e-01	p-value from test of Hardy-Weinberg equilibrium (See functions.hardy_weinberg_test() in Hail)
hwe	AFR,EUR,AMR,EAS,SAS	float	5.09e-01,…., 5.00e-01	p-value from test of Hardy-Weinberg equilibrium by population
	lowestphwe	float	5.00e-01	The lowest value of p-values from test of Hardy-Weinberg equilibrium by population
	qcpass	bool	true	True if all conditions in genotype-level QC and variant-level QC are satisfied, otherwise False
	qcpass_wo_hwe	bool	true	True if all conditions except the HWE condition in genotype-level QC and variant-level QC are satisfied, otherwise False

30 of 32

Column fields in the release matrix table

Structrue	Feature	Type	example	description
	s	str	SHIP*******	Sample ID
sample_qc	dp_stats. (mean,stdev,min,max)	float	2.53e+01	Read Depth (Mean value, Standard deviation, Minimum value, Maximum value)
	gq_stats. (mean,stdev,min,max)	float	6.77e+01	Genotype Quality (Mean value, Standard deviation, Minimum value, Maximum value)
	call_rate	float	9.58e-01	Fraction of calls not missing or filtered
	n_called	int	187591690	Number of non-missing calls
	n_not_called	int	0	Number of missing calls
	n_filtered	int	8232804	Number of filtered entries
	n_hom_ref	int	183582501	Number of homozygous reference calls
	n_het	int	2464591	Number of heterozygous calls
	n_hom_var	int	1544598	Number of homozygous alternate calls
	n_non_ref	int	4009189	Sum of n_het and n_hom_var
	n_singleton	int	8276	Number of private alleles
	n_snp	int	4967631	Number of SNP alternate alleles
	n_insertion	int	296715	Number of insertion alternate alleles
	n_deletion	int	289441	Number of deletion alternate alleles
	n_transition	int	3366311	Number of transition (A-G, C-T) alternate alleles
	n_transversion	int	1601320	Number of transversion alternate alleles
	n_star	int	0	Number of star (upstream deletion) alleles
	r_ti_tv	float	2.1e+00	Transition/Transversion ratio
	r_het_hom_var	float	1.6e+00	Het/HomVar call ratio
	r_insertion_deletion	float	1.03e+00	Insertion/Deletion allele ratio
	super_pop	bool	EUR	Population predicted with Random Forest algorithm and 1000 genome data as a reference (AFR – African, EUR - European, SAS – South Asian, EAS-East Asian, AMR-Ad Mixed American)
	kin	bool	false	True if kinship coefficients are greater than 0.0884, meaning there is a relative pair, otherwise False

31 of 32

Entry fields in the release matrix table

Structrue	Feature	Type	example	description	memo
	RGQ	int	22	Reference genotype quality
	DP	int	12	Approximate read depth
	GQ	int	22	Genotype quality
	MIN_DP	int	12	Minimum DP observed within the GVCF block
	PID	str	-	Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group	No information
	SB	array<int>		Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias
	GT	call	0/1	Genotype
	PGT	call	-	Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another	No information
	AD	array<int>	[29,13]	Allelic depths for the ref and alt alleles in the order listed
	PL	array<int>	[273,0,774]	Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
gvcf_info	BaseQRankSum	float	-6.06e-01	Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities
	ClippingRankSum	float	0	Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases
	DS	bool	falsd	Were any of the samples downsampled?
	End	int		Stop position of the interval
	ExcessHet	float	3.01e+00	Phred-scaled p-value for exact test of excess heterozygosity
	InbreedingCoeff	float	-	Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation	No information
	MLEAC	array<int>	[1,0]	Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed
	MLEAF	array<float>	[5.00e-01,0.00e+00]	Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed
	MQ	float	-	RMS Mapping Quality	No information
	MQRankSum	float	-3.55e-01	Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities
	RAW_MQ	float	2.11e+04	Raw data for RMS Mapping Quality
	ReadPosRankSum	float	-1.20e+00	Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias

32 of 32

Quality control pipeline – Performance & Cost

Process			Output File
Steps	Process	Time / Cost	Output	Format	Size	Number of columns (Sample)	Number of rows
Step1	Individual genome-level QC	-	gVCF per qualified sample	gVCF	62.5TB	10,413	-
Step1	Aggregating gVCFs	16h / $1200	gVCFs-merged	Matrix Table	29.7TB	10,413	2,900,012,781
Step2	Pre-processing	27h / $500	VCFs-merged	Matrix Table	8.14TB	10,413	242,156,902
	Genotype-level QC	12h / $230	Qualified entries and variants	Matrix Table		Filtering 6.67% entries out
	Variant-level QC	12h / $230	Qualified entries and variants	Matrix Table		-	187,790,701
	Sample-level QC	20m / $93	Qualified samples	Matrix Table		10,413	-
Other	Remove problematic samples from other analyzes	-	-	-		10,390
Final	Total Process	~3.5 days / ~$2300	Release Matrix Table	Matrix Table	6.74TB	10,390	187,790,701