Whole Genome Sequencing Data Release 1
Jina Song, PhD
April 22, 2022
A Partnership with Veterans
Overview
Goal for WGS Data Release 1
WGS Data Release 1 demographics
Population-based variant quality control pipeline
Computational model: We set up Hail in a Dataproc cluster in the Google Cloud and tested and optimized the performance for the population-based variant quality control process
Quality control pipeline - Individual genome quality control
Analysis | Tool | Input | Results | Check Items and condition |
Sequencing Reads | FastQC 0.11.4 | bam | Base sequence quality Read quality GC contents | Average of per base sequence quality > 28 |
40 <= GC content (%) <= 41 | ||||
Read Alignment | Samtools 0.1.19 (flagstat) | bam | Mapped reads Properly mapped reads | Mapped properly paired read pct > 95% |
Variant Calling | RTG Tools 3.7.1 (vcfstats) | vcf | SNV, Indels, Ti/Tv, SNP het/hom ratios | SNP count : 3-sigma limits |
Indel count: 3-sigma limits | ||||
Ti/Tv ratio : 3-sigma limits | ||||
SNP het/hom ratio : 3-sigma limits | ||||
Contamination | verifyBAMID | bam | Contamination rate | Contamination rate < 5% |
QC tools and conditions
* 3-sigma limit : the data are within three standard deviations from a mean
Quality control pipeline – gVCF aggregation
Quality control pipeline – Conditions for retaining genotypes
MAF >= 1% : lowest hwe P-value > 1e-5MAF < 1% : lowest P-value hwe > 1e-6
Kinship Analysis
Marked up to 2nd-degree using a King tool
Ancestry Prediction
- Population-specified p-value
- Used Trace tool for PCA with 1000G data as a reference and Random Forest algorithm for prediction
* DP > 5 for haploid genotypes on sex chromosomes
Genotype-level QC
Pre-process
Variant-level QC
Sample-level QC
Remove blacklisted regions and homozygous reference calls |
10* ≤ DP (Depth) ≤ 400 |
hom_ref() : - DP ≥ 10*, GQ ≥ 20 |
hom_var() : - AD[1] / DP ≥ 0.9 - PL[0] ≥ 20, - variant-level AF[0] ≤ 0.5 & GQ ≥ 20 |
het() : - (AD[0] + AD[1]) / DP ≥ 0.9 - PL[0] ≥ 20 - AD[1] / DP ≥ 0.20 |
call rate ≥ 0.80 |
call rate ≥ 0.97 Mean DP ≥ 18 |
Applied the conditions as below to the calculated HWE p-values by the predicted ancestry per variant and marked variants with a very low MAF
- MAF >= 1% :
lowest HWE P-value > 1e-5
- MAF < 1% :
lowest HWE P-value > 1e-6
Quality control pipeline – Performance
Process | Output File | |||||
Steps | Process | Output | Format | Size | Number of samples | Number of variant calls |
Step1 | Individual genome-level QC | gVCF per qualified sample | gVCF | 62.5TB | 10,413 | - |
Aggregating gVCFs | gVCFs-merged | Matrix Table | 29.7TB | 10,413 | 2,900,012,781 | |
Step2 | Pre-processing | VCFs-merged | Matrix Table | 8.14TB | 10,413 | 242,156,902 |
Genotype-level QC | Qualified entries and variants | Matrix Table | | Filtering 6.67% entries out | ||
Variant-level QC | - | 187,790,701 | ||||
Sample-level QC | Qualified samples | Matrix Table | | 10,413 | - | |
Other | Remove problematic samples from other analyses | - | - | | 10,390 | |
Final | QC-passed | QCed Matrix Table | Matrix Table | 6.74TB | 10,390 | 187,790,701 |
Quality control pipeline - Data Release
Process | Output Files | |||||
| Process | Output | Format | Size | Number of columns (Sample) | Number of rows |
Data Release | QC-passed data | QCed Matrix Table | Matrix Table | 6.74TB | 10,390 | 187,790,701 |
Variant Statistical Information | Text | 8GB (compressed) | - | 187,790,701 | ||
Sample Statistical Information | Text | 2.5MB | 10,390 | - | ||
Genotype | PGEN | 34GB | 10,390 | 187,790,701 | ||
Average per genome statistics by population
| ALL | AFR | EUR | HIS |
Sample count | 10390 | 3595 | 6347 | 445 |
Read Depth (DP) | 28.16 | 28.06 | 28.20 | 28.38 |
Genotype Quality (GQ) | 74.37 | 74.21 | 74.43 | 74.80 |
Fraction of calls not missing or filtered | 0.99 | 0.99 | 0.99 | 0.99 |
Number of non-missing calls | 186,486,498 | 186,332,526 | 186,568,120 | 186,563,846 |
Number of heterozygous calls | 2,576,471 | 3,155,182 | 2,386,911 | 2,522,028 |
Number of homozygous alternate calls | 1,562,213 | 1,581,534 | 1,552,180 | 1,548,675 |
Number of SNPs | 5,194,374 | 5,679,522 | 4,929,257 | 5,057,311 |
Number of insertion | 301,728 | 327,496 | 287,636 | 294,594 |
Number of deletion | 288,427 | 313,057 | 274,970 | 281,439 |
Number of transition (A-G, C-T) | 3,526,996 | 3,856,714 | 3,346,844 | 3,433,447 |
Number of transversion | 1,667,378 | 1,822,808 | 1,582,413 | 1,623,864 |
Transition/transversion ratio | 2.12 | 2.12 | 2.12 | 2.12 |
Heterozygous/homozygous alternate call ratio | 1.70 | 2.00 | 1.54 | 1.64 |
Insertion/deletion allele ratio | 1.05 | 1.05 | 1.05 | 1.05 |
* 3 samples in ASN
Distribution of per genome statistics
Sequencing Read Depth (DP)
Genotype Quality (GQ)
Distribution of per genome statistics by population
Ratio of heterozygous to homozygous alternative calls
Distribution of call-rate per genome/variant
Call-rate distribution by sample
Call-rate distribution by variant
Datasets used in GWAS
| | WGS dataset | Imputed genotype dataset |
Genomic Dataset | Sample (EUR only) | 6347 | 6340 |
Phenotype data | Covariates | Age, Age-squared, Sex, BMI, 10 principal components from PCA | Age, Age-squared, Sex, 10 principal components from PCA |
Phenotype | Height | Height | |
GWAS | Algorithm | Linear Regression | Linear Regression |
WGS dataset validation
We compared the results of Height GWAS using the MVP WGS dataset with those of Height GWAS using the imputed genotype dataset from the same samples. We chose the European population because it included the largest number of samples in our dataset.
European GWAS of Height : WGS dataset vs Imputed genotype dataset
European GWAS of Height : WGS dataset vs Imputed genotype dataset
Manhattan plot comparison
- The common variants (n=8,932,105)
in two datasets are only considered
GWAS with WGS
GWAS with Imputed genotype
- Visual similarity
- Correlation coefficient : 0.957�Pearson Correlation (-1,1), which represents linear correlation between two sets of data, has been applied to the p-value datasets. The value close to 1 means that two datasets are highly correlated
Top 100 significant variants
Comparison
100 variant loci list ordered by p-value (reverse)
- The 81 common loci are indicated by green shadow in the list
- 48 of the top-ranked 50 significant variants in each variant set are found in the other variant set
The 81 common variants exist between two variant sets
European GWAS of Height : WGS dataset vs Imputed genotype dataset
Conclusions and future directions
Conclusion
Future direction
Acknowledgements
Cuiping Pan, PhD
Xihong Lin, PhD
Hufeng Zhou, PhD
Paul Billing-Ross, MS
Bryan Gorman, PhD
External Contributors
Philip Tsao, PhD
Center for Common Disease Genomics
VA Million Veteran Program
Palo Alto Group
Boston Group
Jina Song, PhD
Saiju Pyarajan, PhD
Kyriacos Markianos, PhD
Stanford Center for Genomics & Personalized Medicine
Vandhana Krishnan, MS
Keith Bettinger, PhD
Greater Bay Area Institute of Precision Medicine (China)
Institutions and Tools
Patrick Schreiner, PhD
Derek Harkins, MS
Joe Sarro, PhD
Chandri Yandava, PhD
Prathima Vembu, MS
Rodrigo Guarischi Sousa, PhD
Resources available online
Check out our website to find this slide deck and more:
gbsc.stanford.edu
OR
https://med.stanford.edu/gbsc/vapahcs.html
Supplement
Variant counts by chromosome
Count of variants
Chromosome
Chromosome
Missingness rate
by chromosome
chromosome | call counts | missingness rate |
chr1 | 148,984,721,345 | 0.55% |
chr2 | 162,727,004,496 | 0.46% |
chr3 | 133,268,008,002 | 0.42% |
chr4 | 130,403,237,040 | 0.44% |
chr5 | 120,380,802,638 | 0.42% |
chr6 | 114,497,809,429 | 0.44% |
chr7 | 107,288,209,659 | 0.51% |
chr8 | 104,986,537,672 | 0.46% |
chr9 | 84,273,762,913 | 0.63% |
chr10 | 89,940,072,168 | 0.49% |
chr11 | 91,352,226,073 | 0.45% |
chr12 | 87,990,042,351 | 0.47% |
chr13 | 65,414,896,629 | 0.47% |
chr14 | 60,364,492,280 | 0.53% |
chr15 | 55,608,109,342 | 0.61% |
chr16 | 61,971,527,853 | 0.61% |
chr17 | 54,059,623,596 | 0.64% |
chr18 | 51,225,209,956 | 0.42% |
chr19 | 41,461,126,752 | 0.78% |
chr20 | 42,853,850,440 | 0.55% |
chr21 | 25,622,163,200 | 0.76% |
chr22 | 26,914,795,208 | 0.85% |
chrX | 70,406,562,152 | 4.53% |
chrY | 5,588,216,210 | 6.78% |
All chromosomes | 1,937,583,007,404 | 0.68% |
Kinship analysis
Relationship | Phi coefficient range | Sample count |
Twin | phi > 0.354 | 0 |
1st-degree | 0.177 < phi <= 0.354 | 8 |
2nd-degree | 0.0884 < phi <= 0.177 | 12 |
Phi value
Distribution of phi values
Ancestry prediction
- Used Trace tool for PCA with 1000 Genomes data as a reference and Random Forest algorithm for prediction
- Ancestry prediction categories: EUR, AFR, EAS, SAS, AMR (5 Super population)
- HARE** categories: EUR, AFR, HIS, ASN (4)
Concordance between the ancestry prediction results and self-reported ethnicity
Predicted Pop. | HARE** | Sample count | |
1 | EUR | EUR | 6277 |
HIS | 8 | ||
2 | AFR | AFR | 3591 |
HIS | 5 | ||
3 | EAS(ASN)* | ASN | 2 |
4 | SAS(ASN)* | AFR | 1 |
5 | AMR(HIS)* | EUR | 70 |
AFR | 3 | ||
HIS | 432 | ||
ASN | 1 | ||
| Total | 10390 | |
The concordance
0.992
()* : corrected labels between 1000G data populations and MVP HARE**, refer to 'Feb2022 All of Us Beta Release Genomic Quality Report.pdf'
** : HARE - a ethnicity classification software developed at VA
Distribution of Allele Frequency (AF)
Log-scale frequency from 0% to 100%
Log-scale frequency from 0% to 1%
(Rare variants)
AF >= 1% |
14,258,796 (7.6%) |
AF < 1% |
173,531,905 (92.4%) |
Variant count
QQ Plot for WGS European Height GWAS
Information in the release matrix table
|
|
| s | SHIPxx1 | SHIPxx2 | SHIPxx3 |
|
|
| sample_qc | .. | .. | .. |
locus | alleles | variant_qc |
|
|
|
|
chr1:10101 | [‘T’] | : |
|
|
|
|
chr1:10102 | [‘A’] | : |
|
|
|
|
chr1:10103 | [‘A’] | : |
|
|
|
|
Matrix table
Row fields in the release matrix table
Structrue | Feature | Type | example | description | memo |
| locus | locus<GRCh38> | chr1:10467 | Variant locus | |
| alleles | array<str> | ["CTCGCGG","C"] | Variant alleles | |
| rsid | str | - | The number that defines a human variation in the public genetic database dbSNP | No information |
| a_index | int | 1 | The original index of this alternate allele in the multiallelic representation | |
| was_split | bool | true | True if this variant was originally multiallelic, otherwise False | |
variant_qc | dp_stats. (mean,stdev,min,max) | float | 2.32e+01 | Read Depth (Mean value, Standard deviation, Minimum value, Maximum value) | |
gq_stats. (mean,stdev,min,max) | float | 4.79e+01 | Genotype Quality (Mean value, Standard deviation, Minimum value, Maximum value) | | |
AC | array<int> | [14605,1] | Calculated allele count, one element per allele, including the reference. Sums to AN | | |
AF | array<float> | [1.00e+00,6.85e-05] | Calculated allele frequency, one element per allele, including the reference. Sums to one. Equivalent to AC / AN. | | |
AN | int | 14606 | Total number of called alleles | | |
Homozygote_count | int | [7302,0] | Number of homozygotes per allele. One element per allele, including the reference | | |
call_rate | float | 7.01e-01 | Fraction of calls neither missing nor filtered | | |
n_called | int | 7303 | Number of samples with a defined GT | | |
n_not_called | int | 0 | Number of samples with a missing GT | | |
n_filtered | int | 3110 | Number of filtered entries | | |
n_het | int | 1 | Number of heterozygous samples | | |
n_non_het | int | 1 | Number of samples with at least one called non-reference allele | | |
het_freq_hwe | float | 1.37e-04 | Expected frequency of heterozygous samples under Hardy-Weinberg equilibrium (See functions.hardy_weinberg_test() in Hail) | | |
p_value_hwe | float | 5.00e-01 | p-value from test of Hardy-Weinberg equilibrium (See functions.hardy_weinberg_test() in Hail) | | |
hwe | AFR,EUR,AMR,EAS,SAS | float | 5.09e-01,…., 5.00e-01 | p-value from test of Hardy-Weinberg equilibrium by population | |
| lowestphwe | float | 5.00e-01 | The lowest value of p-values from test of Hardy-Weinberg equilibrium by population | |
| qcpass | bool | true | True if all conditions in genotype-level QC and variant-level QC are satisfied, otherwise False | |
| qcpass_wo_hwe | bool | true | True if all conditions except the HWE condition in genotype-level QC and variant-level QC are satisfied, otherwise False | |
Column fields in the release matrix table
Structrue | Feature | Type | example | description |
| s | str | SHIP******* | Sample ID |
sample_qc | dp_stats. (mean,stdev,min,max) | float | 2.53e+01 | Read Depth (Mean value, Standard deviation, Minimum value, Maximum value) |
gq_stats. (mean,stdev,min,max) | float | 6.77e+01 | Genotype Quality (Mean value, Standard deviation, Minimum value, Maximum value) | |
call_rate | float | 9.58e-01 | Fraction of calls not missing or filtered | |
n_called | int | 187591690 | Number of non-missing calls | |
n_not_called | int | 0 | Number of missing calls | |
n_filtered | int | 8232804 | Number of filtered entries | |
n_hom_ref | int | 183582501 | Number of homozygous reference calls | |
n_het | int | 2464591 | Number of heterozygous calls | |
n_hom_var | int | 1544598 | Number of homozygous alternate calls | |
n_non_ref | int | 4009189 | Sum of n_het and n_hom_var | |
n_singleton | int | 8276 | Number of private alleles | |
n_snp | int | 4967631 | Number of SNP alternate alleles | |
n_insertion | int | 296715 | Number of insertion alternate alleles | |
n_deletion | int | 289441 | Number of deletion alternate alleles | |
n_transition | int | 3366311 | Number of transition (A-G, C-T) alternate alleles | |
n_transversion | int | 1601320 | Number of transversion alternate alleles | |
n_star | int | 0 | Number of star (upstream deletion) alleles | |
r_ti_tv | float | 2.1e+00 | Transition/Transversion ratio | |
r_het_hom_var | float | 1.6e+00 | Het/HomVar call ratio | |
r_insertion_deletion | float | 1.03e+00 | Insertion/Deletion allele ratio | |
| super_pop | bool | EUR | Population predicted with Random Forest algorithm and 1000 genome data as a reference (AFR – African, EUR - European, SAS – South Asian, EAS-East Asian, AMR-Ad Mixed American) |
| kin | bool | false | True if kinship coefficients are greater than 0.0884, meaning there is a relative pair, otherwise False |
Entry fields in the release matrix table
Structrue | Feature | Type | example | description | memo |
| RGQ | int | 22 | Reference genotype quality | |
| DP | int | 12 | Approximate read depth | |
| GQ | int | 22 | Genotype quality | |
| MIN_DP | int | 12 | Minimum DP observed within the GVCF block | |
| PID | str | - | Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group | No information |
| SB | array<int> | | Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias | |
| GT | call | 0/1 | Genotype | |
| PGT | call | - | Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another | No information |
| AD | array<int> | [29,13] | Allelic depths for the ref and alt alleles in the order listed | |
| PL | array<int> | [273,0,774] | Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification | |
gvcf_info | BaseQRankSum | float | -6.06e-01 | Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities | |
ClippingRankSum | float | 0 | Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases | | |
DS | bool | falsd | Were any of the samples downsampled? | | |
End | int | | Stop position of the interval | | |
ExcessHet | float | 3.01e+00 | Phred-scaled p-value for exact test of excess heterozygosity | | |
InbreedingCoeff | float | - | Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation | No information | |
MLEAC | array<int> | [1,0] | Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed | | |
MLEAF | array<float> | [5.00e-01,0.00e+00] | Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed | | |
MQ | float | - | RMS Mapping Quality | No information | |
MQRankSum | float | -3.55e-01 | Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities | | |
RAW_MQ | float | 2.11e+04 | Raw data for RMS Mapping Quality | | |
ReadPosRankSum | float | -1.20e+00 | Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias | |
Quality control pipeline – Performance & Cost
Process | Output File | ||||||
Steps | Process | Time / Cost | Output | Format | Size | Number of columns (Sample) | Number of rows |
Step1 | Individual genome-level QC | - | gVCF per qualified sample | gVCF | 62.5TB | 10,413 | - |
Aggregating gVCFs | 16h / $1200 | gVCFs-merged | Matrix Table | 29.7TB | 10,413 | 2,900,012,781 | |
Step2 | Pre-processing | 27h / $500 | VCFs-merged | Matrix Table | 8.14TB | 10,413 | 242,156,902 |
Genotype-level QC | 12h / $230 | Qualified entries and variants | Matrix Table | | Filtering 6.67% entries out | ||
Variant-level QC | - | 187,790,701 | |||||
Sample-level QC | 20m / $93 | Qualified samples | Matrix Table | | 10,413 | - | |
Other | Remove problematic samples from other analyzes | - | - | - | | 10,390 | |
Final | Total Process | ~3.5 days / ~$2300 | Release Matrix Table | Matrix Table | 6.74TB | 10,390 | 187,790,701 |