VCF format, VCF toolkits, Hardy-Weinberg Equilibrium
Applied Computational Genomics
https://github.com/quinlan-lab/applied-computational-genomics
Aaron Quinlan
Departments of Human Genetics and Biomedical Informatics
USTAR Center for Genetic Discovery
University of Utah
quinlanlab.org
Variant Calling Overview
FASTQ
Align
(BWA)
BAM
Detect SNP/INDELs
(GATK or FreeBayes)
VCF
VCF format
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf
VCF format
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf
VCF format. A basic example
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LF1396
chr7 117175373 . A G 90 PASS AF=0.5 GT 0/1
Heterozygous A/G. The REF allele is allele "0", ALT is allele "1"
Genotypes
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LF1396
chr7 117175373 . A G 90 PASS AF=0.0 GT 0/0
chr7 117175373 . A G 90 PASS AF=0.5 GT 0/1
chr7 117175373 . A G 90 PASS AF=1.0 GT 1/1
chr7 117175373 . A G 0 PASS AF=0.0 GT ./.
Hom. Ref.
Het.
Hom. Alt.
Unknown
Why would a genotype be unknown?
Multi-sample VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID
chr7 2194169 . C T 210 PASS AF=0.5 GT 0/1 0/1
Heterozygous C/T.
Mom
Kid
VCF format example
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf
##fileformat=VCFv4.3
##fileDate=20090805
##source=variantcallerXYZ
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM DAD KID
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
A/A
A/G
A/G
Let's look at a real VCF file from a mom,dad,kid "trio"
curl https://s3.amazonaws.com/gemini-tutorials/trio.trim.vep.vcf.gz \
| zcat > trio.vcf
"Phased" genotypes
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID
chr7 2194169 . C T 210 PASS AF=0.5 GT 0/1 0/1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID
chr7 2194169 . C T 210 PASS AF=0.5 GT 0|1 0|1
Phased genotype
"Phased" genotypes distinguish haplotypes
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID
chr7 2194169 . C T 210 PASS AF=0.5 GT 0|1 0|1
Phased genotype
♂
♀
ATCGGGTACCATCCAATCATTACC
ATCGGGCACCATCCAATCATTACC
♂
♀
Phasing by inheritance
This is an example of a compound heterozygote. Hets where the alt allele comes from different parents at two different loci
A/A
A/G
A/G
C/T
T/T
C/T
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM DAD KID
chr1 100000 . A G 95 PASS AF=0.33 GT 0/1 0/0 1|0
chr1 200000 . T C 99 PASS AF=0.33 GT 0/0 0/1 0|1
100000
200000
chr1
A
T
Read-backed phasing: evidence of haplotype in read or pair
De novo mutation in the kid is linked to dad’s haplotype
Dad
Mom
Kid
bcftools: http://www.htslib.org/doc/bcftools.html
Bcftools examples: http://samtools.github.io/bcftools/howtos/index.html
Bcftools examples: http://samtools.github.io/bcftools/howtos/index.html
ExAC: exome sequencing of 60,706 humans
1 variant every 8 bases among 60,706 humans! Also, >50% of the 9 million variants discovered were present as a heterozygote in 1 out of the 60706 individuals (a "singleton")!!!
http://www.nature.com/nature/journal/v536/n7616/pdf/nature19057.pdf
Allele frequency spectrum: most variants are rare
http://www.nature.com/nature/journal/v536/n7616/pdf/nature19057.pdf
>50% of the 9 million variants discovered were present as a heterozygote in 1 out of the 60706 individuals (a "singleton"). >90% present on <=10 chromosomes sampled
Huge excess of rare variation from recent, rapid population growth
http://science.sciencemag.org/content/336/6082/740/
Hardy-Weinberg Equilibrium (the binomial!)
Polymorphic loci that are biallelic (e.g., A and G alleles) have two allele frequencies, p and q.
f(A) = p = 4/6 = 0.67
f(G) = q = 2/6 = 0.33
p + q = 1
A
A
G
A
G
A
Hardy-Weinberg Equilibrium
In the absence of evolutionary forces such as selection, drift, or bottlenecks, Hardy–Weinberg equilibrium states that allele and genotype frequencies in a population will remain constant from generation to generation.
If we know the allele frequencies, p and q, we can predict the genotype frequencies that should be observed (binomial expectation).
Observed allele frequencies
f(A) = p = 4/6 = 0.67
f(G) = q = 2/6 = 0.33
https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle#Deviations_from_Hardy.E2.80.93Weinberg_equilibrium
Expected genotype frequencies, given the observed allele frequencies
f(AA) = p2 = (0.67)2 = 0.4489
f(AG) = 2pq = 2(0.67)(0.33) = 0.4422
f(GG) = q2 = (0.33)2 = 0.1089
p2 + 2pq + q2 = 1
Hardy-Weinberg Equilibrium: expected genotype freqs
https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle#Deviations_from_Hardy.E2.80.93Weinberg_equilibrium
f(AA) = p2 = (0.5)2 = 0.25
f(AG) = 2pq = 2(0.5)(0.5) = 0.5
f(GG) = q2 = (0.5)2 = 0.25
p = 0.5, q = 0.5
f(AA) = p2 = (0.1)2 = 0.01
f(AG) = 2pq = 2(0.1)(0.9) = 0.18
f(GG) = q2 = (0.9)2 = 0.81
p = 0.1, q = 0.9
f(AA) = p2 = (0.01)2 = 0.0001
f(AG) = 2pq = 2(0.01)(0.99) = 0.0198
f(GG) = q2 = (0.99)2 = 0.9801
p = 0.01, q = 0.99
f(AA) = p2 = (0.001)2 = 0.000001
f(AG) = 2pq = 2(0.001)(0.999) = 0.001998
f(GG) = q2 = (0.999)2 = 0.998001
p = 0.001, q = 0.999
Hardy-Weinberg Equilibrium
https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle#Deviations_from_Hardy.E2.80.93Weinberg_equilibrium
f(AA) = p2 = (0.1)2 = 0.01
f(AG) = 2pq = 2(0.1)(0.9) = 0.18
f(GG) = q2 = (0.9)2 = 0.81
p = 0.1, q = 0.9
If we sequenced 100 individuals, how many A/G heterozygotes would we expect?
How many A/A homozygotes?
Hardy-Weinberg Equilibrium - example
The frequency of allele "Z" at a given locus on chromosome 7 is 0.3. What proportion of individuals do we expect to be heterozygous for the Z and Q alleles?
Deviations from Hardy-Weinberg Equilibrium
http://www.nature.com/scitable/knowledge/library/the-hardy-weinberg-principle-13235724
Example: a recessive, disease causing allele.
Expect p2 homozygotes for the recessive allele, yet observe significantly less than p2 * the number of individuals tested
Hardy-Weinberg Equilibrium - example
In a population that is in Hardy-Weinberg equilibrium, the frequency of the recessive homozygote genotype of a certain trait is 0.16. Calculate the percentage of individuals homozygous for the dominant
allele.
http://www.houstonisd.org/cms/lib2/TX01001591/Centricity/Domain/5363/Hardy%20Weinberg%20Problem%20Set%20KEY.pdf