1 of 26

VCF format, VCF toolkits, Hardy-Weinberg Equilibrium

Applied Computational Genomics

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 26

Variant Calling Overview

FASTQ

Align

(BWA)

BAM

Detect SNP/INDELs

(GATK or FreeBayes)

VCF

3 of 26

VCF format

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf

4 of 26

VCF format

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf

5 of 26

VCF format. A basic example

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LF1396

chr7 117175373 . A G 90 PASS AF=0.5 GT 0/1

Heterozygous A/G. The REF allele is allele "0", ALT is allele "1"

6 of 26

Genotypes

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LF1396

chr7 117175373 . A G 90 PASS AF=0.0 GT 0/0

chr7 117175373 . A G 90 PASS AF=0.5 GT 0/1

chr7 117175373 . A G 90 PASS AF=1.0 GT 1/1

chr7 117175373 . A G 0 PASS AF=0.0 GT ./.

Hom. Ref.

Het.

Hom. Alt.

Unknown

Why would a genotype be unknown?

7 of 26

Multi-sample VCF

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID

chr7 2194169 . C T 210 PASS AF=0.5 GT 0/1 0/1

Heterozygous C/T.

Mom

Kid

8 of 26

VCF format example

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/pdf/btr330.pdf

##fileformat=VCFv4.3

##fileDate=20090805

##source=variantcallerXYZ

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM DAD KID

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

A/A

A/G

A/G

9 of 26

Let's look at a real VCF file from a mom,dad,kid "trio"

10 of 26

"Phased" genotypes

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID

chr7 2194169 . C T 210 PASS AF=0.5 GT 0/1 0/1

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID

chr7 2194169 . C T 210 PASS AF=0.5 GT 0|1 0|1

Phased genotype

11 of 26

"Phased" genotypes distinguish haplotypes

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM KID

chr7 2194169 . C T 210 PASS AF=0.5 GT 0|1 0|1

Phased genotype

ATCGGGTACCATCCAATCATTACC

ATCGGGCACCATCCAATCATTACC

12 of 26

Phasing by inheritance

This is an example of a compound heterozygote. Hets where the alt allele comes from different parents at two different loci

A/A

A/G

A/G

C/T

T/T

C/T

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOM DAD KID

chr1 100000 . A G 95 PASS AF=0.33 GT 0/1 0/0 1|0

chr1 200000 . T C 99 PASS AF=0.33 GT 0/0 0/1 0|1

100000

200000

chr1

A

T

13 of 26

Read-backed phasing: evidence of haplotype in read or pair

De novo mutation in the kid is linked to dad’s haplotype

Dad

Mom

Kid

14 of 26

bcftools: http://www.htslib.org/doc/bcftools.html

15 of 26

Bcftools examples: http://samtools.github.io/bcftools/howtos/index.html

16 of 26

Bcftools examples: http://samtools.github.io/bcftools/howtos/index.html

17 of 26

ExAC: exome sequencing of 60,706 humans

1 variant every 8 bases among 60,706 humans! Also, >50% of the 9 million variants discovered were present as a heterozygote in 1 out of the 60706 individuals (a "singleton")!!!

http://www.nature.com/nature/journal/v536/n7616/pdf/nature19057.pdf

18 of 26

Allele frequency spectrum: most variants are rare

http://www.nature.com/nature/journal/v536/n7616/pdf/nature19057.pdf

>50% of the 9 million variants discovered were present as a heterozygote in 1 out of the 60706 individuals (a "singleton"). >90% present on <=10 chromosomes sampled

19 of 26

Huge excess of rare variation from recent, rapid population growth

http://science.sciencemag.org/content/336/6082/740/

20 of 26

Hardy-Weinberg Equilibrium (the binomial!)

Polymorphic loci that are biallelic (e.g., A and G alleles) have two allele frequencies, p and q.

f(A) = p = 4/6 = 0.67

f(G) = q = 2/6 = 0.33

p + q = 1

A

A

G

A

G

A

21 of 26

Hardy-Weinberg Equilibrium

In the absence of evolutionary forces such as selection, drift, or bottlenecks, Hardy–Weinberg equilibrium states that allele and genotype frequencies in a population will remain constant from generation to generation.

If we know the allele frequencies, p and q, we can predict the genotype frequencies that should be observed (binomial expectation).

Observed allele frequencies

f(A) = p = 4/6 = 0.67

f(G) = q = 2/6 = 0.33

https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle#Deviations_from_Hardy.E2.80.93Weinberg_equilibrium

Expected genotype frequencies, given the observed allele frequencies

f(AA) = p2 = (0.67)2 = 0.4489

f(AG) = 2pq = 2(0.67)(0.33) = 0.4422

f(GG) = q2 = (0.33)2 = 0.1089

p2 + 2pq + q2 = 1

22 of 26

Hardy-Weinberg Equilibrium: expected genotype freqs

https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle#Deviations_from_Hardy.E2.80.93Weinberg_equilibrium

f(AA) = p2 = (0.5)2 = 0.25

f(AG) = 2pq = 2(0.5)(0.5) = 0.5

f(GG) = q2 = (0.5)2 = 0.25

p = 0.5, q = 0.5

f(AA) = p2 = (0.1)2 = 0.01

f(AG) = 2pq = 2(0.1)(0.9) = 0.18

f(GG) = q2 = (0.9)2 = 0.81

p = 0.1, q = 0.9

f(AA) = p2 = (0.01)2 = 0.0001

f(AG) = 2pq = 2(0.01)(0.99) = 0.0198

f(GG) = q2 = (0.99)2 = 0.9801

p = 0.01, q = 0.99

f(AA) = p2 = (0.001)2 = 0.000001

f(AG) = 2pq = 2(0.001)(0.999) = 0.001998

f(GG) = q2 = (0.999)2 = 0.998001

p = 0.001, q = 0.999

23 of 26

Hardy-Weinberg Equilibrium

https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle#Deviations_from_Hardy.E2.80.93Weinberg_equilibrium

f(AA) = p2 = (0.1)2 = 0.01

f(AG) = 2pq = 2(0.1)(0.9) = 0.18

f(GG) = q2 = (0.9)2 = 0.81

p = 0.1, q = 0.9

If we sequenced 100 individuals, how many A/G heterozygotes would we expect?

How many A/A homozygotes?

24 of 26

Hardy-Weinberg Equilibrium - example

The frequency of allele "Z" at a given locus on chromosome 7 is 0.3. What proportion of individuals do we expect to be heterozygous for the Z and Q alleles?

25 of 26

Deviations from Hardy-Weinberg Equilibrium

http://www.nature.com/scitable/knowledge/library/the-hardy-weinberg-principle-13235724

  • Inbreeding
  • Population bottlenecks
  • Positive selection
  • Purifying selection
  • Random genetic drift

Example: a recessive, disease causing allele.

Expect p2 homozygotes for the recessive allele, yet observe significantly less than p2 * the number of individuals tested

26 of 26

Hardy-Weinberg Equilibrium - example

In a population that is in Hardy-Weinberg equilibrium, the frequency of the recessive homozygote genotype of a certain trait is 0.16. Calculate the percentage of individuals homozygous for the dominant

allele.

http://www.houstonisd.org/cms/lib2/TX01001591/Centricity/Domain/5363/Hardy%20Weinberg%20Problem%20Set%20KEY.pdf