2 of 31

GIAB samples and reference materials

Genome in a Bottle Consortium develops metrology infrastructure for benchmarking human whole genome variant detection
Characterization of seven broadly-consented human genomes including two son-mother-father trios with several as NIST RMs

3 of 31

Goals for today

Re-convene and update the broad GIAB community

Provide a plan of what we plan to work on over the next year

Evaluate Chromosome X/Y benchmark, variant call error modeling, tandem repeat benchmark, T2T-diploid collaboration, SV benchmarks, tumor/normal samples for somatic benchmarks, and other work that we are involved in such as RNAseq of GIAB samples

Give specific update on Chromosome X/Y small variant benchmark evaluation and upcoming call for volunteers

4 of 31

Modeling sequencing and variant call errors

Nate Dwarshuis, Justin Wagner, et al

5 of 31

SV benchmarks

v0.6 whole genome on GRCh37
Challenging genes on GRCh37/38 and CHM13
Future assembly-�based

6 of 31

Tandem repeat benchmarks

Collaboration led by Adam English and Fritz Sedlazeck

Three main goals:

Standard list of tandem repeats
Benchmark for variants in tandem repeats >=5bp
Improved comparison tools for benchmarking

7 of 31

Somatic/mosaic benchmarks

Medical Device Innovation Consortium Somatic Reference Samples

Engineering medically-important tumor variants into GIAB cell lines
Also developing mosaic benchmarks

Tumor/Normal “21st Century Cell lines”

Develop matched tumor and normal cell lines with consents similar to existing GIAB samples
Initial Illumina and HiC data for a pancreatic ductal adenocarcinoma cell line

8 of 31

RNAseq of GIAB samples

Recently generated RNA-seq data:

Several GIAB lymphoblastoid cell lines (HG002/4/5) and 2 iPSCs for HG002
Illumina and PacBio RNA-seq completed
ONT RNA-seq in progress

Potential analyses include isoforms, variants, gene annotation

Current collaborators: Fritz Sedlazeck, Miten Jain, Chris Mason, Andrew Carroll, Jason Merker

Collaborations welcome!

9 of 31

HG002 “Q100” Project

T2T-GIAB collaboration to create “near-perfect” diploid assembly and associated benchmarks

https://github.com/marbl/HG002

Adam Phillippy’s recent keynote about T2T, including work on HG002

https://www.youtube.com/watch?v=KnHeF8Zwbq4
Goal is to move beyond reporting “99% accuracy on 90% of the genome”

10 of 31

Specific Update on Chromosome X/Y Benchmark Development

11 of 31

Chromosome XY Benchmark Development

Telomere to Telomere consortium generated a complete assembly of HG002 X and Y for first complete human genome

First T2T Y chromosome described in new preprint at https://doi.org/10.1101/2022.12.01.518724

Using these assemblies to create benchmark regions and benchmark variants for HG002 X and Y

12 of 31

Assembly-Based Draft Benchmark Development Pipeline

Credits: Nate Olson, Jennifer McDaniel, and GIAB team

13 of 31

What we exclude from the assembly-based benchmark

Regions without the expected one contig aligned per haplotype

Relies on dipcall bed file

Large repeats if they are partially aligned

Segmental duplications
Long VNTRs and satellites
Assembly gaps

14 of 31

Benchmark Regions

Variants outside benchmark regions are not assessed

Majority of variants unique to method should be false positives (FPs)

Majority of variants unique to benchmark should be false negatives (FNs)

Matching variants assumed to be true positives

Variants from any method being evaluated

Benchmark Variant

Calls

Design of our human genome reference values

Genome

*not to scale (e.g., variants cover <1% of the genome)

Reliable IDentification of Errors (RIDE) Criteria

15 of 31

Evaluation

Fitness for purpose is evaluated against a collection of variant sets from different sequencing technologies and variant calling methods

Manually curate a subset of FPs and FNs (selected randomly) from each comparison variant set

Variants from any method being evaluated

Benchmark Regions

Benchmark Variant

Calls

Genome

16 of 31

Process for independent evaluations

Callset developer curates putative errors

Benchmark is wrong or questionable

NIST curator disagrees

Discuss with callset developer

NIST curator agrees

Classify source of potential error in benchmark

Benchmark is correct

No further curation

17 of 31

Estimate confidence intervals about benchmark performance by curating differences between Illumina, HiFi, and draft benchmark

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Stratify, sample, manual curation sampled variants

Compare to Illumina

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Compare to HiFi

18 of 31

Estimate confidence intervals about benchmark performance by curating differences between Illumina, HiFi, and draft benchmark

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Stratify, sample, manual curation sampled variants

Compare to Illumina

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Compare to HiFi

Fit for Purpose Goal: >95% confidence that differences between other methods and the benchmark are mostly (>50%) errors in the other methods.

19 of 31

Approach: Use Active Evaluation to Provide Confidence Intervals and Address Curation

Active Evaluation: Carefully choosing which part of the unlabeled test set to obtain labels for, we can minimize the labeling effort and measurement uncertainty.

Population: The entire set of disagreements between the benchmark and the systems

Sample: The specific variants that were human curated against the benchmark

Metric measured: The fraction of curations where the benchmark agrees with the human-curated labels (Accuracy)

Strategy: Partition the variants into strata, and leverage stratified sampling to both provide a confidence interval on the accuracy of the unlabeled test set and to recommend additional variants to human-curate. Use an assumption that the fraction of agreement approximates a binomial distribution for each strata.

Software: Preliminary Version available at https://github.com/usnistgov/active-evaluation

20 of 31

Approach: Use Active Evaluation to Provide Confidence Intervals and Systematize Curation

Obtain strong evidence that the benchmark is more than 50% correct for cases where other systems disagree with the benchmark (i.e., putative false positive and false negatives)

Identify specific subregions where the benchmark disagrees with both human labels and other systems more than 50% of the time

Focus additional sampling on regions where we are unsure of the above two items

21 of 31

Approach: Use Active Evaluation to Provide Confidence Intervals and Address Curation

Take a small initial sampling

Perform an analysis of different stratifications to separate out regions of high and low accuracy of the benchmark

Identify a reasonable stratification and then seek additional samples

Provide Confidence Intervals

22 of 31

Use Stratifications to Sample Curations

False Positive and False Negative SNPs and INDELs along with inside and outside of a small region with many FPs (ChrY:1102000-11600000), as well as AT homopolymers of length >30 to decide if should exclude from benchmark

- FP_SNP_Pos: False Positive SNPs in ChrY:1102000-11600000

- FP_SNP_outPos: False Positive SNPs not in ChrY:1102000-11600000

- FN_SNP: False Negative SNPs

- FP_Indel_ATL: False Positive INDELs in an AT homopolymer longer than 30bp

- FN_Indel_ATL: False Negative INDELs in an AT homopolymer longer than 30bp

- FP_Indel_ATS: False Positive INDELs not in an AT homopolymer longer than 30bp

- FN_Indel_ATS: False Negative INDELs not in an AT homopolymer longer than 30bp

23 of 31

Identifying poor performing regions

24 of 31

Evaluation Using Illumina and HiFi Callsets

Strata consisted of SNVs, INDELs, small region with many false positives, homopolymers longer than 20, pseudoautosomal regions, and large repetitive sequences

All other trials [Only FPs and FNs were curated]

26 of 31

Draft Small Variant Benchmark Characteristics

Draft Benchmark Statistics

Bases in Benchmark Regions	SNVs	INDELs
163,505,305	88,772	25,461

27 of 31

Draft Small Variant Benchmark Example Comparison

Example preliminary performance metrics from HiFi-DeepVariant vs. draft X/Y benchmark

	SNVs	INDELs
Precision	97.3	88.0
Recall	98.6	90.4

28 of 31

Draft Small Variant Benchmark Example Comparison (HiFi FN SNVs)

29 of 31

Draft Small Variant Benchmark Example Comparison (HiFi FN SNVs)

30 of 31

Some regions are too complex for benchmarking tools

Region containing DAZ1/DAZ2/DAZ3/DAZ4 excluded due to complexity

Fig 4 from https://www.biorxiv.org/content/10.1101/2022.08.05.502980v2.full

31 of 31

Plan for XY Benchmark Evaluation

XY Benchmark Evaluation: Requesting external collaborators submit high-quality X/Y callsets from a variety of technologies and variant calling methods

GIAB team will…

compare to draft benchmark
select FPs and FNs using active evaluation framework
send ~100 variants that disagree between draft benchmark and comparison callsets
request you to curate those sites to identify whether the draft benchmark is correct

Stay tuned to the GIAB analysis team google group for this request in the next ~week

1 of 31

2 of 31

3 of 31

4 of 31

5 of 31

6 of 31

7 of 31

8 of 31

9 of 31

10 of 31

11 of 31

12 of 31

13 of 31

14 of 31

15 of 31

16 of 31

17 of 31

18 of 31

19 of 31

20 of 31

21 of 31

22 of 31

23 of 31

24 of 31

25 of 31

26 of 31

27 of 31

28 of 31

29 of 31

30 of 31

31 of 31