1 of 31

Genome in a Bottle Consortium Updates and Plans for X/Y Benchmark Evaluation

Justin Zook and the GIAB team

December 12, 2022

2 of 31

GIAB samples and reference materials

  • Genome in a Bottle Consortium develops metrology infrastructure for benchmarking human whole genome variant detection
  • Characterization of seven broadly-consented human genomes including two son-mother-father trios with several as NIST RMs

3 of 31

Goals for today

  • Re-convene and update the broad GIAB community

  • Provide a plan of what we plan to work on over the next year
    • Evaluate Chromosome X/Y benchmark, variant call error modeling, tandem repeat benchmark, T2T-diploid collaboration, SV benchmarks, tumor/normal samples for somatic benchmarks, and other work that we are involved in such as RNAseq of GIAB samples

  • Give specific update on Chromosome X/Y small variant benchmark evaluation and upcoming call for volunteers

4 of 31

Modeling sequencing and variant call errors

Nate Dwarshuis, Justin Wagner, et al

5 of 31

SV benchmarks

  1. v0.6 whole genome on GRCh37
  2. Challenging genes on GRCh37/38 and CHM13
  3. Future assembly-�based

6 of 31

Tandem repeat benchmarks

  • Collaboration led by Adam English and Fritz Sedlazeck

  • Three main goals:
    • Standard list of tandem repeats
    • Benchmark for variants in tandem repeats >=5bp
    • Improved comparison tools for benchmarking

7 of 31

Somatic/mosaic benchmarks

  • Medical Device Innovation Consortium Somatic Reference Samples
    • Engineering medically-important tumor variants into GIAB cell lines
    • Also developing mosaic benchmarks

  • Tumor/Normal “21st Century Cell lines”
    • Develop matched tumor and normal cell lines with consents similar to existing GIAB samples
    • Initial Illumina and HiC data for a pancreatic ductal adenocarcinoma cell line

8 of 31

RNAseq of GIAB samples

  • Recently generated RNA-seq data:
    • Several GIAB lymphoblastoid cell lines (HG002/4/5) and 2 iPSCs for HG002
    • Illumina and PacBio RNA-seq completed
    • ONT RNA-seq in progress

  • Potential analyses include isoforms, variants, gene annotation

  • Current collaborators: Fritz Sedlazeck, Miten Jain, Chris Mason, Andrew Carroll, Jason Merker
    • Collaborations welcome!

9 of 31

HG002 “Q100” Project

  • T2T-GIAB collaboration to create “near-perfect” diploid assembly and associated benchmarks

  • Adam Phillippy’s recent keynote about T2T, including work on HG002
    • https://www.youtube.com/watch?v=KnHeF8Zwbq4
    • Goal is to move beyond reporting “99% accuracy on 90% of the genome”

10 of 31

Specific Update on Chromosome X/Y Benchmark Development

11 of 31

Chromosome XY Benchmark Development

  • Telomere to Telomere consortium generated a complete assembly of HG002 X and Y for first complete human genome
    • First T2T Y chromosome described in new preprint at https://doi.org/10.1101/2022.12.01.518724

  • Using these assemblies to create benchmark regions and benchmark variants for HG002 X and Y

12 of 31

Assembly-Based Draft Benchmark Development Pipeline

Credits: Nate Olson, Jennifer McDaniel, and GIAB team

13 of 31

What we exclude from the assembly-based benchmark

  • Regions without the expected one contig aligned per haplotype
    • Relies on dipcall bed file

  • Large repeats if they are partially aligned
    • Segmental duplications
    • Long VNTRs and satellites
    • Assembly gaps

14 of 31

Benchmark Regions

Variants outside benchmark regions are not assessed

Majority of variants unique to method should be false positives (FPs)

Majority of variants unique to benchmark should be false negatives (FNs)

Matching variants assumed to be true positives

Variants from any method being evaluated

Benchmark Variant

Calls

Design of our human genome reference values

Genome

*not to scale (e.g., variants cover <1% of the genome)

Reliable IDentification of Errors (RIDE) Criteria

15 of 31

Evaluation

  • Fitness for purpose is evaluated against a collection of variant sets from different sequencing technologies and variant calling methods

  • Manually curate a subset of FPs and FNs (selected randomly) from each comparison variant set

Variants from any method being evaluated

Benchmark Regions

Benchmark Variant

Calls

Genome

16 of 31

Process for independent evaluations

Callset developer curates putative errors

Benchmark is wrong or questionable

NIST curator disagrees

Discuss with callset developer

NIST curator agrees

Classify source of potential error in benchmark

Benchmark is correct

No further curation

17 of 31

Estimate confidence intervals about benchmark performance by curating differences between Illumina, HiFi, and draft benchmark

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Stratify, sample, manual curation sampled variants

Stratify, sample, manual curation sampled variants

Compare to Illumina

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Compare to HiFi

18 of 31

Estimate confidence intervals about benchmark performance by curating differences between Illumina, HiFi, and draft benchmark

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Stratify, sample, manual curation sampled variants

Stratify, sample, manual curation sampled variants

Compare to Illumina

Draft Benchmark Regions

Putative false positives

Matching variants assumed true positives

Draft

Benchmark Variants

Putative false negatives

Genome

Compare to HiFi

Fit for Purpose Goal: >95% confidence that differences between other methods and the benchmark are mostly (>50%) errors in the other methods.

19 of 31

Approach: Use Active Evaluation to Provide Confidence Intervals and Address Curation

Active Evaluation: Carefully choosing which part of the unlabeled test set to obtain labels for, we can minimize the labeling effort and measurement uncertainty.

Population: The entire set of disagreements between the benchmark and the systems

Sample: The specific variants that were human curated against the benchmark

Metric measured: The fraction of curations where the benchmark agrees with the human-curated labels (Accuracy)

Strategy: Partition the variants into strata, and leverage stratified sampling to both provide a confidence interval on the accuracy of the unlabeled test set and to recommend additional variants to human-curate. Use an assumption that the fraction of agreement approximates a binomial distribution for each strata.

Software: Preliminary Version available at https://github.com/usnistgov/active-evaluation

20 of 31

Approach: Use Active Evaluation to Provide Confidence Intervals and Systematize Curation

  • Obtain strong evidence that the benchmark is more than 50% correct for cases where other systems disagree with the benchmark (i.e., putative false positive and false negatives)

  • Identify specific subregions where the benchmark disagrees with both human labels and other systems more than 50% of the time

  • Focus additional sampling on regions where we are unsure of the above two items

21 of 31

Approach: Use Active Evaluation to Provide Confidence Intervals and Address Curation

  1. Take a small initial sampling

  • Perform an analysis of different stratifications to separate out regions of high and low accuracy of the benchmark

  • Identify a reasonable stratification and then seek additional samples

  • Provide Confidence Intervals

22 of 31

Use Stratifications to Sample Curations

False Positive and False Negative SNPs and INDELs along with inside and outside of a small region with many FPs (ChrY:1102000-11600000), as well as AT homopolymers of length >30 to decide if should exclude from benchmark

- FP_SNP_Pos: False Positive SNPs in ChrY:1102000-11600000

- FP_SNP_outPos: False Positive SNPs not in ChrY:1102000-11600000

- FN_SNP: False Negative SNPs

- FP_Indel_ATL: False Positive INDELs in an AT homopolymer longer than 30bp

- FN_Indel_ATL: False Negative INDELs in an AT homopolymer longer than 30bp

- FP_Indel_ATS: False Positive INDELs not in an AT homopolymer longer than 30bp

- FN_Indel_ATS: False Negative INDELs not in an AT homopolymer longer than 30bp

23 of 31

Identifying poor performing regions

24 of 31

Evaluation Using Illumina and HiFi Callsets

Strata consisted of SNVs, INDELs, small region with many false positives, homopolymers longer than 20, pseudoautosomal regions, and large repetitive sequences

All other trials [Only FPs and FNs were curated]

25 of 31

26 of 31

Draft Small Variant Benchmark Characteristics

  • Draft Benchmark Statistics

Bases in Benchmark Regions

SNVs

INDELs

163,505,305

88,772

25,461

27 of 31

Draft Small Variant Benchmark Example Comparison

  • Example preliminary performance metrics from HiFi-DeepVariant vs. draft X/Y benchmark

SNVs

INDELs

Precision

97.3

88.0

Recall

98.6

90.4

28 of 31

Draft Small Variant Benchmark Example Comparison (HiFi FN SNVs)

29 of 31

Draft Small Variant Benchmark Example Comparison (HiFi FN SNVs)

30 of 31

Some regions are too complex for benchmarking tools

  • Region containing DAZ1/DAZ2/DAZ3/DAZ4 excluded due to complexity

31 of 31

Plan for XY Benchmark Evaluation

  • XY Benchmark Evaluation: Requesting external collaborators submit high-quality X/Y callsets from a variety of technologies and variant calling methods

  • GIAB team will…
    • compare to draft benchmark
    • select FPs and FNs using active evaluation framework
    • send ~100 variants that disagree between draft benchmark and comparison callsets
    • request you to curate those sites to identify whether the draft benchmark is correct

  • Stay tuned to the GIAB analysis team google group for this request in the next ~week