Genome in a Bottle Consortium Updates and Plans for X/Y Benchmark Evaluation
Justin Zook and the GIAB team
December 12, 2022
GIAB samples and reference materials
Goals for today
Modeling sequencing and variant call errors
Nate Dwarshuis, Justin Wagner, et al
SV benchmarks
Tandem repeat benchmarks
Somatic/mosaic benchmarks
RNAseq of GIAB samples
HG002 “Q100” Project
Specific Update on Chromosome X/Y Benchmark Development
Chromosome XY Benchmark Development
Assembly-Based Draft Benchmark Development Pipeline
Credits: Nate Olson, Jennifer McDaniel, and GIAB team
What we exclude from the assembly-based benchmark
Benchmark Regions
Variants outside benchmark regions are not assessed
Majority of variants unique to method should be false positives (FPs)
Majority of variants unique to benchmark should be false negatives (FNs)
Matching variants assumed to be true positives
Variants from any method being evaluated
Benchmark Variant
Calls
Design of our human genome reference values
Genome
*not to scale (e.g., variants cover <1% of the genome)
Reliable IDentification of Errors (RIDE) Criteria
Evaluation
Variants from any method being evaluated
Benchmark Regions
Benchmark Variant
Calls
Genome
Process for independent evaluations
Callset developer curates putative errors
Benchmark is wrong or questionable
NIST curator disagrees
Discuss with callset developer
NIST curator agrees
Classify source of potential error in benchmark
Benchmark is correct
No further curation
Estimate confidence intervals about benchmark performance by curating differences between Illumina, HiFi, and draft benchmark
Draft Benchmark Regions
Putative false positives
Matching variants assumed true positives
Draft
Benchmark Variants
Putative false negatives
Genome
Stratify, sample, manual curation sampled variants
Stratify, sample, manual curation sampled variants
Compare to Illumina
Draft Benchmark Regions
Putative false positives
Matching variants assumed true positives
Draft
Benchmark Variants
Putative false negatives
Genome
Compare to HiFi
Estimate confidence intervals about benchmark performance by curating differences between Illumina, HiFi, and draft benchmark
Draft Benchmark Regions
Putative false positives
Matching variants assumed true positives
Draft
Benchmark Variants
Putative false negatives
Genome
Stratify, sample, manual curation sampled variants
Stratify, sample, manual curation sampled variants
Compare to Illumina
Draft Benchmark Regions
Putative false positives
Matching variants assumed true positives
Draft
Benchmark Variants
Putative false negatives
Genome
Compare to HiFi
Fit for Purpose Goal: >95% confidence that differences between other methods and the benchmark are mostly (>50%) errors in the other methods.
Approach: Use Active Evaluation to Provide Confidence Intervals and Address Curation
Active Evaluation: Carefully choosing which part of the unlabeled test set to obtain labels for, we can minimize the labeling effort and measurement uncertainty.
Population: The entire set of disagreements between the benchmark and the systems
Sample: The specific variants that were human curated against the benchmark
Metric measured: The fraction of curations where the benchmark agrees with the human-curated labels (Accuracy)
Strategy: Partition the variants into strata, and leverage stratified sampling to both provide a confidence interval on the accuracy of the unlabeled test set and to recommend additional variants to human-curate. Use an assumption that the fraction of agreement approximates a binomial distribution for each strata.
Software: Preliminary Version available at https://github.com/usnistgov/active-evaluation
Approach: Use Active Evaluation to Provide Confidence Intervals and Systematize Curation
Approach: Use Active Evaluation to Provide Confidence Intervals and Address Curation
Use Stratifications to Sample Curations
False Positive and False Negative SNPs and INDELs along with inside and outside of a small region with many FPs (ChrY:1102000-11600000), as well as AT homopolymers of length >30 to decide if should exclude from benchmark
- FP_SNP_Pos: False Positive SNPs in ChrY:1102000-11600000
- FP_SNP_outPos: False Positive SNPs not in ChrY:1102000-11600000
- FN_SNP: False Negative SNPs
- FP_Indel_ATL: False Positive INDELs in an AT homopolymer longer than 30bp
- FN_Indel_ATL: False Negative INDELs in an AT homopolymer longer than 30bp
- FP_Indel_ATS: False Positive INDELs not in an AT homopolymer longer than 30bp
- FN_Indel_ATS: False Negative INDELs not in an AT homopolymer longer than 30bp
Identifying poor performing regions
Evaluation Using Illumina and HiFi Callsets
Strata consisted of SNVs, INDELs, small region with many false positives, homopolymers longer than 20, pseudoautosomal regions, and large repetitive sequences
All other trials [Only FPs and FNs were curated]
Draft Small Variant Benchmark Characteristics
Bases in Benchmark Regions | SNVs | INDELs |
163,505,305 | 88,772 | 25,461 |
Draft Small Variant Benchmark Example Comparison
| SNVs | INDELs |
Precision | 97.3 | 88.0 |
Recall | 98.6 | 90.4 |
Draft Small Variant Benchmark Example Comparison (HiFi FN SNVs)
Draft Small Variant Benchmark Example Comparison (HiFi FN SNVs)
Some regions are too complex for benchmarking tools
Plan for XY Benchmark Evaluation