3 of 29

Data Assurance Necessary for AI Assurance

Trustworthy AI Characteristics

Robustness to small changes in input space
Security against adversarial inputs
Mitigate unintended data bias preventing representativeness/generalizability
Explainable performance in a well-defined operating environment�

Machine Learning (ML) models are data dependent

Given same model architecture/hyperparameters, change in training data leads to change in model performance
Prediction explanations are typically model centric, may not locate cause
Black box transferability of adversarial samples across models with different architectures suggest data weakness, not learning algorithm�

To improve the input space, �first need to characterize the input space

Model learns relationships between combinations of features X and labels Y, �how interactions of elements of X affect probability of element of Y
Necessary model complexity due to interaction, can’t rely on 1D stats
Gaps in coverage 🡪 failure to model relationship
Applied to features, metadata, or both

wolf

husky

wolf?

Explainability tool LIME highlights salient parts of image (snow)

Identifying root cause requires analysis of data coverage (lack of huskies in snow)

4 of 29

Data Coverage Operating Envelope Estimation

Problem:

Model trained on Southern imagery exhibited performance drop on Northern imagery
North to South transfer did not have drop
Even without an adversary, model developed in one part of California fails to classify some images of planes elsewhere same state

Can we deploy a computer vision model for �detecting vehicles from Southern, CA to Northern, CA?

Can we tell operators when to trust the model?

What are the top information gain data points for improving the model for this deployment?

Solution:

Data Coverage as a method to estimate operating envelope in which a model’s performance is well defined

5 of 29

Technical Foundation: Coverage Metrics

Designed metrics to detect gaps in coverage [1]

of input space vs. universe: combinatorial coverage (CC)
Between datasets: set difference combinatorial coverage (SDCC)

Why does this matter?

Classifier trained to detect huskies vs. wolves really detected snow due to lack of images with combinations (husky, snow) (wolf, urban) etc.

U_t

D_t

U_t

S_t

T_t

U_t

S_t

T_t

U_t

T_t

S_t

U_t

S_t=T_t

U_t

T_t

S_t

Explainable at 3 levels of complexity

1. # interactions present/absent

2. Which interactions present/absent

3. Distribution of interactions

[1] E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial testing metrics for machine learning,” in 2021 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 81-84, IEEE, 2021.

Hair	Live Birth	Eco	Class
No	Yes	Ocean	Orca
Yes	Yes	Woods	Wolf
Yes	Yes	Ocean	Otter
No	No	Woods	Owl

2-way combination :

Hair, Eco�Hair, Live Birth

2-way interaction :

Hair=no, Eco=Ocean

Eco=Ocean, Live Birth =Yes

6 of 29

Combinatorial Coverage Published Results

Coverage metrics correlate with performance drop direction

North covers (93%) more contexts than South (83%)

North covers more of South (98%) than South covers of North (88%)��[1] E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial testing metrics for machine learning,” in 2021 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 81-84, IEEE, 2021.

Retraining source model with few images from target domain selected by set difference combinatorial coverage achieves higher accuracy than same number of randomly selected images�

[2] S. F. Ahamed, P. Aggarwal, S. Shetty, E. Lanus, L. J. Freeman, “ATTL: an automated targeted transfer learning with deep neural networks,” in 2021 IEEE Global Communications Conference (GLOBECOM), 2021, pp. 1-7.

Test samples with interactions covered by train set have higher accuracy than samples with some interactions not covered 🡪 create representative or challenging test sets �

[3] T. Cody, E. Lanus, D. Doyle, L. J. Freeman, “Systematic training and testing for machine learning using combinatorial interaction testing,” in 2022 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 102-109, IEEE, 2022.

Smean=2�Class=1

7 of 29

Systematic Inclusion & Exclusion Experimental Framework

Challenge: How do we determine the features, and metadata categories that define the operating envelopes?

Solution: Design of Experiments!

[1] E. Lanus, B. Lee, L. Pol, D. Sobien, J. Kauffman, L. J. Freeman. “Coverage for Identifying Critical Metadata in Machine Learning Operating Envelopes,” to appear in 2024 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), IEEE, 2024.

8 of 29

Results on RarePlanes

9 of 29

Results on RarePlanes - EDA

10 of 29

Modeling Approach

Logistic Regression

Frequency = Test Size

Response Variables

Precision
Recall
F1

Predictors:

Feature Excluded in Test: Yes/No
Covered in Training: Yes/No

Two models fit

Control
Systematic Inclusion/Exclusion

11 of 29

Model Results

Systematic Inclusion / Exclusion (SIE)

Control

Precision

Recall

Response Variable	Model p-value (Included in training)
Precision	0.4225
Recall	0.9008
F1	0.7140

12 of 29

Results on RarePlanes - Precision

Contrast Significance	P-Value
Avg Pan Res Excluded/included *Covered	0.0296
Biome Excluded/included *Covered	0.0002
Hour of Day Excluded/included *Covered	0.0029
Off Nadir Max Excluded/included *Covered	0.6406
Season Excluded/included *Covered	0.1493

13 of 29

Results on RarePlanes - Recall

Contrast Significance	P-Value
Avg Pan Res Excluded/included *Covered	0.0237
Biome Excluded/included *Covered	0.0007
Hour of Day Excluded/included *Covered	0.0011
Off Nadir Max Excluded/included *Covered	0.2393
Season Excluded/included *Covered	0.0383

14 of 29

Results on RarePlanes – F1

Contrast Significance	P-Value
Avg Pan Res Excluded/included *Covered	0.0218
Biome Excluded/included *Covered	0.0002
Hour of Day Excluded/included *Covered	0.0012
Off Nadir Max Excluded/included *Covered	0.3729
Season Excluded/included *Covered	0.0627

15 of 29

Coverage of Data Explorer (CODEX)

CODEX Tool Goals:

Enable experimentation and dynamic decision making on data partitions via dashboard to visualize and interact with trained ML model’s envelope in iterative process
Understand what compromises a sufficient training or test data set for ML T&E
Understand which factors define operating envelope and impact model performance
Implement functionalities utilizing combinatorial coverage from published and unpublished results

Visualize Coverage Metrics

Plot data coverage metrics, input human operator choices and objectives

Know areas of input space with little/no data

Partition, Train, Test

Algorithmically generate partitions of data w.r.t. coverage, automatically train/evaluate models

Generate minimal train sets, certify gold and robust test sets

Delta Performance Coverage

Between partitions assessment: plot change in performance against change in coverage at interaction and model scales

Know interactions with lower performance, effect of data coverage on resulting model

16 of 29

CODEX Functions

Lower Level:

Combinatorial Coverage (CC)
Set Difference Combinatorial Coverage (SDCC)
Dataset Evaluation

Compute CC over a dataset

Dataset Split Evaluation

Compute SDCC over a dataset split

Plot performance against coverage
Plot performance by interaction
Balanced Set Construction

Given a goal size, build a dataset that covers interactions as equally as possible as close to the goal as possible

Higher Level:

Dataset Split Comparison

Uses plot performance by coverage

Targeted Retraining in Transfer Learning

Identify samples in set difference (gap)

Identify Critical Factors via Systematic Inclusion/Exclusion (SIE)

Uses Balanced Set Construction

Model Probing

Uses plot performance by interaction

17 of 29

Coverage Applications & Future Work

Inform data collection/labeling
Reduce train sets
Certify gold & robust test sets

Systematic �model probing
Discover and improve attacks

Why is model not performant?
What are model boundaries?

Which model to choose?
Where can we deploy?

Explainability: Selection from Model Zoo

Operator Trust: �Data & Model Card Interoperability

Robustness: Blue Team/ White Box Assessment

Security:�Red Team/�Black Box Assessment

Use case: sponsor has model trained in desert and upcoming mission in arctic.

Automate coverage computation of data to assist in course of action decision

Enough coverage

use the model

Better coverage elsewhere

use this other model

Not enough coverage in specific areas

alert for human oversight when operating in gaps
collect/synthesize data from gaps and fine tune the model

Apply data coverage in model development via planned data collection and partitioning to: �– Reduce time, effort, and cost of training and testing

– Improve robustness by knowing model boundaries and filling gaps

Improve efficiency:

Training and testing

Data collection/labeling

Improve robustness:

Natural events

Adversarial actions

Enhance statistics on data cards with interaction coverage, gaps

Summarize prior to sharing 🡪 “need to share”

Expand model cards with coverage informed performance analysis, critical metadata

Combinatorial arrays for black box software testing to construct test suites to search input space:

– Construct low performing samples

– Estimate operating envelope�– Guess training data used

Coverage based approaches to

– Data poisoning

– Adversarial partitioning

18 of 29

Backup:

Adversarial Testing Future Directions

19 of 29

Data Poisoning

Metadata such as time of day, season, biome impact appearance of objects relative to background
Combinatorial coverage of different contexts leads to models that can perform in a variety of contexts, do not learn spurious correlations (e.g., detect runway instead of plane, leading to failure to detect plane when runway obscured by snow)
Coverage becomes very useful for designing train and test sets
Research:

Can an adversary use coverage of metadata to direct poisoning attacks to lead to models that are brittle in specific ways?
Can we use coverage to minimize the amount of poisoning needed by identifying low coverage regions?

20 of 29

Private Aggregation of Teacher Ensembles (PATE)

Images from http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html

PATE Framework: N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data,” arXiv preprint arXiv:1610.05755.

Monolithic focus on privacy 🡪 naïve about impact on utility
Model operating envelope work 🡪 can’t treat dataset as blob

Look inside the data, characterize what is contains (coverage)
Understand how subsetting dataset impacts model utility

Case study: PATE Framework (Google Brain) uses subsets for achieving differential privacy, treats subsets as blobs
Explore how different subsets as partitions impacts utility

Apply coverage approaches to characterize partitions

21 of 29

Adversarial Partitions: Beyond User Privacy

PATE or a similar disjoint partition ensemble strategy for:

protecting user private information in health or other human services datasets
preventing adversary from extracting specifics of sensitive training samples by examination of model deployed in unsecured areas;
partitioning sensitive data into disjoint sets to limit impact of leakage of one partition to limit damage caused by an insider threat acting alone;
outsourcing training teacher models on non-sensitive data, later ensembling with teachers trained on sensitive data or using sensitive auxiliary dataset to train student model.

PATE one of few approaches specific to DP in ML https://www.nist.gov/blogs/cybersecurity-insights/how-deploy-machine-learning-differential-privacy

Utility impacting partitions could occur:

Accidentally (random split)
Naturally or Structurally (separate private datasets, sensitivity restrictions on part of dataset, keep frames from unique flight together)
Adversarially (insider compromise the splitting algorithm) 🡨 worst case

22 of 29

Constructing Adversarial Partitions

Coverage Metrics + Computational Construction Algorithm

Wolf

Husky

???

Hypotheses�(1) it is possible to construct a partition to negatively impact student model utility on certain interactions when certain properties of the private dataset hold �(2) an adversarially constructed data partition is detectable by analyzing set relationships/distributions between teacher partitions
Similar to data poisoning attack, but no data injected 🡪 mechanisms to detect non-original (poisoned) samples should fail

1 of 29

2 of 29