1 of 29

5/15/2024

May 16, 2024

Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing

Erin Lanus, Brian Lee

Laura Freeman

Laura.freeman@vt.edu

2 of 29

Overview

  • Why data is important
  • Framing problem
  • Theory
  • Published results
  • Systematic framework for exploring coverage
  • CODEX tool: goals, functions, outputs
  • Current and Future Efforts

3 of 29

Data Assurance Necessary for AI Assurance

  • Trustworthy AI Characteristics
    • Robustness to small changes in input space
    • Security against adversarial inputs
    • Mitigate unintended data bias preventing representativeness/generalizability
    • Explainable performance in a well-defined operating environment�
  • Machine Learning (ML) models are data dependent
    • Given same model architecture/hyperparameters, change in training data leads to change in model performance
    • Prediction explanations are typically model centric, may not locate cause
    • Black box transferability of adversarial samples across models with different architectures suggest data weakness, not learning algorithm�
  • To improve the input space, �first need to characterize the input space
    • Model learns relationships between combinations of features X and labels Y, �how interactions of elements of X affect probability of element of Y
    • Necessary model complexity due to interaction, can’t rely on 1D stats
    • Gaps in coverage 🡪 failure to model relationship
    • Applied to features, metadata, or both

wolf

husky

wolf?

Explainability tool LIME highlights salient parts of image (snow)

Identifying root cause requires analysis of data coverage (lack of huskies in snow)

4 of 29

Data Coverage Operating Envelope Estimation

4

4

4

Problem:

  • Model trained on Southern imagery exhibited performance drop on Northern imagery
  • North to South transfer did not have drop
  • Even without an adversary, model developed in one part of California fails to classify some images of planes elsewhere same state

Can we deploy a computer vision model for �detecting vehicles from Southern, CA to Northern, CA?

Can we tell operators when to trust the model?

What are the top information gain data points for improving the model for this deployment?

Solution:

  • Data Coverage as a method to estimate operating envelope in which a model’s performance is well defined

5 of 29

Technical Foundation: Coverage Metrics

  • Designed metrics to detect gaps in coverage [1]
    • of input space vs. universe: combinatorial coverage (CC)
    • Between datasets: set difference combinatorial coverage (SDCC)
  • Why does this matter?
    • Classifier trained to detect huskies vs. wolves really detected snow due to lack of images with combinations (husky, snow) (wolf, urban) etc.

Ut

Dt

 

 

 

Ut

St

Tt

 

Ut

St

Tt

 

Ut

Tt

St

 

Ut

St =Tt

 

Ut

Tt

St

 

Explainable at 3 levels of complexity

1. # interactions present/absent

2. Which interactions present/absent

3. Distribution of interactions

[1] E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial testing metrics for machine learning,” in 2021 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 81-84, IEEE, 2021.

Hair

Live Birth

Eco

Class

No

Yes

Ocean

Orca

Yes

Yes

Woods

Wolf

Yes

Yes

Ocean

Otter

No

No

Woods

Owl

2-way combination :

Hair, Eco�Hair, Live Birth

2-way interaction :

Hair=no, Eco=Ocean

Eco=Ocean, Live Birth =Yes

6 of 29

Combinatorial Coverage Published Results

Coverage metrics correlate with performance drop direction

North covers (93%) more contexts than South (83%)

North covers more of South (98%) than South covers of North (88%)��[1] E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial testing metrics for machine learning,” in 2021 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 81-84, IEEE, 2021.

Retraining source model with few images from target domain selected by set difference combinatorial coverage achieves higher accuracy than same number of randomly selected images�

[2] S. F. Ahamed, P. Aggarwal, S. Shetty, E. Lanus, L. J. Freeman, “ATTL: an automated targeted transfer learning with deep neural networks,” in 2021 IEEE Global Communications Conference (GLOBECOM), 2021, pp. 1-7.

Test samples with interactions covered by train set have higher accuracy than samples with some interactions not covered 🡪 create representative or challenging test sets �

[3] T. Cody, E. Lanus, D. Doyle, L. J. Freeman, “Systematic training and testing for machine learning using combinatorial interaction testing,” in 2022 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 102-109, IEEE, 2022.

Smean=2�Class=1

Smean=2�Class=1

7 of 29

Systematic Inclusion & Exclusion Experimental Framework

Challenge: How do we determine the features, and metadata categories that define the operating envelopes?

Solution: Design of Experiments!

[1] E. Lanus, B. Lee, L. Pol, D. Sobien, J. Kauffman, L. J. Freeman. “Coverage for Identifying Critical Metadata in Machine Learning Operating Envelopes,” to appear in 2024 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), IEEE, 2024.

8 of 29

Results on RarePlanes

9 of 29

Results on RarePlanes - EDA

10 of 29

Modeling Approach

  • Logistic Regression
    • Frequency = Test Size
  • Response Variables
    • Precision
    • Recall
    • F1
  • Predictors:
    • Feature Excluded in Test: Yes/No
    • Covered in Training: Yes/No
  • Two models fit
    • Control
    • Systematic Inclusion/Exclusion

11 of 29

Model Results

Systematic Inclusion / Exclusion (SIE)

Control

Precision

Recall

F1

Response Variable

Model p-value

(Included in training)

Precision

0.4225

Recall

0.9008

F1

0.7140

12 of 29

Results on RarePlanes - Precision

Contrast Significance

P-Value

Avg Pan Res Excluded/included *Covered

0.0296

Biome Excluded/included

*Covered

0.0002

Hour of Day Excluded/included

*Covered

0.0029

Off Nadir Max Excluded/included

*Covered

0.6406

Season Excluded/included

*Covered

0.1493

13 of 29

Results on RarePlanes - Recall

Contrast Significance

P-Value

Avg Pan Res Excluded/included *Covered

0.0237

Biome Excluded/included

*Covered

0.0007

Hour of Day Excluded/included

*Covered

0.0011

Off Nadir Max Excluded/included

*Covered

0.2393

Season Excluded/included

*Covered

0.0383

14 of 29

Results on RarePlanes – F1

Contrast Significance

P-Value

Avg Pan Res Excluded/included *Covered

0.0218

Biome Excluded/included

*Covered

0.0002

Hour of Day Excluded/included

*Covered

0.0012

Off Nadir Max Excluded/included

*Covered

0.3729

Season Excluded/included

*Covered

0.0627

15 of 29

Coverage of Data Explorer (CODEX)

    • CODEX Tool Goals:
      • Enable experimentation and dynamic decision making on data partitions via dashboard to visualize and interact with trained ML model’s envelope in iterative process
      • Understand what compromises a sufficient training or test data set for ML T&E
      • Understand which factors define operating envelope and impact model performance
      • Implement functionalities utilizing combinatorial coverage from published and unpublished results

Visualize Coverage Metrics

Plot data coverage metrics, input human operator choices and objectives

Know areas of input space with little/no data

Partition, Train, Test

Algorithmically generate partitions of data w.r.t. coverage, automatically train/evaluate models

Generate minimal train sets, certify gold and robust test sets

Delta Performance Coverage

Between partitions assessment: plot change in performance against change in coverage at interaction and model scales

Know interactions with lower performance, effect of data coverage on resulting model

16 of 29

CODEX Functions

  • Lower Level:
    • Combinatorial Coverage (CC)
    • Set Difference Combinatorial Coverage (SDCC)
    • Dataset Evaluation
      • Compute CC over a dataset
    • Dataset Split Evaluation
      • Compute SDCC over a dataset split
    • Plot performance against coverage
    • Plot performance by interaction
    • Balanced Set Construction
      • Given a goal size, build a dataset that covers interactions as equally as possible as close to the goal as possible

  • Higher Level:
    • Dataset Split Comparison
      • Uses plot performance by coverage
    • Targeted Retraining in Transfer Learning
      • Identify samples in set difference (gap)
    • Identify Critical Factors via Systematic Inclusion/Exclusion (SIE)
      • Uses Balanced Set Construction
    • Model Probing
      • Uses plot performance by interaction

17 of 29

Coverage Applications & Future Work

    • Inform data collection/labeling
    • Reduce train sets
    • Certify gold & robust test sets

    • Systematic �model probing
    • Discover and improve attacks
    • Why is model not performant?
    • What are model boundaries?
    • Which model to choose?
    • Where can we deploy?

Explainability: Selection from Model Zoo

Operator Trust: �Data & Model Card Interoperability

Robustness: Blue Team/ White Box Assessment

Security:�Red Team/�Black Box Assessment

Use case: sponsor has model trained in desert and upcoming mission in arctic.

Automate coverage computation of data to assist in course of action decision

Enough coverage

    • use the model

Better coverage elsewhere

    • use this other model

Not enough coverage in specific areas

    • alert for human oversight when operating in gaps
    • collect/synthesize data from gaps and fine tune the model

Apply data coverage in model development via planned data collection and partitioning to: �– Reduce time, effort, and cost of training and testing

– Improve robustness by knowing model boundaries and filling gaps

Improve efficiency:

Training and testing

Data collection/labeling

Improve robustness:

Natural events

Adversarial actions

Enhance statistics on data cards with interaction coverage, gaps

Summarize prior to sharing 🡪 “need to share”

Expand model cards with coverage informed performance analysis, critical metadata

Combinatorial arrays for black box software testing to construct test suites to search input space:

– Construct low performing samples

– Estimate operating envelope�– Guess training data used

Coverage based approaches to

– Data poisoning

– Adversarial partitioning

18 of 29

Backup:

Adversarial Testing Future Directions

19 of 29

Data Poisoning

19

  • Metadata such as time of day, season, biome impact appearance of objects relative to background
  • Combinatorial coverage of different contexts leads to models that can perform in a variety of contexts, do not learn spurious correlations (e.g., detect runway instead of plane, leading to failure to detect plane when runway obscured by snow)
  • Coverage becomes very useful for designing train and test sets
  • Research:
    • Can an adversary use coverage of metadata to direct poisoning attacks to lead to models that are brittle in specific ways?
    • Can we use coverage to minimize the amount of poisoning needed by identifying low coverage regions?

20 of 29

Private Aggregation of Teacher Ensembles (PATE)

Images from http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html

PATE Framework: N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data,” arXiv preprint arXiv:1610.05755. 

  • Monolithic focus on privacy 🡪 naïve about impact on utility
  • Model operating envelope work 🡪 can’t treat dataset as blob
        • Look inside the data, characterize what is contains (coverage)
        • Understand how subsetting dataset impacts model utility
  • Case study: PATE Framework (Google Brain) uses subsets for achieving differential privacy, treats subsets as blobs
  • Explore how different subsets as partitions impacts utility
    • Apply coverage approaches to characterize partitions

21 of 29

Adversarial Partitions: Beyond User Privacy

PATE or a similar disjoint partition ensemble strategy for:  

  • protecting user private information in health or other human services datasets 
  • preventing adversary from extracting specifics of sensitive training samples by examination of model deployed in unsecured areas;  
  • partitioning sensitive data into disjoint sets to limit impact of leakage of one partition to limit damage caused by an insider threat acting alone;  
  • outsourcing training teacher models on non-sensitive data, later ensembling with teachers trained on sensitive data or using sensitive auxiliary dataset to train student model.

PATE one of few approaches specific to DP in ML https://www.nist.gov/blogs/cybersecurity-insights/how-deploy-machine-learning-differential-privacy  

Utility impacting partitions could occur:

    • Accidentally (random split)
    • Naturally or Structurally (separate private datasets, sensitivity restrictions on part of dataset, keep frames from unique flight together)
    • Adversarially (insider compromise the splitting algorithm) 🡨 worst case

22 of 29

Constructing Adversarial Partitions

Coverage Metrics + Computational Construction Algorithm

Wolf

Husky

???

  • Hypotheses�(1) it is possible to construct a partition to negatively impact student model utility on certain interactions when certain properties of the private dataset hold �(2) an adversarially constructed data partition is detectable by analyzing set relationships/distributions between teacher partitions
  • Similar to data poisoning attack, but no data injected 🡪 mechanisms to detect non-original (poisoned) samples should fail

23 of 29

Backup:

Systematic Inclusion/Exclusion

24 of 29

Experimental Framework

25 of 29

Results on RarePlanes

26 of 29

Results on RarePlanes

27 of 29

Results on RarePlanes

28 of 29

Results on RarePlanes

29 of 29

Results on RarePlanes