5/15/2024
May 16, 2024
Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing
Erin Lanus, Brian Lee
Laura Freeman
Laura.freeman@vt.edu
Overview
Data Assurance Necessary for AI Assurance
wolf
husky
wolf?
Explainability tool LIME highlights salient parts of image (snow)
Identifying root cause requires analysis of data coverage (lack of huskies in snow)
Data Coverage Operating Envelope Estimation
4
4
4
Problem:
Can we deploy a computer vision model for �detecting vehicles from Southern, CA to Northern, CA?
Can we tell operators when to trust the model?
What are the top information gain data points for improving the model for this deployment?
Solution:
Technical Foundation: Coverage Metrics
Ut
Dt
Ut
St
Tt
Ut
St
Tt
Ut
Tt
St
Ut
St =Tt
Ut
Tt
St
Explainable at 3 levels of complexity
1. # interactions present/absent
2. Which interactions present/absent
3. Distribution of interactions
[1] E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial testing metrics for machine learning,” in 2021 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 81-84, IEEE, 2021.
Hair | Live Birth | Eco | Class |
No | Yes | Ocean | Orca |
Yes | Yes | Woods | Wolf |
Yes | Yes | Ocean | Otter |
No | No | Woods | Owl |
2-way combination :
Hair, Eco�Hair, Live Birth
2-way interaction :
Hair=no, Eco=Ocean
Eco=Ocean, Live Birth =Yes
Combinatorial Coverage Published Results
Coverage metrics correlate with performance drop direction
North covers (93%) more contexts than South (83%)
North covers more of South (98%) than South covers of North (88%)��[1] E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial testing metrics for machine learning,” in 2021 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 81-84, IEEE, 2021.
Retraining source model with few images from target domain selected by set difference combinatorial coverage achieves higher accuracy than same number of randomly selected images�
[2] S. F. Ahamed, P. Aggarwal, S. Shetty, E. Lanus, L. J. Freeman, “ATTL: an automated targeted transfer learning with deep neural networks,” in 2021 IEEE Global Communications Conference (GLOBECOM), 2021, pp. 1-7.
Test samples with interactions covered by train set have higher accuracy than samples with some interactions not covered 🡪 create representative or challenging test sets �
[3] T. Cody, E. Lanus, D. Doyle, L. J. Freeman, “Systematic training and testing for machine learning using combinatorial interaction testing,” in 2022 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), pp. 102-109, IEEE, 2022.
Smean=2�Class=1
Smean=2�Class=1
Systematic Inclusion & Exclusion Experimental Framework
Challenge: How do we determine the features, and metadata categories that define the operating envelopes?
Solution: Design of Experiments!
[1] E. Lanus, B. Lee, L. Pol, D. Sobien, J. Kauffman, L. J. Freeman. “Coverage for Identifying Critical Metadata in Machine Learning Operating Envelopes,” to appear in 2024 IEEE International Conference on Software Testing Verification and Validation Workshops (ICSTW), IEEE, 2024.
Results on RarePlanes
Results on RarePlanes - EDA
Modeling Approach
Model Results
Systematic Inclusion / Exclusion (SIE)
Control
Precision
Recall
F1
Response Variable | Model p-value (Included in training) |
Precision | 0.4225 |
Recall | 0.9008 |
F1 | 0.7140 |
Results on RarePlanes - Precision
Contrast Significance | P-Value |
Avg Pan Res Excluded/included *Covered | 0.0296 |
Biome Excluded/included *Covered | 0.0002 |
Hour of Day Excluded/included *Covered | 0.0029 |
Off Nadir Max Excluded/included *Covered | 0.6406 |
Season Excluded/included *Covered | 0.1493 |
Results on RarePlanes - Recall
Contrast Significance | P-Value |
Avg Pan Res Excluded/included *Covered | 0.0237 |
Biome Excluded/included *Covered | 0.0007 |
Hour of Day Excluded/included *Covered | 0.0011 |
Off Nadir Max Excluded/included *Covered | 0.2393 |
Season Excluded/included *Covered | 0.0383 |
Results on RarePlanes – F1
Contrast Significance | P-Value |
Avg Pan Res Excluded/included *Covered | 0.0218 |
Biome Excluded/included *Covered | 0.0002 |
Hour of Day Excluded/included *Covered | 0.0012 |
Off Nadir Max Excluded/included *Covered | 0.3729 |
Season Excluded/included *Covered | 0.0627 |
Coverage of Data Explorer (CODEX)
Visualize Coverage Metrics
Plot data coverage metrics, input human operator choices and objectives
Know areas of input space with little/no data
Partition, Train, Test
Algorithmically generate partitions of data w.r.t. coverage, automatically train/evaluate models
Generate minimal train sets, certify gold and robust test sets
Delta Performance Coverage
Between partitions assessment: plot change in performance against change in coverage at interaction and model scales
Know interactions with lower performance, effect of data coverage on resulting model
CODEX Functions
Coverage Applications & Future Work
Explainability: Selection from Model Zoo
Operator Trust: �Data & Model Card Interoperability
Robustness: Blue Team/ White Box Assessment
Security:�Red Team/�Black Box Assessment
Use case: sponsor has model trained in desert and upcoming mission in arctic.
Automate coverage computation of data to assist in course of action decision
Enough coverage
Better coverage elsewhere
Not enough coverage in specific areas
Apply data coverage in model development via planned data collection and partitioning to: �– Reduce time, effort, and cost of training and testing
– Improve robustness by knowing model boundaries and filling gaps
Improve efficiency:
Training and testing
Data collection/labeling
Improve robustness:
Natural events
Adversarial actions
Enhance statistics on data cards with interaction coverage, gaps
Summarize prior to sharing 🡪 “need to share”
Expand model cards with coverage informed performance analysis, critical metadata
Combinatorial arrays for black box software testing to construct test suites to search input space:
– Construct low performing samples
– Estimate operating envelope�– Guess training data used
Coverage based approaches to
– Data poisoning
– Adversarial partitioning
Backup:
Adversarial Testing Future Directions
Data Poisoning
19
Private Aggregation of Teacher Ensembles (PATE)
Images from http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html
PATE Framework: N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data,” arXiv preprint arXiv:1610.05755.
Adversarial Partitions: Beyond User Privacy
PATE or a similar disjoint partition ensemble strategy for:
PATE one of few approaches specific to DP in ML https://www.nist.gov/blogs/cybersecurity-insights/how-deploy-machine-learning-differential-privacy
Utility impacting partitions could occur:
Constructing Adversarial Partitions
Coverage Metrics + Computational Construction Algorithm
Wolf
Husky
???
Backup:
Systematic Inclusion/Exclusion
Experimental Framework
Results on RarePlanes
Results on RarePlanes
Results on RarePlanes
Results on RarePlanes
Results on RarePlanes