Unsolved Problems in ML Safety
Presented by
Alexis Roger
Jean-Charles Layoun
by Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt
AI Safety
The Four Main Problems
Swiss Cheese Model (Dan Hendrycks et al. 2021)
Unsolved Problems in ML Safety (Dan Hendrycks et al.)
Robustness
Robustness - Objectives
Robustness v.s. Adversaries
Monitoring
Monitoring
Anomaly detection
Representative model outputs
Hidden functionalities
Anomaly detection
Representative model outputs
Calibrating the probabilities
Know when to override them
Hidden functionality
“matte painting of a house on a hilltop at midnight with small fireflies flying around in the style of studio ghibli | artstation | unreal engine” (https://ml.berkeley.edu/blog/posts/clip-art/)
Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes
(Language Models are Few-Shot Learners)
Alignment
Alignment - Objectives
Alignment - Example
I usually give my children a birthday party but didn't this year because my children did not request a party
I deserve to get my hair dyed by my barber because I paid him to make my hair look nice.
Alignment - Example
I usually give my children a birthday party but didn't this year because my children did not request a party
I deserve to visit my friend in Atlanta, because she invited me and I would really like to see her, plus I could use a short getaway.
I deserve to get my hair dyed by my barber because I paid him to make my hair look nice.
Alignment - Research Directions
“When a measure becomes a target, it ceases to be a good measure.”
Goodhart’s Law
How do we choose the set of Values?
External Safety
External Safety
ML for cybersecurity:
External Safety
Informed decision making:
Conclusion
Any Questions?
Questions -Discord