MACHINE LEARNING AND YOUR RESEARCH
Dr. Savannah Thais
Columbia University Data Science Institute
06/22/2023
Welcome!
About me:
Savannah Thais 06/22/2023
About you:
What this course is:
What it’s not:
Outline
Savannah Thais 06/22/2023
Savannah Thais 06/22/2023
10 Minute Overview
Killer Robots! Or, What is Machine Learning?
ML can be a great tool to enable scientific (and other) research, but it doesn’t know what what it doesn’t know
Savannah Thais 06/22/2023
Algorithms that improve automatically through experience
Computer programs that can access data and use it to learn for themselves
Teaching a computer system how to make accurate predictions when fed data
Using statistics to find patterns in datasets without the patterns being explicitly stated
A field of study that gives the ability to the computer to learn without being explicitly programed
Supervision Required?
Supervised Learning
*if that relationship can be expressed as a mathematical function
Savannah Thais 06/22/2023
Un-supervised Learning
*but no promises that those patterns are useful!
Bonus: Reinforcement Learning – models are rewarded for meeting goals
Supervision Required?
Supervised Learning
Example: distinguishing between two plant types
Savannah Thais 06/22/2023
Un-supervised Learning
Example: grouping data into classes
What are some problems for ML?
What are some examples of ML in your daily life and how might the problem be defined?
Savannah Thais 06/22/2023
What are Problems for ML?
Classification:
Predict a class label for an input
Savannah Thais 06/22/2023
Regression:
Predict a continuous variable
Clustering:
Group similar inputs
Generation:
Construct new data within pattern
Model
$$$$
$
What are Problems for ML?
Association Rules:
Identify common patterns in data
Savannah Thais 06/22/2023
Ranking:
Generate optimal orderings
Anomaly Detection:
Identify statistical outliers
Restructuring:
Transform data representations
What are NOT Problems for ML?*
Causation:
Models learn correlations, but can’t infer causality or intent
Savannah Thais 06/22/2023
Precise Interpretability:
It’s often difficult to understand what a model is learning
Context:
Models are incapable of non-mathematical reasoning
Data Limitations:
Models can’t fix problems in data or learn without examples
Savannah Thais 06/22/2023
Thinking Like A Machine
ML Should Be Scientific!
Designing a (good) ML model is like running a scientific experiment: we don’t know apriori what will work best
Savannah Thais 06/22/2023
* Including how certain you are!
*
The Hypothesis
Your ML hypothesis is a combination of the model you want to build and the pattern you want to explore
Savannah Thais 06/22/2023
The Experiment
Building, training, and evaluating your model is the experimental process of testing your hypothesis
Savannah Thais 06/22/2023
Setting Up Your Data
Your model is only as good as your training data
Savannah Thais 06/22/2023
Do you have examples of all data classes/ranges?
Are there patterns in your data you don’t want the model to exploit?
How expensive is it to create/collect more data or labels?
Is there noise in your label creation or distribution?
Are the available labels related to the decision you want to make?
How much data is available and does each entry have the same information?
Are classes and inputs balanced and normalized?
Example Problems
Predicting author’s political associations from ‘mind metaphors’
Savannah Thais 06/22/2023
Distinguishing genetic cohorts
Your Turn!
Think of a research problem where you want to employ ML:
Savannah Thais 06/22/2023
Savannah Thais 06/22/2023
Supervised Algorithms
How Do We Do This?
Savannah Thais 06/22/2023
Make an Inference!
Training a Model
Training Set
Savannah Thais 06/22/2023
Validation Set
Test Set
Train Model
Evaluate Model During Training
Evaluate Final Model
Data Split
Purpose
Over and Under Fitting
Savannah Thais 06/22/2023
Over and Under Fitting
Savannah Thais 06/22/2023
Decision Trees
Input data is processed through a series of linear cuts
Savannah Thais 06/22/2023
Forests of Trees
Multiple decisions trees are combined into ensemble classifier
Savannah Thais 06/22/2023
Neural Networks
Data is transformed into a space with easier decision boundaries
Savannah Thais 06/22/2023
Convolutional Neural Networks
Create multiple transformed (convolved) versions of input data
Savannah Thais 06/22/2023
Recurrent Neural Networks
Output of previous sequence step is used to process next step
Savannah Thais 06/22/2023
Generative Adversarial Networks
Two NNs with different learning goals train together
Savannah Thais 06/22/2023
Graph Neural Networks
Savannah Thais 04/04/2023
Learn a smart re-embedding of the graph data that preserves the relational structure
Transformers
Savannah Thais 04/04/2023
Use multi-headed attention with encoder and decoder for sequence to sequence modeling
Savannah Thais 06/22/2023
Unsupervised Algorithms
Clustering
K-Means
Savannah Thais 06/22/2023
Group similar data points into a cluster (you define similarity)
K-means
DBSCAN
DBSCAN
Dimensionality Reduction
PCA
Savannah Thais 06/22/2023
Group features into higher-level, informative sets
SVD
Autoencoders
Savannah Thais 06/22/2023
Loss-ily transform data such that it can be transformed back
Anomaly Detection
Classification Based:
Data contains examples of outliers
Savannah Thais 06/22/2023
Clustering Based:
Identify points outside of clusters
Dimension Based:
Assume components normal
Autoencoder Based:
Assume embedding is normal
Find outliers in data by understanding underlying distribution
Your Turn!
What model(s) would you use in your experiment?
Savannah Thais 06/22/2023
Savannah Thais 06/22/2023
AI/ML Has A Hype Problem
Savannah Thais 06/22/2023
Characterizing Model Output
Understanding the distribution of model errors, and how they relate to the research task you’re trying to solve, is KEY to ensuring reliability and validity of your ML models
Savannah Thais 06/22/2023
Consider All Steps of the Pipeline
Savannah Thais 06/22/2023
Choosing a Cost Function + Evaluation
Loss Functions
Savannah Thais 06/22/2023
Evaluation Metrics
Racial Bias in Healthcare Risk Assessment
Savannah Thais 06/22/2023
Understanding the Distributions
Savannah Thais 06/22/2023
Predictive Policing: Predictions
Savannah Thais 06/22/2023
Predictive Policing: Arrests
Savannah Thais 06/22/2023
Understanding the Distributions
Savannah Thais 06/22/2023
Just the Beginning
Savannah Thais 06/22/2023
Thank you!
Savannah Thais 06/22/2023
Backup
Savannah Thais 06/22/2023
Key Terms
Savannah Thais 06/22/2023
further reading on: optimization, loss functions, training, inference, parameters, overfitting
Gradient Descent
Savannah Thais 06/22/2023