1 of 52

MACHINE LEARNING AND YOUR RESEARCH

Dr. Savannah Thais

Columbia University Data Science Institute

06/22/2023

2 of 52

Welcome!

About me:

Particle physics PhD
Now working in ML development and AI Ethics
Excited about scientific ML

Savannah Thais 06/22/2023

About you:

Research with a quantitative/data component
Curious about ML
Deep knowledge of your domain and its limitations

What this course is:

What kind of problems are appropriate for ML
Conceptual introduction to algorithms
Spark interest and learn what to follow up on

What it’s not:

Mathematical or coding-based
A how-to guide
An exhaustive overview of every model
Going to make you an ML expert

3 of 52

Outline

ML Basics

Thinking Like a Machine

Supervised Algorithms

Unsupervised Algorithms

Evaluating Models

Discussion

Savannah Thais 06/22/2023

4 of 52

Savannah Thais 06/22/2023

10 Minute Overview

5 of 52

Killer Robots! Or, What is Machine Learning?

ML can be a great tool to enable scientific (and other) research, but it doesn’t know what what it doesn’t know

Savannah Thais 06/22/2023

Algorithms that improve automatically through experience

Computer programs that can access data and use it to learn for themselves

Teaching a computer system how to make accurate predictions when fed data

Using statistics to find patterns in datasets without the patterns being explicitly stated

A field of study that gives the ability to the computer to learn without being explicitly programed

6 of 52

Supervision Required?

Supervised Learning

Model is provided with labels
Algorithm learns relationship* between features and labels during training

*if that relationship can be expressed as a mathematical function

Savannah Thais 06/22/2023

Un-supervised Learning

Model takes unlabeled or unclassified data
Algorithm learns patterns* or groupings in the data during training

*but no promises that those patterns are useful!

Bonus: Reinforcement Learning – models are rewarded for meeting goals

7 of 52

Supervision Required?

Supervised Learning

Example: distinguishing between two plant types

Savannah Thais 06/22/2023

Un-supervised Learning

Example: grouping data into classes

8 of 52

What are some problems for ML?

What are some examples of ML in your daily life and how might the problem be defined?

Savannah Thais 06/22/2023

9 of 52

What are Problems for ML?

Classification:

Predict a class label for an input

Savannah Thais 06/22/2023

Regression:

Predict a continuous variable

Clustering:

Group similar inputs

Generation:

Construct new data within pattern

Model

$$$$

10 of 52

What are Problems for ML?

Association Rules:

Identify common patterns in data

Savannah Thais 06/22/2023

Ranking:

Generate optimal orderings

Anomaly Detection:

Identify statistical outliers

Restructuring:

Transform data representations

11 of 52

What are NOT Problems for ML?*

Causation:

Models learn correlations, but can’t infer causality or intent

Savannah Thais 06/22/2023

Precise Interpretability:

It’s often difficult to understand what a model is learning

Context:

Models are incapable of non-mathematical reasoning

Data Limitations:

Models can’t fix problems in data or learn without examples

12 of 52

Savannah Thais 06/22/2023

Thinking Like A Machine

13 of 52

ML Should Be Scientific!

Designing a (good) ML model is like running a scientific experiment: we don’t know apriori what will work best

Savannah Thais 06/22/2023

source

* Including how certain you are!

14 of 52

The Hypothesis

Your ML hypothesis is a combination of the model you want to build and the pattern you want to explore

“An algorithm can distinguish between normal and cancerous brain scans based only on pixel values”
“A model can simulate tau lepton decays within a defined margin of uncertainty”

Questions to consider as you construct your hypothesis:

What specifically do I want my model to be able to do?
What is the ideal outcome/use case of my experiment?
What will I consider a success (proving hypothesis) or failure (rejecting hypothesis)?
What kinds out outputs do I need the model to make and how will I use them?

Savannah Thais 06/22/2023

15 of 52

The Experiment

Building, training, and evaluating your model is the experimental process of testing your hypothesis

Your learning goal, input data, and desired output structure can help determine what class of models to study
All components need to be quantifiable and measurable

What are your input features and how are they represented?
What is the specific learning task for the model?
How do you quantify how well the model is doing?
What metric can you use to compare different models?

Savannah Thais 06/22/2023

16 of 52

Setting Up Your Data

Your model is only as good as your training data

Savannah Thais 06/22/2023

Do you have examples of all data classes/ranges?

Are there patterns in your data you don’t want the model to exploit?

How expensive is it to create/collect more data or labels?

Is there noise in your label creation or distribution?

Are the available labels related to the decision you want to make?

How much data is available and does each entry have the same information?

Are classes and inputs balanced and normalized?

17 of 52

Example Problems

Predicting author’s political associations from ‘mind metaphors’

Supervised multi-class classifier
Each metaphor (embedded) is an input data point
Evaluate based on accuracy of predictions (% of correct classifications)
Watch out! Could learn to associate writing styles with individual authors

Savannah Thais 06/22/2023

Distinguishing genetic cohorts

Unsupervised clustering
Patient’s full genome as input data
Evaluate based on measurable differences (disease manifestation) between clusters and gene pathway analysis of differences

18 of 52

Your Turn!

Think of a research problem where you want to employ ML:

What is your (testable!) hypothesis?
What data could you use as inputs?
What would the learning goal be?
How would you quantify success?

Savannah Thais 06/22/2023

19 of 52

Savannah Thais 06/22/2023

Supervised Algorithms

20 of 52

How Do We Do This?

Savannah Thais 06/22/2023

Make an Inference!

21 of 52

Training a Model

Training Set

Savannah Thais 06/22/2023

Validation Set

Test Set

Train Model

Evaluate Model During Training

Evaluate Final Model

Data Split

Purpose

22 of 52

Over and Under Fitting

Didn’t learn general patterns in data
High bias
Low variance
Inaccurate inference

Savannah Thais 06/22/2023

Learned specific details of training set
Low bias
High variance
Inaccurate inference

23 of 52

Over and Under Fitting

Savannah Thais 06/22/2023

24 of 52

Decision Trees

Input data is processed through a series of linear cuts

Generally, different splits on different features are tried and the one with the lowest cost is selected at each ‘branch’
All data in a ‘leaf’ is assigned the same label

Savannah Thais 06/22/2023

Important parameters:

When to stop splitting
Pruning process

Data type:

Fixed, limited number of continuous/quantized features

Pros:

Always interpretable
Easy to train

Cons:

Always linear
Easy to overtrain
High variance

further reading

25 of 52

Forests of Trees

Multiple decisions trees are combined into ensemble classifier

Boosting: misclassified data points given a higher weight in next tree
Bagging: train trees on random subsets of data (can extend to random subsets of features)

Savannah Thais 06/22/2023

Important parameters:

Depth and pruning
Combination function

Data type:

Fixed, limited number of continuous/quantized features

Pros:

~Interpretable
Reduced variance
Can use higher dimensional data

Cons:

Less precise regression

further reading

26 of 52

Neural Networks

Data is transformed into a space with easier decision boundaries

Transformed linearly through matrix multiplications (learned weights)
Transformed non-linearly through activation functions (neurons)
The label errors are backpropagated to adjust network weights

Savannah Thais 06/22/2023

Important parameters:

Depth + hidden dimensions
Learning rate + regularization
Activation functions + loss

Data type:

Fixed number of continuous/quantized features

Pros:

Captures non-linear relationships
Can handle high-dimension input
Can use higher dimensional data

Cons:

Black box
Requires extensive training data + time
Sensitive to hyperparameters

further reading

27 of 52

Convolutional Neural Networks

Create multiple transformed (convolved) versions of input data

Convolutions are small, learned matrices that collect local information and learn hierarchical features
Transformed inputs are typically used in a classification or regression

Savannah Thais 06/22/2023

Important parameters:

Size and number of convolutions, activations, network structure, rates…

Data type:

fixed size matrices (often used with images)

Pros:

learns locational features, hierarchical features, works with high dimensional data

Cons:

complex architecture, longer training, local minima

further reading

28 of 52

Recurrent Neural Networks

Output of previous sequence step is used to process next step

Network retains ‘memory’ of previous inputs
Network weights and recurrence weights are updated at each step

Savannah Thais 06/22/2023

Important parameters:

Hidden state size, recurrent unit types, activation functions, directionality, rates…

Data type:

Variable length sequence data (often used with time series, text, video)

Pros:

Allows variable length inputs, maintains sequencing

Cons:

complex architecture, vanishing gradient, limited long-term memory, difficult to stack

further reading

29 of 52

Generative Adversarial Networks

Two NNs with different learning goals train together

Generator network creates new data from random noise
Discriminator network distinguishes generated from real data

Savannah Thais 06/22/2023

Important parameters:

Depth + hidden dimensions
Learning rate + regularization
Loss function/network balancing

Data type:

Fixed size labeled data (features, images, text…)

Pros:

Can be faster than other simulations
Allows various data types

Cons:

Training can be unstable
Societal dangers

further reading

30 of 52

Graph Neural Networks

Savannah Thais 04/04/2023

Learn a smart re-embedding of the graph data that preserves the relational structure

Important parameters:

Aggregator, size of message passing and update networks, graph construction, message features

Data type:

Variable length relational or geometrically structured data

Pros:

Allows variable length inputs, maintains relational structure

Cons:

Complex architecture, over-smoothing

31 of 52

Transformers

Savannah Thais 04/04/2023

Use multi-headed attention with encoder and decoder for sequence to sequence modeling

Output embedding is trained to predict sequence shifted by one word

Q=vector representation of one word, K=vector rep of all words, V=(different) vector rep of all words

32 of 52

Savannah Thais 06/22/2023

Unsupervised Algorithms

33 of 52

Clustering

K-Means

Centroids selected randomly
Points assigned to closest centroid
Centroid adjusted towards cluster mean
Best for circular clusters

Savannah Thais 06/22/2023

Group similar data points into a cluster (you define similarity)

further reading

K-means

DBSCAN

further reading

DBSCAN

For starting point, all neighbors are mapped
If enough neighbors, all neighbors are added to cluster
More parameters, but allows different cluster shapes

34 of 52

Dimensionality Reduction

PCA

Derives an orthonormal basis for the dataset
Each dimension aims to be linearly uncorrelated
You select desired dimensionality

Savannah Thais 06/22/2023

Group features into higher-level, informative sets

further reading

SVD

A generalized PCA that finds all eigenvectors of dataset covariance matrix
Eigenvectors ranked by their ‘singular value’
In highly correlated data, many SVs will be small

35 of 52

Autoencoders

Consists of an encoder that re-embeds the data and an decoder that attempts to reconstruct original data

Both are typically feed-forward NNs

Different ways to enforce loss

Sparsity (only allow some activations)
Undercomplete (reduced hidden dim)
Contractive (reduced output dim)
Variational (predict latent distributions)

Wide range of uses:

Learn useful embedding spaces
Denoise samples
Dimensionality reduction

Savannah Thais 06/22/2023

Loss-ily transform data such that it can be transformed back

further reading

36 of 52

Anomaly Detection

Classification Based:

Data contains examples of outliers

Savannah Thais 06/22/2023

Clustering Based:

Identify points outside of clusters

Dimension Based:

Assume components normal

Autoencoder Based:

Assume embedding is normal

Find outliers in data by understanding underlying distribution

further reading

37 of 52

Your Turn!

What model(s) would you use in your experiment?

What parameters would you need to consider?
Would you have enough/the right training data?
How would the model enable useful research decisions?

Savannah Thais 06/22/2023

38 of 52

Savannah Thais 06/22/2023

39 of 52

AI/ML Has A Hype Problem

Savannah Thais 06/22/2023

40 of 52

Characterizing Model Output

Understanding the distribution of model errors, and how they relate to the research task you’re trying to solve, is KEY to ensuring reliability and validity of your ML models

Savannah Thais 06/22/2023

Some questions to help:

What data selections or biases contributed to the distribution?
How do we ensure the distribution is modeled correctly? What kinds of tests can we use?
What should the distribution look like (or what do we want it to look like given certain assumptions/goals)?
If anything is incorrect or needs to be changed, what knobs can we turn?

Sometimes these questions are very hard or impossible to answer!

41 of 52

Consider All Steps of the Pipeline

Savannah Thais 06/22/2023

42 of 52

Choosing a Cost Function + Evaluation

Loss Functions

Common choices:

Mean Squared Error

Hinge Loss

Cross Entropy

Considerations:

Type of model
Presence of outliers in dataset
Sensitive outcomes
Training behavior

Savannah Thais 06/22/2023

further reading

Evaluation Metrics

Common choices:

Accuracy
Confusion matrix
Sensitivity
Specificity
False positive rate
Precision
Recall
F1 Score

Considerations:

What type of outcomes matter?
Typically should report multiple

further reading

43 of 52

Racial Bias in Healthcare Risk Assessment

Savannah Thais 06/22/2023

44 of 52

Understanding the Distributions

Determine an appropriate measure of model validity
Check the performance (closure) of the model across all areas where it will be applied (each race)
See that it underperforms in some phase spaces…why could that be?
Interrogate model variables and labels to look for statistical sources of bias
Fix it…in this case by defining a better training label and potentially de-biasing historical data and/or accounting for uncertainty

Savannah Thais 06/22/2023

45 of 52

Predictive Policing: Predictions

Savannah Thais 06/22/2023

46 of 52

Predictive Policing: Arrests

Savannah Thais 06/22/2023

47 of 52

Understanding the Distributions

What is an appropriate measure of performance here? (not all crimes are reported)
Where could the apparent bias in prediction arise from? How could we understand if it’s accurate?
How does the system itself affect the phase space the developers are trying to measure
How could we start to address these concerns?

Savannah Thais 06/22/2023

48 of 52

Just the Beginning

There are MANY other types of ML models + learning mechanisms

Geometric machine learning/graph neural networks
Semi-supervised, self-supervised, transfer learning, etc
Bayesian NNs and inverse problems/likelihood free inference

Other important considerations when applying ML to research

Interpretability/explainability: many models are black box, but we want to use them to inform our understanding of the world
Uncertainty quantification: need to understand how much we can trust the model prediction and under what circumstances it’s valid

Sociotechnical and ethical considerations around ML
ML is a broad, multi-faceted field and it’s difficult to stay up-to-date on innovations and best practices

Interdisciplinary and cross-campus collaboration is key!

Savannah Thais 06/22/2023

49 of 52

Thank you!

Happy to answer any questions!

st3565@columbia.edu @basicsciencesav

Savannah Thais 06/22/2023

50 of 52

Backup

Savannah Thais 06/22/2023

51 of 52

Key Terms

Optimization: minimizing an objective function f(x) parameterized by x (minimize loss function over model params)

Converges to global min if the objective function is convex, otherwise converges within a neighborhood

Loss/cost function: equation that quantifies algorithm performance

Common functions include: MSE, MAE, Log Likelihood

Model Parameters: the fitted/learned model parameters

Ex: slope in linear regression

Hyperparameter: adjustable (non-learned) parameters that must be tuned for optimal training/inference
Training: adjusting model parameters to optimize the loss function
Inference: using a trained model to draw conclusions about unseen data
Overfitting: a model learns relics in the training dataset that are not representative of the full data distribution (memorization)

Savannah Thais 06/22/2023

further reading on: optimization, loss functions, training, inference, parameters, overfitting

52 of 52

Gradient Descent

Common optimization algorithm for training ML models

First order (uses first derivatives), loss function must be differentiable

At each training iteration update model parameters in opposite direction of gradient of loss function in direction of steepest ascent

Essentially, finding the slope of the loss function with current model values and adjusting values to move loss function towards the minimum
Step/adjustment size is determined by learning rate (hyperparameter)

For non-convex functions initial parameterization maters
Common variations

Batching: sum gradient over a set of training examples
Stochastic: adjust model for each training example

Savannah Thais 06/22/2023

further reading