1 of 52

MACHINE LEARNING AND YOUR RESEARCH

Dr. Savannah Thais

Columbia University Data Science Institute

06/22/2023

2 of 52

Welcome!

About me:

    • Particle physics PhD
    • Now working in ML development and AI Ethics
    • Excited about scientific ML

Savannah Thais 06/22/2023

About you:

    • Research with a quantitative/data component
    • Curious about ML
    • Deep knowledge of your domain and its limitations

What this course is:

    • What kind of problems are appropriate for ML
    • Conceptual introduction to algorithms
    • Spark interest and learn what to follow up on

What it’s not:

    • Mathematical or coding-based
    • A how-to guide
    • An exhaustive overview of every model
    • Going to make you an ML expert

3 of 52

Outline

  • ML Basics

  • Thinking Like a Machine

  • Supervised Algorithms

  • Unsupervised Algorithms

  • Evaluating Models

  • Discussion

Savannah Thais 06/22/2023

4 of 52

Savannah Thais 06/22/2023

10 Minute Overview

5 of 52

Killer Robots! Or, What is Machine Learning?

ML can be a great tool to enable scientific (and other) research, but it doesn’t know what what it doesn’t know

Savannah Thais 06/22/2023

Algorithms that improve automatically through experience

Computer programs that can access data and use it to learn for themselves

Teaching a computer system how to make accurate predictions when fed data

Using statistics to find patterns in datasets without the patterns being explicitly stated

A field of study that gives the ability to the computer to learn without being explicitly programed

6 of 52

Supervision Required?

Supervised Learning

  • Model is provided with labels
  • Algorithm learns relationship* between features and labels during training

*if that relationship can be expressed as a mathematical function

Savannah Thais 06/22/2023

Un-supervised Learning

  • Model takes unlabeled or unclassified data
  • Algorithm learns patterns* or groupings in the data during training

*but no promises that those patterns are useful!

Bonus: Reinforcement Learning – models are rewarded for meeting goals

7 of 52

Supervision Required?

Supervised Learning

Example: distinguishing between two plant types

Savannah Thais 06/22/2023

Un-supervised Learning

Example: grouping data into classes

8 of 52

What are some problems for ML?

What are some examples of ML in your daily life and how might the problem be defined?

Savannah Thais 06/22/2023

9 of 52

What are Problems for ML?

Classification:

Predict a class label for an input

Savannah Thais 06/22/2023

Regression:

Predict a continuous variable

Clustering:

Group similar inputs

Generation:

Construct new data within pattern

Model

$$$$

$

10 of 52

What are Problems for ML?

Association Rules:

Identify common patterns in data

Savannah Thais 06/22/2023

Ranking:

Generate optimal orderings

Anomaly Detection:

Identify statistical outliers

Restructuring:

Transform data representations

11 of 52

What are NOT Problems for ML?*

Causation:

Models learn correlations, but can’t infer causality or intent

Savannah Thais 06/22/2023

Precise Interpretability:

It’s often difficult to understand what a model is learning

Context:

Models are incapable of non-mathematical reasoning

Data Limitations:

Models can’t fix problems in data or learn without examples

12 of 52

Savannah Thais 06/22/2023

Thinking Like A Machine

13 of 52

ML Should Be Scientific!

Designing a (good) ML model is like running a scientific experiment: we don’t know apriori what will work best

Savannah Thais 06/22/2023

* Including how certain you are!

*

14 of 52

The Hypothesis

Your ML hypothesis is a combination of the model you want to build and the pattern you want to explore

    • “An algorithm can distinguish between normal and cancerous brain scans based only on pixel values”
    • “A model can simulate tau lepton decays within a defined margin of uncertainty”
  • Questions to consider as you construct your hypothesis:
    • What specifically do I want my model to be able to do?
    • What is the ideal outcome/use case of my experiment?
    • What will I consider a success (proving hypothesis) or failure (rejecting hypothesis)?
    • What kinds out outputs do I need the model to make and how will I use them?

Savannah Thais 06/22/2023

15 of 52

The Experiment

Building, training, and evaluating your model is the experimental process of testing your hypothesis

  • Your learning goal, input data, and desired output structure can help determine what class of models to study
  • All components need to be quantifiable and measurable
    • What are your input features and how are they represented?
    • What is the specific learning task for the model?
    • How do you quantify how well the model is doing?
    • What metric can you use to compare different models?

Savannah Thais 06/22/2023

16 of 52

Setting Up Your Data

Your model is only as good as your training data

Savannah Thais 06/22/2023

Do you have examples of all data classes/ranges?

Are there patterns in your data you don’t want the model to exploit?

How expensive is it to create/collect more data or labels?

Is there noise in your label creation or distribution?

Are the available labels related to the decision you want to make?

How much data is available and does each entry have the same information?

Are classes and inputs balanced and normalized?

17 of 52

Example Problems

Predicting author’s political associations from ‘mind metaphors’

    • Supervised multi-class classifier
    • Each metaphor (embedded) is an input data point
    • Evaluate based on accuracy of predictions (% of correct classifications)
    • Watch out! Could learn to associate writing styles with individual authors

Savannah Thais 06/22/2023

Distinguishing genetic cohorts

    • Unsupervised clustering
    • Patient’s full genome as input data
    • Evaluate based on measurable differences (disease manifestation) between clusters and gene pathway analysis of differences

18 of 52

Your Turn!

Think of a research problem where you want to employ ML:

  • What is your (testable!) hypothesis?
  • What data could you use as inputs?
  • What would the learning goal be?
  • How would you quantify success?

Savannah Thais 06/22/2023

19 of 52

Savannah Thais 06/22/2023

Supervised Algorithms

20 of 52

How Do We Do This?

  •  

Savannah Thais 06/22/2023

 

 

Make an Inference!

21 of 52

Training a Model

Training Set

Savannah Thais 06/22/2023

Validation Set

Test Set

Train Model

Evaluate Model During Training

Evaluate Final Model

Data Split

Purpose

22 of 52

Over and Under Fitting

  • Didn’t learn general patterns in data
  • High bias
  • Low variance
  • Inaccurate inference

Savannah Thais 06/22/2023

  • Learned specific details of training set
  • Low bias
  • High variance
  • Inaccurate inference

23 of 52

Over and Under Fitting

Savannah Thais 06/22/2023

24 of 52

Decision Trees

Input data is processed through a series of linear cuts

    • Generally, different splits on different features are tried and the one with the lowest cost is selected at each ‘branch’
    • All data in a ‘leaf’ is assigned the same label

Savannah Thais 06/22/2023

  • Important parameters:
    • When to stop splitting
    • Pruning process
  • Data type:
    • Fixed, limited number of continuous/quantized features
  • Pros:
    • Always interpretable
    • Easy to train
  • Cons:
    • Always linear
    • Easy to overtrain
    • High variance

25 of 52

Forests of Trees

Multiple decisions trees are combined into ensemble classifier

    • Boosting: misclassified data points given a higher weight in next tree
    • Bagging: train trees on random subsets of data (can extend to random subsets of features)

Savannah Thais 06/22/2023

  • Important parameters:
    • Depth and pruning
    • Combination function
  • Data type:
    • Fixed, limited number of continuous/quantized features
  • Pros:
    • ~Interpretable
    • Reduced variance
    • Can use higher dimensional data
  • Cons:
    • Less precise regression

26 of 52

Neural Networks

Data is transformed into a space with easier decision boundaries

    • Transformed linearly through matrix multiplications (learned weights)
    • Transformed non-linearly through activation functions (neurons)
    • The label errors are backpropagated to adjust network weights

Savannah Thais 06/22/2023

  • Important parameters:
    • Depth + hidden dimensions
    • Learning rate + regularization
    • Activation functions + loss
  • Data type:
    • Fixed number of continuous/quantized features
  • Pros:
    • Captures non-linear relationships
    • Can handle high-dimension input
    • Can use higher dimensional data
  • Cons:
    • Black box
    • Requires extensive training data + time
    • Sensitive to hyperparameters

27 of 52

Convolutional Neural Networks

Create multiple transformed (convolved) versions of input data

    • Convolutions are small, learned matrices that collect local information and learn hierarchical features
    • Transformed inputs are typically used in a classification or regression

Savannah Thais 06/22/2023

  • Important parameters:
    • Size and number of convolutions, activations, network structure, rates…
  • Data type:
    • fixed size matrices (often used with images)
  • Pros:
    • learns locational features, hierarchical features, works with high dimensional data
  • Cons:
    • complex architecture, longer training, local minima

28 of 52

Recurrent Neural Networks

Output of previous sequence step is used to process next step

    • Network retains ‘memory’ of previous inputs
    • Network weights and recurrence weights are updated at each step

Savannah Thais 06/22/2023

  • Important parameters:
    • Hidden state size, recurrent unit types, activation functions, directionality, rates…
  • Data type:
    • Variable length sequence data (often used with time series, text, video)
  • Pros:
    • Allows variable length inputs, maintains sequencing
  • Cons:
    • complex architecture, vanishing gradient, limited long-term memory, difficult to stack

29 of 52

Generative Adversarial Networks

Two NNs with different learning goals train together

    • Generator network creates new data from random noise
    • Discriminator network distinguishes generated from real data

Savannah Thais 06/22/2023

  • Important parameters:
    • Depth + hidden dimensions
    • Learning rate + regularization
    • Loss function/network balancing
  • Data type:
    • Fixed size labeled data (features, images, text…)
  • Pros:
    • Can be faster than other simulations
    • Allows various data types
  • Cons:
    • Training can be unstable
    • Societal dangers

30 of 52

Graph Neural Networks

Savannah Thais 04/04/2023

Learn a smart re-embedding of the graph data that preserves the relational structure

  • Important parameters:
    • Aggregator, size of message passing and update networks, graph construction, message features
  • Data type:
    • Variable length relational or geometrically structured data
  • Pros:
    • Allows variable length inputs, maintains relational structure
  • Cons:
    • Complex architecture, over-smoothing

31 of 52

Transformers

Savannah Thais 04/04/2023

Use multi-headed attention with encoder and decoder for sequence to sequence modeling

  • Output embedding is trained to predict sequence shifted by one word
    • Q=vector representation of one word, K=vector rep of all words, V=(different) vector rep of all words

32 of 52

Savannah Thais 06/22/2023

Unsupervised Algorithms

33 of 52

Clustering

K-Means

  • Centroids selected randomly
  • Points assigned to closest centroid
  • Centroid adjusted towards cluster mean
  • Best for circular clusters

Savannah Thais 06/22/2023

Group similar data points into a cluster (you define similarity)

K-means

DBSCAN

DBSCAN

  • For starting point, all neighbors are mapped
  • If enough neighbors, all neighbors are added to cluster
  • More parameters, but allows different cluster shapes

34 of 52

Dimensionality Reduction

PCA

  • Derives an orthonormal basis for the dataset
  • Each dimension aims to be linearly uncorrelated
  • You select desired dimensionality

Savannah Thais 06/22/2023

Group features into higher-level, informative sets

SVD

  • A generalized PCA that finds all eigenvectors of dataset covariance matrix
  • Eigenvectors ranked by their ‘singular value’
  • In highly correlated data, many SVs will be small

35 of 52

Autoencoders

  • Consists of an encoder that re-embeds the data and an decoder that attempts to reconstruct original data
    • Both are typically feed-forward NNs
  • Different ways to enforce loss
    • Sparsity (only allow some activations)
    • Undercomplete (reduced hidden dim)
    • Contractive (reduced output dim)
    • Variational (predict latent distributions)
  • Wide range of uses:
    • Learn useful embedding spaces
    • Denoise samples
    • Dimensionality reduction

Savannah Thais 06/22/2023

Loss-ily transform data such that it can be transformed back

36 of 52

Anomaly Detection

Classification Based:

Data contains examples of outliers

Savannah Thais 06/22/2023

Clustering Based:

Identify points outside of clusters

Dimension Based:

Assume components normal

Autoencoder Based:

Assume embedding is normal

Find outliers in data by understanding underlying distribution

37 of 52

Your Turn!

What model(s) would you use in your experiment?

  • What parameters would you need to consider?
  • Would you have enough/the right training data?
  • How would the model enable useful research decisions?

Savannah Thais 06/22/2023

38 of 52

Savannah Thais 06/22/2023

 

39 of 52

AI/ML Has A Hype Problem

Savannah Thais 06/22/2023

40 of 52

Characterizing Model Output

Understanding the distribution of model errors, and how they relate to the research task you’re trying to solve, is KEY to ensuring reliability and validity of your ML models

Savannah Thais 06/22/2023

  • Some questions to help:
    • What data selections or biases contributed to the distribution? 
    • How do we ensure the distribution is modeled correctly? What kinds of tests can we use? 
    • What should the distribution look like (or what do we want it to look like given certain assumptions/goals)?
    • If anything is incorrect or needs to be changed, what knobs can we turn?
  • Sometimes these questions are very hard or impossible to answer!

41 of 52

Consider All Steps of the Pipeline

Savannah Thais 06/22/2023

42 of 52

Choosing a Cost Function + Evaluation

Loss Functions

  • Common choices:
    • Mean Squared Error

    • Hinge Loss

    • Cross Entropy

  • Considerations:
    • Type of model
    • Presence of outliers in dataset
    • Sensitive outcomes
    • Training behavior

Savannah Thais 06/22/2023

Evaluation Metrics

  • Common choices:
    • Accuracy
    • Confusion matrix
    • Sensitivity
    • Specificity
    • False positive rate
    • Precision
    • Recall
    • F1 Score
  • Considerations:
    • What type of outcomes matter?
    • Typically should report multiple

43 of 52

Racial Bias in Healthcare Risk Assessment

Savannah Thais 06/22/2023

44 of 52

Understanding the Distributions

  1. Determine an appropriate measure of model validity
  2. Check the performance (closure) of the model across all areas where it will be applied (each race)
  3. See that it underperforms in some phase spaces…why could that be?
  4. Interrogate model variables and labels to look for statistical sources of bias  
  5. Fix it…in this case by defining a better training label and potentially de-biasing historical data and/or accounting for uncertainty

Savannah Thais 06/22/2023

45 of 52

Predictive Policing: Predictions

Savannah Thais 06/22/2023

46 of 52

Predictive Policing: Arrests

Savannah Thais 06/22/2023

47 of 52

Understanding the Distributions

  • What is an appropriate measure of performance here? (not all crimes are reported)
  • Where could the apparent bias in prediction arise from? How could we understand if it’s accurate?
  • How does the system itself affect the phase space the developers are trying to measure
  • How could we start to address these concerns?

Savannah Thais 06/22/2023

48 of 52

Just the Beginning

  • There are MANY other types of ML models + learning mechanisms
    • Geometric machine learning/graph neural networks
    • Semi-supervised, self-supervised, transfer learning, etc
    • Bayesian NNs and inverse problems/likelihood free inference
  • Other important considerations when applying ML to research
    • Interpretability/explainability: many models are black box, but we want to use them to inform our understanding of the world
    • Uncertainty quantification: need to understand how much we can trust the model prediction and under what circumstances it’s valid
  • Sociotechnical and ethical considerations around ML
  • ML is a broad, multi-faceted field and it’s difficult to stay up-to-date on innovations and best practices
    • Interdisciplinary and cross-campus collaboration is key!

Savannah Thais 06/22/2023

49 of 52

Thank you!

  • Happy to answer any questions!

  • st3565@columbia.edu @basicsciencesav

Savannah Thais 06/22/2023

50 of 52

Backup

Savannah Thais 06/22/2023

51 of 52

Key Terms

  • Optimization: minimizing an objective function f(x) parameterized by x (minimize loss function over model params)
    • Converges to global min if the objective function is convex, otherwise converges within a neighborhood
  • Loss/cost function: equation that quantifies algorithm performance
    • Common functions include: MSE, MAE, Log Likelihood
  • Model Parameters: the fitted/learned model parameters
    • Ex: slope in linear regression
  • Hyperparameter: adjustable (non-learned) parameters that must be tuned for optimal training/inference
  • Training: adjusting model parameters to optimize the loss function
  • Inference: using a trained model to draw conclusions about unseen data
  • Overfitting: a model learns relics in the training dataset that are not representative of the full data distribution (memorization)

Savannah Thais 06/22/2023

52 of 52

Gradient Descent

  • Common optimization algorithm for training ML models
    • First order (uses first derivatives), loss function must be differentiable
  • At each training iteration update model parameters in opposite direction of gradient of loss function in direction of steepest ascent
    • Essentially, finding the slope of the loss function with current model values and adjusting values to move loss function towards the minimum
    • Step/adjustment size is determined by learning rate (hyperparameter)
  • For non-convex functions initial parameterization maters
  • Common variations
    • Batching: sum gradient over a set of training examples
    • Stochastic: adjust model for each training example

Savannah Thais 06/22/2023