1 of 33

Fundamentals of

Dimension Reduction

Exploring high-dimensional data

Zachary del Rosario (He/Him)

1

2 of 33

Workshop Schedule

Extract

Wrangle + Tidy

Friday

Saturday

Visualize

Model

Sunday

Monday

Tabula +

WebPlotDigitizer

Python + Jupyter

Concepts

Execution

Concepts

Execution

Concepts

Fin

Focus

Live

Take-Home

2

3 of 33

Sneak Preview: Alloys Dataset

3

4 of 33

Sneak Preview: Alloys Dataset

26 numeric columns!

325 pairs!

At best: Two axes + shape & color == 4 column in one visual

4

5 of 33

Dimension Reduction to the Rescue!

Dimension reduction (DR) visualizes high-dimensional data

BUT interpreting DR is tricky

SO let’s study some fundamentals

5

6 of 33

Outline

  • Linear dimension reduction: PCA
  • Nonlinear dimension reduction: UMAP

6

7 of 33

Principal Component Analysis (PCA)

Fundamental linear dimension reduction (DR)

7

8 of 33

8

9 of 33

Data-informed

direction

9

10 of 33

Data-informed

direction

Projected points

}

10

11 of 33

Data-informed

direction

Projected points

}

Idea: Find direction of greatest variance in the data

11

12 of 33

Linear Dimension Reduction : PCA

Principal Components Analysis (PCA)

  • Linear : subspace-based (lines and angles)
  • Captures variance in data

12

13 of 33

Linear Dimension Reduction : PCA

Procedure:

  • Find the highest-variance directions
  • Project data onto those “lines”
  • Visualize with the first two directions (principal components)

13

14 of 33

PCA Example

26 numeric columns!

325 pairs!

At best: Two axes + shape & color == 4 column in one visual

14

15 of 33

Can Visualize Pairs of Variables….

Somewhat informative…

Not using all information (variables)!

15

16 of 33

PCA : Projection

More informative!

Note the more distinct groups

In particular, easier to see Series 8 clusters

16

17 of 33

PCA : Interpreting Weights

[Text goes here]

17

18 of 33

PCA : Projection

Read Al content

More Al

Less Al

18

19 of 33

PCA : Interpreting Weights

[Text goes here]

19

20 of 33

PCA : Projection

More informative!

Note the more distinct groups

More Al

Less Al

More Zn

Less Cu

Less Zn

More Cu

20

21 of 33

PCA : Projection

More informative!

Note the more distinct groups

More Al

Less Al

More Zn

Less Cu

Less Zn

More Cu

21

22 of 33

Important Caveat

  • I’m not saying “throw out your materials intuition”

  • I am saying “complement your materials intuition with informatics tools”

  • PCA is one such (visualization) tool!

22

23 of 33

Uniform Manifold Approximation (UMAP)

Cutting-edge nonlinear dimension reduction

23

24 of 33

Linear vs Nonlinear DR

  • PCA is linear; limits our DR to “lines and angles”
  • What if we allow nonlinear transforms?

24

25 of 33

Uniform Manifold Approximation (UMAP)

Recent (2018) approach to nonlinear dimension reduction

  • Based on graph embedding and fuzzy sets
  • Challenging to interpret
  • Extremely powerful!

25

26 of 33

UMAP Example

26

27 of 33

UMAP Example

Very distinct clusters!

27

28 of 33

Difficulties

UMAP cluster distances mean nothing!

28

29 of 33

Observations

Series 8 clusters with other alloys

29

30 of 33

UMAP : With Great Power...

  • Powerful tool for your toolchest!
  • Challenging to interpret
    • See “Understanding UMAP” for more

30

31 of 33

Tonight’s Exercise

31

32 of 33

Tonight’s Notebook: Visualizing in Python

04_vis_assignment

  • Plotnine / ggplot
  • Interpreting graphs
  • Recreating graphs
  • Dimension reduction

32

33 of 33

End of Today

Feel free to contact me via email:

    • Zach del Rosario: zdelrosario@olin.edu

33