1 of 51

Materials Informatics

Case Studies

Zachary del Rosario (He/Him)

1

2 of 51

Workshop Schedule

Extract

Wrangle + Tidy

Friday

Saturday

Visualize

Model

Sunday

Monday

Tabula +

WebPlotDigitizer

Python + Jupyter

Concepts

Execution

Concepts

Execution

Concepts

Fin

Focus

Live

Take-Home

2

3 of 51

Survey Time!

Please fill out this survey on the live notebook:

https://forms.gle/XAFZeyyfiP86CCvN8

3

4 of 51

Case Studies

  • MPEA data : outlier checking
  • Starrydata2 : data extraction
  • DFT-DB : data pipeline, curation, statistics

  • Oliynyk Heuslers : ML + materials intuition

4

5 of 51

Full Disclosure

Citrine Informatics

Olin College

+

I work closely with the Citrine folks,

affecting my selection of examples!

6 of 51

MPEA Database Checking

The power of simple statistics

6

7 of 51

MPEA Database

Multiple-Principle Element Alloys (MPEA)

  • 6-components with 30 elements yields 594,000 possible alloys!
  • Would like to use ML to predict, but need data

Borg, C.K.H. et al. Sci Data (2020).

DOI: 10.1038/s41597-020-00768-9

7

8 of 51

MPEA Database

Multiple-Principle Element Alloys (MPEA)

  • 6-components with 30 elements yields 594,000 possible alloys!
  • Would like to use ML to predict, but need data
  • Thus: Construct database from literature

Borg, C.K.H. et al. Sci Data (2020).

DOI: 10.1038/s41597-020-00768-9

8

9 of 51

MPEA a “Data Literature” Review

Borg, C.K.H. et al. Sci Data (2020).

DOI: 10.1038/s41597-020-00768-9

9

10 of 51

MPEA Finding Errors!

Borg, C.K.H. et al. Sci Data (2020).

DOI: 10.1038/s41597-020-00768-9

10

11 of 51

MPEA Finding Errors!

Borg, C.K.H. et al. Sci Data (2020).

DOI: 10.1038/s41597-020-00768-9

Plotly and boxplots!

11

12 of 51

MPEA Takeaways

  • Literature review can be used to build a dataset
  • The simple things we learned (boxplots, plotly mouseover) are powerful investigative tools!
  • This is a key step before doing ML

12

13 of 51

Starrydata2

WebPlotDigitizer for data extraction

13

14 of 51

Starrydata2 Concept

Thermoelectric Datasets

  • UCSB TE Data, ~ 300 experimental records
  • TE Design Lab, >2,300 computational records

Katsura et al. (2019) Science and Technology of Advanced Materials

DOI: 10.1080/14686996.2019.1603885

14

15 of 51

Starrydata2 Concept

Thermoelectric Datasets

  • UCSB TE Data, ~ 300 experimental records
  • TE Design Lab, >2,300 computational records

  • Starrydata2, >11,500 extracted records
    • and growing!

Katsura et al. (2019) Science and Technology of Advanced Materials

DOI: 10.1080/14686996.2019.1603885

15

16 of 51

Starrydata2 Schematic

Katsura et al. (2019) Science and Technology of Advanced Materials

DOI: 10.1080/14686996.2019.1603885

16

17 of 51

Starrydata2 Schematic

Manual, streamlined data extraction

WebPlotDigitizer!

Katsura et al. (2019) Science and Technology of Advanced Materials

DOI: 10.1080/14686996.2019.1603885

17

18 of 51

Starrydata2 Findings

Compared doping model to experimental data

  • n-type close
  • p-type inaccurate

Katsura et al. (2019) Science and Technology of Advanced Materials

DOI: 10.1080/14686996.2019.1603885

18

19 of 51

Starrydata2 Takeaways

  • Data extraction (WebPlotDigitizer) is an underrated tool!
  • Improved data access alone can lead to new scientific insights
  • Starrydata2 is an excellent resource for building ML models….

19

20 of 51

DFT Database Comparison

Data pipeline work and statistics(!)

20

21 of 51

High-Throughput DFT Database Comparison

21

22 of 51

High-Throughput DFT Database Comparison

?

22

23 of 51

DFT-DB Workflow

  1. Download
  2. Match structures (via ICSD Collection Code)
  3. Data curation

Hegde et al. (2020) ArXiv preprint arXiv:2007.01988 (Under review)

23

24 of 51

DFT-DB Bug Story!

Hegde et al. (2020) ArXiv preprint arXiv:2007.01988 (Under review)

24

25 of 51

DFT-DB Percent Disagreement

Hegde et al. (2020) ArXiv preprint arXiv:2007.01988 (Under review)

25

26 of 51

DFT-DB Takeaways

  • Data wrangling is underrated!
  • Computational reproducibility is worth quantifying!
    • HT discrepancy is larger than we might expect!
    • Larger discrepancies in particular classes and particular properties
  • Can treat discrepancies as quantification of noise

Hegde et al. (2020) ArXiv preprint arXiv:2007.01988 (Under review)

26

27 of 51

Heusler Prediction

Machine Learning + Materials Intuition

27

28 of 51

Heusler Prediction

Class of intermetallic compound

Attractive thermoelectric and spintronic candidates

Heusler structure unstable for some compounds!

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

28

29 of 51

Heusler Prediction - Methods

  1. Curated crystallographic database
    1. 341 Heuslers, out of 1948 total
  2. Predict Heusler formation with ML model
  3. Select candidates to experimentally confirm

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

29

30 of 51

Heusler Prediction - Methods

  • Curated crystallographic database
    • 341 Heuslers, out of 1948 total
  • Predict Heusler formation with ML model
  • Select candidates to experimentally confirm

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

30

31 of 51

Heusler Prediction - Model

Machine Learning Features (22 total)

  • Element properties (e.g. group number)
  • Pair properties (e.g. radius difference)
  • Compound properties (e.g. total p valence electrons)

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

31

32 of 51

Heusler Prediction - Model

Machine Learning Features (22 total)

  • Element properties (e.g. group number)
  • Pair properties (e.g. radius difference)
  • Compound properties (e.g. total p valence electrons)

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

I could not have come up with these features!

Materials intuition important for feature engineering

32

33 of 51

Heusler Prediction - Model

Machine Learning Features (22 total)

  • Element properties (e.g. group number)
  • Pair properties (e.g. radius difference)
  • Compound properties (e.g. total p valence electrons)

Hard to hold 22 facts in your head!

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

33

34 of 51

Heusler Prediction - Model

Model: Decision Tree

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

34

35 of 51

Heusler Prediction - Model

Model: Decision Tree

But decision trees tend to overfit!

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

35

36 of 51

Heusler Prediction - Model

Idea: Random Forest

  • Train trees on random subsets
  • Use voting / averaging

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

36

37 of 51

Heusler Prediction

ML model used to predict Heusler formation

Specific candidates selected to be unlikely to form based on valence electron count

Formation tested by experiment

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

37

38 of 51

Heusler Prediction Takeaways

  • Can use ML and experiments in partnership!
    • Saal et al. (2020) Annual Reviews for more examples of experimentally-verified MI
  • NB. Can update a dataset and re-train
    • Active (or sequential) learning

Oliynyk et al. (2016) Chem. Mater.

DOI: 10.1021/acs.chemmater.6b02724

38

39 of 51

Closing Takeaways

39

40 of 51

What We Covered

  • Data Extraction
  • Data Management
  • Data Visualization
  • Basics of Machine Learning

40

41 of 51

What We Covered

  • Data Extraction
  • Data Management
  • Data Visualization
  • Basics of Machine Learning

“Locked up” data do us no good!

Tabula + WebPlotDigitizer help us liberate data

41

42 of 51

What We Covered

  • Data Extraction
  • Data Management
  • Data Visualization
  • Basics of Machine Learning

Data are messy, untidy, and sometimes in error!

We can wrangle and tidy in a reproducible notebook

42

43 of 51

What We Covered

  • Data Extraction
  • Data Management
  • Data Visualization
  • Basics of Machine Learning

Tables are useless for finding patterns

Informative visuals are easy with plotnine / plotly

43

44 of 51

What We Covered

  • Data Extraction
  • Data Management
  • Data Visualization
  • Basics of Machine Learning

Simple heuristics can miss trends

Machine Learning can utilize our materials intuition (features)

44

45 of 51

What We Covered

  • Data Extraction
  • Data Management
  • Data Visualization
  • Basics of Machine Learning

I hope you enjoyed the workshop!

45

46 of 51

Recordings and Slides

All recordings and slides will be posted to the MI 101 Workshop Website

46

47 of 51

Recommended Courses (from AJ)

  • MSE-6140 - Computational Materials Science
  • ChBE 4745/6745 - Data Analytics for Chemical Engineers

47

48 of 51

Shhh, A Secret...

All of the solutions to all of the notebooks are on GitHub

Notebook solutions here

These slides linked from `Slides ` page!

48

49 of 51

One Last Survey!

49

50 of 51

While You’re Still Here….

I want this workshop to be as useful as possible

Please fill this out:

https://forms.gle/D1tDSTubxLnHxVfP8

(I will paste link in chat)

50

51 of 51

Questions? Thoughts? Comments?

Feel free to contact me via email:

    • Zach del Rosario: zdelrosario@olin.edu

51