1 of 64

Introduction to Machine Learning

2 of 64

Delta Analytics builds technical capacity around the world.

This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good.

Please reach out with any questions or feedback to inquiry@deltanalytics.org.

Find out more about our mission here.

3 of 64

Course overview:

Now let’s turn to the data we will be using...

  • Module 1: Introduction to Machine Learning
  • Module 2: Machine Learning Deep Dive
  • Module 3: Model Selection and Evaluation
  • Module 4: Linear Regression
  • Module 5: Decision Trees
  • Module 6: Ensemble Algorithms
  • Module 7: Unsupervised Learning Algorithms
  • Module 8: Natural Language Processing Part 1
  • Module 9: Natural Language Processing Part 2

4 of 64

Module 1: �Introduction to Machine Learning

5 of 64

Module Checklist

  • What is machine learning?
  • How do you define a research question?
  • What are observations?
  • What are features?
  • What are outcome variables?
  • Introduction to KIVA data

6 of 64

What is machine learning?

7 of 64

What is machine learning?

Artificial Intelligence (AI)

Machine Learning

Using data science methods and sometimes big data

We call something machine learning when instead of telling a computer to do something, we allow a computer to come up with its own solution based upon the data it is given.

8 of 64

Machine learning is a subset of AI that allows machines to learn from raw data.

Humans learn from experience. Traditional software programing involves giving machines instructions which they perform. Machine learning involves allowing machines to learn from raw data so that the computer program can change when exposed to new data (learning from experience).

Source: https://www.youtube.com/watch?v=IpGxLWOIZy4

Machine learning

+

9 of 64

Machine learning is interdisciplinary

Machine learning is…

  • Computer science + statistics + mathematics
  • The use of data to answer questions

Critical thinking combined with technical toolkit

Data

Domain Knowledge

Insight

Action

10 of 64

There is a growing need for machine learning

Sources:

[1] “What is Big Data,” IBM,

  • There are huge amounts of data generated every day.
  • Previously impossible problems are now solvable.
  • Companies are increasingly demanding quantitative solutions.

“Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” [1]

11 of 64

Machine learning example: Predicting malaria

Dr. Delta works to diagnose patients with malaria.

However, it takes a long time for her to see everyone.

Luckily, Dr. Delta has historical patient data about what factors predict malaria, such as body temperature, travel history, age, medical history.

Dr. Delta can use historical data as an input in a machine learning algorithm to help her predict whether a new patient will have malaria.

The algorithm (the machine) learns from past data, like a human would, and is thus able to make predictions about the future.

12 of 64

Machine learning is a powerful tool; it can…

Determine your credit rating based upon cell phone usage.

Determine the topic of a piece of text.

Recognize your face in a photo.

Recommend movies you will like.

13 of 64

Machine learning helps us answer questions. How do we define the question?

14 of 64

Before we even get to the models/algorithms, we have to learn about our data and define our research question.

Research Question

Exploratory Analysis

Modeling Phase

Performance

Data Validation + Cleaning

Machine learning takes place during the modeling phase.

~80% of your time as a data scientist is spent here, preparing your data for analysis

15 of 64

A research question is the question we want our model to answer.

Examples of research questions:

  • Does this patient have malaria?
  • Can we monitor illegal deforestation by detecting chainsaw noises in audio streamed from rainforests?

Research Question

16 of 64

We may have a question in mind before we look at the data, but we will often use our exploration of the data to develop or refine our research question.

Research Question

Exploratory Analysis

Data Validation+ Cleaning

What comes first, the chicken or the egg?

17 of 64

Data Validation and Cleaning

18 of 64

Source: Survey of 80 data scientists. Forbes article, March 23, 2016.

Data Cleaning

“Data preparation accounts for about 80% of the work of data scientists.”

19 of 64

Data Cleaning

Why do we need to validate and clean our data?

Data often comes from multiple sources

  • Do data align across different sources?

Data is created by humans

  • Does the data need to be transformed?
  • Is it free from human bias and errors?

Go further with these readings: here and here

20 of 64

Data cleaning involves identifying any issues with our data and confirming our qualitative understanding of the data.

Data Cleaning

Missing Data

Is there missing data? Is it missing systematically?

Times Series Validation

Is the data for the correct time range?

Are there unusual spikes in the volume of loans over time?

Data Type

Are all variables the right type? Is a date treated like a date?

Data Range

Are all values in the expected range? Are all loan_amounts greater than 0?

21 of 64

Let’s step through some examples:

Data Cleaning

Missing data

Time series

Data types

Data Cleaning

Transforming variables

After gaining an initial understanding of your data, you may need to transform it to be used in analysis

22 of 64

Very few datasets have no missing data; most of the time you will have to deal with missing data.

The first question you have to ask is what type of missing data you have.

Data Cleaning

Is there missing data? Is data missing at random or systematically?

Missing completely at random: no pattern in the missing data. This is the best type of missing you can hope for.

Missing at random: there is a pattern in your missing data but not in your variables of interest.

Missing not at random: there is a pattern in the missing data that systematically affects your primary variables.

Missing data

23 of 64

Example: You have survey data from a random sample from high school students in the U.S. Some students didn’t participate:

Data Cleaning

Missing data

If data is missing at random, we can use the rest of the nonmissing data without worrying about bias!

If data is missing in a non-random or systematic way, your nonmissing data may be biased

Some students were sick the day of the day of the survey

Some students declined to participate, since the survey asks about grades

Is there missing data? Is data missing at random or systematically?

24 of 64

Example: You have survey data from a random sample from high school students in the U.S. Some students didn’t participate:

Data Cleaning

Missing data

If data is missing at random, we can use the rest of the nonmissing data without worrying about bias!

If data is missing in a non-random or systematic way, your nonmissing data may be biased

Some students were sick the day of the day of the survey

Some students declined to participate, since the survey asks about grades

Is there missing data? Is data missing at random or systematically?

25 of 64

Data Cleaning

Sometimes, you can replace missing data.

  • Drop missing observations
  • Populate missing values with average of available data
  • Impute data

Missing data

What you should do depends heavily on what makes sense for your research question, and your data.

26 of 64

Data Cleaning

Common imputation techniques

Take the average of observations you do have to populate missing observations - i.e., assume that this observation is also represented by the population average

Missing data

Use the average of nonmissing values

Use an educated guess

Use common point imputation

It sounds arbitrary and often isn’t preferred, but you can infer a missing value. For related questions, for example, like those often presented in a matrix, if the participant responds with all “4s”, assume that the missing value is a 4.

For a rating scale, using the middle point or most commonly chosen value. For example, on a five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). This is a bit more structured than guessing, but it’s still among the more risky options. Use caution unless you have good reason and data to support using the substitute value.

27 of 64

Ask yourself:

    • Is the data for the correct time range?
    • Are there unusual spikes in the data over time?

Data Cleaning

If we have observations over time, we need to do time series validation.

What should we do if there are unusual spikes in the data over time?

Time series

28 of 64

Data anomaly

Data Cleaning

How do we address unexpected spikes in our data?

For certain datasets, (like sales data) systematic seasonal spikes are expected. For example, around Christmas we would see a spike in sales venue. This is normal, and should not necessarily be removed.

Time series

Systematic spike

Random spike

If the spike is isolated it is probably unexpected, we may want to remove the corrupted data. �For example, if for one month sales are recorded in Kenyan Shillings rather than US dollars, it would inflate sales figures. We should do some data cleaning by converting to $ or perhaps remove this month.�

Note, sometimes there are natural anomalies in data that should be investigated first

29 of 64

Many functions in Python are type specific, which means we need to make sure all of our fields are being treated as the correct type:

Data Cleaning

Are all variables the right type?

  • Integer: A number with no decimal places
  • Float: A number with decimal places
  • String: Text field, or more formally, a sequence of unicode characters
  • Boolean: Can only be True or False (also called indicator or dummy variable)
  • Datetime: Values meant to hold time data.

integer float string date

Data type

30 of 64

Data cleaning quiz!

As you explore the data, some questions arise…

31 of 64

Question #1

Question

Answer

There is an observation from the KIVA loan dataset that says a loan was fully funded in year 1804, but Kiva wasn’t even founded then. What do I do?

32 of 64

Question #1

Question

Answer

There is an observation that says this loan was fully funded in year 1804, but Kiva wasn’t even founded then. What do I do?

Consult the data documentation. If no explanation exists, remove this observation.

Data Cleaning

Time series

This question illustrates you should always do validation of the time range. Check what the minimum and maximum observations in your data set are.

33 of 64

Question #2

Question

Answer

There is an observation that states a person’s birthday is 12/1/80 but the “age” variable is missing. What do I do?

34 of 64

Question #2

Data Cleaning

This question illustrates how we may be able to leverage other fields to make an educated guess about the missing age.

Missing Data

Question

Answer

There is an observation that states a person’s birthday is 12/1/80 but the “age” variable is missing. What do I do?

We have the input (year, month and day) needed to calculate age. We can define a function that will transform this input into the age of each loan recipient.

35 of 64

Question #3

As you explore the data, some questions arise…

Question

Answer

The variable “amount_funded” has values of both “N/A” and “0”. What do I do?

36 of 64

Question #3

Question

Answer

The variable “amount_funded” has values of both “N/A” and “0”. What do I do?

Check documentation if there is a material difference between NA and 0.

37 of 64

Question #4

Question

Answer

I’m not sure what currency the variable “amount_funded” is reported in. What do I do?

38 of 64

Question #4

Question

Answer

I’m not sure what currency the variable “amount_funded” is reported in. What do I do?

Check documentation and other variables, convert to appropriate currency

39 of 64

A final note…

Data Cleaning

Note that our examples were all very specific - you may or may not encounter these exact examples in the wild. This is because data cleaning is very often idiosyncratic and cannot be adequately completed by following a predetermined set of steps - you must use common sense!

Next we turn to exploratory analysis, for which we often have to transform our data.

Data transformation

40 of 64

Exploratory Analysis

41 of 64

The goal of exploratory analysis is to better understand your data.

Research Question

Exploratory Analysis

Data Validation + Cleaning

Exploratory Analysis

Exploratory analysis can reveal data limitations, what features are important, and inform what methods you use in answering your research question.

This is an indispensable first step in any data analysis!

42 of 64

Let’s explore our data!

Once we have done some initial validation, we explore the data to see what models are suitable and what patterns we can identify.

The process varies depending on the data, your style, and time constraints, but typically exploration includes:

  • Histogram
  • Scatter plots
  • Correlation tables
  • Box plots
  • Summary statistics
    • Mean, median, frequency

Exploratory Analysis

Scatter plots

Correlation

Box plots

Summary statistics

Histogram

43 of 64

Histograms tell us about the distribution of the feature.

Exploratory Analysis

Histogram

A histogram shows the frequency distribution of a continuous feature.

Here, we have height data of a group of people. We see that most of the people in the group are between 149 and 159 cm tall.

44 of 64

Scatter plots provide insight about the relationship between two features.

Scatter plots visualize relationships between any two features as points on a graph. They are a useful first step to exploring a research question.

Here, we can already see a positive relationship between amount funded and amount requested. What can we conclude?

Exploratory Analysis

Scatter Plots

45 of 64

Scatter plot provide important data about the relationship between two features.

Scatter plots are an indispensable first step to exploring a research question.

Here, we can already see a positive relationship between amount funded and amount requested for a KIVA loan. What can we conclude?

It looks like there is a strong relationship between what loan amount is requested and what is funded.

Exploratory Analysis

Scatter Plots

46 of 64

Correlation is a useful measure of the strength of a relationship between two variables. It ranges from -1.00 to 1.00

Go further with this fun game.

1.00 0.88 0.60 0.00 -0.55 -0.78 -1.00

Exploratory Analysis

Correlation

47 of 64

Correlation does not equal causation

Let’s say you are an executive at a company. You’ve gathered the following data:

X = $ spent on advertising

Y= Sales

Based on the graph and positive correlation, you’d be tempted to say $ spent on advertising caused an increase in sales. But hang on - it’s also possible that an increase in sales (and thus, profit) would lead to an increase in $ spent on advertising! Correlation between x and y does not mean x causes y; it could mean that y causes x!

Exploratory Analysis

Correlation

48 of 64

Kiva example: Correlation does not equal causation

Correlation: 0.96

If you wanted to request a loan through Kiva, and were presented with this graph only, you might conclude that it is a good idea to request $1 million dollars.

Exploratory Analysis

Correlation

49 of 64

Kiva example: Correlation does not equal causation

Conclusions can be invalid even when data is valid!

But common sense tells us that this conclusion doesn’t make a lot of sense. Just because you request a lot doesn’t mean you will be funded a lot!

Exploratory Analysis

Correlation

50 of 64

Mean, median, frequency are useful summary statistics that let you know what is in your data.

Exploratory Analysis

Summary statistics

51 of 64

Image source: University of Florida, Quantitative introduction to the boxplot

Exploratory Analysis

Boxplots

Boxplots are a useful visual depiction of certain summary statistics.

52 of 64

Forming a

research question

53 of 64

Recall: We may have a question in mind before we look at the data, but our exploration of the data often develops or refines our research question.

Research Question

Exploratory Analysis

Data Validation + Cleaning

What comes first, the chicken or the egg?

54 of 64

How do you define the research question?

  • We ask a question we expect data to answer. What comes first, the data or the question?

START:

Research question

Do I have data that may answer my question?

Gather or find more data

Is the data labelled?

Supervised learning methods

Unsupervised learning methods

Y

N

Y

N

55 of 64

How do you define the research question?

  • We ask a question we expect data to answer. What comes first, the data or the question?

START:

Research question

Do I have data that may answer my question?

Gather or find more data

Is the data labelled?

Unsupervised learning methods

Supervised learning methods

Requires data processing and validation

Y

N

Y

N

Does it contain the outcome feature Y?

We’ll cover both supervised and unsupervised methods in this course!

56 of 64

Given the KIVA data below, we may find a few questions interesting.

Research Question

Loan amount requested by a Kiva borrower in Kenya

Town Kiva borrower resides in.

One possible research question we might be interested in exploring is: Does the loan amount requested vary by town?

57 of 64

How does loan amount requested vary by town?

Research Question

This is a reasonable research question, because we would expect the amount to vary because the cost of materials and services varies from region to region.

For example, we would expect the cost of living in a rural area to be cheaper than an urban city.

58 of 64

Looking ahead:

Modeling

59 of 64

Now we have our research question, we are able to start modeling!

Research Question

Exploratory Analysis

Modeling Phase

Performance

Data Cleaning

We are here!

Learning Methodology

Task

60 of 64

Task

What is the problem we want our model to solve?

Performance Measure

Quantitative measure we use to evaluate the model’s performance.

Learning Methodology

ML algorithms can be supervised or unsupervised. This determines the learning methodology.

All models have 3 key components: a task, a performance measure and a learning methodology.

Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning

Modeling Phase

61 of 64

We will go over the machine learning task and learning methodology in the next lesson.

62 of 64

You covered this today:

  • What is machine learning?
  • How do you define a research question?
  • What are observations?
  • What are features?
  • What are outcome variables?
  • Introduction to KIVA data

63 of 64

You are on fire! Go straight to the next module here.

Need to slow down and digest? Take a minute to write us an email about what you thought about the course. All feedback small or large welcome!

Email: sara@deltanalytics.org

64 of 64

Find out more about Delta’s machine learning for good mission here.