1 of 19

Machine Learning With Datasets

Ai4Ga

Unit 3.4

1

2 of 19

What are datasets?

a collection of data. In the case of tabular data, a dataset corresponds to one or more tables, where every column of a table represents a particular feature, and each row corresponds to an example of the dataset.

2

Jeans dataset from AI Lab.

3 of 19

Where do datasets come from?

Scientists, companies, engineers, and data scientist

  • collect data about phenomena and problems they are interested in understanding

For, example data can come from

  • Sensors in a car
  • Surveys
  • Video or images of engaged animals or plants
  • Number of products, sneakers, or songs sold

Data can come from almost anywhere and take any form.

3

Fun fact: Data is plural.

In Latin, data is the plural of datum. In specialized scientific fields, it is also treated as a plural in English. Therefore, we say data were collected and classified and not data was collected and classified

Phenomenon (singular),

An observable fact or event that occurs in a natural or designed system, especially one whose cause or explanation is in question. Phenomena (plural)

4 of 19

Orientation to AI Lab

& Its ML Pipeline

4

Use this link to access AI Lab

5 of 19

Step 1: Selecting a Dataset

AI Lab has a lot of different datasets to choose from that have already been preloaded into the tool. For our practice activity, select the Jeans Measurements dataset.

5

6 of 19

Identifying a Question

After we identify our dataset, we need to review its features to figure out what questions we can answer with it.

6

Jeans dataset from AI Lab.

7 of 19

What features best predict the price of jeans?

7

Our QUESTION:

8 of 19

Selecting a feature we would like to predict

8

9 of 19

Label of the Feature we want to predict: price in dollars

9

10 of 19

Label of the Feature we want train our predictor model with: �Let’s try Brand, Style, Mens or Womens

10

11 of 19

Let’s train our model

11

12 of 19

How is our model trained?

Step 2 - A ML algorithm is use to find patterns in the training data. Patterns are statistical relationships called correlations between the value to be predicted and the features selected for the training.

Step 3 - After the training is complete and the model is created, it needs to be tested for accuracy using the test set.

Step 1 - Our dataset is first split into two sets by AI lab

  • Training set (90% of our data)
  • Test set (10% of our data)

This is a random process so your results may vary based on what data was included in the training or the test set.

12

13 of 19

Testing the Model - Press Continue to skip this animation!

13

14 of 19

Evaluating the Accuracy of the Model

This selection and test had an accuracy of 75%…Let’s see how well it did.

14

15 of 19

Explore Correct Predictions

Notice that correct - doesn’t mean it has to exactly match. There is a range of +/- 5%.

15

16 of 19

Explore Incorrect Predictions

Wow! Look at very different these predictions are. They are off by way more that +/- 5%

16

17 of 19

Note - Your results may be different than the slides, even though you picked the same features and data set. Remember, the AI trains with 90% of the data, and tests with the remaining 10%. Which numbers are included in each section can change the result.

17

18 of 19

Now you try!

Click Try Again in the bottom left corner.

Keep the prediction set to the price of jeans, and select two or three features you think will give the most accurate results. Place your answers in the table below.

18

Feature for category 1

Category 2

Category 3

Percent Correct

Attempt 1

Brand

75

Attempt 2

Attempt 3

Attempt 4

Attempt 5

19 of 19

A solution to the Jeans dataset

    • Try these features:

Brand, Cotton Contents, Mens or Womens

19