1 of 12

Intro Project Walkthrough

From Dataset to Website

2 of 12

Follow along:

https://bit.ly/ds-walk

3 of 12

Overview

  • Selecting a Dataset
  • Project Outline
  • Data Cleaning
  • Visualization
  • Statistical Methods
  • Making your project into a website using Github and Netlify

** We have done full workshops on each of these bullet points, the point of this workshop is to combine it all into a cohesive project!

4 of 12

Dataset Selection

  • kaggle.com
    • Millions of datasets on a wide range of topics
    • Browse projects made with datasets you can use
  • Google dataset Search
    • Makes datasets easily searchable and citable
  • Government Agencies and Websites
  • …and more!

5 of 12

Our Dataset

  • Palmer Penguins
    • Classic dataset
    • Sourced from kaggle.com (or just library(palmerpenguins) in R)

6 of 12

Project Outline

Link

7 of 12

Data Cleaning

Palmer penguins is already a very clean dataset so we won’t have to worry about data cleaning for our purposes.

If you want to learn more about cleaning your dataset, check out our resources here.

8 of 12

Visualization

  • Good for formulating a question about the dataset.
  • Read more about visualization:
    • https://bit.ly/dsc-gm3
  • Tableu Workshop:

9 of 12

Statistical Methods

This is where we get to apply our statistical knowledge to our dataset to answer a question about our dataset.

  • Hypothesis Testing: ANOVA, Student’s t-test, … (120B/C,122)
    • Pros: Works with most tabular data (data-matrix), gives a definite answer to a question
    • Cons: Limited flexibility of the kinds of questions you can ask.
  • Time-series methods: Survival Analysis, Autoregressive Methods (175, 174)
    • Pros: Prediction of datasets over time.
    • Cons:
  • Machine Learning Methods: Regression, Tree-based methods, Neural Nets (126, 131)
    • Pros: Good for modeling large variety of datasets.
    • Cons: Sometimes difficult to interpret.

10 of 12

Our Statistical Methodology: ANOVA

We are going to use an Analysis of Variance to determine if Species plays a significant role in determining Bill Depth.

PSTAT 122 extensively covers ANOVAs.

11 of 12

Presenting your Project

Often a website is the prefered method for displaying your project.

While dynamic website hosting can be difficult and expensive to use, static website hosting is completely free! Static websites can display an HTML document (like the one you can generate with R-markdown or Quarto).

Popular methods for static website hosting include Netlify and github.io websites.

We are going to use Netlify for our website.

12 of 12

Thank you!!

These slides will be made available at dscollab.github.io.