1 of 11

AI Club’s

Project Workshop

  • Build your own ML project from scratch�
  • Add to your portfolio�
  • Compete for prizes

2 of 11

Week 2: Problem & Dataset Selection

  • Definition of Done
    • Problem identified and defined
    • Dataset selected and accessible
  • Extra Credit
    • Explore your data
      • What ranges are your features in?
      • How many labels do you have to predict?
      • Is what you’re trying to predict equally represented in the data?
        • Do you expect it to be?
      • Are there any features you don’t think you need?
    • Preliminary set of models identified
    • Train a model!

3 of 11

What Makes a Good Problem?

  • Clean, available datasets
    • There is no model without data to learn from
  • Interesting to you
  • Some level of interactivity
  • Classification problem?
    • Best suited to beginners, strongest support from us

4 of 11

Limitations of Machine Learning

  • Supervised ML Problem Types
    • Classification: Is this email spam or not spam?
    • Regression: What will the house price be?
  • How to Frame a Problem for ML
    • Predict [specific outcome] based on [available data]
    • Bad: Buy and sell stocks/crypto
    • Good: Predict if Apple will gain or lose value based on yesterday’s change
  • Guiding Questions
    • Is your prediction target specific?
    • Do you have data to learn from?
    • Can you quantitatively measure success?

5 of 11

Project Ideas & Examples 1

  • Pet Breed Classification
    • Problem: Identify cat and dog breeds from photos
    • Dataset: Oxford-IIT Pet Dataset
    • Models: CNN, ResNet
  • Heart Disease Detection
    • Problem: Predict heart disease risk from medical data
    • Dataset: Cleveland Heart Disease Dataset
    • Models: Random Forest, Logistic Regression, SVM
  • Network Security
    • Problem: Detect malicious network connections
    • Dataset: KDD Cup 99, NSL-KDD
    • Models: Decision Trees, Neural Networks, Ensemble

6 of 11

Project Ideas & Examples 2

  • Music Genre Classification
    • Problem: Classify songs by genre from audio
    • Dataset: GTZAN
    • Models: SVM, Random Forest, Deep Learning
  • Stock Price Prediction
    • Problem: Predict stock movements from historical data
    • Dataset: Yahoo Finance, Alpha Vantage
    • Models: LSTM, Linear Regression
  • House Price Prediction
    • Problem: Predict home prices from property features
    • Dataset: Boston Housing, Ames Housing Dataset
    • Models: Linear Regression, XGBoost, Random Forest

7 of 11

Dataset Checklist

  • Sufficient size (1000+ samples preferred)
  • Accessible and downloadable
  • Reasonably clean (some messiness is okay!)
    • Missing values
    • Bad headers for columns
  • Legal to use

8 of 11

Where to Find Great Datasets

  • General Recommendations
  • Domain-Specific Sources
    • Government: data.gov, census.gov
    • Finance: Yahoo Finance, Alpha Vantage
    • Images: ImageNet, COCO, OpenImages
    • Text: Common Crawl, Project Gutenberg

9 of 11

Example Project

  • Kellen will now show off Week 2’s example project!

10 of 11

Jupyter Notebook Demo

  • There may have been some confusion about the tutorials from last meeting
  • Let’s go over what a Python Notebook is and how to use them

11 of 11

Let’s Build Something Amazing!

  • TODOs
    • Form teams
    • Complete tutorials on unfamiliar tools
    • Choose your problem
    • Find, download, and verify your dataset
    • Tutorials Finished!
  • Resources
    • Project idea slides
    • Dataset recommendations
    • Tutorials on our website; scan the QR code!
  • Questions? Stuck?
    • Raise your hand! We’re here to help