1 of 17

Machine Learning HW2

MLTAs

ntueemlta2021@gmail.com

2 of 17

Outline

  • HW2 - Income 50K prediction
    • Dataset and Tasks Description
    • Provided Feature Format
    • Sample Submission
  • Kaggle
  • Grading / Assignment Regulation

3 of 17

Dataset and task introduction

  • Dataset : Adult Data Set

Reference : https://archive.ics.uci.edu/ml/datasets/Adult

Please down load data from here (只需要載X_train,Y_train, X_test就好)

  • Task : Binary Classification
    • Logistic regression, Probabilistic generative model

Determine whether a person makes over 50K a year.

4 of 17

Data Attribute Information

train.csv 、test.csv :

age, workclass, fnlwgt, education, education num, marital-status, occupation

relationship, race, sex, capital-gain, capital-loss, hours-per-week,

native-country, make over 50K a year or not

  • More detail please check out Kaggle Description Page

5 of 17

Provided Feature Format

X_train, Y_train, X_test : (Please download data here)

  1. discrete features in train.csv => one-hot encoding in X_train (work_class,education...)
  2. continuous features in train.csv => remain the same in X_train (age,capital_gain...)
  3. X_train, X_test : each row contains one 106-dim feature represents a sample
  4. Y_train: label = 0 means “<= 50K” 、 label = 1 means “ >50K ”

6 of 17

Sample Submission

請預測test set中16281筆資料

  1. 上傳格式為csv
  2. 第一行必須為id, label,第二行開始為預測結果
  3. 每行分別為id以及預測的label,請以逗號分隔
  4. Evaluation: Accuracy

7 of 17

Kaggle Info & Deadline

  • Link: https://www.kaggle.com/t/93e214f8b5b64978a9e03c923dfd3e8f
  • sample code
  • 個人進行、不須組隊
  • Team Name:
    • 修課學生:學號_任意名稱(e.g., b09901666_)
    • 旁聽:旁聽_任意名稱
  • Maximum Daily Submission: 5 times
  • Kaggle Deadline: 10/28/2021 23:59:59 (GMT+8)
  • Ceiba Deadline: 10/30/2021 23:59:59 (GMT+8)
  • test set的16281筆資料將被分為兩份,8140筆public,8141筆private
  • Leaderboard上所顯示為public score,在Kaggle Deadline前可以選擇2份submission作為private score的評分依據。

8 of 17

配分 Grading Criteria - kaggle (5% + Bonus 1%)

  • Kaggle Deadline : 10/28/2021 23:59:59 (GMT+8)
  • Kaggle Score Point - 4%
    • 以 10/28/2021 23:59:59 於 public/private scoreboard 之分數為準 :
      • 超過public leaderboard的simple baseline分數 : 1%
      • 超過public leaderboard的strong baseline分數 : 1%
      • 超過private leaderboard的simple baseline分數 : 1%
      • 超過private leaderboard的strong baseline分數 : 1%
    • 以上皆須通過 Reproduce 才給分
  • Bonus - 1%
    • (1.0%) private leaderboard 排名前五名並繳交投影片描述實作方法,另外需錄製一份講解影片(少於三分鐘)作一個簡單的presentation,助教將公布給同學們參考

9 of 17

配分 Grading Criteria - report(5%)

10 of 17

作業規定 Assignment Regulation

  1. 手刻 gradient descent 實作 logistic regression
  2. 手刻實作 probabilistic generative model
  3. Only Python 3.7 available !
  4. hw2_logistic.ipynb、hw2_generative.ipynb 開放使用套件
    1. numpy ==1.19.5
    2. scipy == 1.4.1
    3. pandas == 1.1.5
    4. python standard library
  5. hw2_best.ipynb不限做法,開放以下套件(但有版本限制請注意)
    • pytorch == 1.9.0 (phytorch教學一, pytorch教學二
    • tensorflow == 2.6.0
    • keras == 2.6.0
    • scikit-learn == 0.22.2
    • 不可以使用 xgboost, AdaBoostClassifier, ExtraTreesClassifier
  6. 若需使用其他套件,請儘早寄信至助教信箱詢問,並請闡明原因。

11 of 17

Ceiba Submissions

你的ceiba上至少有下列4個檔案(格式必須完全一樣):

1. hw2_logistic.ipynb : handcraft "logistic regression" using Gradient Descent

2. hw2_generative.ipynb : handcraft "probabilistic generative model"

3. hw2_best.ipynb : meet the highest score you choose in kaggle

4. report.pdf : Please refer to report template

請不要上傳dataset,請不要上傳dataset,請不要上傳dataset

12 of 17

Report 格式

  • 限制
    • 檔名必須為 report.pdf !!!
    • 檔名必須為 report.pdf !!!
    • 檔名必須為 report.pdf !!!
    • 請用中文撰寫report(非中文母語者可用英文)
    • 標明系級、學號、姓名,並按照report模板回答問題,切勿隨意更動題號順序
    • 若有和其他修課同學討論,請務必於題號前標明collaborator(含姓名、學號)
  • Report模板連結
    • 連結:Link
  • 截止日期同 Ceiba Deadline: 10/30/2021 23:59:59 (GMT+8)

13 of 17

其他規定 Other Policy

  • Lateness
    • Ceiba 遲交一天(不足一天以一天計算) hw2 所得總分將x0.7
    • 不接受程式 or 報告單獨遲交
    • 不得遲交超過一天,若有特殊原因請儘速聯絡助教

14 of 17

繳交格式 Handin Format

  • Kaggle deadline:10/28/2021 23:59:59 (GMT+8) �Ceiba code & report deadline:10/30/2021 23:59:59 (GMT+8)
  • 把程式碼和report壓縮成zip檔上傳到ceiba,檔案名稱為,學號_hw2.zip,包含程式碼及report.pdf(report包含數學題)

15 of 17

其他規定 Other Policy

  • Cheating
    • 抄 code、抄report (含之前修課同學)
    • 開設 kaggle 多重分身帳號註冊 competition
    • 於訓練過程以任何不限定形式接觸到 testing data 的正確答案
    • 不得上傳之前的 kaggle 競賽
    • 教授與助教群保留請同學到辦公室解釋 coding 作業的權利,請同學務必自愛

16 of 17

機器學習前測

測問卷, 請大家幫忙填寫

17 of 17

TA Hour

  • 10/22, 10/29 (Fri) @BL B1 系k
  • 18:00 ~ 19:00