1 of 12

FB貼文預測

第一組

109502535 資工3A 湯騏蔚

2 of 12

Goal

使用Decision Tree Classifier
預測FB靠北中央粉絲專頁貼文的留言數量

4 of 12

規格－ feature & target

target：留言數量（comment_num）
feature：

內文長度（content_length）
貼文日期（posted_date）
內文是否含有連結（url）
內文是否含有某些關鍵字（key_word_0、 key_word_1、…）

5 of 12

爬蟲－ Python Selenium

避免爬到一半出錯，每爬100則貼文就儲存一個csv檔，最後再用另一個py檔把全部合起來
使用regular expression，擷取需要的訊息
初步資料處理，使用datetime模組，將貼文日期統一格式
意外的錯誤，memory error

最後成功爬下1081則貼文

6 of 12

資料處理

由data_process這個function來負責整個資料前處理的部分

comment_num：由comment_num_process這個function處理，分為兩類，popular與unpopular，分界線是一個hyper parameter 叫comment_num_popular
posted_date：由posted_date_process這個function處理，分為一周內、一個月內、三個月內、半年內、一年內、更久之前，共六類
content_length：由content_length_process這個function處理，分為0~100、100~200、…、900~1000、1000~，共11類
url：由url_process這個function處理，分為有或沒有兩類
資料分割：用sklearn的train_test_split分為train_data跟test_data

7 of 12

資料處理

key_word_n：

由key_word_process這個function處理
使用jieba模組對train_data分詞處理，找出在popular裡面出現次數超過key_word_consider_freq，並且沒有出現在unpopular裡面，的詞
在train跟test都找出是否含有那些詞
每一個key_word_n是獨立的feature

將comment_num取出，變成y_train、y_test，其餘則是x_train、x_test

8 of 12

Model Training

9 of 12

建Decision Tree

由build_tree這個function來負責，包含pre pruning跟post pruning以及print出結果

直接使用sklearn的DecisionTreeClassifier

pre_pruning：用for迴圈設置不同的max_depth參數，計算train accuracy、test accuracy、Area under ROC curve，並且留存auc最高的那次。
post_pruning：用DecisionTreeClassifier 的cost_complexity_pruning_path取得每個node的alpha值，然後用for迴圈設置不同的ccp_alpha參數，計算train accuracy、test accuracy、Area under ROC curve，並且留存auc最高的那次。

10 of 12

畫ROC curve

簡單的在plot_roc這個function裡面，用sklearn的predict_proba、roc_curve、auc、RocCurveDisplay，再加上matplotlib，畫出ROC curve，並且存成roc curve.png

12 of 12

Conclusion

成果不如預期，可能原因：

Data set太小
feature考量不夠詳盡
NLP好難訓練好模型需要經驗，feature的考量也很重要