1 of 1

KDD-Cup-2019�Los Bocadillos

Practical Course “Big Data Science”

Alessandro Volpicella, Ane Berasategi, Denny Steigmeier, Frederik Ludwigs, Lynn Nguyen (Group “Los Bocadillos”)

Lehrstuhl für Datenbanksysteme und Data Mining, Ludwig-Maximilians-Universität München�{al.volpicella, an.berasategi, de.steigmeier, fr.ludwigs, th.nguyen}@campus.lmu.de

Abstract

Our group participated in the KDD-Cup of 2019, where one of of the competition tasks was to predict the most appropriate transport mode given a user query. For this purpose we analyzed the input data set provided by Baidu, created features that we considered insightful and relevant to the task, and applied machine learning architectures such as learning to rank and multiclass classification. The best performing model turned out to be Light-GBM Multiclass with a F1 score of 69.42%. Additionally, we delved into the second task of the competition, which was an open research or application challenge in order to (i) enable selective advertising (Baidu’s core business), (ii) allow targeted marketing, and (iii) increase profit by raising the usability and range of applications of the provided data set. In our research we used clustering methods to devise user behaviour and increase accuracy of the predicted transportation mode. Optics was the best performed clustering method with a F1 score of 52.18% for multiclass predictions.

KDD-Cup 2019

Data Preprocessing

Evaluation

Task 2

Learning-To-Rank

Multiclass Classification

Results

Idea: Find connection between query and user features in order to

enable selective advertising (Baidu’s core business)
allow targeted marketing
increase profit by using whole potential of dataset

Assumption:

Users with similar application behaviour have same general preferences

Approach: 1. Cluster available PID

2. Train multiclass model to predict user groups based on

query features

Clustering

K-Means: Haphazardly run algorithm for numerous cluster sizes, while having an eye on the silhouette score in order to interpret consistency within clusters → four cluster sizes (4, 8, 16, 36)
K-modes: Similar to k-Means but tries to minimize the sum of within-cluster distances by using Hamming instead of Euclidean distance function
Hierarchical clustering (agglomerative): No knowledge about cluster numbers in advanced required
Optics: Basic idea of DBSCAN but addresses one of its major weakness → detecting meaningful clusters in data of varying density

Annual Data Mining Conference
2019: Data Competition in cooperation with Baidu (Chinese search engine)

Task 1: Predict the most appropriate transport mode for a user query

Task 2: Open research/application challenge

Stacking

Provide columns indicating the availability for each query
Choose the best available transport mode

Raw Data

Queries Records: Session ID, Profile ID, Request Time, Origin/Destination Coordinates,

�

Display Records: Several plan suggestions with Transport Type, Price, Estimated Travel Time, Distance

�

User Attributes: Profile Features one-hot encoded

Click Records: Chosen Transport Mode, Click Time

Rank the available transport modes
Choose the one with the highest predicted score

Boost predictions by using ensemble method stacking
Combine multiclass and ranking learners into a meta learner
Idea is to not look at original feature space directly
Use predicted scores from baseline models instead

External Data

District File: 16 District Names and Coordinates of corresponding district (no public data on districts)
Subway File: 250 Subway Names and Coordinates
Weather File: Weather Type, Min/Max Temperature, Wind Speed

Preprocessed Data

Profile Features: Profile ID, 65 Profile Features, Clustered Features
Weather Features: Weather Type, Maximum Temperature, Minimum Temperature, Wind
Plan Feature: Session Specific Transport Mode Features (availability of modes, mode specific costs, distances, times, mean time distance cost, mode with highest/lowest cost)
Temporal Features: Request Time, Click Time, Plan Time, Month, Day, Day Time, National Holiday
Spatial Features: Origin/Destination Coordinates, Distance to Origin/Destination, Distance to Nearest Subway Station, District of Origin/Destination

Baidu’s Test Set: Summit to platform for results
5-Fold CV: Summarize skill of model using

mean evaluation score

Task 2

Use cluster results for multiclass prediction
Evaluate by F1 score
Best performance: Optics with 52.18 %

Task 1

Best performance: Light-GBM Multiclass
Final Rank: 149 / 1,696
Best F1 Score: 70.43 % vs Our F1 Score: 69.42 %

District Shapes by Clustered Points

Fitted models: LightGBM Lambdarank, TF-Ranking

Fitted models: KNN, LightGBM Multiclass, Random Forest, MLP, XGBoost

Layout with features for one SID

Layout with features for two SID