KDD-Cup-2019�Los Bocadillos
Practical Course “Big Data Science”
Alessandro Volpicella, Ane Berasategi, Denny Steigmeier, Frederik Ludwigs, Lynn Nguyen (Group “Los Bocadillos”)
Lehrstuhl für Datenbanksysteme und Data Mining, Ludwig-Maximilians-Universität München�{al.volpicella, an.berasategi, de.steigmeier, fr.ludwigs, th.nguyen}@campus.lmu.de
Abstract
Our group participated in the KDD-Cup of 2019, where one of of the competition tasks was to predict the most appropriate transport mode given a user query. For this purpose we analyzed the input data set provided by Baidu, created features that we considered insightful and relevant to the task, and applied machine learning architectures such as learning to rank and multiclass classification. The best performing model turned out to be Light-GBM Multiclass with a F1 score of 69.42%. Additionally, we delved into the second task of the competition, which was an open research or application challenge in order to (i) enable selective advertising (Baidu’s core business), (ii) allow targeted marketing, and (iii) increase profit by raising the usability and range of applications of the provided data set. In our research we used clustering methods to devise user behaviour and increase accuracy of the predicted transportation mode. Optics was the best performed clustering method with a F1 score of 52.18% for multiclass predictions.
Multiclass Classification
Idea: Find connection between query and user features in order to
- enable selective advertising (Baidu’s core business)
- allow targeted marketing
- increase profit by using whole potential of dataset
Assumption:
Users with similar application behaviour have same general preferences
Approach: 1. Cluster available PID
2. Train multiclass model to predict user groups based on
query features
Clustering
- K-Means: Haphazardly run algorithm for numerous cluster sizes, while having an eye on the silhouette score in order to interpret consistency within clusters → four cluster sizes (4, 8, 16, 36)
- K-modes: Similar to k-Means but tries to minimize the sum of within-cluster distances by using Hamming instead of Euclidean distance function
- Hierarchical clustering (agglomerative): No knowledge about cluster numbers in advanced required
- Optics: Basic idea of DBSCAN but addresses one of its major weakness → detecting meaningful clusters in data of varying density
- Annual Data Mining Conference
- 2019: Data Competition in cooperation with Baidu (Chinese search engine)
Task 1: Predict the most appropriate transport mode for a user query
Task 2: Open research/application challenge
- Provide columns indicating the availability for each query
- Choose the best available transport mode
Raw Data
- Queries Records: Session ID, Profile ID, Request Time, Origin/Destination Coordinates,
�
- Display Records: Several plan suggestions with Transport Type, Price, Estimated Travel Time, Distance
�
- User Attributes: Profile Features one-hot encoded
- Click Records: Chosen Transport Mode, Click Time
- Rank the available transport modes
- Choose the one with the highest predicted score
- Boost predictions by using ensemble method stacking
- Combine multiclass and ranking learners into a meta learner
- Idea is to not look at original feature space directly
- Use predicted scores from baseline models instead
External Data
- District File: 16 District Names and Coordinates of corresponding district (no public data on districts)
- Subway File: 250 Subway Names and Coordinates
- Weather File: Weather Type, Min/Max Temperature, Wind Speed
Preprocessed Data
- Profile Features: Profile ID, 65 Profile Features, Clustered Features
- Weather Features: Weather Type, Maximum Temperature, Minimum Temperature, Wind
- Plan Feature: Session Specific Transport Mode Features (availability of modes, mode specific costs, distances, times, mean time distance cost, mode with highest/lowest cost)
- Temporal Features: Request Time, Click Time, Plan Time, Month, Day, Day Time, National Holiday
- Spatial Features: Origin/Destination Coordinates, Distance to Origin/Destination, Distance to Nearest Subway Station, District of Origin/Destination
- Baidu’s Test Set: Summit to platform for results
- 5-Fold CV: Summarize skill of model using
mean evaluation score
Task 2
- Use cluster results for multiclass prediction
- Evaluate by F1 score
- Best performance: Optics with 52.18 %
Task 1
- Best performance: Light-GBM Multiclass
- Final Rank: 149 / 1,696
- Best F1 Score: 70.43 % vs Our F1 Score: 69.42 %
District Shapes by Clustered Points
Fitted models: LightGBM Lambdarank, TF-Ranking
Fitted models: KNN, LightGBM Multiclass, Random Forest, MLP, XGBoost
Layout with features for one SID
Layout with features for two SID