1 of 53

Instructions

Please make a copy of the template slide and paste it as a new last page of this document, then fill in the details, following our example of Project 0.
Project number:

Please take the next integer, so that project numbers are 1,2,3,...

Project name:

Please write the name of the project in UPPERCASE
For projects using data provided by companies please use company name, followed by ‘-’ and then anything you like in capital letters or numbers (for instance, DEEPMIND-AGI)
For project topics taken from Social Impact Data Hack, please start with SIDH-## where ## is the number of the topic, for instance SIDH-01
For project topics taken from Kaggle, please start with KAGGLE-

Datasets:

Please include all datasets that you plan to use together with their sizes and links (unless the dataset is private, then just describe its origin)

Goals: Please state 1-3 goals or questions

2 of 53

List of Projects

for UT Data Mining course MTAT.03.183

2017 Fall

3 of 53

Pitch structure

Hello/Project title/Your name
Data Mining questions/goals that you plan to answer
Business model (why this is going to be useful)
Data that you are going to use
Technology that you are going to use
Team
Thank you!

2 minutes per team + 2 minutes QA from instructor & audience

4 of 53

Project #<number>: PROJECT-NAME�

Dataset 1 (SIZE GB): description of dataset 1 (please specify the origin and refer to link below, if publicly available)

Dataset 2 (SIZE KB): description of dataset 2

...

Goal 1: Description of the first goal

Goal 2: Description of the second goal

...

Any more comments or thoughts you want to highlight here.

Links:�[1] URL1�[2] URL2�...

Project title that can even be as long as two lines of text

TEAM:�Name of member 1

Name of member 2

Name of member 3

Project repository: <put github or bitbucket URL here>

5 of 53

Project #0: DEEPMIND-AGI�

Dataset 1 (123 GB): private dataset with all machine learning models ever trained, donated by Deepmind (we already have it in our storage)

Dataset 2 (25 KB): open dataset with 10 full self-play games of AlphaGo [1] (to be downloaded)

Goal 1: Develop a new algorithm by making the KNN algorithm deeper

Goal 2: Make our own version of AlphaGo by training a K-nearest neighbour algorithm using Dataset 2

Goal 3: Use transfer learning to make our AlphaGo model mimic all models of Dataset 1, in order to achieve an artificial general intelligence (AGI)

We are very sure that this will work out as we are using the latest smartphone to make calculations. Also, we are ready to invest into buying another smart phone to exploit the wisdom of crowds.

Links:�[1] https://deepmind.com/research/alphago/alphago-vs-alphago-self-play-games/

Building artificial general intelligence using deep K-nearest neighbour algorithm

TEAM:�Meelis Kull

Dmytro Fishman

Mari-Liis Allikivi

Project repository: https://github.com/zaf/agi

6 of 53

Project #1: KAGGLE-ENVISION�

Dataset 1 (~ 40 MB): dataset provided by the Recruit Holdings on the kaggle competition page available at bit.ly/rrvfdata

...

Goal 1: Predict how many future visitors a restaurant will receive.

Goal 2: What other real-time sources of data can be employed (and employ one) to make the prediction more accurate and relevant? Such as tweets, facebook posts, weather forecast, etc.

...

Any more comments or thoughts you want to highlight here. [not yet!]

Links:�[1] http://bit.ly/kagglerrvf �...

Restaurant Visitor Forecasting

TEAM:�Novin Shahroudi

Ian Mackerracher

Project repository: https://bitbucket.org/novinsha/rrvf

7 of 53

Project #2: PBGB-POLIS�

Dataset 1 (5,78 MB): Dataset on criminal offence cases against property in public space (2016-2017) [1];

Dataset 2 (19,6 MB): Dataset on criminal offence cases against property in public space (2011-2015) [2];

Goal 1: Identify areas with the highest crime rate.

Goal 2: Classify crimes with their severity level and identify where and at what time the most severe crimes happen.

Goal 3: Information for law enforcement where to show more presence to reduce crime rate.

Links:�[1] https://opendata.smit.ee/ppa/csv/liiklusjarelevalve_1.csv�[2] https://opendata.smit.ee/ppa/csv/liiklusjarelevalve_2.csv

Analyzing criminal offence cases against property in public space

TEAM:�Mart Simisker

Leiger Virro

Karl-Martin Uiga

Project repository: https://github.com/martinuiga/ut_dataminingPosterProject

8 of 53

Project #3: POLITICAL-PARTIES�

Dataset 1 (94KB): Expenses of each political parties (link 1)

Dataset 2 (4MB): Incomes of each political parties (link 1)

Dataset 3 (Needs to be built): Survey of the popularity of each political parties per month (link 2)

Goal 1: Visualize and analyze the incomes and the expenses of each political parties

Goal 2: Analyze the impact of expenses on the popularity of each political parties

Goal 3: Analyze the impact of the popularity of each political parties on incomes

Goal 4: Find the best strategy to do a good political campaign

Links:�[1] https://drive.google.com/drive/folders/0B7EGDc-g2xscc2E1dS1kQUtKMW8

[2] http://www.emor.ee/erakondade-toetus/

Analyze the money spending of political parties and the popularity of these one.

TEAM:�Laura Ruusmann

Flavien Reymond

Project repository: https://bitbucket.org/flavienreymond/dataminingproject/overview

9 of 53

Project #4: SPOTIFY

Project repository: https://bitbucket.org/anastassiaIv/dataminingproject2017

Dataset 1 (43 MB): Spotify's Worldwide Daily Song Ranking (link 1)

Dataset 2 (183 MB): Every song you have heard (almost)! (link 2)

Goal 1: Find the most frequent words used in most listened songs.

Goal 2: How does the repetition of lyrics affect the song’s ranking.

Goal 3: Predict if song will be listened a lot based on lyrics.

Links:�[1] https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking

[2] https://www.kaggle.com/artimous/every-song-you-have-heard-almost

Team:

Anastassia Ivanova

Kevin Ree

Kelian Kaio

Spotify’s recommendation system based on lyrics

10 of 53

Project #5: NMT�

Dataset 1 (MANY GB):Subtitle corpora for English, Estonian, Finnish and many others from OpenSubs2018[1]

Dataset 2 (MANY MB): Subtitle corpora alignments for Estonian-English, Finnish-English and many others from OpenSubs2018

Goal 1: Describe the relative effect of parent language choice for the child language

Goal 2: Describe the relative effect of parent language corpus size for the child langage

Based on last year’s article[2], project based on work done for the past 2 weeks.

Links:�[1] http://opus.nlpl.eu/OpenSubtitles2018.php

[2] https://aclweb.org/anthology/D16-1163

Investigating the effects of parent language NMT choice and corpus size to training models for low-resource languages

TEAM:�Natia Doliashvili

Kaur Karus

Project repository: https://github.com/kaurix02/NLPproject

11 of 53

Project #6: SIDH-15

Dataset 1 (6 MB): VanEssen.zip - XLS files from VanEssen produced sensors. [1]

Dataset 2 (35 MB): Keller.zip - XLS files from Keller produced sensors. [2]

Goal 1: Unify data format from different sensors

Goal 2: Detect outliers and replace them with sensical values

Goal 3: Create a website that visualizes the data

Goal 4: Detect if water levels have increased after drainage elimination in swamps

Links:�[1] https://drive.google.com/file/d/0B54kyLYxGC6DaF8wNUl3NWFuazg/view?usp=sharing�[2] https://drive.google.com/file/d/0B54kyLYxGC6DNHpXQ0s4ZTFrOFk/view?usp=sharing...

RMK WATER LEVELS

TEAM:�Vello Vaherpuu

Madis Martin Lutter

Project repository: https://bitbucket.org/rmkdm/dm-rmk

12 of 53

Project #7: CLASSIFYING ICEBERGS IN STATOIL’S KAGGLE COMPETITION�

train.json.7z (42.85 MB): This is the dataset dedicated to training the model as part of the competition. It has 5 fields: id, band_1, band_2, inc_angle, is_iceberg. Band_1 and band_2 and 75x75 pixel flattened images of differing polarizations. This and the bellow dataset are both available at the competition’s data page: https://www.kaggle.com/c/statoil-iceberg-classifier-challenge/data

test.json.7z (245.22 MB): Pretty much the same as above, but with an unspecified is_iceberg field. Results of model’s classification on this data is used to rank model on the leaderboard.

Goal 1: Get into the top 80% on the public leaderboard for lowest log loss on the test data.

Goal 2: If Goal 1 fails, then make a public kernel detailing our approach.

Links:�[1] https://www.kaggle.com/c/statoil-iceberg-classifier-challenge�

TEAM:�Radita Liem

Theodore Heiser

Project repository: https://github.com/raymerta/statoil-kaggle

13 of 53

Project #8: WORLD HAPPINESS ANALYSIS

Dataset 1 (61.74 KB): World Happiness scores by country (2015, 2016, 2017)

Dataset 2 (138.4 KB): Alcohol consumption

Dataset 3 (67.3 KB): Suicide rate by country

Dataset 4 (88.5 KB): Tobacco consumption

Goal 1: Find the relationship between income and happiness

Goal 2: Find the relationship between government corruption and happiness

Goal 3: Analyze the impact of drugs consumption in happiness

Goal 4: Analyze the relationship between suicide rate and happiness

Some of the goals may be modified or changed in the future if we find some interesting datasets or relationships in the data available

Links:�[1] https://www.kaggle.com/unsdsn/world-happiness/downloads/world-happiness-report.zip�[2] https://www.kaggle.com/START-UMD/gtd/downloads/globalterrorismdb_0617dist.csv

[3] http://apps.who.int/gho/data/node.sdg.3-4-data?lang=en�

TEAM:�ISMAIL GUL

MARCELO SURRIABRE

NURLAN KERIMOV

Project repository: https://bitbucket.org/marcelout/worldhappinessproject

14 of 53

Project #9: KAGGLE -

US PERMANENT VISA APPLICATIONS

Dataset 1 (69.79 MB): us_perm_visas.csv - The dataset consists of 154 features which holds data for visa application (decision, date), employer (city, postal code, title, job posting history etc.), offered job (title, offered salary) and employee (education, citizenship visa history and education). The dataset covers information between 2012-2017 years.

...

Goal 1: Our first goal is to predicting visa decisions for new employer based on data for employer, employee and wage

Goal 2: And our second goal is by using the prediction result, inform people in advance about their visa decisions

...

Links:�[1] https://www.kaggle.com/jboysen/us-perm-visas�...

TEAM:�ELDAR HASANOV

DENIZALP KAPISIZ

Project repository: https://bitbucket.org/eldarhasanov/uspermanentvisaapplications

15 of 53

Project #10: Kaggle: HOUSE PREDICTION

Dataset 1 (449 KB): public dataset with 79 explanatory variables describing almost every aspect of residential homes in Ames, Lowa

Dataset 2 (13.5 KB): description of dataset 1

Goal 1: Find the relationship between house features of the house and it’s price

Goal 1: Develop a new regression model on dataset 1 to predict the final price of each home

Goal 2: Get the RMSE less than 0.11979

Links:�[1] https://www.kaggle.com/c/house-prices-advanced-regression-techniques

[2] https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/train.csv

[3] https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/data_description.txt

Building regression model that predicts the final price of each home

TEAM:�Tural Ismayilov

Mansur Alizada

Polad Mahmudov

Project repository: https://bitbucket.org/garabagh/dm

16 of 53

Project 11: KAGGLE COMPETITION - CORPORACION FAVORITA GROCERY STORE PREDICTION �

Train dataset (4.65 GB): train dataset with target variable with date, number of store and item number of order

transaction dataset (1.48 MB): number of transactions by day

oil dataset (21 KB): oil prices by day (economy of this country depends on oil)

holiday events dataset (21 KB): holidays in this country

Goal 1: implement feature engineering and try different models to make prediction

Links:�https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data - all datasets are avaiable here

TEAM:�Vladislav Fediukov

Alina Vorontseva

Anton Potapchuk

Project repository: https://github.com/cherrybonch/kaggle_FavoritaGrocery

17 of 53

Project #12: PARTIES AND TAXIS�

Dataset 1 (53 MB): Parties in New York

Dataset 2 (1.91 GB): New York City Taxi

Goal 1: Build model of the police calls depending on a location

Goal 2: Build model of the taxi pickup time depending on a location

Goal 3: Draw a map of the parties locations and taxis pick up locations

Links:�[1] https://www.kaggle.com/somesnm/partynyc/data�[2] https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm/data

Finding relations between party locations and taxi pick up locations

TEAM:�Maksym Melnyk

Evgen Dorodnikov

Project repository: https://bitbucket.org/utdatamining/parties-taxis

18 of 53

Project #13: STAY SAFE WHEN DRIVING IN LONDON

Dataset 1 (375 MB): Road Safety Data

Dataset 2 (11.6 KB): CCTV Traffic Enforcement - camera locations

Dataset 3 (2.3 MB): Road Casualties

Goal 1: Identify the most dangerous and emergency parts of roads

Goal 2: Establish a relationship between the presence of surveillance cameras and the number of accidents

Goal 3: Detect the most common accident scenarios

Links:�[1] https://data.gov.uk/dataset/road-accidents-safety-data�[2] https://data.gov.uk/dataset/cctv-traffic-enforcement-camera-locations�[3] https://data.gov.uk/dataset/gb-road-casualty-statistics-2008

TEAM:�Yevheniia Kryvenko

Oleksandra Tkalich

Project repository: https://github.com/oleksandratk/LondonCarAccidents

19 of 53

Project #14: MONEYTALKS� Make Students Rich Again.�

107 Datasets (each of ~80KB, ~1300 entries, 7 features): �each dataset captures the historical price movement �(on a 5 years basis) of one of the 107 stocks�composing the NASDAQ-100 index. ��Goal 1: �Identify patterns in the stocks’ prices movements.�Goal 2: �Draw general advices for the investors.�Goal 3: �Analyse the relation between opening price and �closing price of stocks, in terms of % change. �Goal 4: �Train and evaluate a classifier able to provide �BUY or SELL suggestions for investors.��Links:�[1] https://www.investing.com/

Team:�Adriano Augusto

Grace Achenyo Okolo

Project repository: https://github.com/nemo-91/moneytalks

20 of 53

Project #15: Ntertane App�

Dataset 1 (~1 GB): A large collection of dataset which captures available music data and user interaction with ntertane app

Goal 1: To Identify and visualize the geographical distribution of the app’s users

Goal 2: Search for patterns amongst listeners

Goal 3: Create a predictive model that suggests a listener’s next song in a playlist

Links: (Dataset is not publicly available)�[1] http://ntertane.com�...

Building a song predication model for Ntertane App

TEAM:�Clive Tinashe Mawoko

Ojiambo Ivan

Alli Abdulateef Olamide

Project repository: https://bitbucket.org/ntertane/datamining

21 of 53

Project #16: CLIENT IS KING-COOP�

Dataset 1 (5 GB): COOP Tartu 1 year sales data (> 50 *10^6 rows), data is private

Dataset 2 (35 MB): Product descriptions from COOP (114426 rows), data is private

Goal 1: Identify the groups of clients who have similar shopping patterns: collaborative filtering and non-negative matrix factorization

Goal 2: Describe and visualise the features of detected client groups

Goal 3: Build a recommender system based on these groups

Goal 4: Compare the 2 approaches

The goal is to enhance the customer experience and ensuring a sustainable revenue growth to the stakeholders via building a customer recommendation system.

��

Recommendation system for COOP Tartu

TEAM:�Ahto Salumets

Enn Pokk

Liis Kolberg

Project repository: https://github.com/liiskolb/dm_project2017

22 of 53

Project #17: VIDEO & BOARD GAMES�

Dataset 1 (1581 KB): Video game sales with ratings

Dataset 2 (1973 KB): IGN ratings

Dataset 3 (143.6 MB): Board games

Goal 1: Suggest game genre with the best statistical chance to succeed

Goal 2: Find similarities or differences between trends of video and board games

Links:

[1] https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings/data

[2] https://www.kaggle.com/egrinstein/20-years-of-games/data�[3] https://www.kaggle.com/gabrio/board-games-dataset/data

Analyse video game ratings and sales and how they compare to board game popularity

TEAM:�Janar Ojalaid

Kaspar Hollo

Marek Pagel

Project repository: https://bitbucket.org/janaroj/data-mining-project/overview

23 of 53

Project #18: FITLAP-2�

Dataset 1 (320 MB): Fitlap user data

Goal 1: Data cleansing and understanding.

Goal 2: Find patterns in user behaviour linked to quitting the usage of Fitlap.

Goal 3: Implement a classifier using machine learning techniques in order to predict if a user will quit Fitlap in the future.

Goal 3: Evaluation and visualisation of the obtained results.

Links:�[1] fitlap.ee/

When do people quit Fitlap and why do they quit?

TEAM:�Moritz Hilscher

Hendrik Rätz

Project repository: private (Fitlap user data is confidential)

24 of 53

Project #19: RateChain-2�

market_price.csv (3.65 GB): The dataset is a collected data from different brokers about car rent, which includes car class, broker name, dates of using car, total price and other features.

rate_quote.csv (18.6 GB): data about quotes. Who requests, what requests.

reservations_view.csv (0.03 GB): reservations data.

Goal 1: Improve accuracy for car rental offers for better price comparison

Goal 2: Investigate data and answer the questions from the RateChain company.

Some of them are :

· Which is “correct” ACRISS code for a car model based on collected data? Does it vary by countries?

· Which car models can be considered as alternatives to each other from pricing perspective?

· How manual and automatic gearbox affects rental price by car class?

· What is price difference between car classes?

Links:�[1] http://rate-chain.com/�[2] http://acriss.org/car-codes.asp

Market price analysis to compare car rentals

TEAM:�Xatia Kilanava

Giorgi Sheklashvili

Oleksandr Shvechykov

Project repository: https://github.com/LexSwed/DataMiningProject

25 of 53

Project #20: RATECHAIN-1�

rate_quote.csv (23.44 GB): Data on price quotes - price quotes together with input parameters for which the price quote was generated for (ex. age, source country, start date, end date, …). Data is from Iceland region.

reservations_view.csv (31.9 MB): Actual reservations made based on price quotes. Includes info on when reservation was made, for which time interval, which car, location et cetera.

Goal 1: Find answers to the questions posed by the client RateChain

Ex1. What number of requests could be cached in a certain time period?
Ex2. How many days in advance customers are checking prices?

Goal 2: Try to create a model using rate quotes to reservation

ratio to derive historical demand based on reservations history

Links:�[1] http://rate-chain.com/

Car rental demand detection based on price

requests from online channels

TEAM:�Pirge Kaasik

Joonas Puura

Project repository: https://github.com/Abercus/dmproj2017

26 of 53

Project #21: FITLAP-1�

Dataset 1 (328 MB): Fitlap user data.

Goal 1: Data processing and understanding.

Goal 2: Divide people in groups by their weight losses.

Goal 3: Apply sequential pattern mining and find useful information.

Goal 4: Analyzing and visualization of obtained results.

Fitlap user data is private.

Links:�[1] fitlap.ee/

What are the habit patterns and parameters of those who lose the most weight with Fitlap?

TEAM:�Olha Kaminska

Marharyta Dekret

Viacheslav Komisarenko

Project repository: https://github.com/anitera/fitlap-1

27 of 53

Project #<22>: Airbnb New User Booking�

Dataset 1 (64 mb): List of users along with their demographics, web session records, and some summary statistics.

Goal 1: Find interesting patterns

Goal 2: Predict where a new user will book their first travel experience.

Goal 3: Find the best fitting model

Links:�[1] https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

[2]https://www.airbnb.com/ �...

TEAM:�Gunay Abdullayeva

Frozan Maqsoodi

Aytaj Aghabayli

Project repository: <https://github.com/AbGunay/Data-Mining-Project>

28 of 53

Project #23 KDD-IDS

Project repo: https://github.com/prabhant/KDD-IDS

Datasets: All datasets listed on http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html we might gather some new data for testing purposes

GOAL1: Detect network connection as bad (attack) or good(Normal) using ML

Goal 2: Detecting various attacks and their connection with the new attacks

Notes: New data can be used for more testing purposes

Links:

[1]http://kdd.ics.uci.edu/databases/kddcup99/task.html

TEAM:

Prabhant Singh

Mahir Gulzar

29 of 53

Project #25: FITLAP-3

�

Dataset 1 (??? GB): private dataset provided by Fitlap (waiting for legal clearance)

�Goal 1: Find out, how different personas use Fitlap and what are the differences.

Goal 2: Predict who will comply with the diet and who needs additional motivation.

Goal 3: Find out which individuals are likely to quit before one month.��

Will it Fit? Different personas on Fitlap.ee

TEAM:�Nele Taba

Vladislav Stafinjak

Madis Vasser

Project repository: Private due to NDA / https://bitbucket.org/VladisSt/datamining

30 of 53

Project #26: SIDH-05

Dataset 1 (792.35 KB): Data is taken from https://www.rahandusministeerium.ee/et/riigi-personalipoliitika/palgapoliitika we took dataset 2015. Data is all .xlsx format. Datasets category is riigi ametiasutused“.

Dataset 2 (733.05 KB): Data is taken from https://www.rahandusministeerium.ee/et/riigi-personalipoliitika/palgapoliitika we took dataset 2016, data is in .xlsx format. Datasets category is riigi ametiasutused“.

Dataset 3 (571.95 KB): Data is taken from https://www.rahandusministeerium.ee/et/riigi-personalipoliitika/palgapoliitika we took dataset 2017, data is in .xlsx format. Datasets category is riigi ametiasutused“.

Dataset 4 (4.50 KB): Dataset with the title ‘Keskmine brutokuupalk, 2004–2016’ about average salaries of state employees throughout the years is available at https://www.stat.ee/stat-keskmine-brutokuupalk

Goal 1: Predict how the Estonian state employee's budgets in upcoming years are competitive with the local employment market.

Goal 2: How the changes in salaries of the state officials affect the local employment market?

Links:�[1https://bitbucket.org/kadri_oluwagbemi/project-salaries-of-estonian-public-officials

Salaries of Estonian public officials

TEAM:�Ardi Aasmaa

Madis Harjo

Kadri Oluwagbemi

Project repository: https://bitbucket.org/kadri_oluwagbemi/project-salaries-of-estonian-public-officials/wiki/Home

31 of 53

Project #27: MERCARI PRICE SUGGESTION CHALLENGE�

Dataset 1 (73 MB): Training data of items sold by Mercari, data contains item descriptions and other attributes such as category and brand name. Actual origin is unknown but presumably it’s a small subset of item listings in Mercari.

Dataset 2 (34 MB): Test set which contains unlabelled items that have all the same attributes except for the price of the item.

Goal 1: Build a model that suggests item prices.

Goal 2: Achieve root mean squared logarithmic error of 0.5 or less to be in the top 500 of the contest.

Main challenge will be to clean the data from typos and train a good model on very biased data.

Links:�[1] https://www.kaggle.com/c/mercari-price-suggestion-challenge

Can you guess which one costs 9.99$ and which one is 335$?

TEAM:�Sander-Sebastian Värv

Markus Loide

Martin Liivak

Project repository: https://github.com/Sebastianvarv/MercariPriceSuggestion

32 of 53

Project 28: MeelisOrNotMeelis�

Dataset 1 (<0.1 GB (100Mb)): approx 2500 110x110 pixel images of Meelis Kull´s face taken from lecture videostream screenshots. OpenCV is used for face detection from the screenshots.

Dataset 2 (same size) : approx 2500 110x110 pixel images of somebody else´s face.

Goal 1: Detect Meelis Kull from webcam videostream. From 30fps video one frame per second can be false negative. No false positives allowed!

OpenCV and its Haar Like Feature model is used for face detection. The model will be trained on detected face rectangular images.

Detecting Meelis from a webcam stream.

TEAM:�Jaan Tohver

Andres Matsin

Project repository: https://bitbucket.org/jaantohver/meelis_or_not_meelis

33 of 53

Project 29: Movie-Stars�

Dataset 1 (900 MB): The Movies Dataset (https://www.kaggle.com/rounakbanik/the-movies-dataset)

Dataset 2 (? KB): We will create additional rating data from http://www.imdb.com

Goal 1: Examine main properties which result in good ratings

Goal 2: Build a recommender system based on user data

Our dataset is an ensemble of data collected from TMDB, imdb and GroupLens.

Links:�[1] https://www.themoviedb.org/?language=en �[2] http://www.imdb.com

[3]https://grouplens.org �

What it takes to get a good movie ratings and recommendation system

TEAM:�Andreas Baum

Marielle Egert

Project repository: https://github.com/mariellee/datamining2017

34 of 53

Project 30: KAGGLE - CATERPILLAR TUBE PRICING�

Dataset 1 (1355 KB): tube.csv This file contains information on tube assemblies, tube Assemblies are made of multiple parts. The main piece is the tube which has a specific diameter, wall thickness, length, number of bends and bend radius.

Dataset 2 (1617 KB): train_set.csv data General data provided about product with prices for training

Goal 1: Train machine with training data and find most suitable algorithms.

Goal 2: use trained machine to predict possible best prices for tubes.

There are more datasets than described here but this two are most important all other 19 are provided to support this two (mostly it is a variety of tube assemblies)�(total 21 datasets)

Analyzing Caterpillar data to Predict the correct prices for tubes

TEAM:�Shalva Kalandarishvili

David Chagiashvili

Viktor Mysko

Project repository: https://bitbucket.org/kalandarishvili/datamining

35 of 53

Project #31: Interactive Terrorism Map�

Dataset 1 (29.19MB): Global Terrorism Dataset from Kaggle [1]

Goal 1: To develop a web application with an interactive map of terrorist attacks, which can be filtered by multiple options like year, month, attack type, number of casualties, group of terrorists, etc.

Goal 2: Add a feature of exporting the filtered data to a CSV file

Links:�[1] https://www.kaggle.com/START-UMD/gtd/data�

Developing an Interactive Map with multiple filter options

TEAM:�Shaswata Saha

Janno Peterson

Abel Mesfin Cherinet

Project repository: https://github.com/jannopet/Interactive-Map-of-Terrorist-Attacks

36 of 53

Project #32: Age Detection Based on Facial Images�

Dataset 1 (3 GB)(Photos + Meta): We use WIKI dataset.

Goal 1: To build a system that will predict age of a human being whose

photo is inserted as an input

Goal 2: Researching about LBP, LBP histogram and RGB descriptors,

for age detection systems.

Links:

1 - https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/

TEAM:�Başar TURGUT

Soltan QARAYEV

Project repository: https://bitbucket.org/soltankara/faceagedetection

37 of 53

Project #33: ELEKTRILEVI-1�

Dataset 1 (SIZE 660 MB): private dataset of measurements of the substations

Dataset 2 (SIZE 2.69 MB): private dataset of events happened in substations

Goal 1: Identify network status index

Goal 2: Model for predicting anomalies in substations

Quality of electricity

TEAM:�Margarit Shmavonyan�Tair Vaher �Tsotne Kekelia

Project repository: https://bitbucket.org/utmastersttm/elektrilevi

38 of 53

Project #34: THRONES MINING

Dataset 1 (8 Kb): battles.csv: all the battles from the series

Dataset 2 (38 Kb): character-deaths.csv: contains information about dead characters and their deaths: chapter, year, gender, nobility of character.

Dataset 3 (210 Kb): character-predictions.csv: has information about all characters appeared in the saga: their names, house, title, gender, family status, popularity, etc.

Goal 1: provide predictions and analysis of previous seasons’ events that can be used in different fields, e.g. in betting. Specifically, our model should be able to answer the following questions:

what house is most likely to occupy the throne for now;
what is the rating of death probability of main characters;
what are the bonds between houses, how much they are reliable.

Links:�[1] https://www.kaggle.com/mylesoneill/game-of-thrones/data

Predictions and analysis of the Game of Thrones TV series

TEAM:�Maksym Yerokhin

Roman Ismagilov

Project repository: https://github.com/Pomis/ThronesMining

39 of 53

Project#35: KAGGLE-INSTACART MARKET BASKET ANALYSIS

Dataset 1 (574 MB): Train dataset, which contains information about products and in which order they were purchased.

Dataset 2 (103 MB): Test dataset, which tells to which set order belongs(train,test,prior)

Dataset 3 (2 MB): Dataset which contains products, aisles and departments.

...

Goal 1: Build a model to predict which previously purchased products will be in a user’s next order.

Goal 2: Achieve a mean F-score higher than 0.3989240 on the first 33% of the test data which was used to produce the public leaderboard in the Kaggle competition.

...

Links:�https://www.kaggle.com/c/instacart-market-basket-analysis��

TEAM:�Leonid Tolli

Lauri Kongas

Hele-Andra Kuulmets

Project repository: https://github.com/Leonidk29/Instacart

40 of 53

Project #36: IMDB Movie Rating Analysis

Dataset 1 (133 KB): Kaggle Imdb dataset from 2006 - 2016

Dataset 2 (198 KB): Kaggle The academy awards database 1927 - 2015

...

Goal 1: Analyze correlations between

Movie genre , description and votes in imdb
Patterns of actor and director combinations which get higher rating in IMDB
Movie imdb rating and chances to get Oscar

Goal 2: Make some prediction model for new possible movie ratings based on genre, actors , etc.

Links:�[1] URL1 : https://www.kaggle.com/PromptCloudHQ/imdb-data/ �[2] URL2 : https://www.kaggle.com/theacademy/academy-awards

Project title that can even be as long as two lines of text

TEAM:�Aramais Khachatryan

Aleksandr Tsõganov

Yaroslav Hrushchak

Project repository: https://github.com/aramYnwa/imdbRatingAnalysis.git

41 of 53

Project #37:��

Dataset 1 (100 GB): sensor data of Starship robots that travelled in 2017

Dataset 2 (278 MB): GPS data of Starship robots that travelled in 2017 in Tallinn

...

Goal 1: Measure quality of roads and visualize it on a map of Tallinn

Goal 2: Integrate road quality with route optimization

...

Links:�[1] https://www.starship.xyz/�...

Measuring pavement quality in Tallinn

TEAM:�Kevin Kanarbik

Märten Veskimäe

Tõnis Kasekamp

Project repository: https://github.com/Kennu76/DataMiningProject

42 of 53

Project #38:�

Dataset 1 (7.43 MB): Run or Walk [1]

Goal 1: Get more acquaintance with data mining processes behind Samsung S-Health system

Goal 2: Recognize physical activity of a person using sensor data of his mobile phone (accelerometer and gyroscope)

Being motivated in the S Health application of my Samsung Note 4 mobile phone, I got interested to learn more about data processing of this healthcare application. One step further, I am eager to see do we can use gathered data by mobile sensors to predict our physical activities. In this project, data are collected by accelerometer and gyroscope sensors. Physical activities refer to activities such as running, walking and etc.

Links:�[1] https://www.kaggle.com/vmalyi/run-or-walk

Your mobile can recognize your physical activity!

TEAM:�Shahla Atapoor

Project repository: https://github.com/atapoor/Data_Mining_2017

43 of 53

Project #39: FITLAP�

Dataset 1 (SIZE 319 MB): Private file with all of the user data

Goal 1: The main goal would be to find out who achieves the goal they set. More specifically:

1. what do they have in common (maybe mostly people who want to maintain the weight not lose it or the other way around);

2. are there any specific meals or ingredients that these people eat;

3. how many people change the goal and achieve it after that.

...

Links:�[1] fitlap.ee/��...

Who achieves the goal - analysis of Fitlap data

TEAM:�Reelika Tõnisson

Rando Tõnisson

Project repository: can’t be added because of the non-disclosure agreement

44 of 53

Project #40: RATECHAIN-1

Majnun Abdurahmanov

Khaled Nimr Charkie

Bejon Sarker

Project repository https://github.com/delone-lora/CRDD-dm-project-2017 (will make it private later)

rate_quote.csv (21.18 GB): Data about price quotes - including input parameters for calculating the price quote at the end such as age, source country, start date and so on.

reservations_view.csv (34.4 MB): Reservations for price quotes. The table includes data such as when reservation was made, duration of reservation and so on.

market_price.csv (3.65 GB): general information about market prices for car renting in different countries

Goal 1: Fulfill business goals of RateChain.

To which time period customers are looking for rental cars now/yesterday/last week/last month?
How many days/weeks/months in advance bookings are made? Does it vary based on season and years?
What is “rate quotes to reservation” ratio by resellers? Does it vary by seasons or source countries?
And so on.

Goal 2: Create model using rate quotes to reservation ratio to derive historical demand based on reservations history?

45 of 53

Project #41: HEALTHY HEART�

Heart Disease Dataset (1 MB): Dataset containing patient information (such as age, whether they’re smokers etc.) and the presence of heart disease in the patient

...

Goal 1: Identify the biggest risk factors that contribute to heart disease

Goal 2: Test whether the country where people live (some economic or financial factors) influences the rate of heart disease.

Goal 3: Predict the risk for heart disease of an unknown person based on some personal information about the person

Goal 4: Inform (and warn) the general public about the risks in order to try to reduce the heart disease rate in people.

…

Links:�[1] http://archive.ics.uci.edu/ml/datasets/Heart+Disease�

...

TEAM:�Simona Micevska

Hristijan Sardjoski

Project repository: https://bitbucket.org/hakerchinja/healthy-heart

46 of 53

Project #42: CRIME RESEARCH IN USA

Dataset 1 (1.53 GB): The data was taken from the U.S. Government`s open data: [1]. Data includes the details about crime in the City of Chicago from 2011 to present.

Dataset 2 (357 MB): The same data about crimes but for the city of Los Angeles dating back to 2010 [2].

Goal 1: Based on dataset we will identify and analyze the patterns and trends in crime situation in the City of Chicago (2001 - 2017) and the City of Los Angeles (2010 - 2017).

�

TEAM:

Sofiya Demchuk

Dmytro Tkachuk

Raman Shapaval

Project repository: https://github.com/dimatkachuk/DM-project

Crime pattern and trends identifying, its analysis and visualization in the City of Chicago from 2001 and in the City of Los Angeles from 2010 to present.

Goal 2: Using visualization library and filtering functions we will determine the most “dangerous” districts or the most safe one depending on the day time, period of the year, weekends and etc. We think in this way we can extract benefit for security management in the city. This should help with preventing crimes and consequently with lowering crimes’ rate.

Goal 3: The dates of crimes can be studied to see crime trends depending on seasons, celebration days and time of the day.

Goal 4: Based on the data about crimes, we will try to predict exact time of committed crimes depending on type of crime, time and district.

Links: [1] https://catalog.data.gov/dataset?tags=crime

[2] https://catalog.data.gov/dataset/crime-data-from-2010-to-present

47 of 53

Project #44: Dermtest CNN�

Dataset (800 MB): dataset of suspicious skin areas images taken using dermatoscopic camera [1] form Dermtest [2]

Goal 1: Train a convolutional neural network to predict cancer risk from images

Goal 2: Extract embeddings from the NN, cluster images using PCA and t-SNE

Goal 3: Use saliency maps from the NN to identify the most important parts of the skin image

Links:

[1] https://en.wikipedia.org/wiki/Dermatoscopy

[2] https://www.dermtest.com

Predicting skin cancer from images using a convolutional neural network

TEAM:�Maksym Semikin

Martin Valgur

Project repository: https://github.com/msemikin/dermtest

48 of 53

Project #45: Bosch Production Line Performance�

Dataset 1 (7.2 GB): Training data from the competition (uncompressed), unlabeled measurements from the manufacturing process

Dataset 2 (7.2 GB): Test data from the competition (uncompressed)

Goal 1: Predict which mechanical components fail quality control on the whole test set

Goal 2: Build an ensemble of multiple models

Goal 3: Achieve a good (top 20% ?) score

Links:�[1] https://www.kaggle.com/c/bosch-production-line-performance

Reduce manufacturing failures

TEAM:�Oliver-Matis Lill

Project repository: https://github.com/oml1111/bosch

49 of 53

Project #46: DUNORD

Demand prediction for Liivi 2 Cafeteria�

Dataset 1 (1 MB): Dataset on Liivi-2 cafeteria sales from Nov 2015 to Oct 2016

Dataset 2 (1 MB): Dataset on Liivi-2 cafeteria sales from Nov 2016 to Oct 2017

Dataset 3 (2.5 MB): Dataset on Liivi-2 lecture room occupancy from fall 2015 to fall 2017

Goal 1: Predict the sales quantities in the Du Nord cafeteria at J. Liivi 2

Goal 2: Build an automatic method using machine learning models to predict the sales

Goal 3: Analyse correlations between variables from lecture room occupancy data and sales quantities in different categories

�

TEAM:�Navedanjum Ansari

Saumitra Bagchi

Sriyal Jayasinghe

Project repository: https://bitbucket.org/navedanjum/dunord.git

50 of 53

Project #47: Global Terrorism

�

Dataset 1: (SIZE 143.96 GB): The Global Terrorism Database (GTD) is an open-source database including information on terrorist attacks around the world from 1970 through 2016 (with annual updates planned for the future).

Dataset name: globalterrorismdb_0617dist.csv

Link: https://www.kaggle.com/START-UMD/gtd/data

...

Goal 1: Get information about the most common attack types and targets and for trying to predict future attacks.

Goal 2: Provide with information about most dangerous places in certain times of the year and visualize it.

Goal 3: Identify if certain nationalities are more likely to get killed or just to be affected by terrorism.

Analysing global terrorism dataset

TEAM:�Diana Grygorian

Eneko Ruiz de Loizaga

Ibrahim Abdulhamid

Project repository: https://github.com/eruizdeloizaga002/globalterrorism

51 of 53

Project #48: STARSHIP�

Dataset 1 (177 GB): sensor data from the delivery robots (timestamp, orientation, readings from the accelerometer, magnetometer, gyroscope, etc.)

Dataset 2 (278 KB): localization data (coordinates of the particular robot with timestamp)

Goal 1: reduce data size (take only Tallinn) and build route map

Goal 2: study pavement quality in Tallinn using sensors data

Pavement quality can be studied dynamically

Links:�[1] https://www.starship.xyz/�...

Pavement quality measurements in Tallinn

TEAM:�Mikhail Papkov

Elizaveta Korotkova

Project repository: https://github.com/papkov/starship

52 of 53

Project #49: ELEKTRILEVI-2: Cables, transformers, poles �

The data for this project is private so it cannot be shared publically.

Dataset 1 (3.3 MB): cableReliability.csv - Includes all the relevant data of cables

Dataset 2 (4.2 MB): defectsTrafo.csv - Includes all the relevant data of Transformers

Dataset 3 (372 KB): poleLifeSpan.csv - Includes all the relevant data of Poles

Goal 1: Connection between Transformer types, their ages and the defects more likely to appear

Goal 2: Predicting the age of poles and what is the best time to replace them

Goal 3: Predicting when a new joint is more likely to be installed in a cable.

TEAM:�Muhammad Bilal Shahid

Hippolyte Fayol

Bilawal Hussain

Project repository: https://bitbucket.org/bilawal_ut/elektrilevi_cablepoletransformer

53 of 53

Project #50: Predictions and analysis of the �biathlon World Cup data

Project repository: https://github.com/annalanevali/dataminingProject.git

Dataset: season 2016/2017 biathlon World Cup Data (collected)� (https://biathlonresults.com/)

Goal 1: analyzation of the data: why some athletes are better than others. �Visualizations

Goal 2: try to find out what are the winning strategies.

Goal 3: build a small prediction model

TEAM: �Anna Laaneväli