Instructions
List of Projects
for UT Data Mining course MTAT.03.183
2017 Fall
Pitch structure
2 minutes per team + 2 minutes QA from instructor & audience
Project #<number>: PROJECT-NAME�
Dataset 1 (SIZE GB): description of dataset 1 (please specify the origin and refer to link below, if publicly available)
Dataset 2 (SIZE KB): description of dataset 2
...
Goal 1: Description of the first goal
Goal 2: Description of the second goal
...
Any more comments or thoughts you want to highlight here.
Links:�[1] URL1�[2] URL2�...
Project title that can even be as long as two lines of text
TEAM:�Name of member 1
Name of member 2
Name of member 3
Project repository: <put github or bitbucket URL here>
Project #0: DEEPMIND-AGI�
Dataset 1 (123 GB): private dataset with all machine learning models ever trained, donated by Deepmind (we already have it in our storage)
Dataset 2 (25 KB): open dataset with 10 full self-play games of AlphaGo [1] (to be downloaded)
Goal 1: Develop a new algorithm by making the KNN algorithm deeper
Goal 2: Make our own version of AlphaGo by training a K-nearest neighbour algorithm using Dataset 2
Goal 3: Use transfer learning to make our AlphaGo model mimic all models of Dataset 1, in order to achieve an artificial general intelligence (AGI)
We are very sure that this will work out as we are using the latest smartphone to make calculations. Also, we are ready to invest into buying another smart phone to exploit the wisdom of crowds.
Links:�[1] https://deepmind.com/research/alphago/alphago-vs-alphago-self-play-games/
Building artificial general intelligence using deep K-nearest neighbour algorithm
TEAM:�Meelis Kull
Dmytro Fishman
Mari-Liis Allikivi
Project repository: https://github.com/zaf/agi
Project #1: KAGGLE-ENVISION�
Dataset 1 (~ 40 MB): dataset provided by the Recruit Holdings on the kaggle competition page available at bit.ly/rrvfdata
...
Goal 1: Predict how many future visitors a restaurant will receive.
Goal 2: What other real-time sources of data can be employed (and employ one) to make the prediction more accurate and relevant? Such as tweets, facebook posts, weather forecast, etc.
...
Any more comments or thoughts you want to highlight here. [not yet!]
Links:�[1] http://bit.ly/kagglerrvf �...
Restaurant Visitor Forecasting
TEAM:�Novin Shahroudi
Ian Mackerracher
Project repository: https://bitbucket.org/novinsha/rrvf
Project #2: PBGB-POLIS�
Dataset 1 (5,78 MB): Dataset on criminal offence cases against property in public space (2016-2017) [1];
Dataset 2 (19,6 MB): Dataset on criminal offence cases against property in public space (2011-2015) [2];
Goal 1: Identify areas with the highest crime rate.
Goal 2: Classify crimes with their severity level and identify where and at what time the most severe crimes happen.
Goal 3: Information for law enforcement where to show more presence to reduce crime rate.
Links:�[1] https://opendata.smit.ee/ppa/csv/liiklusjarelevalve_1.csv�[2] https://opendata.smit.ee/ppa/csv/liiklusjarelevalve_2.csv
Analyzing criminal offence cases against property in public space
TEAM:�Mart Simisker
Leiger Virro
Karl-Martin Uiga
Project repository: https://github.com/martinuiga/ut_dataminingPosterProject
Project #3: POLITICAL-PARTIES�
Dataset 1 (94KB): Expenses of each political parties (link 1)
Dataset 2 (4MB): Incomes of each political parties (link 1)
Dataset 3 (Needs to be built): Survey of the popularity of each political parties per month (link 2)
Goal 1: Visualize and analyze the incomes and the expenses of each political parties
Goal 2: Analyze the impact of expenses on the popularity of each political parties
Goal 3: Analyze the impact of the popularity of each political parties on incomes
Goal 4: Find the best strategy to do a good political campaign
Links:�[1] https://drive.google.com/drive/folders/0B7EGDc-g2xscc2E1dS1kQUtKMW8
Analyze the money spending of political parties and the popularity of these one.
TEAM:�Laura Ruusmann
Flavien Reymond
Project repository: https://bitbucket.org/flavienreymond/dataminingproject/overview
Project #4: SPOTIFY
Project repository: https://bitbucket.org/anastassiaIv/dataminingproject2017
Dataset 1 (43 MB): Spotify's Worldwide Daily Song Ranking (link 1)
Dataset 2 (183 MB): Every song you have heard (almost)! (link 2)
Goal 1: Find the most frequent words used in most listened songs.
Goal 2: How does the repetition of lyrics affect the song’s ranking.
Goal 3: Predict if song will be listened a lot based on lyrics.
Links:�[1] https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking
[2] https://www.kaggle.com/artimous/every-song-you-have-heard-almost
Team:
Anastassia Ivanova
Kevin Ree
Kelian Kaio
Spotify’s recommendation system based on lyrics
Project #5: NMT�
Dataset 1 (MANY GB):Subtitle corpora for English, Estonian, Finnish and many others from OpenSubs2018[1]
Dataset 2 (MANY MB): Subtitle corpora alignments for Estonian-English, Finnish-English and many others from OpenSubs2018
Goal 1: Describe the relative effect of parent language choice for the child language
Goal 2: Describe the relative effect of parent language corpus size for the child langage
Based on last year’s article[2], project based on work done for the past 2 weeks.
Links:�[1] http://opus.nlpl.eu/OpenSubtitles2018.php
Investigating the effects of parent language NMT choice and corpus size to training models for low-resource languages
TEAM:�Natia Doliashvili
Kaur Karus
Project repository: https://github.com/kaurix02/NLPproject
Project #6: SIDH-15
Dataset 1 (6 MB): VanEssen.zip - XLS files from VanEssen produced sensors. [1]
Dataset 2 (35 MB): Keller.zip - XLS files from Keller produced sensors. [2]
Goal 1: Unify data format from different sensors
Goal 2: Detect outliers and replace them with sensical values
Goal 3: Create a website that visualizes the data
Goal 4: Detect if water levels have increased after drainage elimination in swamps
Links:�[1] https://drive.google.com/file/d/0B54kyLYxGC6DaF8wNUl3NWFuazg/view?usp=sharing�[2] https://drive.google.com/file/d/0B54kyLYxGC6DNHpXQ0s4ZTFrOFk/view?usp=sharing...
RMK WATER LEVELS
TEAM:�Vello Vaherpuu
Madis Martin Lutter
Project repository: https://bitbucket.org/rmkdm/dm-rmk
Project #7: CLASSIFYING ICEBERGS IN STATOIL’S KAGGLE COMPETITION�
train.json.7z (42.85 MB): This is the dataset dedicated to training the model as part of the competition. It has 5 fields: id, band_1, band_2, inc_angle, is_iceberg. Band_1 and band_2 and 75x75 pixel flattened images of differing polarizations. This and the bellow dataset are both available at the competition’s data page: https://www.kaggle.com/c/statoil-iceberg-classifier-challenge/data
test.json.7z (245.22 MB): Pretty much the same as above, but with an unspecified is_iceberg field. Results of model’s classification on this data is used to rank model on the leaderboard.
Goal 1: Get into the top 80% on the public leaderboard for lowest log loss on the test data.
Goal 2: If Goal 1 fails, then make a public kernel detailing our approach.
Links:�[1] https://www.kaggle.com/c/statoil-iceberg-classifier-challenge�
TEAM:�Radita Liem
Theodore Heiser
Project repository: https://github.com/raymerta/statoil-kaggle
Project #8: WORLD HAPPINESS ANALYSIS
Dataset 1 (61.74 KB): World Happiness scores by country (2015, 2016, 2017)
Dataset 2 (138.4 KB): Alcohol consumption
Dataset 3 (67.3 KB): Suicide rate by country
Dataset 4 (88.5 KB): Tobacco consumption
Goal 1: Find the relationship between income and happiness
Goal 2: Find the relationship between government corruption and happiness
Goal 3: Analyze the impact of drugs consumption in happiness
Goal 4: Analyze the relationship between suicide rate and happiness
Some of the goals may be modified or changed in the future if we find some interesting datasets or relationships in the data available
Links:�[1] https://www.kaggle.com/unsdsn/world-happiness/downloads/world-happiness-report.zip�[2] https://www.kaggle.com/START-UMD/gtd/downloads/globalterrorismdb_0617dist.csv
[3] http://apps.who.int/gho/data/node.sdg.3-4-data?lang=en�
TEAM:�ISMAIL GUL
MARCELO SURRIABRE
NURLAN KERIMOV
Project repository: https://bitbucket.org/marcelout/worldhappinessproject
Project #9: KAGGLE -
US PERMANENT VISA APPLICATIONS
Dataset 1 (69.79 MB): us_perm_visas.csv - The dataset consists of 154 features which holds data for visa application (decision, date), employer (city, postal code, title, job posting history etc.), offered job (title, offered salary) and employee (education, citizenship visa history and education). The dataset covers information between 2012-2017 years.
...
Goal 1: Our first goal is to predicting visa decisions for new employer based on data for employer, employee and wage
Goal 2: And our second goal is by using the prediction result, inform people in advance about their visa decisions
...
Links:�[1] https://www.kaggle.com/jboysen/us-perm-visas�...
TEAM:�ELDAR HASANOV
DENIZALP KAPISIZ
Project repository: https://bitbucket.org/eldarhasanov/uspermanentvisaapplications
Project #10: Kaggle: HOUSE PREDICTION
Dataset 1 (449 KB): public dataset with 79 explanatory variables describing almost every aspect of residential homes in Ames, Lowa
Dataset 2 (13.5 KB): description of dataset 1
Goal 1: Find the relationship between house features of the house and it’s price
Goal 1: Develop a new regression model on dataset 1 to predict the final price of each home
Goal 2: Get the RMSE less than 0.11979
Links:�[1] https://www.kaggle.com/c/house-prices-advanced-regression-techniques
[2] https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/train.csv
[3] https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/data_description.txt
Building regression model that predicts the final price of each home
TEAM:�Tural Ismayilov
Mansur Alizada
Polad Mahmudov
Project repository: https://bitbucket.org/garabagh/dm
Project 11: KAGGLE COMPETITION - CORPORACION FAVORITA GROCERY STORE PREDICTION �
Train dataset (4.65 GB): train dataset with target variable with date, number of store and item number of order
transaction dataset (1.48 MB): number of transactions by day
oil dataset (21 KB): oil prices by day (economy of this country depends on oil)
holiday events dataset (21 KB): holidays in this country
Goal 1: implement feature engineering and try different models to make prediction
Links:�https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data - all datasets are avaiable here
TEAM:�Vladislav Fediukov
Alina Vorontseva
Anton Potapchuk
Project repository: https://github.com/cherrybonch/kaggle_FavoritaGrocery
Project #12: PARTIES AND TAXIS�
Dataset 1 (53 MB): Parties in New York
Dataset 2 (1.91 GB): New York City Taxi
Goal 1: Build model of the police calls depending on a location
Goal 2: Build model of the taxi pickup time depending on a location
Goal 3: Draw a map of the parties locations and taxis pick up locations
Links:�[1] https://www.kaggle.com/somesnm/partynyc/data�[2] https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm/data
Finding relations between party locations and taxi pick up locations
TEAM:�Maksym Melnyk
Evgen Dorodnikov
Project repository: https://bitbucket.org/utdatamining/parties-taxis
Project #13: STAY SAFE WHEN DRIVING IN LONDON
Dataset 1 (375 MB): Road Safety Data
Dataset 2 (11.6 KB): CCTV Traffic Enforcement - camera locations
Dataset 3 (2.3 MB): Road Casualties
Goal 1: Identify the most dangerous and emergency parts of roads
Goal 2: Establish a relationship between the presence of surveillance cameras and the number of accidents
Goal 3: Detect the most common accident scenarios
Links:�[1] https://data.gov.uk/dataset/road-accidents-safety-data�[2] https://data.gov.uk/dataset/cctv-traffic-enforcement-camera-locations�[3] https://data.gov.uk/dataset/gb-road-casualty-statistics-2008
TEAM:�Yevheniia Kryvenko
Oleksandra Tkalich
Project repository: https://github.com/oleksandratk/LondonCarAccidents
Project #14: MONEYTALKS� Make Students Rich Again.�
107 Datasets (each of ~80KB, ~1300 entries, 7 features): �each dataset captures the historical price movement �(on a 5 years basis) of one of the 107 stocks�composing the NASDAQ-100 index. ��Goal 1: �Identify patterns in the stocks’ prices movements.�Goal 2: �Draw general advices for the investors.�Goal 3: �Analyse the relation between opening price and �closing price of stocks, in terms of % change. �Goal 4: �Train and evaluate a classifier able to provide �BUY or SELL suggestions for investors.��Links:�[1] https://www.investing.com/
Team:�Adriano Augusto
Grace Achenyo Okolo
Project repository: https://github.com/nemo-91/moneytalks
Project #15: Ntertane App�
Dataset 1 (~1 GB): A large collection of dataset which captures available music data and user interaction with ntertane app
Goal 1: To Identify and visualize the geographical distribution of the app’s users
Goal 2: Search for patterns amongst listeners
Goal 3: Create a predictive model that suggests a listener’s next song in a playlist
Links: (Dataset is not publicly available)�[1] http://ntertane.com�...
Building a song predication model for Ntertane App
TEAM:�Clive Tinashe Mawoko
Ojiambo Ivan
Alli Abdulateef Olamide
Project repository: https://bitbucket.org/ntertane/datamining
Project #16: CLIENT IS KING-COOP�
Dataset 1 (5 GB): COOP Tartu 1 year sales data (> 50 *10^6 rows), data is private
Dataset 2 (35 MB): Product descriptions from COOP (114426 rows), data is private
Goal 1: Identify the groups of clients who have similar shopping patterns: collaborative filtering and non-negative matrix factorization
Goal 2: Describe and visualise the features of detected client groups
Goal 3: Build a recommender system based on these groups
Goal 4: Compare the 2 approaches
The goal is to enhance the customer experience and ensuring a sustainable revenue growth to the stakeholders via building a customer recommendation system.
��
Recommendation system for COOP Tartu
TEAM:�Ahto Salumets
Enn Pokk
Liis Kolberg
Project repository: https://github.com/liiskolb/dm_project2017
Project #17: VIDEO & BOARD GAMES�
Dataset 1 (1581 KB): Video game sales with ratings
Dataset 2 (1973 KB): IGN ratings
Dataset 3 (143.6 MB): Board games
Goal 1: Suggest game genre with the best statistical chance to succeed
Goal 2: Find similarities or differences between trends of video and board games
Links:
[1] https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings/data
[2] https://www.kaggle.com/egrinstein/20-years-of-games/data�[3] https://www.kaggle.com/gabrio/board-games-dataset/data
Analyse video game ratings and sales and how they compare to board game popularity
TEAM:�Janar Ojalaid
Kaspar Hollo
Marek Pagel
Project repository: https://bitbucket.org/janaroj/data-mining-project/overview
Project #18: FITLAP-2�
Dataset 1 (320 MB): Fitlap user data
Goal 1: Data cleansing and understanding.
Goal 2: Find patterns in user behaviour linked to quitting the usage of Fitlap.
Goal 3: Implement a classifier using machine learning techniques in order to predict if a user will quit Fitlap in the future.
Goal 3: Evaluation and visualisation of the obtained results.
Links:�[1] fitlap.ee/
When do people quit Fitlap and why do they quit?
TEAM:�Moritz Hilscher
Hendrik Rätz
Project repository: private (Fitlap user data is confidential)
Project #19: RateChain-2�
market_price.csv (3.65 GB): The dataset is a collected data from different brokers about car rent, which includes car class, broker name, dates of using car, total price and other features.
rate_quote.csv (18.6 GB): data about quotes. Who requests, what requests.
reservations_view.csv (0.03 GB): reservations data.
Goal 1: Improve accuracy for car rental offers for better price comparison
Goal 2: Investigate data and answer the questions from the RateChain company.
Some of them are :
· Which is “correct” ACRISS code for a car model based on collected data? Does it vary by countries?
· Which car models can be considered as alternatives to each other from pricing perspective?
· How manual and automatic gearbox affects rental price by car class?
· What is price difference between car classes?
Links:�[1] http://rate-chain.com/�[2] http://acriss.org/car-codes.asp
Market price analysis to compare car rentals
TEAM:�Xatia Kilanava
Giorgi Sheklashvili
Oleksandr Shvechykov
Project repository: https://github.com/LexSwed/DataMiningProject
Project #20: RATECHAIN-1�
rate_quote.csv (23.44 GB): Data on price quotes - price quotes together with input parameters for which the price quote was generated for (ex. age, source country, start date, end date, …). Data is from Iceland region.
reservations_view.csv (31.9 MB): Actual reservations made based on price quotes. Includes info on when reservation was made, for which time interval, which car, location et cetera.
Goal 1: Find answers to the questions posed by the client RateChain
Goal 2: Try to create a model using rate quotes to reservation
ratio to derive historical demand based on reservations history
Links:�[1] http://rate-chain.com/
Car rental demand detection based on price
requests from online channels
TEAM:�Pirge Kaasik
Joonas Puura
Project repository: https://github.com/Abercus/dmproj2017
Project #21: FITLAP-1�
Dataset 1 (328 MB): Fitlap user data.
Goal 1: Data processing and understanding.
Goal 2: Divide people in groups by their weight losses.
Goal 3: Apply sequential pattern mining and find useful information.
Goal 4: Analyzing and visualization of obtained results.
Fitlap user data is private.
Links:�[1] fitlap.ee/
What are the habit patterns and parameters of those who lose the most weight with Fitlap?
TEAM:�Olha Kaminska
Marharyta Dekret
Viacheslav Komisarenko
Project repository: https://github.com/anitera/fitlap-1
Project #<22>: Airbnb New User Booking�
Dataset 1 (64 mb): List of users along with their demographics, web session records, and some summary statistics.
Goal 1: Find interesting patterns
Goal 2: Predict where a new user will book their first travel experience.
Goal 3: Find the best fitting model
Links:�[1] https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
[2]https://www.airbnb.com/ �...
TEAM:�Gunay Abdullayeva
Frozan Maqsoodi
Aytaj Aghabayli
Project repository: <https://github.com/AbGunay/Data-Mining-Project>
Project #23 KDD-IDS
Project repo: https://github.com/prabhant/KDD-IDS
Datasets: All datasets listed on http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html we might gather some new data for testing purposes
GOAL1: Detect network connection as bad (attack) or good(Normal) using ML
Goal 2: Detecting various attacks and their connection with the new attacks
Notes: New data can be used for more testing purposes
Links:
[1]http://kdd.ics.uci.edu/databases/kddcup99/task.html
TEAM:
Prabhant Singh
Mahir Gulzar
Project #25: FITLAP-3
�
Dataset 1 (??? GB): private dataset provided by Fitlap (waiting for legal clearance)
�Goal 1: Find out, how different personas use Fitlap and what are the differences.
Goal 2: Predict who will comply with the diet and who needs additional motivation.
Goal 3: Find out which individuals are likely to quit before one month.����
Will it Fit? Different personas on Fitlap.ee
TEAM:�Nele Taba
Vladislav Stafinjak
Madis Vasser
Project repository: Private due to NDA / https://bitbucket.org/VladisSt/datamining
Project #26: SIDH-05
Dataset 1 (792.35 KB): Data is taken from https://www.rahandusministeerium.ee/et/riigi-personalipoliitika/palgapoliitika we took dataset 2015. Data is all .xlsx format. Datasets category is riigi ametiasutused“.
Dataset 2 (733.05 KB): Data is taken from https://www.rahandusministeerium.ee/et/riigi-personalipoliitika/palgapoliitika we took dataset 2016, data is in .xlsx format. Datasets category is riigi ametiasutused“.
Dataset 3 (571.95 KB): Data is taken from https://www.rahandusministeerium.ee/et/riigi-personalipoliitika/palgapoliitika we took dataset 2017, data is in .xlsx format. Datasets category is riigi ametiasutused“.
Dataset 4 (4.50 KB): Dataset with the title ‘Keskmine brutokuupalk, 2004–2016’ about average salaries of state employees throughout the years is available at https://www.stat.ee/stat-keskmine-brutokuupalk
Goal 1: Predict how the Estonian state employee's budgets in upcoming years are competitive with the local employment market.
Goal 2: How the changes in salaries of the state officials affect the local employment market?
Links:�[1https://bitbucket.org/kadri_oluwagbemi/project-salaries-of-estonian-public-officials
Salaries of Estonian public officials
TEAM:�Ardi Aasmaa
Madis Harjo
Kadri Oluwagbemi
Project repository: https://bitbucket.org/kadri_oluwagbemi/project-salaries-of-estonian-public-officials/wiki/Home
Project #27: MERCARI PRICE SUGGESTION CHALLENGE�
Dataset 1 (73 MB): Training data of items sold by Mercari, data contains item descriptions and other attributes such as category and brand name. Actual origin is unknown but presumably it’s a small subset of item listings in Mercari.
Dataset 2 (34 MB): Test set which contains unlabelled items that have all the same attributes except for the price of the item.
Goal 1: Build a model that suggests item prices.
Goal 2: Achieve root mean squared logarithmic error of 0.5 or less to be in the top 500 of the contest.
Main challenge will be to clean the data from typos and train a good model on very biased data.
Links:�[1] https://www.kaggle.com/c/mercari-price-suggestion-challenge
Can you guess which one costs 9.99$ and which one is 335$?
TEAM:�Sander-Sebastian Värv
Markus Loide
Martin Liivak
Project repository: https://github.com/Sebastianvarv/MercariPriceSuggestion
Project 28: MeelisOrNotMeelis�
Dataset 1 (<0.1 GB (100Mb)): approx 2500 110x110 pixel images of Meelis Kull´s face taken from lecture videostream screenshots. OpenCV is used for face detection from the screenshots.
Dataset 2 (same size) : approx 2500 110x110 pixel images of somebody else´s face.
Goal 1: Detect Meelis Kull from webcam videostream. From 30fps video one frame per second can be false negative. No false positives allowed!
OpenCV and its Haar Like Feature model is used for face detection. The model will be trained on detected face rectangular images.
Detecting Meelis from a webcam stream.
TEAM:�Jaan Tohver
Andres Matsin
Project repository: https://bitbucket.org/jaantohver/meelis_or_not_meelis
Project 29: Movie-Stars�
Dataset 1 (900 MB): The Movies Dataset (https://www.kaggle.com/rounakbanik/the-movies-dataset)
Dataset 2 (? KB): We will create additional rating data from http://www.imdb.com
Goal 1: Examine main properties which result in good ratings
Goal 2: Build a recommender system based on user data
Our dataset is an ensemble of data collected from TMDB, imdb and GroupLens.
Links:�[1] https://www.themoviedb.org/?language=en �[2] http://www.imdb.com
What it takes to get a good movie ratings and recommendation system
TEAM:�Andreas Baum
Marielle Egert
Project repository: https://github.com/mariellee/datamining2017
Project 30: KAGGLE - CATERPILLAR TUBE PRICING�
Dataset 1 (1355 KB): tube.csv This file contains information on tube assemblies, tube Assemblies are made of multiple parts. The main piece is the tube which has a specific diameter, wall thickness, length, number of bends and bend radius.
Dataset 2 (1617 KB): train_set.csv data General data provided about product with prices for training
Goal 1: Train machine with training data and find most suitable algorithms.
Goal 2: use trained machine to predict possible best prices for tubes.
There are more datasets than described here but this two are most important all other 19 are provided to support this two (mostly it is a variety of tube assemblies)�(total 21 datasets)
Analyzing Caterpillar data to Predict the correct prices for tubes
TEAM:�Shalva Kalandarishvili
David Chagiashvili
Viktor Mysko
Project repository: https://bitbucket.org/kalandarishvili/datamining
Project #31: Interactive Terrorism Map�
Dataset 1 (29.19MB): Global Terrorism Dataset from Kaggle [1]
Goal 1: To develop a web application with an interactive map of terrorist attacks, which can be filtered by multiple options like year, month, attack type, number of casualties, group of terrorists, etc.
Goal 2: Add a feature of exporting the filtered data to a CSV file
Links:�[1] https://www.kaggle.com/START-UMD/gtd/data�
Developing an Interactive Map with multiple filter options
TEAM:�Shaswata Saha
Janno Peterson
Abel Mesfin Cherinet
Project repository: https://github.com/jannopet/Interactive-Map-of-Terrorist-Attacks
Project #32: Age Detection Based on Facial Images�
Dataset 1 (3 GB)(Photos + Meta): We use WIKI dataset.
Goal 1: To build a system that will predict age of a human being whose
photo is inserted as an input
Goal 2: Researching about LBP, LBP histogram and RGB descriptors,
for age detection systems.
Links:
TEAM:�Başar TURGUT
Soltan QARAYEV
Project repository: https://bitbucket.org/soltankara/faceagedetection
Project #33: ELEKTRILEVI-1�
Dataset 1 (SIZE 660 MB): private dataset of measurements of the substations
Dataset 2 (SIZE 2.69 MB): private dataset of events happened in substations
Goal 1: Identify network status index
Goal 2: Model for predicting anomalies in substations
Quality of electricity
TEAM:�Margarit Shmavonyan�Tair Vaher �Tsotne Kekelia
Project repository: https://bitbucket.org/utmastersttm/elektrilevi
Project #34: THRONES MINING
Dataset 1 (8 Kb): battles.csv: all the battles from the series
Dataset 2 (38 Kb): character-deaths.csv: contains information about dead characters and their deaths: chapter, year, gender, nobility of character.
Dataset 3 (210 Kb): character-predictions.csv: has information about all characters appeared in the saga: their names, house, title, gender, family status, popularity, etc.
Goal 1: provide predictions and analysis of previous seasons’ events that can be used in different fields, e.g. in betting. Specifically, our model should be able to answer the following questions:
Links:�[1] https://www.kaggle.com/mylesoneill/game-of-thrones/data
Predictions and analysis of the Game of Thrones TV series
TEAM:�Maksym Yerokhin
Roman Ismagilov
Project repository: https://github.com/Pomis/ThronesMining
Project#35: KAGGLE-INSTACART MARKET BASKET ANALYSIS
Dataset 1 (574 MB): Train dataset, which contains information about products and in which order they were purchased.
Dataset 2 (103 MB): Test dataset, which tells to which set order belongs(train,test,prior)
Dataset 3 (2 MB): Dataset which contains products, aisles and departments.
...
Goal 1: Build a model to predict which previously purchased products will be in a user’s next order.
Goal 2: Achieve a mean F-score higher than 0.3989240 on the first 33% of the test data which was used to produce the public leaderboard in the Kaggle competition.
...
Links:�https://www.kaggle.com/c/instacart-market-basket-analysis��
TEAM:�Leonid Tolli
Lauri Kongas
Hele-Andra Kuulmets
Project repository: https://github.com/Leonidk29/Instacart
Project #36: IMDB Movie Rating Analysis
Dataset 1 (133 KB): Kaggle Imdb dataset from 2006 - 2016
Dataset 2 (198 KB): Kaggle The academy awards database 1927 - 2015
...
Goal 1: Analyze correlations between
Goal 2: Make some prediction model for new possible movie ratings based on genre, actors , etc.
Links:�[1] URL1 : https://www.kaggle.com/PromptCloudHQ/imdb-data/ �[2] URL2 : https://www.kaggle.com/theacademy/academy-awards
Project title that can even be as long as two lines of text
TEAM:�Aramais Khachatryan
Aleksandr Tsõganov
Yaroslav Hrushchak
Project repository: https://github.com/aramYnwa/imdbRatingAnalysis.git
Project #37:��
Dataset 1 (100 GB): sensor data of Starship robots that travelled in 2017
Dataset 2 (278 MB): GPS data of Starship robots that travelled in 2017 in Tallinn
...
Goal 1: Measure quality of roads and visualize it on a map of Tallinn
Goal 2: Integrate road quality with route optimization
...
Links:�[1] https://www.starship.xyz/�...
Measuring pavement quality in Tallinn
TEAM:�Kevin Kanarbik
Märten Veskimäe
Tõnis Kasekamp
Project repository: https://github.com/Kennu76/DataMiningProject
Project #38:�
Dataset 1 (7.43 MB): Run or Walk [1]
Goal 1: Get more acquaintance with data mining processes behind Samsung S-Health system
Goal 2: Recognize physical activity of a person using sensor data of his mobile phone (accelerometer and gyroscope)
Being motivated in the S Health application of my Samsung Note 4 mobile phone, I got interested to learn more about data processing of this healthcare application. One step further, I am eager to see do we can use gathered data by mobile sensors to predict our physical activities. In this project, data are collected by accelerometer and gyroscope sensors. Physical activities refer to activities such as running, walking and etc.
Links:�[1] https://www.kaggle.com/vmalyi/run-or-walk
Your mobile can recognize your physical activity!
TEAM:�Shahla Atapoor
Project repository: https://github.com/atapoor/Data_Mining_2017
Project #39: FITLAP�
Dataset 1 (SIZE 319 MB): Private file with all of the user data
Goal 1: The main goal would be to find out who achieves the goal they set. More specifically:
1. what do they have in common (maybe mostly people who want to maintain the weight not lose it or the other way around);
2. are there any specific meals or ingredients that these people eat;
3. how many people change the goal and achieve it after that.
...
Links:�[1] fitlap.ee/��...
Who achieves the goal - analysis of Fitlap data
TEAM:�Reelika Tõnisson
Rando Tõnisson
Project repository: can’t be added because of the non-disclosure agreement
Project #40: RATECHAIN-1
Majnun Abdurahmanov
Khaled Nimr Charkie
Bejon Sarker
Project repository https://github.com/delone-lora/CRDD-dm-project-2017 (will make it private later)
rate_quote.csv (21.18 GB): Data about price quotes - including input parameters for calculating the price quote at the end such as age, source country, start date and so on.
reservations_view.csv (34.4 MB): Reservations for price quotes. The table includes data such as when reservation was made, duration of reservation and so on.
market_price.csv (3.65 GB): general information about market prices for car renting in different countries
Goal 1: Fulfill business goals of RateChain.
Goal 2: Create model using rate quotes to reservation ratio to derive historical demand based on reservations history?
Project #41: HEALTHY HEART�
Heart Disease Dataset (1 MB): Dataset containing patient information (such as age, whether they’re smokers etc.) and the presence of heart disease in the patient
...
Goal 1: Identify the biggest risk factors that contribute to heart disease
Goal 2: Test whether the country where people live (some economic or financial factors) influences the rate of heart disease.
Goal 3: Predict the risk for heart disease of an unknown person based on some personal information about the person
Goal 4: Inform (and warn) the general public about the risks in order to try to reduce the heart disease rate in people.
…
Links:�[1] http://archive.ics.uci.edu/ml/datasets/Heart+Disease�
...
TEAM:�Simona Micevska
Hristijan Sardjoski
Project repository: https://bitbucket.org/hakerchinja/healthy-heart
Project #42: CRIME RESEARCH IN USA
Dataset 1 (1.53 GB): The data was taken from the U.S. Government`s open data: [1]. Data includes the details about crime in the City of Chicago from 2011 to present.
Dataset 2 (357 MB): The same data about crimes but for the city of Los Angeles dating back to 2010 [2].
Goal 1: Based on dataset we will identify and analyze the patterns and trends in crime situation in the City of Chicago (2001 - 2017) and the City of Los Angeles (2010 - 2017).
�
TEAM:
Sofiya Demchuk
Dmytro Tkachuk
Raman Shapaval
Project repository: https://github.com/dimatkachuk/DM-project
Crime pattern and trends identifying, its analysis and visualization in the City of Chicago from 2001 and in the City of Los Angeles from 2010 to present.
Goal 2: Using visualization library and filtering functions we will determine the most “dangerous” districts or the most safe one depending on the day time, period of the year, weekends and etc. We think in this way we can extract benefit for security management in the city. This should help with preventing crimes and consequently with lowering crimes’ rate.
Goal 3: The dates of crimes can be studied to see crime trends depending on seasons, celebration days and time of the day.
Goal 4: Based on the data about crimes, we will try to predict exact time of committed crimes depending on type of crime, time and district.
Links: [1] https://catalog.data.gov/dataset?tags=crime
[2] https://catalog.data.gov/dataset/crime-data-from-2010-to-present
Project #44: Dermtest CNN�
Dataset (800 MB): dataset of suspicious skin areas images taken using dermatoscopic camera [1] form Dermtest [2]
Goal 1: Train a convolutional neural network to predict cancer risk from images
Goal 2: Extract embeddings from the NN, cluster images using PCA and t-SNE
Goal 3: Use saliency maps from the NN to identify the most important parts of the skin image
Links:
[1] https://en.wikipedia.org/wiki/Dermatoscopy
Predicting skin cancer from images using a convolutional neural network
TEAM:�Maksym Semikin
Martin Valgur
Project repository: https://github.com/msemikin/dermtest
Project #45: Bosch Production Line Performance�
Dataset 1 (7.2 GB): Training data from the competition (uncompressed), unlabeled measurements from the manufacturing process
Dataset 2 (7.2 GB): Test data from the competition (uncompressed)
Goal 1: Predict which mechanical components fail quality control on the whole test set
Goal 2: Build an ensemble of multiple models
Goal 3: Achieve a good (top 20% ?) score
Links:�[1] https://www.kaggle.com/c/bosch-production-line-performance
Reduce manufacturing failures
TEAM:�Oliver-Matis Lill
Project repository: https://github.com/oml1111/bosch
Project #46: DUNORD
Demand prediction for Liivi 2 Cafeteria�
Dataset 1 (1 MB): Dataset on Liivi-2 cafeteria sales from Nov 2015 to Oct 2016
Dataset 2 (1 MB): Dataset on Liivi-2 cafeteria sales from Nov 2016 to Oct 2017
Dataset 3 (2.5 MB): Dataset on Liivi-2 lecture room occupancy from fall 2015 to fall 2017
Goal 1: Predict the sales quantities in the Du Nord cafeteria at J. Liivi 2
Goal 2: Build an automatic method using machine learning models to predict the sales
Goal 3: Analyse correlations between variables from lecture room occupancy data and sales quantities in different categories
�
TEAM:�Navedanjum Ansari
Saumitra Bagchi
Sriyal Jayasinghe
Project repository: https://bitbucket.org/navedanjum/dunord.git
Project #47: Global Terrorism
�
Dataset 1: (SIZE 143.96 GB): The Global Terrorism Database (GTD) is an open-source database including information on terrorist attacks around the world from 1970 through 2016 (with annual updates planned for the future).
Dataset name: globalterrorismdb_0617dist.csv
Link: https://www.kaggle.com/START-UMD/gtd/data
...
Goal 1: Get information about the most common attack types and targets and for trying to predict future attacks.
Goal 2: Provide with information about most dangerous places in certain times of the year and visualize it.
Goal 3: Identify if certain nationalities are more likely to get killed or just to be affected by terrorism.
Analysing global terrorism dataset
TEAM:�Diana Grygorian
Eneko Ruiz de Loizaga
Ibrahim Abdulhamid
Project repository: https://github.com/eruizdeloizaga002/globalterrorism
Project #48: STARSHIP�
Dataset 1 (177 GB): sensor data from the delivery robots (timestamp, orientation, readings from the accelerometer, magnetometer, gyroscope, etc.)
Dataset 2 (278 KB): localization data (coordinates of the particular robot with timestamp)
Goal 1: reduce data size (take only Tallinn) and build route map
Goal 2: study pavement quality in Tallinn using sensors data
Pavement quality can be studied dynamically
Links:�[1] https://www.starship.xyz/�...
Pavement quality measurements in Tallinn
TEAM:�Mikhail Papkov
Elizaveta Korotkova
Project repository: https://github.com/papkov/starship
Project #49: ELEKTRILEVI-2: Cables, transformers, poles �
The data for this project is private so it cannot be shared publically.
Dataset 1 (3.3 MB): cableReliability.csv - Includes all the relevant data of cables
Dataset 2 (4.2 MB): defectsTrafo.csv - Includes all the relevant data of Transformers
Dataset 3 (372 KB): poleLifeSpan.csv - Includes all the relevant data of Poles
Goal 1: Connection between Transformer types, their ages and the defects more likely to appear
Goal 2: Predicting the age of poles and what is the best time to replace them
Goal 3: Predicting when a new joint is more likely to be installed in a cable.
TEAM:�Muhammad Bilal Shahid
Hippolyte Fayol
Bilawal Hussain
Project repository: https://bitbucket.org/bilawal_ut/elektrilevi_cablepoletransformer
Project #50: Predictions and analysis of the �biathlon World Cup data
Project repository: https://github.com/annalanevali/dataminingProject.git
Dataset: season 2016/2017 biathlon World Cup Data (collected)� (https://biathlonresults.com/)
Goal 1: analyzation of the data: why some athletes are better than others. �Visualizations
Goal 2: try to find out what are the winning strategies.
Goal 3: build a small prediction model
TEAM: �Anna Laaneväli