ML Methods Dataset 2019

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	AA	AB	AC	AD	AE	AF	AG	AH	AI	AJ	AK	AL	AM	AN	AO	AP
1	Proposer	Dataset name	Training items (approx)	Test items (approx)	# of classes	Feature types	URL/reference	Brief description	Readiness	Votes	Agrawal Abhishek	Ahmadli Aydin	Balhar Jiří	Doubravová Petra	Eliáš Richard	Fischer Claire	Gokirmak Memduh	Henriette Bertrand	Houška Petr	Chalupa Michael	Chembrolu Surya Prakash	Ihnatchenko Bohdan	Jareš Antonín	Karella Tomáš	Kratochvíl Jonáš	Kremel Tomáš	Kumová Věra	Nekvinda Michal	Pilař Tomáš	Pospěch Michal	Procházka Štěpán	Shafiq Chaman	Schmidtová Patrícia	Souček Tomáš	Šerý Martin	Špaček Jan	Teste Alexis	Tryhubyshyn Iryna	Vainer Jan	Vandas Marek	OB	ZZ
2	Ondrej Bojar	ob-SampleData	120k	1k	4	real, integer, categorical	I will bring the dataset	This is just a fake entry. The goal is to predict the color of the Teddy bear based on its measurements and properties (cuddliness etc.)		74	3	3	0	3	0	3	3	3	0	0	3	0	0	3	3	4	3	0	3	3	3	0	3	3	3	3	3	3	3	3	7	0

3	Aydin Ahmadli	Scene recognition	2000	400	2	real, integer	https://www.openml.org/d/312	It contains characteristics about images, their classes - 6 different labels: {Beach, Sunset, FallFoliage, Field, Mountain, Urban}.Problem is binary classification. We have to decide whether image is 'Urban' or not.	R	0
4	Aydin Ahmadli	Robot Navigation	5400		4	real,integer	https://www.openml.org/d/1526	Given features such as 24 different sensor readings, we have to decide which action will robot take - 4 output classes : {Move-Forward, Slight-Right-Turn, Sharp-Right-Turn, Slight-Left-Turn}	R	0																															0
5	Aydin Ahmadli	ID recognition from Walking	59k		22	real,integer	https://archive.ics.uci.edu/ml/datasets/User+Identification+From+Walking+Activity	Datas collected from Android smartphone positioned in the chest pocket of 22 participants. Input features : {Time-step, x acceleration, y acceleration, z acceleration}.... Output Classes: 22 User ID		0																															0
6	Tomáš Kremel	Financial well-being survey	6k		5	integer	https://www.consumerfinance.gov/data-research/financial-well-being-survey-data/	A person’s financial well-being comes from their sense of financial security and freedom of choice—both in the present and when considering the future. The survey dataset includes respondents’ scores, as well as measures of individual and household characteristics.	C	6																	1				1		1	1				1		1
7	Tomáš Kremel	3 million Russian troll tweets	3M			text, time, enum, integer	https://github.com/fivethirtyeight/russian-troll-tweets/	Data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian "troll factory" and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller's Russia investigation.	C	2				1																1											?
8	Tomáš Kremel	Airbags	26k		2	enum, integer, time	https://maths-people.anu.edu.au/~johnm/datasets/airbags/	Did airbags, over 1997-2002 in the US, reduce accident risk?		0
9	Tomáš Karella	Grants	6k per year (2006 - 2019)		2(3)	text, enum, interger, real	https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A1-sada%2Fhttp---opendata.praha.eu-api-3-action-package_show-id-mhmp-granty-2006	Prague City Hall is sharing data about the grant requests. Every request contains information about the applicant, info about the project, the verdict and assigned money.	C	0		motion (SAVEE) database has been recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. The data were recorded in a visual media lab with high quality audio-visual equipment, processed and labeled. To check the quality of performance, the																													0
10	Tomáš Karella	Car accidents	100k per year (2007 - 2019)		4	text, enum, interger, real	https://www.policie.cz/clanek/statistika-nehodovosti-900835.aspx?q=Y2hudW09Mg%3d%3d	Police department of Czech republic shares information about car accidents. It includes the severity of the injuries. It would be possible to merge this dataset with the chmi weather statistics.	C	4																									1	1				1	1
11	Tomáš Karella	DOHMH New York City Restaurant Inspection Results	385k		2	text, enum, interger, real	https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/data	Data describing restaurant inspections in New York. The target value could the closure of the restaurant, columns contains info about the location, type of restaurant, etc...	C	0
12	Věra Kumová	College Scorecard	7k		2	enum, int, real	https://collegescorecard.ed.gov/data/	Data about US schools. The target value could be one of flags variables - e.g. "Flag for women-only college"	P	0
13	Věra Kumová	Women, Business and the Law	187		4	0-1	https://datacatalog.worldbank.org/dataset/women-business-and-law	The study examines 35 questions in 8 eight areas (about equal opportunities for women), yes-no answears. Data are originally time series but one year could be also used for classification - each row is one country and the target value could be "Income group" of the country.	R	0
14	Věra Kumová	University students behaviour	2k		2	enum, int, text	https://data.brno.cz/dataset/?id=sociologicky-vyzkum-chovani-studentu-vs	Survey among students in Brno. The target value could be the information, whether a student has additional income for study or not.	C	6											1						1						1		1				1		1
15	Jan Špaček	Dancing	~36k latin, ~33k standard		(see desc)	(see desc)	https://www.dancerank.cz	Results of Czech dancesport competitions, can be used to predict results of future competitions and/or national championships. To coerce the task into the simple ML formulation, we can generate a fixed number of features for each couple (# of competitions, # of finals, average time between competitions etc. before time t) and predict a discrete result (what class will the couple achieve after time t? how long will the couple continue dancing together after time t?). Alternately, similar data is available for international dance competitions (under WDSF).		1																															1
16	Jan Špaček	Maturita	~100		(see desc)	real	Personal communication	Anonymized dataset of students from my high school. Given their results at the admission test (Scio), predict their maturita scores. To pose this as a classification task, we can discretize the scores in various ways (passed? average above 2?)		3																1									1						1
17	Jan Špaček	Lítačka	arbitrary	arbitrary	(see desc)	5 real	http://opendata.praha.eu/dataset/jizdni-rady-pid	Given coordinates of start and target positions in Prague and time of departure, predict how long the trip takes using public transport. The dataset is generated artificially by sampling pairs of stops and times and computing the shortest route in the real timetables for Prague. This can also be posed as a classification task in various ways (is the route faster than 30 minutes? is there a route with no transfer? does the fastest route use the metro? what types of vehicles the fastest route uses?).		1																										1					0
18	Abhishek Agrawal	Formspring data labelled for cyberbullying			2	text	http://www.chatcoder.com/Data/DataReleaseDec2011.rar	The data represented 50 ids from Formspring.me that were crawled in Summer 2010. For each id, the profile information and each post (question and answer) was extracted. Each post was loaded into Amazon's Mechanical Turk and labeled by three workers for cyberbullying content.		2	1						1
19	Abhishek Agrawal	Myspace Group Data labelled for Cyberbullying			2	text	http://www.chatcoder.com/Data/BayzickBullyingData.rar	The folder contains a small subset of data from crawl of myspace groups. The data has been manually labelled for bullying content by 3 independent coders. Each input file was split into a window of 10 posts each. Each window was judged to determine if there was cyberbullying content anywhere in the window. The labels are contained in separate files. For a window to be labelled as containing cyberbullying, at least 2 out of 3 users had to label it as cyberbullying.		0
20	Jan Vainer	trashnet	0	500	6	int	https://github.com/garythung/trashnet	The dataset can be used to learn a classifier to recognize various kinds of trash such as glass or plastic bottles. The usefulness lies in the ability to sort trash automatically without human intervention.	R	2														1														1
21	Jan Vainer	Sound20	20000	4000	19	real	https://github.com/ivclab/Sound20	The dataset contains sample sounds of various animals (insects) and musical instruments. It could be used to classify the animals based on the sounds they make, to distinguish between animal sounds and instruments. The data are in the form of spectrograms of given sounds	R	0
22	Jan Vainer	Savee	1000	200	6	real	http://kahlan.eps.surrey.ac.uk/savee/	Database of audiovisual emotional speech. Emotion classification.	P	2								1																						1
23	Alexis Teste	Parkinson's Disease Classification Data Set	750	750		integer, real	https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification	The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 (65.1Â±10.9) at the Department of Neurology in CerrahpaÅŸa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1Â±8.9). During the data collection process, the microphone is set to 44.1 KHz and following the physicianâ€™s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.		2						1																							1
24	Alexis Teste	mfeat-factors	2000		10		https://www.openml.org/d/12	One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.		0
25	Alexis Teste	nursery	10000	3000	5	nominal	https://www.openml.org/d/26	Nursery Database was derived from a hierarchical decision model originally developed to rank applications for nursery schools. It was used during several years in 1980's when there was excessive enrollment to these schools in Ljubljana, Slovenia, and the rejected applications frequently needed an objective explanation. The final decision depended on three subproblems: occupation of parents and child's nursery, family structure and financial standing, and social and health picture of the family. The model was developed within expert system shell for decision making DEX		0
26	Martin Šerý	Banknote authentication	1000	300	2	real	https://archive.ics.uci.edu/ml/datasets/banknote+authentication	Data were extracted from images that were taken for the evaluation of an authentication procedure for banknotes.		2							1							1
27	Martin Šerý	DOTA 2 game results	7000	300	2	categorical	https://archive.ics.uci.edu/ml/datasets/Dota2+Games+Results	Predict the result of the match based on initial draft of heroes in DOTA 2 game		1		1
28	Martin Šerý	Student performance	500	150	20	categorical	https://archive.ics.uci.edu/ml/datasets/Student+Performance	Student performance analysis - predict final grade based on student grades, demographic, social and school related features.		1						1
29	Iryna Tryhubyshyn	Cyber-Trolls detection	20k		2	text	https://dataturks.com/projects/abhishek.narayanan/Dataset%20for%20Detection%20of%20Cyber-Trolls	The dataset contains tweets that are labeled whether they are aggressive or not	C	3		1													1	1
30	Iryna Tryhubyshyn	Programmer's salary survey	9k			real, integer, categorical	I will translate the dataset in English https://github.com/devua/csv/tree/master/salaries	The dataset contains results of survey about salary of Ukrainian programmers. Features contains job position, programming language, age, city, work experience, company size and type(outsource, product, outstaff), education, English level and so on. Can be either classification or regression task.	C	0
31	Iryna Tryhubyshyn	Survey of programmers who emigrated from Ukraine	1700		5	real, integer, categorical	I will translate the dataset in English https://github.com/devua/csv/tree/master/relocation	The dataset contains information about life satisfaction, current salary, country and job position, purpuses of leaving Ukraine. More than 20 questions, mostly multichoice. We can try to predict life quality changes	C	0
32	Marek Vandas	sentence-language	120k		12	integer, categorial	https://tatoeba.org/eng/downloads	The dataset contains list of sentences and language labels. Task is to categorize sentences to language.	C	0
33	Marek Vandas	section-category	70k		2-16	real, integer	I will bring the dataset	The dataset contains images of cross sections (intersection of some 3d rigid beam with 2d plane), the task is to categorize cross sections to one of categories - categories can be seen at https://www.dlubal.com/-/media/Images/website/pages/solutions/online-tools/glossary/000014/01-en.png	P	0
34	Marek Vandas	spoken-command	1500 * # of classes		2-5000	real, integer	https://github.com/JohannesBuchner/spoken-command-recognition	The dataset contains list of spoken command and is artificially created by software (variantions are based on noise and other sound filters). Task is to recognize command from spoken word.	R	0
35	Chembrolu Surya	Crowdsourced mapping dataset	10545	300	6	real	https://archive.ics.uci.edu/ml/datasets/Crowdsourced+Mapping	The dataset contains NDVI(vegetative index)values that are obtained over a period of 17 months and in particular on 27 different days. The data is collected using satellite imagery and OpenStreetMap and the task is to classify the land cover based NDVI values. There are six different classes: Farm, Forest, Grass, Orchid, Water, impervious		0
36	Chembrolu Surya	SkyTrax Review Dataset	over 17k	split from training samples	2	real,text,categorical	https://raw.githubusercontent.com/quankiquanki/skytrax-reviews-dataset/master/data/airport.csv	The dataset is obtained by scraping reviews given on skytrax airlines website and reviews are about experiences of different airports by the travellers. The dataset consists of recommended column with either 0 or 1 implying whether the particular airport is recommended or not.		2		1																													1
37	Chembrolu Surya	Thyroid Disease Dataset	7200	split from training samples	3	real,integer	https://sci2s.ugr.es/keel/dataset_smja.php?cod=1179#sub1	The task is to detect thyroid condition of a patient which can be 1 for normal, 2 for hyperthyroid, 3 for hypothyroid		0
38	Štěpán Procházka	underlying-distribution	10k / arbitrarily many	2k / arbitrarily many	5-10	real	artificially generated	The dataset examples will be real valued vectors generated by some distribution (e.g., normal, uniform, etc.), the goal is to classify which distribution generated an example (may be thought of as trying to find out what is a distribution of a random variable from a fixed width sequence of independent trials)	P/C	0
39	Štěpán Procházka	fps-cheater-detection	1K	100	2	real, integer	https://github.com/Nexosis/sampledata/blob/master/csgo-small.csv + some new data if someone knows where to get them	The goal of this task is to distinguish between cheaters and non cheaters in FPS game based on their in-game statistics (accuracy, K/D ratio etc.). The choice of the exact game title may be different, if better data are available.	P	2																			1									1
40	Štěpán Procházka	fake-news-classification	1K+	100+	2	categorical, integer, real	may be scraped from pages similar to this one https://www.politifact.com/personalities/donald-trump/statements/by/	The goal is to tell if the text (FB post, tweet) presents objective information, based on extracted features - length of text, number of emojis, uppercase-to-lowercase character ratio etc. The data may be collected from various sources (public social media accounts etc.)	P	4						1		1			1																1
41	Michal Pospěch	czech-presidental-election	1086	taken from training sample	10	real, integer, categorical	Taken from CVVM	Based on various demographic and socio-economic indicators try to predict who did people vote for in the first round of Czech presidential election 2018	C, processing needed though	14	1			1			1				1				1	1			1		1		1	1		1	1		1		1
42	Michal Pospěch	star-wars	1200~	taken from training sample	3	categorical	https://github.com/fivethirtyeight/data/blob/master/star-wars-survey/StarWars.csv	The goal is to predict who do the respondents think shot first, Han or Greedo.	C	1																			1
43	Michal Pospěch	mlb	170k~	taken from training sample	2	real	https://github.com/fivethirtyeight/data/tree/master/mlb-elo	Predict winner of MLB match based on ratings of teams and their pitchers	C, some processing needed	0
44	Memduh Gokirmak	music genre identification			5-10	real		Identify the genre of a musical audio file		1								1																							0
45	Memduh Gokirmak	document word segmentation			2	real		find the boundaries of words in images of text		0
46	Memduh Gokirmak	text difficulty evaluation			~5	real	MICUSP or something else	assign a difficulty level for readers to a natural language text	P	0
47	Petr Houška	fitts-nail097-houska	2000	taken from training sample	4	real	https://github.com/petrroll/NAIL087-fitts-exp/tree/master/data	The goal is to predict participant id based on fitts' experiment results (speed of click, length, and size)		1																				1
48	Petr Houška	Audit Data Data Set	777	taken from training sample	2	real, integer, cat	https://archive.ics.uci.edu/ml/datasets/Audit+Data#	The goal of the research is to help the auditors by building a classification model that can predict the fraudulent firm on the basis the present and historical risk factors.		0
49	Petr Houška	Militarized Interstate Disputes v4.2	~2000	taken from training sample	20	categorical, real, dates	http://www.correlatesofwar.org/data-sets/MIDs	The goal of the dataset is to predict outcome of a battle/war/skyrmish based on prdictors from 1993-2010.		2	1													1
50	Petra Doubravová	General mortality	~300000	taken from training sample		real, integer, categorical	https://ec.europa.eu/eurostat/web/health/data/database	May be used for obtaining informations about mortality in different countries, their causes depending on age, sex and other features and real use is in prevention	C	0
51	Petra Doubravová	Sentiment analysis	~400 maybe more, can be extended	taken from training sample	3	text in czech	my data	Classification of facebook posts mostly on big bank - negative, postive, neutral, especialy in czech language	C	2																											1				1
52	Petra Doubravová									0
53	Abhishek Agrawal	Musk dataset	6598	taken from training sample	2	Integer	https://archive.ics.uci.edu/ml/datasets/Musk+(Version+2)	This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. However, the 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule. Because bonds can rotate, a single molecule can adopt many different shapes. To generate this data set, all the low-energy conformations of the molecules were generated to produce 6,598 conformations. Then, a feature vector was extracted that describes each conformation. When learning a classifier for this data, the classifier should classify a molecule as "musk" if ANY of its conformations is classified as a musk. A molecule should be classified as "non-musk" if NONE of its conformations is classified as a musk.		0
54	Jonáš Kratochvíl	Movie recommendation	90000	10000	32	real, cat	https://ufal.mff.cuni.cz/courses/npfl054/materials	IMDb movie database dataset	C	0
55	Jonáš Kratochvíl	Human machine dialogue prediction	1000	100	n^3	real, cat		Predicting price range location and type of food based on human machine dialogue	P	3				1											1		1
56	Jonáš Kratochvíl	Signal/noise classification	800000	20000	35	real, cat	http://opendata.cern.ch/record/328	Predict whether CERN detector sensors a noise or signal based of various measurements	C	0
57	Ondrej adds other possible sources, please enter your name in the Proposer column if you like the dataset and volunteer to review it. Then please add all the other details here, in the row. Highlighted ones seem very interesting.									0
58							https://linked.opendata.cz/dataset/czso-deaths-by-selected-causes-of-death	Time series of death reasons; some conversion would be needed for classification. To see the data, select "Prejit na datovy zdroj" from the last "Prozkoumat" drop-down menu. The data will be downloaded		0
59							https://linked.opendata.cz/dataset/czso-job-applicants	Unemployed registered people counts by region; but perhaps hard to use for classification		0
60										0
61						Various datasets from	https://data.gov.cz/datov%C3%A9-sady	...many sources from Czech Republic		0
62							http://opendata.praha.eu/dataset	...Prague sources		0
63	Chembrolu Surya						https://www.netmetr.cz/open-data.html	Results of internet speed tests over a longer period of time; the goal could be to predict internet connection type (LAN, G4, ...)		0
64										0
65							https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A1-sada%2Fhttp---opendata.praha.eu-api-3-action-package_show-id-ipr-bonita_klimatu_z_hlediska_prirozene_ventilace_uzemi	This dataset alone is just a map, indicating air bonity (not exactly air quality but the speed of air change, so that immission has perhaps not so bad effects). It would be fabulous to link it with some categorical description of the area (family houses, skyscrapers, ..) and predict air bonity based on "a picture" (i.e. to local observations; not that you would be processing pictures) and altitude		0
66							http://www.geoportalpraha.cz/cs/fulltext_geoportal	Search for 'bonita' to see other possible maps/data. OR do not enter any keyword and only select e.g. 'budovy'		0
67							https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A1-sada%2F3751165	Hundreds of thousands of tenders (verejne zakazky), including evaluation criteria, texts, etc. We could change this into various tasks, e.g. predicting price range based on keywords from the description, predicting relevant evaluation criteria based on keywords from the description or based on who is proposing the tender (i.a. to find e.g. municipalities known for obscure practices etc.)		0
68							https://golemio.cz/cs/oblasti	Various datasets on Prague		0
69	Tomáš Souček	Votes from Czech Parlament	approx. 10-60k (x200)		2		https://www.psp.cz/sqw/hp.sqw?k=1300	Votes of all deputies from Czech Parlament from 1993 to present.. can be used to predict how one or more deputies would vote given votes of other deputies	C	4																1				1	1			1
70										0
71										0
72										0
73										0
74										0
75										0
76										0
77										0
78										0
79										0
80										0
81										0
82										0
83										0
84										0
85										0
86										0
87										0
88										0
89										0
90										0
91										0
92										0
93										0
94										0
95										0
96										0
97										0
98										0
99										0
100										0