ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAP
1
ProposerDataset name
Training items (approx)
Test items (approx)
# of classesFeature typesURL/referenceBrief descriptionReadiness
Votes
Agrawal Abhishek
Ahmadli Aydin
Balhar Jiří
Doubravová Petra
Eliáš Richard
Fischer Claire
Gokirmak Memduh
Henriette Bertrand
Houška Petr
Chalupa Michael
Chembrolu Surya Prakash
Ihnatchenko Bohdan
Jareš Antonín
Karella Tomáš
Kratochvíl Jonáš
Kremel Tomáš
Kumová Věra
Nekvinda Michal
Pilař Tomáš
Pospěch Michal
Procházka Štěpán
Shafiq Chaman
Schmidtová Patrícia
Souček Tomáš
Šerý Martin
Špaček Jan
Teste Alexis
Tryhubyshyn Iryna
Vainer Jan
Vandas Marek
OBZZ
2
Ondrej Bojarob-SampleData120k1k4real, integer, categoricalI will bring the datasetThis is just a fake entry. The goal is to predict the color of the Teddy bear based on its measurements and properties (cuddliness etc.)7433030333003003343033303333333370
3
Aydin AhmadliScene recognition20004002real, integer
https://www.openml.org/d/312
It contains characteristics about images, their classes - 6 different labels: {Beach, Sunset, FallFoliage, Field, Mountain, Urban}.Problem is binary classification. We have to decide whether image is 'Urban' or not.R0
4
Aydin AhmadliRobot Navigation54004real,integer
https://www.openml.org/d/1526
Given features such as 24 different sensor readings, we have to decide which action will robot take - 4 output classes : {Move-Forward, Slight-Right-Turn, Sharp-Right-Turn, Slight-Left-Turn}R00
5
Aydin AhmadliID recognition from Walking 59k22real,integer
https://archive.ics.uci.edu/ml/datasets/User+Identification+From+Walking+Activity
Datas collected from Android smartphone positioned in the chest pocket of 22 participants. Input features : {Time-step, x acceleration, y acceleration, z acceleration}.... Output Classes: 22 User ID00
6
Tomáš KremelFinancial well-being survey6k5integer
https://www.consumerfinance.gov/data-research/financial-well-being-survey-data/
A person’s financial well-being comes from their sense of financial security and freedom of choice—both in the present and when considering the future. The survey dataset includes respondents’ scores, as well as measures of individual and household characteristics.C6111111
7
Tomáš Kremel3 million Russian troll tweets3Mtext, time, enum, integer
https://github.com/fivethirtyeight/russian-troll-tweets/
Data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian "troll factory" and a defendant in an indictment filed by the Justice Department in February 2018, as part of special counsel Robert Mueller's Russia investigation.C211?
8
Tomáš KremelAirbags26k2enum, integer, time
https://maths-people.anu.edu.au/~johnm/datasets/airbags/
Did airbags, over 1997-2002 in the US, reduce accident risk?0
9
Tomáš KarellaGrants6k per year (2006 - 2019)2(3)text, enum, interger, realhttps://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A1-sada%2Fhttp---opendata.praha.eu-api-3-action-package_show-id-mhmp-granty-2006Prague City Hall is sharing data about the grant requests. Every request contains information about the applicant, info about the project, the verdict and assigned money.C0
motion (SAVEE) database has been recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. The data were recorded in a visual media lab with high quality audio-visual equipment, processed and labeled. To check the quality of performance, the
0
10
Tomáš KarellaCar accidents
100k per year (2007 - 2019)
4text, enum, interger, realhttps://www.policie.cz/clanek/statistika-nehodovosti-900835.aspx?q=Y2hudW09Mg%3d%3dPolice department of Czech republic shares information about car accidents. It includes the severity of the injuries. It would be possible to merge this dataset with the chmi weather statistics.C41111
11
Tomáš KarellaDOHMH New York City Restaurant Inspection Results385k2text, enum, interger, realhttps://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/dataData describing restaurant inspections in New York. The target value could the closure of the restaurant, columns contains info about the location, type of restaurant, etc...C0
12
Věra KumováCollege Scorecard7k2enum, int, real
https://collegescorecard.ed.gov/data/
Data about US schools. The target value could be one of flags variables - e.g. "Flag for women-only college"P0
13
Věra Kumová
Women, Business and the Law
18740-1
https://datacatalog.worldbank.org/dataset/women-business-and-law
The study examines 35 questions in 8 eight areas (about equal opportunities for women), yes-no answears. Data are originally time series but one year could be also used for classification - each row is one country and the target value could be "Income group" of the country. R0
14
Věra KumováUniversity students behaviour2k2enum, int, text
https://data.brno.cz/dataset/?id=sociologicky-vyzkum-chovani-studentu-vs
Survey among students in Brno. The target value could be the information, whether a student has additional income for study or not.C6111111
15
Jan ŠpačekDancing~36k latin, ~33k standard(see desc)(see desc)
https://www.dancerank.cz
Results of Czech dancesport competitions, can be used to predict results of future competitions and/or national championships. To coerce the task into the simple ML formulation, we can generate a fixed number of features for each couple (# of competitions, # of finals, average time between competitions etc. before time t) and predict a discrete result (what class will the couple achieve after time t? how long will the couple continue dancing together after time t?). Alternately, similar data is available for international dance competitions (under WDSF).11
16
Jan ŠpačekMaturita~100(see desc)real
Personal communication
Anonymized dataset of students from my high school. Given their results at the admission test (Scio), predict their maturita scores. To pose this as a classification task, we can discretize the scores in various ways (passed? average above 2?)3111
17
Jan ŠpačekLítačkaarbitraryarbitrary(see desc)5 real
http://opendata.praha.eu/dataset/jizdni-rady-pid
Given coordinates of start and target positions in Prague and time of departure, predict how long the trip takes using public transport. The dataset is generated artificially by sampling pairs of stops and times and computing the shortest route in the real timetables for Prague. This can also be posed as a classification task in various ways (is the route faster than 30 minutes? is there a route with no transfer? does the fastest route use the metro? what types of vehicles the fastest route uses?).110
18
Abhishek Agrawal
Formspring data labelled for cyberbullying
2text
http://www.chatcoder.com/Data/DataReleaseDec2011.rar
The data represented 50 ids from Formspring.me that were crawled in Summer 2010. For each id, the profile information and each post (question and answer) was extracted. Each post was loaded into Amazon's Mechanical Turk and labeled by three workers for cyberbullying content.211
19
Abhishek Agrawal
Myspace Group Data labelled for Cyberbullying
2text
http://www.chatcoder.com/Data/BayzickBullyingData.rar
The folder contains a small subset of data from crawl of myspace groups. The data has been manually labelled for bullying content by 3 independent coders. Each input file was split into a window of 10 posts each. Each window was judged to determine if there was cyberbullying content anywhere in the window. The labels are contained in separate files. For a window to be labelled as containing cyberbullying, at least 2 out of 3 users had to label it as cyberbullying.0
20
Jan Vainertrashnet05006int
https://github.com/garythung/trashnet
The dataset can be used to learn a classifier to recognize various kinds of trash such as glass or plastic bottles. The usefulness lies in the ability to sort trash automatically without human intervention. R211
21
Jan VainerSound2020000400019real
https://github.com/ivclab/Sound20
The dataset contains sample sounds of various animals (insects) and musical instruments. It could be used to classify the animals based on the sounds they make, to distinguish between animal sounds and instruments. The data are in the form of spectrograms of given soundsR0
22
Jan VainerSavee10002006real
http://kahlan.eps.surrey.ac.uk/savee/
Database of audiovisual emotional speech. Emotion classification.P211
23
Alexis Teste
Parkinson's Disease Classification Data Set
750750integer, real
https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification
The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 (65.1±10.9) at the Department of Neurology in Cerrahpaşa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1±8.9). During the data collection process, the microphone is set to 44.1 KHz and following the physician’s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.
211
24
Alexis Testemfeat-factors200010
https://www.openml.org/d/12
One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.
0
25
Alexis Testenursery1000030005nominal
https://www.openml.org/d/26
Nursery Database was derived from a hierarchical decision model originally developed to rank applications for nursery schools. It was used during several years in 1980's when there was excessive enrollment to these schools in Ljubljana, Slovenia, and the rejected applications frequently needed an objective explanation. The final decision depended on three subproblems: occupation of parents and child's nursery, family structure and financial standing, and social and health picture of the family. The model was developed within expert system shell for decision making DEX 0
26
Martin ŠerýBanknote authentication 10003002real
https://archive.ics.uci.edu/ml/datasets/banknote+authentication
Data were extracted from images that were taken for the evaluation of an authentication procedure for banknotes.211
27
Martin ŠerýDOTA 2 game results70003002categorical
https://archive.ics.uci.edu/ml/datasets/Dota2+Games+Results
Predict the result of the match based on initial draft of heroes in DOTA 2 game11
28
Martin ŠerýStudent performance50015020categorical
https://archive.ics.uci.edu/ml/datasets/Student+Performance
Student performance analysis - predict final grade based on student grades, demographic, social and school related features.11
29
Iryna TryhubyshynCyber-Trolls detection20k2text
https://dataturks.com/projects/abhishek.narayanan/Dataset%20for%20Detection%20of%20Cyber-Trolls
The dataset contains tweets that are labeled whether they are aggressive or notC3111
30
Iryna TryhubyshynProgrammer's salary survey9kreal, integer, categoricalI will translate the dataset in English https://github.com/devua/csv/tree/master/salariesThe dataset contains results of survey about salary of Ukrainian programmers. Features contains job position, programming language, age, city, work experience, company size and type(outsource, product, outstaff), education, English level and so on. Can be either classification or regression task. C0
31
Iryna TryhubyshynSurvey of programmers who emigrated from Ukraine17005real, integer, categoricalI will translate the dataset in English https://github.com/devua/csv/tree/master/relocationThe dataset contains information about life satisfaction, current salary, country and job position, purpuses of leaving Ukraine. More than 20 questions, mostly multichoice. We can try to predict life quality changesC0
32
Marek Vandassentence-language120k12integer, categorial
https://tatoeba.org/eng/downloads
The dataset contains list of sentences and language labels. Task is to categorize sentences to language.C0
33
Marek Vandassection-category70k2-16real, integerI will bring the datasetThe dataset contains images of cross sections (intersection of some 3d rigid beam with 2d plane), the task is to categorize cross sections to one of categories - categories can be seen at https://www.dlubal.com/-/media/Images/website/pages/solutions/online-tools/glossary/000014/01-en.pngP0
34
Marek Vandasspoken-command1500 * # of classes2-5000real, integer
https://github.com/JohannesBuchner/spoken-command-recognition
The dataset contains list of spoken command and is artificially created by software (variantions are based on noise and other sound filters). Task is to recognize command from spoken word.R0
35
Chembrolu Surya
Crowdsourced mapping dataset
105453006real
https://archive.ics.uci.edu/ml/datasets/Crowdsourced+Mapping
The dataset contains NDVI(vegetative index)values that are obtained over a period of 17 months and in particular on 27 different days. The data is collected using satellite imagery and OpenStreetMap and the task is to classify the land cover based NDVI values. There are six different classes: Farm, Forest, Grass, Orchid, Water, impervious 0
36
Chembrolu SuryaSkyTrax Review Datasetover 17k
split from training samples
2real,text,categorical
https://raw.githubusercontent.com/quankiquanki/skytrax-reviews-dataset/master/data/airport.csv
The dataset is obtained by scraping reviews given on skytrax airlines website and reviews are about experiences of different airports by the travellers. The dataset consists of recommended column with either 0 or 1 implying whether the particular airport is recommended or not. 211
37
Chembrolu SuryaThyroid Disease Dataset7200
split from training samples
3real,integer
https://sci2s.ugr.es/keel/dataset_smja.php?cod=1179#sub1
The task is to detect thyroid condition of a patient which can be 1 for normal, 2 for hyperthyroid, 3 for hypothyroid0
38
Štěpán Procházkaunderlying-distribution10k / arbitrarily many2k / arbitrarily many5-10realartificially generatedThe dataset examples will be real valued vectors generated by some distribution (e.g., normal, uniform, etc.), the goal is to classify which distribution generated an example (may be thought of as trying to find out what is a distribution of a random variable from a fixed width sequence of independent trials)P/C0
39
Štěpán Procházkafps-cheater-detection1K1002real, integer
https://github.com/Nexosis/sampledata/blob/master/csgo-small.csv + some new data if someone knows where to get them
The goal of this task is to distinguish between cheaters and non cheaters in FPS game based on their in-game statistics (accuracy, K/D ratio etc.). The choice of the exact game title may be different, if better data are available.P211
40
Štěpán Procházkafake-news-classification1K+100+2categorical, integer, real
may be scraped from pages similar to this one https://www.politifact.com/personalities/donald-trump/statements/by/
The goal is to tell if the text (FB post, tweet) presents objective information, based on extracted features - length of text, number of emojis, uppercase-to-lowercase character ratio etc. The data may be collected from various sources (public social media accounts etc.)P41111
41
Michal Pospěchczech-presidental-election1086
taken from training sample
10real, integer, categoricalTaken from CVVMBased on various demographic and socio-economic indicators try to predict who did people vote for in the first round of Czech presidential election 2018C, processing needed though1411111111111111
42
Michal Pospěchstar-wars1200~
taken from training sample
3categorical
https://github.com/fivethirtyeight/data/blob/master/star-wars-survey/StarWars.csv
The goal is to predict who do the respondents think shot first, Han or Greedo.C11
43
Michal Pospěchmlb170k~
taken from training sample
2real
https://github.com/fivethirtyeight/data/tree/master/mlb-elo
Predict winner of MLB match based on ratings of teams and their pitchersC, some processing needed 0
44
Memduh Gokirmakmusic genre identification5-10realIdentify the genre of a musical audio file110
45
Memduh Gokirmak
document word segmentation
2realfind the boundaries of words in images of text0
46
Memduh Gokirmaktext difficulty evaluation~5real
MICUSP or something else
assign a difficulty level for readers to a natural language textP0
47
Petr Houškafitts-nail097-houska2000
taken from training sample
4real
https://github.com/petrroll/NAIL087-fitts-exp/tree/master/data
The goal is to predict participant id based on fitts' experiment results (speed of click, length, and size)11
48
Petr HouškaAudit Data Data Set 777
taken from training sample
2real, integer, cat
https://archive.ics.uci.edu/ml/datasets/Audit+Data#
The goal of the research is to help the auditors by building a classification model that can predict the fraudulent firm on the basis the present and historical risk factors.0
49
Petr Houška
Militarized Interstate Disputes v4.2
~2000
taken from training sample
20categorical, real, dates
http://www.correlatesofwar.org/data-sets/MIDs
The goal of the dataset is to predict outcome of a battle/war/skyrmish based on prdictors from 1993-2010.211
50
Petra DoubravováGeneral mortality~300000
taken from training sample
real, integer, categorical
https://ec.europa.eu/eurostat/web/health/data/database
May be used for obtaining informations about mortality in different countries, their causes depending on age, sex and other features and real use is in prevention C0
51
Petra DoubravováSentiment analysis
~400 maybe more, can be extended
taken from training sample
3text in czechmy dataClassification of facebook posts mostly on big bank - negative, postive, neutral, especialy in czech languageC211
52
Petra Doubravová0
53
Abhishek AgrawalMusk dataset6598
taken from training sample
2Integer
https://archive.ics.uci.edu/ml/datasets/Musk+(Version+2)
This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. However, the 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule. Because bonds can rotate, a single molecule can adopt many different shapes. To generate this data set, all the low-energy conformations of the molecules were generated to produce 6,598 conformations. Then, a feature vector was extracted that describes each conformation. When learning a classifier for this data, the classifier should classify a molecule as "musk" if ANY of its conformations is classified as a musk. A molecule should be classified as "non-musk" if NONE of its conformations is classified as a musk.
0
54
Jonáš KratochvílMovie recommendation900001000032real, cat
https://ufal.mff.cuni.cz/courses/npfl054/materials
IMDb movie database datasetC0
55
Jonáš Kratochvíl
Human machine dialogue prediction
1000100n^3real, catPredicting price range location and type of food based on human machine dialogueP3111
56
Jonáš KratochvílSignal/noise classification8000002000035real, cat
http://opendata.cern.ch/record/328
Predict whether CERN detector sensors a noise or signal based of various measurementsC0
57
Ondrej adds other possible sources, please enter your name in the Proposer column if you like the dataset and volunteer to review it. Then please add all the other details here, in the row. Highlighted ones seem very interesting.
0
58
https://linked.opendata.cz/dataset/czso-deaths-by-selected-causes-of-death
Time series of death reasons; some conversion would be needed for classification. To see the data, select "Prejit na datovy zdroj" from the last "Prozkoumat" drop-down menu. The data will be downloaded0
59
https://linked.opendata.cz/dataset/czso-job-applicants
Unemployed registered people counts by region; but perhaps hard to use for classification0
60
0
61
Various datasets from
https://data.gov.cz/datov%C3%A9-sady
...many sources from Czech Republic0
62
http://opendata.praha.eu/dataset
...Prague sources0
63
Chembrolu Surya
https://www.netmetr.cz/open-data.html
Results of internet speed tests over a longer period of time; the goal could be to predict internet connection type (LAN, G4, ...)0
64
0
65
https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A1-sada%2Fhttp---opendata.praha.eu-api-3-action-package_show-id-ipr-bonita_klimatu_z_hlediska_prirozene_ventilace_uzemi
This dataset alone is just a map, indicating air bonity (not exactly air quality but the speed of air change, so that immission has perhaps not so bad effects). It would be fabulous to link it with some categorical description of the area (family houses, skyscrapers, ..) and predict air bonity based on "a picture" (i.e. to local observations; not that you would be processing pictures) and altitude0
66
http://www.geoportalpraha.cz/cs/fulltext_geoportal
Search for 'bonita' to see other possible maps/data. OR do not enter any keyword and only select e.g. 'budovy'0
67
https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A1-sada%2F3751165
Hundreds of thousands of tenders (verejne zakazky), including evaluation criteria, texts, etc. We could change this into various tasks, e.g. predicting price range based on keywords from the description, predicting relevant evaluation criteria based on keywords from the description or based on who is proposing the tender (i.a. to find e.g. municipalities known for obscure practices etc.)0
68
https://golemio.cz/cs/oblasti
Various datasets on Prague0
69
Tomáš SoučekVotes from Czech Parlamentapprox. 10-60k (x200)2
https://www.psp.cz/sqw/hp.sqw?k=1300
Votes of all deputies from Czech Parlament from 1993 to present.. can be used to predict how one or more deputies would vote given votes of other deputiesC41111
70
0
71
0
72
0
73
0
74
0
75
0
76
0
77
0
78
0
79
0
80
0
81
0
82
0
83
0
84
0
85
0
86
0
87
0
88
0
89
0
90
0
91
0
92
0
93
0
94
0
95
0
96
0
97
0
98
0
99
0
100
0