| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Name | Source | Tags | Size | Typology | Description | Relevant papers | Comments | ||||||||||||||||||||
2 | CMU Movie Summary Corpus | http://www.cs.cmu.edu/~ark/personas/ | Text + Graphs + Numerical Data | 46 MB (compressed) | Movie metadata, character metadata and plot summaries, all provided as TXT files. | This dataset contains 42,306 movie plot summaries extracted from Wikipedia + aligned metadata extracted from Freebase. | Learning Latent Personas of Film Characters | |||||||||||||||||||||
3 | Beer reviews | https://drive.google.com/drive/folders/1Wz6D2FM25ydFw_-41I9uTwG9uNsN4TCF?usp=sharing | Text + Numerical data | 3.7 GB (compressed) | Metadata about users, beers, and breweries (CSV) User-reviews and ratings (TXT) | This dataset consists of beer reviews from two Beer rating websites: BeerAdvocate and RateBeer, for a period of 17 years ranging from 2001 to 2017. | Learning attitudes and attributes from multi-aspect reviews | Another relevant paper: When Sheep Shop: Measuring Herding Effects in Product Ratings with Natural Experiments | ||||||||||||||||||||
4 | Wikispeedia | https://snap.stanford.edu/data/wikispeedia.html | Text + Graphs + Navigation behavior | 9.5 + 35 + 755 MB (compressed) | Navigation paths and the Wikipedia hyperlink graph, as well as the pertaining Wikipedia articles (full HTML and plaintext only). The data is provided in TXT files. | WIkispeedia is a the human-computation game in which users are asked to navigate from a given source to a given target article, by only clicking Wikipedia links. This dataset contains human navigation paths on Wikipedia, collected through Wikispeedia. | Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts | |||||||||||||||||||||
5 | Wikipedia Requests for Adminship | http://snap.stanford.edu/data/wiki-RfA.html | Text + Graphs | 14 MB (compressed) | The data induces a directed, signed network in which nodes represent Wikipedia members and edges represent votes. Each vote is typically accompanied by a short textual coment. The data is provided in a TXT format. | For a Wikipedia editor to become an administrator, a request for adminship (RfA) must be submitted, either by the candidate or by another community member. Subsequently, any Wikipedia member may cast a supporting, neutral, or opposing vote. This dataset contains 11,381 users (voters and votees) forming 189,004 distinct voter/votee pairs, for a total of 198,275 votes (this is larger than the number of distinct voter/votee pairs because, if the same user ran for election several times, the same voter/votee pair may contribute several votes). | Exploiting Social Network Structure for Person-to-Person Sentiment Analysis | |||||||||||||||||||||
6 | YouNiverse | https://zenodo.org/record/4650046 | Time series; Big dataset | 111 GB (compressed, in total) | Metadata, popularity time-series, co-commenting graph | YouNiverse comprises metadata from over 136k channels and 72.9M videos published between May 2005 and October 2019, as well as channel-level time-series data with weekly subscriber and view counts. | YouNiverse: Large-Scale Channel and Video Metadata from English-Speaking YouTube | Large!! | ||||||||||||||||||||
7 | Coronawiki | https://github.com/epfl-dlab/wiki_pageviews_covid | Time series | ~135 MB (compressed, in total) | interventions.csv — this is a dataset that contains a series of important dates associated with the pandemic. For french, we considered the first dates the events happened in France, for English, we did not fill some of the events as the wikipedia language edition is too international. Other languages are spoken mostly in one country. For descriptions of mobility and normalcy threshold, see the paper. Global_Mobility_Report.csv.gz / applemobilitytrends-2020-04-20.csv.gz — these are mobility data shared by Google and Apple, respectively. They were used to calculate the mobility and normalcy threshold. topicslinked.csv.xz — contains the topic for each specific wikipedia page, identified by a name (in index) and a QID (in column qid). Other columns are indicator variables that equal True if the corresponding page belongs to a topic and False otherwise. aggregated_timeseries.json — a json containing conveniently aggregated time-series of wikipedia pageviews. It contains two keys per wikipedia language edition, for instance, for italian, it contains keys "it" and "it.m". Keys ending in ".m" ("it.m") contain data from pageviews from mobile devices, while keys that do not contain pageviews from desktop devices. We describe the remainer of the data in dictionary style, assuming you have loaded the json as a python dictionary in a variable named "data." For simplicity sake, we will use the data from italian desktop pageviews. data["it"]["len"] — contains the number of distinct wikipedia pages considered in this dataframe. data["it"]["sum"] — contains a time-series with the number of total pageviews per day from the study period. data["it"]["covid"] — contains data related to covid-related pages on wikipedia. data["it"]["covid"]["len"] —contains the number of distinct covid-related pages considered in this language. data["it"]["covid"][""sum"] — contains a time-series with the number of total pageviews going to covid-related pages per day from the study period. data["it"]["covid"][""percent"] — contains a time-series with the percentage of pageviews going to covid-related pages per day from the study period. data["it"]["topics"] — contains one key per topic, e.g., "STEM.Computing" is one topic. data["it"]["topics"]["STEM.Computing"] — contains data related to the "STEM.Computing" topic. data["it"]["topics"]["STEM.Computing"] ["len"] —contains the number of distinct "STEM.Computing"-related pages considered in this language. data["it"]["topics"]["STEM.Computing"] ["sum"] — contains a time-series with the number of total pageviews going to "STEM.Computing"-related pages per day from the study period. data["it"]["topics"]["STEM.Computing"] [""percent"] — contains a time-series with the percentage of pageviews going to "STEM.Computing"-related pages per day from the study period. | This datasets contains pageview statistics for 12 Wikipedia langauge editions, as well as pobility reports published by Apple and Google." with "This datasets contains pageview statistics for 12 Wikipedia langauge editions (i.e. Wikipedia access logs), as well as pobility reports published by Apple and Google. | Sudden Attention Shifts on Wikipedia During the COVID-19 Crisis | |||||||||||||||||||||
8 | ||||||||||||||||||||||||||||
9 | ||||||||||||||||||||||||||||
10 | ||||||||||||||||||||||||||||
11 | ||||||||||||||||||||||||||||
12 | ||||||||||||||||||||||||||||
13 | ||||||||||||||||||||||||||||
14 | ||||||||||||||||||||||||||||
15 | ||||||||||||||||||||||||||||
16 | ||||||||||||||||||||||||||||
17 | ||||||||||||||||||||||||||||
18 | ||||||||||||||||||||||||||||
19 | ||||||||||||||||||||||||||||
20 | ||||||||||||||||||||||||||||
21 | ||||||||||||||||||||||||||||
22 | ||||||||||||||||||||||||||||
23 | ||||||||||||||||||||||||||||
24 | ||||||||||||||||||||||||||||
25 | ||||||||||||||||||||||||||||
26 | ||||||||||||||||||||||||||||
27 | ||||||||||||||||||||||||||||
28 | ||||||||||||||||||||||||||||
29 | ||||||||||||||||||||||||||||
30 | ||||||||||||||||||||||||||||
31 | ||||||||||||||||||||||||||||
32 | ||||||||||||||||||||||||||||
33 | ||||||||||||||||||||||||||||
34 | ||||||||||||||||||||||||||||
35 | ||||||||||||||||||||||||||||
36 | ||||||||||||||||||||||||||||
37 | ||||||||||||||||||||||||||||
38 | ||||||||||||||||||||||||||||
39 | ||||||||||||||||||||||||||||
40 | ||||||||||||||||||||||||||||
41 | ||||||||||||||||||||||||||||
42 | ||||||||||||||||||||||||||||
43 | ||||||||||||||||||||||||||||
44 | ||||||||||||||||||||||||||||
45 | ||||||||||||||||||||||||||||
46 | ||||||||||||||||||||||||||||
47 | ||||||||||||||||||||||||||||
48 | ||||||||||||||||||||||||||||
49 | ||||||||||||||||||||||||||||
50 | ||||||||||||||||||||||||||||
51 | ||||||||||||||||||||||||||||
52 | ||||||||||||||||||||||||||||
53 | ||||||||||||||||||||||||||||
54 | ||||||||||||||||||||||||||||
55 | ||||||||||||||||||||||||||||
56 | ||||||||||||||||||||||||||||
57 | ||||||||||||||||||||||||||||
58 | ||||||||||||||||||||||||||||
59 | ||||||||||||||||||||||||||||
60 | ||||||||||||||||||||||||||||
61 | ||||||||||||||||||||||||||||
62 | ||||||||||||||||||||||||||||
63 | ||||||||||||||||||||||||||||
64 | ||||||||||||||||||||||||||||
65 | ||||||||||||||||||||||||||||
66 | ||||||||||||||||||||||||||||
67 | ||||||||||||||||||||||||||||
68 | ||||||||||||||||||||||||||||
69 | ||||||||||||||||||||||||||||
70 | ||||||||||||||||||||||||||||
71 | ||||||||||||||||||||||||||||
72 | ||||||||||||||||||||||||||||
73 | ||||||||||||||||||||||||||||
74 | ||||||||||||||||||||||||||||
75 | ||||||||||||||||||||||||||||
76 | ||||||||||||||||||||||||||||
77 | ||||||||||||||||||||||||||||
78 | ||||||||||||||||||||||||||||
79 | ||||||||||||||||||||||||||||
80 | ||||||||||||||||||||||||||||
81 | ||||||||||||||||||||||||||||
82 | ||||||||||||||||||||||||||||
83 | ||||||||||||||||||||||||||||
84 | ||||||||||||||||||||||||||||
85 | ||||||||||||||||||||||||||||
86 | ||||||||||||||||||||||||||||
87 | ||||||||||||||||||||||||||||
88 | ||||||||||||||||||||||||||||
89 | ||||||||||||||||||||||||||||
90 | ||||||||||||||||||||||||||||
91 | ||||||||||||||||||||||||||||
92 | ||||||||||||||||||||||||||||
93 | ||||||||||||||||||||||||||||
94 | ||||||||||||||||||||||||||||
95 | ||||||||||||||||||||||||||||
96 | ||||||||||||||||||||||||||||
97 | ||||||||||||||||||||||||||||
98 | ||||||||||||||||||||||||||||
99 | ||||||||||||||||||||||||||||
100 |