ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
2
3
NameURLNotes
4
Google Books nGramshttps://aws.amazon.com/datasets/8172056142375670No full text, only ngrams, but topic models still work.
5
Reuters News Corpushttp://about.reuters.com/researchandstandards/corpus/NLP standard dataset, lots of international stories and lots of good stuff over time
6
NLTK Corporahttp://www.nltk.org/nltk_data/Lots of good stuff
7
https://www.cs.cmu.edu/~./enron/Great for orgs scholars but some texts are shorts, and tough to code up interactions
8
http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-013-final.pdf
9
Political Blogs 2008https://aws.amazon.com/public-data-sets/common-crawl/Huge dataset that covers front page text of nearly every site on the internet
10
Freebase Simple Topic Dumphttps://aws.amazon.com/datasets/Encyclopedic/8247878934976180
11
Wikipedia Wikishttps://aws.amazon.com/datasets/Encyclopedic/4182
12
https://aws.amazon.com/datasets/Encyclopedic/2506
13
Amazon Commerce Reviews Sethttp://archive.ics.uci.edu/ml/datasets/Amazon+Commerce+reviews+set
14
NSF Research Award Abstractshttp://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003
15
20 Newsgroupshttp://scikit-learn.org/stable/datasets/twenty_newsgroups.html
16
New York Times Annotated Corpushttps://catalog.ldc.upenn.edu/LDC2008T19Available for purchase(collective action etc)
17
Daily Kos Blog Postshttps://code.google.com/p/graphlabapi/downloads/detail?name=daily_kos.tar.bz2&can=2&q=
18
PubMedhttp://deepdive.stanford.edu/doc/opendata/
19
Google Patents
https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent-data
20
Tweets 2011http://trec.nist.gov/data/tweets/
21
Irish Discussion Boardshttp://www.boards.ie/
22
Movie Review Datahttp://www.cs.cornell.edu/people/pabo/movie-review-data/
23
Yelp Datasethttps://www.yelp.com/academic_datasetnot available
24
Netflix DatasetNetflix Prize Data Set - Academic Torrents
25
Reddit Datasethttp://deepdive.stanford.edu/doc/opendata/
26
BMC BioInformaticshttp://socialcomputing.asu.edu/datasets/Twitter
27
D
28
Higgs Twitter Datasethttps://snap.stanford.edu/data/higgs-twitter.html
29
ICWSM (Varioushttp://icwsm.org/2013/datasets/datasets/
30
Million Song Datasethttp://labrosa.ee.columbia.edu/millionsong/musixmatchAvailable as bag-of-words
31
Stanford Network Datasetshttps://snap.stanford.edu/data/
32
Irvine Network Datasetshttps://networkdata.ics.uci.edu/resources.php
33
EUSpeechhttps://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XPCVEI
34
Chronicling America: America's Historic Newspapershttp://chroniclingamerica.loc.gov/about/api/
35
CourtListenerhttps://www.courtlistener.com/api/bulk-info/
36
CourtListenerhttps://tags.hawksey.info/get-tags/can regularly update personal archive of tweets, one-time authentification
37
Data Worldhttps://data.world/a social media site but for datasets and projects for people to post up; searchable and contains lots of data, quality probably varies though
38
https://data.stanford.edu/congress_textU.S. Congressional Record
39
congressional-recordhttps://github.com/unitedstates/congressional-recordBuild your own Record corpus, 1997-most recent available, includes unique IDs for speakers
40
Data is Plural (various datasets)https://www.data-is-plural.com/archive/
41
Stack Overflowhttps://archive.org/details/stackexchangeLicensed CC-BY and very up-to-date copy of the central Q&A website for programmers (and many scientists)
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100