A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | edition | position | headline | text | links | hattips |
2 | 2015.10.21 | 1 | Every place name in the United States. | Sometimes, bureaucracy creates poetry. Since 1890, the U.S. Board on Geographic Names has been cataloguing, standardizing, and promulgating official names for the places we hike, swim, work, and call home. Along the way, it began publishing Geographic Names Information System (GNIS), a searchable and downloadable database containing all of its domestic nomenclature. In Alaska alone, the database lists names for 167 dams, 303 post offices, 666 glaciers, 2,704 capes, and 9,575 streams. My favorite: Confusion Creek. [h/t @emilymbadger] | http://geonames.usgs.gov/index.html http://geonames.usgs.gov/domestic/index.html https://www.google.com/maps/place/Confusion+Creek,+Alaska/@68.4510925,-152.0233116,15.94z/data=!4m2!3m1!1s0x50d80cfac6a29911:0xc46bfa2a83d54866 | https://twitter.com/emilymbadger/status/653982851386310656 |
3 | 2015.10.21 | 2 | “There’s finally federal data on low-income college graduation rates—but it’s wrong.” | The Hechinger Report casts doubt on the Pell grant graduation numbers contained in the Department of Education’s recently-released College Scorecard. Why the discrepancy? “[W]hile schools are required by law to provide the graduation rates of Pell recipients to any applicants who ask, a loophole protects them from having to report the same figures to the government.” Oof. | http://hechingerreport.org/theres-finally-federal-data-on-low-income-college-graduation-rates-but-its-wrong/ https://collegescorecard.ed.gov/data/ | |
4 | 2015.10.21 | 3 | What police-related data does your city publish? | The Police Open Data Census, created by Code for America fellows in Indianapolis, is tracking “currently available open datasets about police interactions with citizens in the US," including officer-involved shootings, use of force, and citizen complaints. The census currently covers 36 police departments. Related: The NYPD says it will start tracking all officer use-of-force incidents — not just gunfire — next year, the New York Times reports. | https://codeforamerica.github.io/PoliceOpenDataCensus/ http://www.nytimes.com/2015/10/01/nyregion/new-york-police-will-document-virtually-all-instances-of-force.html | |
5 | 2015.10.21 | 4 | How often do Wikipedia editors edit? | The Wikimedia Foundation has published a dataset enumerating monthly revision counts for every editor, across all of its wikis. The foundation is asking for help investigating a few perplexing trends. For example: Why have the number “very active editors” — those with 100+ edits per month — increased while the number of merely “active” editors have plateaued? | https://blog.wikimedia.org/2015/09/25/wikipedia-editor-numbers/ | |
6 | 2015.10.21 | 5 | Four years of rejected license plates. | WNYC, through a freedom-of-information request to the New York DMV, obtained a list of vanity plate approvals and denials from late 2010 to late 2014. Among the denials: “RUBMYDUB,” “S5SS5S5S,” “RFLMAO,” and “CBSNEWS.” (Strangely, “NBC4” was approved. Go figure.) The files and related story were published in August, but the data are timeless. [h/t @veltman] | https://github.com/datanews/license-plates http://www.wnyc.org/story/new-yorkers-vanity-license-plates/ | https://twitter.com/veltman/status/628972777882652672 |
7 | 2015.10.28 | 1 | Data-shaming the robocallers. | If you can’t beat ‘em, post spreadsheets about ‘em. Earlier this month, the Federal Communications Commission started publishing a dataset of complaints against telemarketers and robocalls. The FCC says the file will be updated weekly. It’s already being put to use: A clever programmer has crammed all the offending numbers into a single phone “contact” so that you can block them all at once. [h/t Shale Craig] | https://consumercomplaints.fcc.gov/hc/en-us/articles/205239443-Data-on-Unwanted-Calls https://github.com/shalecraig/telemarketing | https://twitter.com/__shale__/status/657423817623506944 |
8 | 2015.10.28 | 2 | The demographics of traffic stops. | This weekend, the New York Times published a front-page article on “the disproportionate risk of driving while black.” Among other findings: “officers were more likely to conduct [searches] when the driver was black, even though they consistently found drugs, guns or other contraband more often if the driver was white.” The investigation drew on several statewide traffic-stop datasets that track the race and gender of stopped drivers. The “seven states with the most sweeping reporting requirements,” in order of how easy it seems (to me) to get detailed data: Connecticut, North Carolina, Missouri, Nebraska, Maryland, Illinois, and Rhode Island. | http://www.nytimes.com/2015/10/25/us/racial-disparity-traffic-stops-driving-black.html http://ctrp3.ctdata.org/ http://trafficstops.ncdoj.gov/Default.aspx?pageid=2 https://www.ago.mo.gov/home/vehicle-stops-report http://www.ncc.nebraska.gov/statistics/trafficstops/ http://www.goccp.maryland.gov/msac/law-enforcement.php http://www.idot.illinois.gov/transportation-system/local-transportation-partners/law-enforcement/illinois-traffic-stop-study http://www.ri.gov/press/view/23152 | |
9 | 2015.10.28 | 3 | Where do Americans spend their days? | Most population numbers tell you where people live. But legions of Americans commute for work across city, county, and state lines. The Census Bureau’s Commuter-Adjusted Daytime Population Data accounts for these daily migrations. Manhattan’s population (non-tourist) population doubles from 1.5 million to 3 million, by far the largest influx by raw numbers. But Lake Buena Vista, Fla., takes the percentage-growth prize. The city’s entire resident population could fit in two sedans, but its “daytime population” includes 33,000 workers — including a not-insubstantial number dressed as Mickey Mouse. [h/t Steven Romalewski] | https://www.census.gov/hhes/commuting/data/daytimepop.html https://en.wikipedia.org/wiki/Lake_Buena_Vista,_Florida | https://twitter.com/SR_spatial/status/656827844128034816 |
10 | 2015.10.28 | 4 | Finally, free access to detailed U.S. import/export data. | Prior to October 15th, the Census Bureau’s USA Trade Online tool cost $300/year. No longer. The newly-free dataset covers more than 17,000 commodities, including a category for “magic tricks, practical joke articles; parts and accessories.” [h/t Noah Veltman] | https://usatrade.census.gov/ http://www.census.gov/newsroom/press-releases/2015/cb15-tps87.html https://en.wikipedia.org/wiki/Harmonized_Tariff_Schedule_for_the_United_States http://www.census.gov/foreign-trade/statistics/graphs/GOTM/201508/index.html | https://twitter.com/veltman |
11 | 2015.10.28 | 5 | Porn. | Sexualitics.org is on a mission: “to contribute to human sexuality understanding through a Big Data approach.” Last year, the site posted detailed metadata on 800,000 adult videos, including titles, descriptions, view counts, and tags. It powers Porngram, an only-kinda-safe-for-work charting tool. | http://sexualitics.org/ http://sexualitics.github.io/ http://porngram.sexualitics.org/ | |
12 | 2015.11.04 | 1 | Maternity leave policies at hundreds of American companies. | The 600+ entries in this searchable, sortable database range from 3M to Amazon to Zynga, and list both paid and unpaid leave. The database, run by the women-in-the-workplace website FairyGodBoss.com, culls from published policies and employee tips. An introductory blog post provides more information. | https://fairygodboss.com/maternity-leave-resource-center https://www.fairygodboss.com/ http://blog.fairygodboss.com/2015/10/21/our-maternity-leave-database-is-here/ | |
13 | 2015.11.04 | 2 | MoMA, mo’ data. | This July, the Museum of Modern Art published a dataset containing 120,000 artworks from its catalog, joining the UK’s Tate, the Smithsonian’s Cooper Hewitt, and other forward-thinking museums. The MoMA data contains the names of the artwork and artist, the dates created and acquired, and the medium — but no images. Related: Artist Jer Thorp encourages you to “perform” the data. Also related: Every museum in the United States. [h/t Nadja Popovich] | https://github.com/MuseumofModernArt/collection https://github.com/tategallery/collection https://github.com/cooperhewitt/collection http://www.penn.museum/collections/data.php https://www.rijksmuseum.nl/en/api https://www.brooklynmuseum.org/opencollection/api/ https://medium.com/@blprnt/a-sort-of-joy-1d9d5ff02ac9 https://www.imls.gov/research-evaluation/data-collection/museum-universe-data-file | https://twitter.com/popovichn |
14 | 2015.11.04 | 3 | All licensed firearm dealers since 2010. | The Bureau of Alcohol, Tobacco, Firearms, and Explosives publishes a searchable and downloadable licensing database. License-holders fall into eleven categories. Among them: run-of-the-mill dealers, ammunition manufacturers, collectors of “curios and relics,” pawnbrokers, and importers of “destructive devices.” The ATF’s website contains monthly and state-by-state archives. [h/t Marc DaCosta] [Correction, 2015-11-04: There are only nine categories of license-holders. The published ATF data includes only eight of them; it does not include "Collector of Curios and Relics." Thanks to @MikeStucka for flagging this mistake.] | https://data.atf.gov/Licensees/Federal-Firearms-Licensee-Listing-2010-to-2015/qg4c-kex6 https://www.atf.gov/firearms/curios-relics https://www.atf.gov/firearms/firearms-guides-importation-verification-firearms-national-firearms-act-definitions-1 https://www.atf.gov/firearms/listing-federal-firearms-licensees-ffls-2015 | https://twitter.com/marc_dacosta |
15 | 2015.11.04 | 4 | One thousand ways to say “dog.” | Trans-New Guinea is the world’s third-largest language family. But it’s also among the poorest-studied. TransNewGuinea.org, an online database launched in 2013, is trying to change that. It now contains more than 1,000 New Guinea languages and lists 145,000 word translations — including 1,065 entries for “dog.” It even has an API. A recent PLOS ONE journal article provides additional background and statistics. [h/t Simon J. Greenhill] | http://transnewguinea.org/ http://transnewguinea.org/language/ http://transnewguinea.org/word/dog http://transnewguinea.org/api/v1/?format=json http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141563 | https://twitter.com/simonjgreenhill |
16 | 2015.11.04 | 5 | When planes attack. | Last May, a Gulfstream G150 taking off from Houston’s Ellington Airport struck an armadillo. The animal’s remains were collected, but were not sent to the Smithsonian Institution for identification. This anecdote comes from a single row in the Federal Aviation Administration’s Wildlife Strike Database, and draws on just seven of the 94 available fields. The database contains more than 168,000 strikes reported since 1990, almost all involving birds. Roughly 10% of the time, the animal's remains are sent to the Smithsonian's Feather Identification Lab. [h/t Dan Vergano] | http://wildlife.faa.gov/database.aspx http://www.faa.gov/airports/airport_safety/wildlife/smithsonian/ | https://twitter.com/dvergano |
17 | 2015.11.11 | 1 | Naughty companies. | Good Jobs First’s Violation Tracker calls itself “the first national search engine on corporate misconduct.” The new database currently contains nearly 100,000 penalties for environmental, health, and safety violations — sourced from 13 U.S. regulatory agencies — since 2010. Search results can be downloaded as CSV files, which contain a few additional fields. (Tip: Search for “*” to get all cases.) The largest single fine? The Department of Justice’s $20.8 billion penalty this year against BP. [h/t Samuel Rubenfeld] | http://www.goodjobsfirst.org/violation-tracker http://www.goodjobsfirst.org/violation-tracker-data-sources http://violationtracker.goodjobsfirst.org/parent/bp | https://twitter.com/srubenfeld/status/658980441387638784 |
18 | 2015.11.11 | 2 | The 139,756 side effects of 1,430 medical drugs. | The Side Effect Resource, a.k.a. SIDER, takes all the fine print from drug labels, and aggregates the information about side effects into a searchable, downloadable database. SIDER got a major upgrade last month, and now contains 40% more drug-effect pairs than before. The website incorporates both generic and brand names, so that searches for “Prozac” and “fluoxetine” bring you to the same page. | http://sideeffects.embl.de/ http://nar.oxfordjournals.org/content/early/2015/10/19/nar.gkv1075.abstract http://sideeffects.embl.de/drugs/3386/ | |
19 | 2015.11.11 | 3 | Albuquerque’s impressive open-data program. | The New Mexico city publishes dozens of regularly-updated, well-documented datasets. Among them: government employee earnings, the number of daily visitors to the city’s swimming pools, real-time bus locations, the geography of police beats, and the city’s complete vendor checkbook. [h/t Tom Johnson, who emailed Data Is Plural to praise how Albuquerque is sharing its data: “I have not found any other city in the world doing so in such detail.”] | http://www.cabq.gov/abq-data | http://online.sfsu.edu/jjohnson/ |
20 | 2015.11.11 | 4 | 1.8 billion pages of books (and booklike things). | Earlier this year, the HathiTrust Research Center released a massive dataset extracted from 4.8 million digitized volumes. For each of its 1.8 billion pages, the dataset includes word frequencies, languages used, and sentence counts, among other features. | https://www.hathitrust.org/htrc-releases-massive-dataset https://portal.htrc.illinois.edu/features | |
21 | 2015.11.11 | 5 | Deadly Prussian horses. | For his 1898 book, The Law of Small Numbers, statistician Ladislaus Bortkiewicz tabulated the number of Prussian cavalrymen killed by horse kicks each year between 1875 and 1894. (In total, 196 suffered that tragic fate.) The dataset is tiny, but boasts an outsized legacy: Bortkiewicz’s lethal horse kicks allegedly helped to popularize the then-obscure Poisson distribution. [h/t Noah Veltman] | https://en.wikipedia.org/wiki/Ladislaus_Bortkiewicz http://www.math.uah.edu/stat/data/HorseKicks.html http://mindyourdecisions.com/blog/2013/06/21/what-do-deaths-from-horse-kicks-have-to-do-with-statistics/ | https://twitter.com/veltman |
22 | 2015.11.18 | 1 | Follow the F-17s. | The Arms Transfer Database tracks the international flow of major weapons — artillery, missiles, military aircraft, tanks, and the like. Maintained by the Stockholm International Peace Research Institute (SIPRI), the database contains documented sales since 1950 and is updated annually. SIPRI provides a download tool, which outputs rich-text files, but it’s also possible to download the data as CSV. [h/t Martín González] | http://www.sipri.org/databases/armstransfers/armstransfers http://www.sipri.org/databases/armstransfers/armstransfers/background#Coverage http://www.sipri.org/databases https://gist.github.com/jsvine/9cb3300588ed402160fe | https://twitter.com/martgnz |
23 | 2015.11.18 | 2 | #campaign. | The 2016 presidential hopefuls have been tweeting, ‘gramming, and ‘booking like a pack of millennials. Fusion collected nearly 70,000 images from the candidates’ social media accounts, then pumped the pictures through an automated tagging system. Now you can search for guns, money, beer and more — or download the raw data for your own analysis. | http://fusion.net/story/229021/2016-presidential-campaign-images/ http://fusion.net/interactive/213317/lose-yourself-in-our-massive-searchable-collection-of-candidates-social-media-photos/#tag=gun&order=date-desc http://fusion.net/interactive/213317/lose-yourself-in-our-massive-searchable-collection-of-candidates-social-media-photos/#tag=cash&order=date-desc http://fusion.net/interactive/213317/lose-yourself-in-our-massive-searchable-collection-of-candidates-social-media-photos/#tag=beer&order=date-desc http://fusion.net/interactive/213317/lose-yourself-in-our-massive-searchable-collection-of-candidates-social-media-photos/ | |
24 | 2015.11.18 | 3 | America’s exonerees. | The National Registry of Exonerations contains “every known exoneration in the United States since 1989—cases in which a person was wrongly convicted of a crime and later cleared of all the charges based on new evidence of innocence.” For each of the 1,702 cases, the registry includes details about the exoneree, the crime, and the factors — such as new DNA evidence — that contributed to the exoneration. [h/t agate] | http://www.law.umich.edu/special/exoneration/Pages/about.aspx http://www.law.umich.edu/special/exoneration/Pages/detaillist.aspx | http://agate.readthedocs.org/en/1.1.0/tutorial.html |
25 | 2015.11.18 | 4 | Health data, unprotected. | Under the HITECH Act of 2009, companies must notify the government of any data breach involving the HIPAA-protected health data of 500 or more people. Summaries of those reports are available at the Department of Health and Human Services’s Breach Portal, which currently contains more than 1,300 incidents. Related: In April, JAMA published an analysis of the breaches. Also related: Forty years of legislative acronyms. [h/t Virginia Hughes] | http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/ https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf http://jama.jamanetwork.com/article.aspx?articleid=2247135 http://noahveltman.com/acronyms/ | https://twitter.com/virginiahughes |
26 | 2015.11.18 | 5 | Britain’s booze. | What contains 34,052 bottles and is worth an estimated £3 million? The United Kingdom’s official wine cellar, which provides libations for the government’s guests and hosts — and a dram of data for the public. Between April 2014 and March 2015, the cellar’s clients consumed more than 5,500 bottles of wine and liquor. Among them: 205 bottles of Champagne, 51-and-a-half bottles of gin, and one bottle Château Pichon-Longueville Comtesse de Lalande 1986. [h/t Nadja Popovich] | https://www.gov.uk/government/collections/government-wine-cellar https://www.gov.uk/government/publications/annual-statement-on-the-government-wine-cellar-for-the-financial-year-2014-to-2015 | https://twitter.com/popovichn |
27 | 2015.11.25 | 1 | Complaints against Chicago police. | The newly-launched Citizens Police Data Project has collected more than 56,000 allegations of police misconduct. The data, covering 2002-2008 and 2011-2015, includes demographic information about the complainant and the officer, as well as the type and location of the incident. Click here to download the raw data. Related: The City of Chicago’s wide-ranging data portal includes a spreadsheet of every reported crime in the city since 2001; you can explore neighborhood trends via the Chicago Tribune. [h/t Melissa Segura and Abraham Epton] | http://cpdb.co/landing/ http://cpdb.co/#!/data-tools/bVyoBL/citizens-police-data-project http://j.mp/chicagopolicemisconductdata https://data.cityofchicago.org/ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 http://crime.chicagotribune.com/ | https://twitter.com/melissadsegura https://twitter.com/aepton |
28 | 2015.11.25 | 2 | Refugees in America. | The Department of State publishes demographic reports on refugee arrivals since 2002. The data includes country of origin, resettlement city and state, religion, age, gender, and more. Related: At BuzzFeed, I used the data to chart the past decade of refugee arrivals. Also related: The UN’s refugee data portal. | http://www.wrapsnet.org/Reports/InteractiveReporting/tabid/393/Default.aspx http://www.buzzfeed.com/jsvine/where-us-refugees-come-from-and-go-in-charts http://popstats.unhcr.org/en/overview | |
29 | 2015.11.25 | 3 | 1.7 billion Reddit comments. | You can download every comment posted to Reddit since October 2007 … but you’ll need some patience and a terabyte of storage. If you’re more of the instant-gratification, don’t-have-an-external-hard-drive-lying-around type, you might enjoy FiveThirtyEight’s “How The Internet* Talks,” a sort of Google Ngrams for the Reddit data. [h/t Randall Olson and Ritchie King] | https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ http://projects.fivethirtyeight.com/reddit-ngram/ https://books.google.com/ngrams | https://twitter.com/randal_olson/status/667092303392194560 https://twitter.com/RitchieSKing |
30 | 2015.11.25 | 4 | The most popular government web pages. | The U.S. government has one very large Google Analytics account, and has begun sharing traffic data with the public. Not every federal website is accounted for, but more than 4,000 are. Over the past 90 days, they’ve racked up approximately 1.5 billion visits. The most popular page at the time of this writing? Weather.gov. Bonus: How they built it. [h/t Rebecca Williams] | https://analytics.usa.gov/ http://www.weather.gov/ https://18f.gsa.gov/2015/03/19/how-we-built-analytics-usa-gov/ | https://twitter.com/internetrebecca/status/662102448262238209 |
31 | 2015.11.25 | 5 | A century of pumpkin pie. | In 2011, the New York Public Library launched a crowdsourcing project to transcribe its massive collection of restaurant menus, dating back to the 1850s. So far, volunteers have transcribed more than 1.3 million dishes, their prices, and where on the menu each dish appeared. The library publishes a spreadsheet of all the data, and updates it twice a month. Happy Thanksgiving! | http://menus.nypl.org/ http://menus.nypl.org/data http://menus.nypl.org/search?utf8=%E2%9C%93&query=%22thanksgiving%2Bturkey%22%2BOR%2B%22thanksgiving%2Bmeal%22%2BOR%2B%22thanksgiving%2Bdinner%22 | |
32 | 2015.12.02 | 1 | Historical climate data. | The National Centers for Environmental Information maintains more than 20 petabytes of data, it says. Among the most useful slices is the Global Historical Climatology Network’s data, which aggregates reports on temperature, precipitation, wind, and more from tens of thousands of climate-monitoring stations around the world. One tidbit: January 1995 was Death Valley’s wettest month since at least the 1960s, with a whopping 2.59 inches of precipitation. | https://www.ncei.noaa.gov/ https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn http://journals.ametsoc.org/doi/full/10.1175/JTECH-D-11-00103.1 | |
33 | 2015.12.02 | 2 | Mass shootings in America. | ShootingTracker.com provides datasets listing all U.S. mass shootings — defined as “when four or more people are shot in an event, or related series of events” — since 2013. So far in 2015, mass shootings have killed 447 people and wounded an additional 1,292. | http://shootingtracker.com/wiki/Main_Page | |
34 | 2015.12.02 | 3 | A faster way to download open data. | Socrata’s software powers open-data portals around the world. But downloading large datasets — e.g., this 2.8-gigabyte dataset of NYC parking tickets — from Socrata-powered portals can feel, well, sluggish. One solution: OpenDataCache.com, a free website that provides faster-to-download versions of virtually every dataset from 50+ Socrata portals. Related: Thomas Levine’s detailed analyses of Socrata-powered portals, published in 2013 and 2014. [h/t John Krauss and Steven Romalewski] | https://en.wikipedia.org/wiki/Socrata https://data.cityofnewyork.us/dataset/Parking-Violations-Issued-Fiscal-Year-2015/c284-tqph http://www.opendatacache.com/ https://thomaslevine.com/search/?q=socrata&models=articles.article | https://twitter.com/recessionporn/status/569267639358504960 https://twitter.com/sr_spatial |
35 | 2015.12.02 | 4 | College sports financing. | The Huffington Post and Chronicle of Higher Education teamed up to investigate how colleges bankroll their athletics. (Georgia State, for example, spent more than $100 million subsidizing sports between 2010 and 2014, mostly via student fees.) The report, published last week, draws on five years of revenue/expense reports from 234 Division I public universities. You can download the raw data or explore it online. Related: The Washington Post also tackled this topic — from a slightly different angle — last week, examining the profitability (or lack thereof) of athletic programs at 48 schools. [h/t Shane Shifflett] | http://projects.huffingtonpost.com/projects/ncaa/sports-at-any-cost http://projects.huffingtonpost.com/projects/ncaa/subsidy-scorecards/eastern-kentucky-university http://projects.huffingtonpost.com/ncaa/reporters-note http://projects.huffingtonpost.com/projects/ncaa/subsidy-scorecards http://www.washingtonpost.com/sf/sports/wp/2015/11/23/running-up-the-bills/ | https://twitter.com/shaneshifflett |
36 | 2015.12.02 | 5 | Celebrity faces, annotated. | The CelebA dataset, published in September, contains 200,000+ images of 10,000+ celebrities, each annotated with 40 yes/no variables. Some favorites: “5_o_Clock_Shadow,” “Bags_Under_Eyes,” and “Goatee.” | http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html | |
37 | 2015.12.09 | 1 | The 2015 Global Open Data Index, released last night. | Open Knowledge International has just published its latest survey of openly available government data. This year’s audit includes 112 countries and territories, up from 97 last year. The survey scores each based on the availability of datasets in 13 key categories (e.g., “election results,” “government spending,” and “pollutant emissions”) and links out to the available datasets. In this year’s survey, Taiwan ranks first, the U.K. second, and Denmark third. The U.S. ranks eighth. | http://index.okfn.org/place/ http://index.okfn.org/methodology/ | |
38 | 2015.12.09 | 2 | More data (and discussion) on mass shootings. | Last week, Data Is Plural highlighted ShootingTracker.com, a source for data on shootings that wounded at least four people. Other resources include the Gun Violence Archive and Mother Jones’ detailed database of mass shootings since 1982. The Mother Jones database takes narrower approach, focusing on shootings that killed at least four people in a public setting. In a New York Times op-ed, published shortly after last week’s San Bernardino shooting, the editor behind that database argues that broader methodologies don’t distinguish between a “a 1 a.m. gang fight” and “the madness that just played out in Southern California.” A Washington Post article weighs the pros and cons of broader and narrower approaches. [h/t Robin Shields + Mark Follman + Christopher Ingraham] | https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-02-edition http://shootingtracker.com/wiki/Main_Page http://www.gunviolencearchive.org/methodology http://www.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data http://www.nytimes.com/2015/12/04/opinion/how-many-mass-shootings-are-there-really.html https://www.washingtonpost.com/news/wonk/wp/2015/12/03/what-makes-a-mass-shooting-in-america/ | https://twitter.com/robinshields https://twitter.com/markfollman/status/672564051889623040 https://twitter.com/_cingraham/status/672576536608907264 |
39 | 2015.12.09 | 3 | Firearm background checks. | Gun dealers use the FBI’s National Instant Criminal Background Check System to determine whether someone is allowed to buy a firearm. There isn’t a one-to-one correlation between these background checks and gun sales, but they’re said to be the best available proxy. The FBI publishes a PDF tallying the monthly number of firearm checks for each state and type. At BuzzFeed News, we’ve parsed that PDF into a CSV/spreadsheet for easier use. | https://www.fbi.gov/about-us/cjis/nics http://www.thetrace.org/2015/11/black-friday-gun-sales-background-checks/ https://www.fbi.gov/about-us/cjis/nics/reports/nics_firearm_checks_-_month_year_by_state_type.pdf https://github.com/BuzzFeedNews/nics-firearm-background-checks | |
40 | 2015.12.09 | 4 | Good FOOD, bad food. | The CDC’s Foodborne Outbreak Online Database (FOOD) contains 18,000+ outbreaks, which resulted in 358,000+ illnesses and 13,000+ hospitalizations, from 1998 through last year. In 2008, a multi-state Salmonella Saintpaul outbreak hospitalized 308 people — the highest count in the database. | http://wwwn.cdc.gov/foodborneoutbreaks/ | |
41 | 2015.12.09 | 5 | Know thy barber. | The Texas Department of Licensing and Regulation maintains a webpage of well-formatted data on state-licensed workers, including tow truck operators, boxing judges, journeyman electricians, elevator inspectors, manicurists, and, yes, barbers. [h/t Ryan Murphy] | http://www.license.state.tx.us/licensesearch/licfile.asp | https://twitter.com/rdmurphy/status/642427166689509376 |
42 | 2015.12.16 | 1 | Policing the police. | The Department of Justice is authorized to investigate police departments that display a “pattern or practice” of civil rights violations. In April, the Marshall Project began publishing a spreadsheet of the DOJ investigations into local law enforcement. The dataset, which is updated regularly, indicates when each case began, when it ended, and what type of agreement (if any) was reached. The latest entry: An investigation into the Chicago Police Department, announced last week. Related: PBS Frontline's interactive map of DOJ investigations. [h/t Tom Meagher] | https://github.com/themarshallproject/doj14141/blob/master/data/doj_data.csv https://github.com/themarshallproject/doj14141#the-department-of-justices-14141-civil-rights-investigations http://apps.frontline.org/fixingtheforce/ | https://twitter.com/ultracasual |
43 | 2015.12.16 | 2 | All the world’s glaciers. | The recently-updated Randolph Glacier Inventory contains spreadsheets and outlines of every known glacier in the world. Of the 212,000+ glaciers inventoried, more than 27,000 are in Alaska. Someone please adopt Deserted Glacier. [h/t Robin Wilson’s stunningly extensive directory of free GIS data] | http://www.glims.org/RGI/rgi50_dl.html https://www.google.com/maps/place/Deserted+Glacier,+Alaska+99686/@60.9786026,-145.6392684,7075m/data=!3m1!1e3!4m2!3m1!1s0x56b6f38f0ce35db9:0x1f9d53f4331c53fc | http://freegisdata.rtwilson.com/ |
44 | 2015.12.16 | 3 | College coaching salaries. | Last week, USA Today released its annual accounting of assistant — yes, assistant — college football coaches’ salaries. At $1.6 million per annum, Auburn’s Will Muschamp leads the pack. More than 371 assistants have salaries of $250,000+. The release complements the publication’s database of head-coaching salaries. Related: Each state’s highest paid public employee, as of 2013-ish. [h/t Steve Berkowitz] | http://sports.usatoday.com/ncaa/salaries/football/assistant http://sports.usatoday.com/ncaa/salaries/football/coach http://deadspin.com/infographic-is-your-states-highest-paid-employee-a-co-489635228 | https://twitter.com/ByBerkowitz/status/674653175119536129 |
45 | 2015.12.16 | 4 | Many pants on fire. | You’ve probably heard of PolitiFact, the Tampa Bay Times project that fact-checks what politician say. What you might not know: PolitiFact has an API. You can use it to fetch detailed data the project’s national and state-level editions. Related: “All Politicians Lie. Some Lie More Than Others,” PolitiFact’s top editor writes in the New York Times. | http://www.politifact.com/ http://static.politifact.com/api/v2apidoc.html http://www.nytimes.com/2015/12/13/opinion/campaign-stops/all-politicians-lie-some-lie-more-than-others.html | |
46 | 2015.12.16 | 5 | Every obscenity and death in Quentin Tarantino's movies. | This dataset is fucking amazing. | https://github.com/fivethirtyeight/data/tree/master/tarantino | |
47 | 2015.12.23 | 1 | How America injures itself. | Every year, the U.S. Consumer Product Safety Commission tracks emergency rooms visits to approximately 100 hospitals. The commission uses the resulting National Electronic Injury Surveillance System data to estimate national injury statistics, but it also publishes anonymized information for each consumer product–related visit, including the associated product code (e.g., 1701: “Artificial Christmas trees”) and a short narrative (“71 YO WM FRACTURED HIP WHEN GOT DIZZY AND FELL TAKING DOWN CHRISTMAS TREE AT HOME”). | http://www.cpsc.gov/en/Research--Statistics/NEISS-Injury-Data/ http://www.cpsc.gov//Global/Neiss_prod/completemanual.pdf | |
48 | 2015.12.23 | 2 | Farm to data-table. | The USDA’s 2012 Census of Agriculture — the most recent vintage available — tallies agricultural activity at the national, state, and county levels. You can download detailed data from the agency’s Quick Stats tool. In 2012, Oregon harvested more Christmas trees than any other state: 6.8 million of them, or 39% of the census total. [Correction, 2015-12-23: The Oregon numbers incorrectly referenced 2007 data. In 2012, Oregon harvested 6.4 million trees, or 37% of the census total. Thanks to @JoeMurph for flagging this mistake.] | http://www.agcensus.usda.gov/Publications/2012/ http://quickstats.nass.usda.gov/?source_desc=CENSUS http://www.agcensus.usda.gov/Publications/2012/Full_Report/Volume_1,_Chapter_2_US_State_Level/st99_2_035_035.pdf | |
49 | 2015.12.23 | 3 | Wikipedia traffic trends. | The Wikimedia Foundation publishes hourly pageview counts for each of its articles. It’s a tremendous amount of data — about 90 megabytes, compressed, per hour. Luckily, there’s also a tool for browsing individual pages’ daily traffic stats. Last Wednesday, the English-language page for "Christmas tree" received 7,822 visits, its highest mark so far this year. | http://dumps.wikimedia.org/other/pagecounts-raw/ http://stats.grok.se/ https://en.wikipedia.org/wiki/Christmas_tree http://stats.grok.se/en/latest90/Christmas_tree | |
50 | 2015.12.23 | 4 | Little’s big tree maps. | The Forest Service has digitized many of the tree species distribution maps from Elbert Little's “Atlas of United States Trees,” first published in the 1970s. Shapefiles and PDFs are available for for more than 600 species — including Ilex opaca (American holly) and Pseudotsuga menziesii (Douglas fir). | http://esp.cr.usgs.gov/data/little/ | |
51 | 2015.12.23 | 5 | The emjoiverse. | The Unicode Consortium publishes a big ol’ HTML table of every emoji, how they look in various contexts, and when they entered the canon. The “Christmas tree” emoji occupies code point U+1F384, and was introduced in 2010. (“Menorah with nine branches” arrived in 2015.) [h/t Ben Collins] | http://unicode.org/emoji/charts/full-emoji-list.html | https://twitter.com/benlcollins/status/676873468307095552 |
52 | 2015.12.30 | 1 | New Orleans slave sales, 1856–1861. | A new study in the American Economic Review suggests that slaveholders in the South underestimated the odds of “emancipation without compensation.” To reach its conclusions, researchers compiled a dataset of 15,377 slave sales, culled from remarkably detailed official records. Data for each sale includes demographic information about the slaves, seller, and buyer; the price paid; payment method; and researcher notes. | https://www.aeaweb.org/articles.php?doi=10.1257/aer.20131483 | |
53 | 2015.12.30 | 2 | Medicare’s priciest drugs. | Last week, the Centers for Medicare & Medicaid Services published a new drug-spending dataset. It focuses on medications that (a) cost the most, overall; (b) cost the most per patient; or (c) saw the largest price-hike between 2013 and 2014. Vimovo, an arthritis pain reliever, tops the price-hike rankings: Between 2013 and 2014, the average cost per unit increased more than sixfold, from $1.94 to $12.46. [h/t Virginia Hughes] | https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/ | https://twitter.com/virginiahughes |
54 | 2015.12.30 | 3 | Millions of home loans. | Over the weekend, the Seattle Times and BuzzFeed News published an investigation into Clayton Homes, a company that is owned by Warren Buffett's Berkshire Hathaway and that “has grown to dominate virtually every aspect of America’s mobile-home industry.” The investigation draws on data released through the Home Mortgage Disclosure Act. The law requires large lenders to publish details about each of their loans. You can download the raw data from the FFIEC, or slightly user-friendlier versions from the CFPB. [h/t Mike Baker + Dan Wagner] | http://www.buzzfeed.com/danielwagner/warren-buffetts-predatory-lender-charges-minorities-a-lot-mo http://www.consumerfinance.gov/hmda/learn-more https://www.ffiec.gov/hmda/hmdaflat.htm http://www.consumerfinance.gov/hmda/explore | https://twitter.com/bymikebaker https://twitter.com/wagnerreports |
55 | 2015.12.30 | 4 | Every known satellite orbiting Earth. | The Union of Concerned Scientists’s Satellite Database currently contains 1,305 entries and is updated “roughly quarterly.” The longest-orbiting: AMSAT-OSCAR 7, an amateur radio satellite launched in November 1974. Related: The satellites, visualized. [h/t David Yanofsky] | http://www.ucsusa.org/nuclear-weapons/space-weapons/satellite-database.html https://en.wikipedia.org/wiki/AMSAT-OSCAR_7 http://qz.com/296941/interactive-graphic-every-active-satellite-orbiting-earth/ | https://twitter.com/YAN0/status/678953014535716864 |
56 | 2015.12.30 | 5 | Things lost (and not yet found) on the New York subway. | Among them: 37,622 cellphones; 3,604 hats; 1,903 scarves; 1,017 birth certificates; 483 diaries; 115 VHS tapes; 82 violins; 41 GPS navigation systems; and 9 answering machines. At least one of the 2,756 umbrellas is mine. [h/t Mona Chalabi + Allison McCann + Noah Veltman] | http://advisory.mtanyct.info/LPUWebServices/CurrentLostProperty.aspx | http://fivethirtyeight.com/datalab/mta-new-york-lost-and-found-subway-most-common/ https://github.com/atmccann/mta-lost-found/ https://twitter.com/veltman |
57 | 2016.01.06 | 1 | One year of fatal police encounters. | After it became clear that the federal government was doing an awful job of keeping track of how often police kill civilians, two newspapers started counting last year. According to The Guardian’s tally, U.S. police killed 1,136 people in 2015. The Washington Post’s count — which focused on shootings only and didn’t include off-duty officers — counted 984 deaths. Both organizations provide methodologies and downloadable datasets (including demographic and geographic details): Guardian / WaPo. | http://graphics.wsj.com/justifiable-homicides-by-police/ http://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database https://www.washingtonpost.com/graphics/national/police-shootings/ http://www.theguardian.com/us-news/ng-interactive/2015/jun/01/about-the-counted https://www.washingtonpost.com/national/how-the-washington-post-is-examining-police-shootings-in-the-us/2015/06/29/f42c10b2-151b-11e5-9518-f9e0a8959f32_story.html | |
58 | 2016.01.06 | 2 | The World Atlas of Language Structures. | This database compares the phonological, grammatical, and lexical properties of hundreds of languages. One dataset looks at languages’ counting systems. (Many use the decimal system, but Yoruba uses the vigesimal system and Danish uses a hybrid.) Others examine the use of tone, how you say “tea”, and whether there are different words for “finger” and “hand”. [h/t Jacqui Maher] | http://wals.info/feature http://wals.info/feature/131A#2/21.3/132.2 https://en.wikipedia.org/wiki/Vigesimal http://wals.info/feature/13A#2/19.3/152.9 http://wals.info/feature/138A#2/25.5/143.8 http://wals.info/feature/130A#2/14.9/153.6 | https://twitter.com/jacqui/status/661943332877279232 |
59 | 2016.01.06 | 3 | NYC felonies. | The historically opaque New York Police Department has finally started publishing incident-level felony data — something that cities such as Chicago and Boston have done for years. The dataset includes the date, time, and approximate location of each offense. It currently covers the first nine months of 2015 and will (apparently) be updated quarterly. Don’t miss the footnotes in this PDF. Related: Some initial insights. Also related: “Which Cities Share The Most Crime Data?” [h/t Dan Nguyen + Mark Silverberg] | http://gothamist.com/2014/03/21/nypd_transparency.php https://data.cityofnewyork.us/Public-Safety/NYPD-7-Major-Felony-Incidents/hyij-8hr7 http://www.nyc.gov/html/nypd/downloads/pdf/analysis_and_planning/NYPDIncidentLevelDataFootnotes.pdf http://iquantny.tumblr.com/post/136641945194/your-neighborhoods-crime-rank-insights-from-the http://fivethirtyeight.com/features/which-cities-share-the-most-crime-data/ | https://twitter.com/dancow/status/682302336220409856 https://twitter.com/skram/status/682287240148574208 |
60 | 2016.01.06 | 4 | Refugee arrivals along the Western Balkans route. | The UN’s refugee agency is keeping track of daily refugee movements through Greece, Macedonia, Serbia, and farther along into Europe. The downloadable data and interactive map cover migrations since October 2015. | http://data.unhcr.org/mediterranean/country.php?id=502 | |
61 | 2016.01.06 | 5 | The position of Michael Jackson’s white glove in all 10,060 frames of “Billie Jean.” | Crowdsourced from his 1983 “Motown 25” performance. [h/t Nadja Popovich] | http://whiteglovetracking.com/ | https://twitter.com/PopovichN |
62 | 2016.01.13 | 1 | Religion in America. | The 2010 Religious Congregations and Membership Study counts, for more than 200 religious groups, the number of congregations and adherents in each U.S. state and county. In total, the study reported more than 344,000 congregations and more than 150 million adherents — nearly half of the 2010 U.S. population. New counts are published every 10 years. [h/t Julia Silge] | http://www.thearda.com/rcms2010/ | http://juliasilge.com/blog/This-Is-the-Place/ |
63 | 2016.01.13 | 2 | Shifting global borders. | What did the world’s political boundaries look like in 1945? The lines between Swedish counties in 1968? The U.S. states in 1865? Thenmap, an open-source API and mapping tool, answers these questions and more. [h/t Carlos Matallín] | http://www.thenmap.net/ | https://twitter.com/matallo/status/683994848345587716 |
64 | 2016.01.13 | 3 | U.S. foreign assistance. | USAID, the Peace Corps, the U.S. African Development Foundation, and other agencies report data on foreign assistance spending to ForeignAssistance.gov. The full dataset includes detailed information for each grant and contract — and comes with data dictionary. The website also provides a chart of participating agencies, and an interactive map of the data. | http://beta.foreignassistance.gov/ http://beta.foreignassistance.gov/learn/understanding-the-data http://beta.foreignassistance.gov/explore | |
65 | 2016.01.13 | 4 | Retirees’ language preferences. | Last year, more than 2 million people applied for new Social Security retirement and survivor benefits. When they did, they indicated their preferred language. More than 93% said English, and about 5% of applicants said Spanish — the second most popular choice. Among the 88 other options: 1,616 applicants chose American Sign Language, 32 chose Japanese, nine chose Yiddish, and one chose Swedish. | https://www.ssa.gov/open/data/LEP-Yearly-Spoken-Language-RSI-Claimants.html | |
66 | 2016.01.13 | 5 | State Department per diems. | When State Department employees travel on official business abroad, they can get reimbursed — to a point — for lodging, meals, and things such as laundry. The department publishes monthly spreadsheets of the maximum per diems, which vary by location. The highest right now? The Cayman Islands ($735 per day). The lowest? Antarctica ($0/day) and Iraq ($11/day). | https://aoprals.state.gov/content.asp?content_id=233&menu_id=78 | |
67 | 2016.01.20 | 1 | Flint water samples. | Researchers from Virginia Tech have joined forces with Flint, Mich., residents to sample the city’s lead-tainted water supply. In December, the researchers posted the results of 271 samples, which indicated high levels of lead contamination. The most extreme sample found a lead concentration of 158 parts per billion — 10 times higher than the EPA’s “action level.” Related: The New York Times + The Washington Post have used the data. | http://flintwaterstudy.org/ http://flintwaterstudy.org/2015/12/complete-dataset-lead-results-in-tap-water-for-271-flint-samples/ http://www.cdc.gov/nceh/lead/tips/water.htm http://www.nytimes.com/interactive/2016/01/15/us/flint-lead-water-michigan.html https://www.washingtonpost.com/news/wonk/wp/2016/01/15/this-is-how-toxic-flints-water-really-is/ | |
68 | 2016.01.20 | 2 | The transatlantic slave trade. | Slate Magazine’s “The Atlantic Slave Trade in Two Minutes” — recently named a multimedia finalist for the American Society of Magazine Editors’ annual awards — tracks 20,528 transatlantic voyages over 315 years. The information comes via SlaveVoyages.org, which provides searchable, downloadable records of ships’ and captains’ names, regions where slaves were purchased and sent, and more. | http://www.slate.com/articles/life/the_history_of_american_slavery/2015/06/animated_interactive_of_the_history_of_the_atlantic_slave_trade.html https://twitter.com/ASME1963/status/687734147067031552 http://slavevoyages.org/ http://slavevoyages.org/voyage/search http://slavevoyages.org/voyage/download | |
69 | 2016.01.20 | 3 | Campaign ad purchases. | The FCC requires broadcasters to keep records of “all requests for broadcast time made by or on behalf of a candidate for public office.” With the help of volunteers, Political Ad Sleuth gathers those records and enters them into a searchable, downloadable database. Note: Due, in part, to the difficulty of transcribing the (non-standardized) records, the information in the database is incomplete. | https://www.law.cornell.edu/cfr/text/47/73.1943 http://politicaladsleuth.com/ http://politicaladsleuth.com/political-files/most-recent/ | |
70 | 2016.01.20 | 4 | 568,454 reviews of “fine foods” on Amazon. | In 2013, Stanford University researchers published a paper examining how people’s tastes “change and evolve over time.” They drew, in part, on a dataset containing 13 years of Amazon reviews of gourmet foods. (Note: Not all foods were intended for humans.) The dataset comes in a slightly unconventional format; here’s a Python script to convert it to a TSV file. [h/t Kaggle] | http://snap.stanford.edu/data/web-FineFoods.html http://www.amazon.com/dp/B001E4KFG0 https://gist.github.com/jsvine/57679826ed582a95dd71 | https://www.kaggle.com/snap/amazon-fine-food-reviews |
71 | 2016.01.20 | 5 | One hyper-quantified human. | Last month, Nature Communications published a study of the “long-term neural and physiological phenotyping of a single human.” That human? Study co-author Russell A. Poldrack, “a right-handed Caucasian male, aged 45 years at the onset of the study.” The 18 months of results — tracking brain connections, food consumption, stress levels, and much more — are available to download and explore. [h/t Sune Lehmann] | http://www.nature.com/ncomms/2015/151209/ncomms9885/full/ncomms9885.html http://results.myconnectome.org/ http://results.myconnectome.org/explore | https://twitter.com/suneman/status/686847329543020544 |
72 | 2016.01.27 | 1 | Airplane confidential. | NASA collects aviation safety reports from pilots, technicians, flight attendants, and other personnel. The (anonymized) published data contains text narratives, as well as details about flight conditions and other safety factors. (“Ok, I did it; the dumbest thing I have ever done in my entire life,” one confessional begins.) You can search the database but can only download so many records at a time. And you can request the full database from NASA, but you’ll have to wait. An alternative option: There’s a copy from November on the Internet Archive. [h/t Dave Riordan + Julian Simioni] | http://asrs.arc.nasa.gov/index.html http://asrs.arc.nasa.gov/search/database.html http://asrs.arc.nasa.gov/search/requesting.html https://archive.org/download/asrs-extracted.tar | https://twitter.com/riordan https://github.com/orangejulius/asrs-data |
73 | 2016.01.27 | 2 | Cancer statistics. | Earlier this month, the American Cancer Society launched a new data dashboard. Metrics include estimated new cases, historical survival rates, and more. To download the corresponding spreadsheets, use the “tools” button on each page. [h/t Virginia Hughes] | http://cancerstatisticscenter.cancer.org/ http://cancerstatisticscenter.cancer.org/#/data-analysis/NewCaseEstimates http://cancerstatisticscenter.cancer.org/#/data-analysis/SurvivalByStage | https://twitter.com/virginiahughes |
74 | 2016.01.27 | 3 | Tens of millions of movie ratings. | MovieLens.org is a free, noncommercial movie recommender — sort of like Netflix, minus the ability to watch movies. The service is run by a research lab at the University of Minnesota. The lab publishes several datasets of user ratings and movie info. The largest contains 22 million ratings. Among movies with at least 1,000 ratings, The Shawshank Redemption has received the highest average score (4.44 of 5), while 2007’s Epic Movie has netted the lowest (1.48 of 5). | https://movielens.org/ http://grouplens.org/ http://grouplens.org/datasets/movielens/ http://www.rottentomatoes.com/m/epic_movie/ | |
75 | 2016.01.27 | 4 | Federal employees’ feelings. | Last year, more than 400,000 federal employees took the Office of Personnel Management’s annual survey, which includes questions about satisfaction, leadership, and work schedules. You can download aggregate and raw results. Important note: The survey is voluntary and non-random. | http://www.fedview.opm.gov/2015/ http://www.fedview.opm.gov/2015/Reports/ http://www.fedview.opm.gov/2015/EVSDATA/ | |
76 | 2016.01.27 | 5 | The Survey of Scottish Witchcraft. | The University of Edinburgh hosts an incredibly detailed, and deeply documented database of more than 3,000 accused witches in Scotland. The mania reached its quantitative peak in 1662, when, according to the database, 402 people were accused of witchcraft. [h/t Felix Haass] | http://www.shca.ed.ac.uk/Research/witches/ http://webdb.ucs.ed.ac.uk/witches/index.cfm?fuseaction=home.graph2 | https://twitter.com/felixhaass |
77 | 2016.02.03 | 1 | Angry travelers. | The Transportation Security Administration publishes spreadsheets of legal claims against the agency, including the location, circumstances, and outcome of each claim. The most expensive settlement on record appears to involve a vehicle-related personal injury in July 2004, for which the TSA paid $125,000. On the other end of the spectrum: In 2014, a traveler recouped $1.25 for lost food or drink at Hilton Head Island Airport. [h/t Seth Kadish + Lindsey Cook] | http://www.dhs.gov/tsa-claims-data | http://vizual-statistix.tumblr.com/post/138024589666/travelers-make-claims-again-the-transportation https://twitter.com/Lindzcook |
78 | 2016.02.03 | 2 | Famous people on Wikipedia. | Last month, a group of researchers introduced Pantheon 1.0, “a manually verified dataset of globally famous biographies.” It starts with 11,341 Wikipedia biography pages in 25 languages, and adds birthplace, birthdate, gender, occupations, and page views. You can download the data or explore it online. Baffling factoid: As of May 2013, High School Musical star Corbin Bleu had biographies in more language editions than anyone other than Jesus Christ and Barack Obama. Related: A broader-but-shallower dataset of more than 400,000 influential people on the English-language Wikipedia. [h/t Ben Dilday] | http://www.nature.com/articles/sdata201575 http://pantheon.media.mit.edu/about/datasets http://pantheon.media.mit.edu/rankings/cities/all/all/-4000/2010/H15 http://www.buzzfeed.com/josephbernstein/why-the-hell-is-corbin-bleu-such-a-huge-deal-on-wikipedia https://github.com/bdilday/wikipedia_people | https://twitter.com/BenDilday/status/690334614007640065 |
79 | 2016.02.03 | 3 | Zika data. | Fears about the Zika virus — and a possible, but not proven, connection to microcephaly — are growing. Little data on the latest outbreak has been published, but here’s an open guide to what’s available so far, including reported cases of microcephaly in Brazil and the number of suspected Zika samples sent to Colombia’s national institute of health. | http://fivethirtyeight.com/features/zikas-not-a-global-health-emergency-its-potential-consequences-are/ https://github.com/BuzzFeedNews/zika-data | |
80 | 2016.02.03 | 4 | Post-Fukushima radiation. | Next month marks the five-year anniversary of the Fukushima Daiichi disaster, the worst nuclear accident since Chernobyl. Since shortly after the meltdown, volunteers for Safecast have been collecting radiation measurements in Japan and beyond. The results are available to download or to access via API. | http://www.nei.org/News-Media/News/News-Archives/fukushima-chernobyl-and-the-nuclear-event-scale http://blog.safecast.org/history/ http://safecast.org/tilemap/?y=35.2&x=137.9&z=5 http://blog.safecast.org/data/ https://api.safecast.org/en-US/home | |
81 | 2016.02.03 | 5 | Movie chatter. | The Cornell Movie-Dialogs Corpus contains 220,579 “conversational exchanges” between 9,035 characters in 617 movies. Included: “Hello. My name is Inigo Montoya. You killed my father. Prepare to die.” | http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html | |
82 | 2016.02.10 | 1 | Powering America. | Every year, the U.S. Energy Information Administration requires thousands of power plants to report detailed data on fuel consumption and electricity generation. The datasets stretch back more than three decades, to 1989. In 2014, the most recent year available, Arizona’s Palo Verde Nuclear Generating Station generated more electricity — 32 million megawatt hours — than any other power plant in the country. [h/t Marc DaCosta] | http://www.eia.gov/electricity/data/eia923/ https://en.wikipedia.org/wiki/Palo_Verde_Nuclear_Generating_Station | https://twitter.com/marc_dacosta |
83 | 2016.02.10 | 2 | Nature-spotting. | iNaturalist is a sort of social network for nature enthusiasts. Users can post photos and descriptions of birds, fish, bugs, and even mold, which experts can then help to identify. In November, the site recorded its two-millionth observation. You can explore the data via API or, with a free account, use the site’s export tool. [h/t Dan Brady] | http://www.inaturalist.org/pages/about http://www.inaturalist.org/observations/2656845 http://inaturalist.tumblr.com/post/133980888898/2-million-observations http://www.inaturalist.org/pages/api+reference http://www.inaturalist.org/observations/export | http://danjbrady.com/ |
84 | 2016.02.10 | 3 | Organ transplants. | The Organ Procurement and Transplantation Network, a public-private partnership, keeps records of organ donations, transplants, and waiting lists in the United States. The website’s “advanced” data tool lets you generate fairly detailed custom reports. One hitch: The site doesn’t provide an option to download the data. Data Is Plural wrote a small bit of software to fix that. | https://optn.transplant.hrsa.gov/ https://optn.transplant.hrsa.gov/converge/latestData/viewDataReports.asp https://optn.transplant.hrsa.gov/converge/latestData/advancedData.asp https://gist.github.com/jsvine/6ed721172a7f5019332b | |
85 | 2016.02.10 | 4 | More political ads. | The Internet Archive’s Political TV Ad Archive uses audio fingerprinting to identify the campaign ads playing in key primary states. You can search the database, watch the ads, and download the data. The data file contains information about each ad’s sponsor, pro/con-ness, TV network, and time of airing. Previously: Political Ad Sleuth, featured Jan. 20. | http://politicaladarchive.org/ http://politicaladarchive.org/data/ http://politicaladsleuth.com/ http://tinyletter.com/data-is-plural/letters/data-is-plural-2016-01-20-edition | |
86 | 2016.02.10 | 5 | One million songs. | The Million Song Database contains metadata and “feature analysis” (e.g., loudness, tempo, and “danceability”) for, you guessed it, one thousand-thousand songs. The full dataset occupies hundreds of gigabytes, but you can also download a 1% sample. [h/t Neal Lathia] | http://labrosa.ee.columbia.edu/millionsong/ http://labrosa.ee.columbia.edu/millionsong/pages/example-track-description http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset | https://twitter.com/neal_lathia |
87 | 2016.02.17 | 1 | The kids are alright. | Every two years since 1991, the CDC has conducted the Youth Risk Behavior Survey, which asks high school students questions about drug use, sex, eating habits, and more. The results are available at the national, state, and district level. Results from the 2015 survey will be published in June, the CDC says. Related: Today’s teens _______ less than you did. | http://www.cdc.gov/healthyyouth/data/yrbs/overview.htm http://www.cdc.gov/healthyyouth/data/yrbs/data.htm http://www.vox.com/a/teens | |
88 | 2016.02.17 | 2 | Word-emotion associations. | Computational linguists at Canada’s National Research Council used Mechanical Turk to crowdsource the emotional associations of 14,182 words. For each word, participants were asked whether it was “positive” and/or “negative”, and whether it was associated with any of eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The resulting Word-Emotion Association Lexicon was first published in 2010. Of the full lexicon, only two words — “treat” and “feeling” — were associated with all eight emotions. [h/t Bipul Mohanto] | http://www.nrc-cnrc.gc.ca/eng/ http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm | http://opendata.stackexchange.com/questions/7008/labeled-sentiment-words-according-to-8-different-human-sentiments |
89 | 2016.02.17 | 3 | The United States of Land. | In 2011, agriculture occupied about 22% of all land in the contiguous U.S., according to the National Land Cover Database. The NLCD classifies every 30-meter-by-30-meter chunk of land into one of 16 categories, including “woody wetlands,” “cultivated crops,” and “developed” land, at different intensities. (Alaska’s unique landscape has earned it a few additional categories, such as “dwarf scrub.”) The database is presented as raster files, so you’ll need some geospatial software to dig in. [h/t Ryan McNeill] | http://www.mrlc.gov/nlcd11_stat.php http://www.mrlc.gov/nlcd2011.php http://www.mrlc.gov/nlcd11_leg.php | https://twitter.com/mcneill_tweets |
90 | 2016.02.17 | 4 | Hundreds of thousands of chess games. | Portable Game Notation, a file format used to describe chess matches, was invented in 1993. Since then, enthusiasts have created PGN files for virtually all top players’ games and every high-level tournament at sites such as PGN Mentor and Chess DB. [h/t Seth Kadish] | https://en.wikipedia.org/wiki/Portable_Game_Notation http://www.pgnmentor.com/files.html#events http://chess-db.com/public/grandmasters.jsp | http://vizual-statistix.tumblr.com/post/78821780083/when-i-started-this-blog-one-of-my-first |
91 | 2016.02.17 | 5 | A quarter-million bugs. | For 18 years, a trap on the roof of the University of Copenhagen’s Zoological Museum lured moths, butterflies, and beetles to their early deaths. Researchers at the university counted and identified more than 250,000 specimens from 1,500+ species. The most common: Yponomeuta evonymella, a moth species also known as the bird-cherry ermine, which got trapped nearly 40,000 times. | http://onlinelibrary.wiley.com/doi/10.1111/1365-2656.12452/full http://datadryad.org/resource/doi:10.5061/dryad.s4945/1 https://en.wikipedia.org/wiki/Bird-cherry_ermine | |
92 | 2016.02.24 | 1 | Supremely useful data. | The Supreme Court Database is exactly what it sounds like — and definitively so. The most recent release covers all SCOTUS cases from 1946 through 2014. For each case, the database contains 247 “pieces of information,” including the source of the case, why the court agreed to hear the case, the legal provisions at play, and how each justice voted. | http://supremecourtdatabase.org/index.php http://supremecourtdatabase.org/data.php http://supremecourtdatabase.org/documentation.php?var=caseSource http://supremecourtdatabase.org/documentation.php?var=jurisdiction http://supremecourtdatabase.org/documentation.php?var=lawType http://supremecourtdatabase.org/documentation.php?var=vote | |
93 | 2016.02.24 | 2 | Armed conflict. | The Uppsala Conflict Data Program maintains several large, interconnected datasets describing decades of war, genocide, and other armed hostilities. Looking for a slightly less depressing experience? Try the UCDP’s dataset of 216 peace agreements signed between 1975 and 2011. [h/t Tony Gray] | http://www.pcr.uu.se/research/ucdp/program_overview/ http://www.pcr.uu.se/research/ucdp/datasets/ http://www.pcr.uu.se/research/ucdp/datasets/ucdp_peace_agreement_dataset/ | https://twitter.com/tgraybam |
94 | 2016.02.24 | 3 | Nuclear capabilities. | The Nuclear Latency Dataset contains “all known uranium enrichment and plutonium reprocessing facilities” built between 1939 and 2012. That amounts to 253 plants around the world, each with information on its construction timeframe, civilian-vs-military purpose, international oversight, and more. [h/t Abraham Epton] | http://www.matthewfuhrmann.com/datasets.html | https://twitter.com/aepton |
95 | 2016.02.24 | 4 | Cruise ship inspections. | The CDC publishes a searchable database of its cruise ship sanitation inspections — but doesn’t provide an option to download the data. Last week, an open-data enthusiast scraped the database and posted CSVs of specific deficiencies and overall inspection scores since 1990. The lowest score: The Nippon Maru’s 38 points (out of 100) in 1998. Related: ProPublica’s “Cruise Control,” a searchable database of health and safety reports. [h/t Mike Stucka + Lena Groeger] | http://wwwn.cdc.gov/InspectionQueryTool/InspectionSearch.aspx http://wwwn.cdc.gov/InspectionQueryTool/InspectionSearch.aspx https://github.com/marks https://github.com/marks/cdc-cruise-ship-inspections http://wwwn.cdc.gov/InspectionQueryTool/InspectionDetailReport.aspx?ColI=MTMzMDc2-8Ref6xawqHU%3d https://projects.propublica.org/cruises/ | https://twitter.com/mikestucka https://twitter.com/lenagroeger |
96 | 2016.02.24 | 5 | Funny ha ha. | Since 1999, Jester has been telling jokes. The website, built by UC Berkeley’s Laboratory for Automation Science and Engineering, asks you to rate its sometimes-humorous offerings, and then uses those answers to guess which of the remaining 100+ jokes you’ll like best. The UC Berkeley team behind the project has released millions of joke ratings from more than 100,000 anonymous users. [h/t Alex Gude] | http://eigentaste.berkeley.edu/ http://eigentaste.berkeley.edu/dataset/ | http://www.lab41.org/nine-datasets-for-investigating-recommender-systems/ |
97 | 2016.03.02 | 1 | American infrastructure. | Last week, the Department of Homeland Security published more than 250 infrastructure-related datasets, which had previously been marked as "For Official Use Only." The release covers a wide range of topics, including datasets on educational facilities, hurricane evacuation routes, poultry slaughterhouses, and sports venues. (According to that dataset, the Indianapolis Motor Speedway holds more people than any other major sports venue, with a listed capacity of 257,325.) [h/t Michael Keller] | https://hifld-dhs-gii.opendata.arcgis.com https://blogs.esri.com/esri/esri-insider/2016/02/24/open-data-for-economic-resiliency/ https://hifld-dhs-gii.opendata.arcgis.com/datasets?group_id=1b542a2d4fda47aea7e52cbc4fe9fd65 https://hifld-dhs-gii.opendata.arcgis.com/datasets/0eab4e109ce2412882db595aa4555759_0 https://hifld-dhs-gii.opendata.arcgis.com/datasets/b6b9cc72fb58476d92056d5c7ed25f8b_0 https://hifld-dhs-gii.opendata.arcgis.com/datasets/85d3d0fc64924edbbd7c62e319d8a791_0 | https://twitter.com/mhkeller |
98 | 2016.03.02 | 2 | British diets. | The UK government has published data on 27 years of food consumption. The National Food Survey datasets are based on “food diaries” recorded by a sample of British families from 1974 to 2000. In addition to tracking food consumption, the data contains details about each household, including whether they kept vegetarian, had a pregnancy, and/or owned a microwave. [h/t Hannah Brooks + Sebastian Gutierrez] | http://britains-diet.labs.theodi.org/ | http://www.datascienceweekly.org/newsletters/data-science-weekly-newsletter-issue-118 |
99 | 2016.03.02 | 3 | Bills, bills, bills. | Congress has finally begun publishing official bulk data on the status of its bills — something open-government advocates had been requesting for more than a decade. The bulk downloads include an XML file for each piece of legislation, with indicators tracking (among other things) committee referrals and actions. Nostalgia: I’m Just A Bill. [h/t Derek Willis] | https://www.govinfo.gov/features/featured-content/bill-status-bulk-data http://fedscoop.com/congress-makes-bill-status-open-to-public https://www.youtube.com/watch?v=tyeJ55o3El0 | https://twitter.com/derekwillis/status/702536223903129600 |
100 | 2016.03.02 | 4 | Provincial populations. | National population data is easy to find. But it’s much harder to find reliable, standardized population figures for finer-grained geographies. To that end, the World Bank has launched a pilot of its Subnational Population Database, which calculates estimates for 75 countries’ major provinces/states/regions. | http://blogs.worldbank.org/opendata/new-time-series-global-subnational-population-estimates-launched http://data.worldbank.org/data-catalog/subnational-population | |
101 | 2016.03.02 | 5 | Lights, camera, permit. | Through a freedom of information request, WNYC obtained four years of New York City film and television permits. The 40,000+ records date from October 2011 to September 2015 cover several types of permits, including those for scouting, shooting, and red carpet premieres. More: Popular TV shows’ shooting locations, mapped. [h/t John Templon] | https://github.com/datanews/film-permits http://www.wnyc.org/story/tv-shooting-locations-new-york-city/ | https://twitter.com/jtemplon |
102 | 2016.03.09 | 1 | Two thousand billionaires. | Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996–2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire — including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.) | http://www.iie.com/publications/interstitial.cfm?ResearchID=2917 http://www.forbes.com/billionaires/list/ | |
103 | 2016.03.09 | 2 | Legislative linguistics. | The Sunlight Foundation’s Capitol Words project lets you explore the frequency of words and phrases in the Congressional Record since 1996. For example: "weapons of mass destruction", “war” vs. “peace”, or “Obamacare”. The underlying data is available via an API. | http://capitolwords.org/ http://capitolwords.org/term/weapons_of_mass_destruction/ http://capitolwords.org/?terma=war&termb=peace http://capitolwords.org/term/Obamacare/ http://capitolwords.org/api/1/ | |
104 | 2016.03.09 | 3 | Historical mortgages. | With the help of volunteers, the New York Public Library is transcribing 6,000+ mortgage and bond ledgers from Emigrant Savings Bank, founded in 1850 and the oldest such bank in the city. You can search the transcribed records, or download the (very) raw data. | http://emigrantcity.nypl.org/#/intro http://emigrantcity.nypl.org/#/data/browse http://emigrantcity.nypl.org/#/data/download | |
105 | 2016.03.09 | 4 | Overlapping crosswords. | The cruciverb industry is facing its first major plagiarism scandal, unearthed thanks to a newly-published database of crosswords that are at least 25% similar to previous-published puzzles. | http://fivethirtyeight.com/features/a-plagiarism-scandal-is-unfolding-in-the-crossword-world/ http://xd.saul.pw/xdiffs/ | |
106 | 2016.03.09 | 5 | Baseball, baseball, baseball. | If you’re looking for historical data on baseball teams, players, salaries, or managers, Sean Lahman’s Baseball Archive likely has it. The archive was updated with data from the 2015 season last week. Related: Retrosheet’s game logs — a record of every major league game since 1871. [h/t Joe Murphy] | http://www.seanlahman.com/baseball-archive/statistics/ http://www.retrosheet.org/gamelogs/index.html | https://twitter.com/joemurph |
107 | 2016.03.23 | 1 | Nuclear explosions. | The Oklahoma Geological Survey Observatory’s “Catalog of Nuclear Explosions” contains a “nearly complete” list of such detonations — more than 2,000 of them between 1945 and 2006. The dataset roughly (but not precisely) overlaps with the explosions listed in the Stockholm International Peace Research Institute’s “Nuclear Explosions, 1945–1998” (PDF) report. Both datasets list the date and location of each explosion, the country responsible, the detonation site, and (where known) its explosive yield, among other variables. And both reports use unconventional formatting, so I’ve extracted a couple of CSVs for you. | http://www.okgeosurvey1.gov/level2/nuke.cat.html http://www.iaea.org/inis/collection/NCLCollectionStore/_Public/31/060/31060372.pdf https://github.com/data-is-plural/nuclear-explosions | |
108 | 2016.03.23 | 2 | British property sales. | The UK’s Price Paid Data contains virtually all of the country’s residential property sales, with only a few exceptions. (Sales forced under court order are excluded, for example.) Each row includes the sale price, address, property type, and more. The full, multi-gigabyte dataset covers all sales since 1995, but you can also download files for individual years or the most recent month, or just search the dataset online. Related: Where can you afford to buy a house? [h/t Helena Bengtsson] | https://www.gov.uk/government/collections/price-paid-data https://www.gov.uk/guidance/about-the-price-paid-data https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads http://landregistry.data.gov.uk/app/ppd http://www.theguardian.com/society/ng-interactive/2015/sep/02/unaffordable-country-where-can-you-afford-to-buy-a-house | https://twitter.com/helenabengtsson |
109 | 2016.03.23 | 3 | Rising waters. | The U.S. National Water Level Observation Network tracks water levels at hundreds of tide gauges around the country. The data is available via an API. Related: Water’s Edge, a 2014 Reuters investigation based on the gauge data. Also related: The Advanced Hydrologic Prediction Service’s flood observations and warnings, as structured data. [h/t Ryan McNeill] | http://tidesandcurrents.noaa.gov/nwlon.html https://tidesandcurrents.noaa.gov/stations.html?type=Water+Levels https://tidesandcurrents.noaa.gov/api/ http://www.reuters.com/investigates/special-report/waters-edge-the-crisis-of-rising-sea-levels/ http://water.weather.gov/ahps/ http://water.weather.gov/ahps/download.php | https://twitter.com/mcneill_tweets |
110 | 2016.03.23 | 4 | What kind of economy does your county have? | The USDA Economic Research Service’s County Typology Codes categorize each U.S. county based on (a) its dependence on certain industries and on (b) various socio-economic factors. For example, the data classifies 219 counties as “mining-dependent.” [h/t Steven Romalewski] | http://www.ers.usda.gov/data-products/county-typology-codes.aspx | https://twitter.com/SR_spatial |
111 | 2016.03.23 | 5 | Rodents of New York. | NYC’s 311 dataset contains a special category for rat sightings. This slice of data, which is updated daily and stretches back to 2010, contains more than 73,000 rows. One-third of sightings have occurred in Brooklyn. Related: An academic study of NYC rat sightings. Also related: Reply All #56 — ”Zardulu”. | https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4157232/ https://gimletmedia.com/episode/zardulu/ | |
112 | 2016.03.30 | 1 | U.S. drone permits. | Want to fly a drone in the United States for non-recreational purposes? You’ll need a “Section 333” exemption from the Federal Aviation Administration, which governs drone activity. The FAA publishes a list of approved exemptions, which Bard College’s Center for the Study of the Drone has converted into a PDF-formatted database. The Verge, in turn, has converted that PDF into an easy-to-use CSV. Related: Last week, the FAA updated its dataset of unmanned aircraft sightings. [h/t Dan Vergano] | https://www.faa.gov/uas/legislative_programs/section_333/333_authorizations/ http://dronecenter.bard.edu/ http://dronecenter.bard.edu/analysis-us-drone-exemptions-14-15-2/ https://github.com/voxmedia/data-projects/tree/master/verge-drones-over-america http://www.faa.gov/uas/law_enforcement/uas_sighting_reports/ | https://twitter.com/dvergano |
113 | 2016.03.30 | 2 | Digital black markets. | Researcher Gwern Branwen has assembled an archive of listings posted to “dark net markets". Silk Road is the best-known among the group, but the collection covers scores of other markets, including Amazon Dark and FreeBay. The materials gathered from each site are slightly different; many include product advertisements and seller profiles. Warning: Some of the archives contain pictures, which may include offensive or disturbing imagery. And it’s probably wise to heed Gwern’s caveats: The scrapes “are large, complicated, redundant, and highly error-prone. They cannot be taken at face-value.” [h/t Mike Sconzo] | http://www.gwern.net/Black-market%20archives | http://www.secrepo.com/ |
114 | 2016.03.30 | 3 | Titanic passengers. | Based in large part on Encyclopedia Titanica, researchers have compiled a structured dataset of 1,309 passengers on the RMS Titanic’s maiden voyage. (To get the data, download titanic3.csv on this page.) The dataset includes passengers’ names, ages, ticket fare, cabin number, and whether they survived. | http://www.encyclopedia-titanica.org/ http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets | |
115 | 2016.03.30 | 4 | Groceries, quantified. | Open Food Facts is a crowdsourced database of food products’ nutrition data and ingredient lists. (E.g., this kilogram jar of Nutella contains 316 grams of fat.) The entire database can be downloaded in several formats. | http://world.openfoodfacts.org http://world.openfoodfacts.org/product/3017620401473/nutella-1kg-ferrero http://world.openfoodfacts.org/data | |
116 | 2016.03.30 | 5 | America, the varyingly beautiful. | In 1999, the USDA Economic Research Service published a “natural amenities scale,” which rated every county in the contiguous United States based on factors such as landscape variation and January sunniness. Last year, based on the dataset, a Washington Post reporter called Minnesota’s Red Lake County “the absolute worst place to live in America.” Now, he’s moving there. [h/t Jody Avirgan] | http://www.ers.usda.gov/data-products/natural-amenities-scale.aspx https://www.washingtonpost.com/news/wonk/wp/2015/08/17/every-county-in-america-ranked-by-natural-beauty/ https://www.washingtonpost.com/news/wonk/wp/2016/03/08/why-im-moving-to-the-place-i-called-americas-worst-place-to-live/ | http://fivethirtyeight.com/features/he-called-it-americas-worst-place-to-live-now-hes-moving-there/ |
117 | 2016.04.06 | 1 | Global bike-sharing. | The citybik.es API provides access to live data on every bike-sharing station in more than 400 cities around the world. It’s free, and the underlying software is open-source. What data you get per station depends on the city, but typically includes the number of empty slots, number of available bikes, and location information. Looking for bulk data on bike-sharing rides? Many cities — including New York, Chicago, and D.C. — make it available. Related: “A Tale of Twenty-Two Million Citi Bike Rides.” Also related: Three maps illustrating the gender gap in bike-share usage. | http://api.citybik.es/v2/ https://github.com/eskerda/pybikes https://www.citibikenyc.com/system-data https://www.divvybikes.com/data https://www.capitalbikeshare.com/trip-history-data http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/ http://www.buzzfeed.com/jsvine/these-maps-show-a-massive-gender-gap-in-bicycle-riding | |
118 | 2016.04.06 | 2 | Nine years of homelessness estimates | . Every January, at the behest of the U.S. Department of Housing and Urban Development, volunteers across the country attempt to count the homeless in their communities. The result: HUD’s “point in time” estimates, which are currently available for 2007–2015. The most recent estimates found 564,708 homeless people nationwide, with 75,323 of that count (more than 13%) living in New York City. Related: “Why counting America’s homeless is both imperative and imperfect.” Also related: “How Many Street Homeless? NYC’s Tallies Leave the Question Open.” [h/t Tim Henderson + Jonathan Stray] | https://www.hudexchange.info/resource/4832/2015-ahar-part-1-pit-estimates-of-homelessness/ http://fusion.net/story/49980/why-counting-americas-homeless-is-both-imperative-and-imperfect/ http://citylimits.org/2015/10/13/how-many-street-homeless-nycs-tallies-leave-the-question-open/ | https://twitter.com/TimHendersonSL https://twitter.com/jonathanstray |
119 | 2016.04.06 | 3 | Tech’s water cooler. | Hacker News’ official API provides data describing every submission, comment, and user on the community-driven website. You can also analyze the full dataset via Google’s recently-relaunched BigQuery Public Datasets program. [h/t Michael Gardiner] | https://github.com/HackerNews/API https://cloud.google.com/bigquery/public-data/hacker-news | https://twitter.com/MikeARGS |
120 | 2016.04.06 | 4 | John Snow’s data. | When physician John Snow constructed his now-famous dot-map of London’s Broad Street cholera outbreak in the 1850s, the leading geospatial technologies were ink and paper. Academic Robin Wilson has adapted the data for the computer age, converting Snow’s map into several modern GIS formats. Related: Infographics in the Time of Cholera. | https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak#John_Snow_investigation http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/ https://www.propublica.org/nerds/item/infographics-in-the-time-of-cholera | |
121 | 2016.04.06 | 5 | Jon Snow data. | An API Of Ice And Fire lets you fetch data about every book, character, and house in Game of Thrones — including allegiances, family trees, and dates of death. You can also download the data in bulk. Related: Macalester researchers recently published a network analysis (and underlying data) of all characters in A Storm of Swords, the third book in the series. Jon Snow, according to the analysis, was the second-most important character. [h/t Melissa Bierly] | https://anapioficeandfire.com/ https://anapioficeandfire.com/Documentation https://github.com/joakimskoog/AnApiOfIceAndFire/tree/master/AnApiOfIceAndFire.Data.Feeder/Data http://www.macalester.edu/~abeverid/thrones.html | https://blog.modeanalytics.com/analytics-dispatch-017/ |
122 | 2016.04.13 | 1 | Global rainfall. | To create the most detailed measurements of global rainfall ever, researchers at UC Santa Barbara’s Climate Hazards Group harmonize data from satellites and on-the-ground weather stations. The dataset, known as CHIRPS, stretches back more than 30 years and is freely available. Related: Eric Holthaus provides more details and explains why the dataset is so important. [h/t Dave Riordan] | http://chg.geog.ucsb.edu/data/chirps/index.html http://ensia.com/features/this-new-data-set-is-poised-to-revolutionize-climate-adaptation/ | https://twitter.com/riordan/status/717800678085758978 |
123 | 2016.04.13 | 2 | Order in the courts. | CourtListener gathers and publishes bulk data the Supreme Court, all federal appeals courts, and hundreds of other jurisdictions. The files include opinions, audio from oral arguments, dockets, and citations. It also has an API. (If you register, you can also create and explore networks of citation-linked cases.) [h/t Jeff Grove] | https://www.courtlistener.com/api/bulk-info/ https://www.courtlistener.com/api/rest-info/ https://www.courtlistener.com/visualizations/scotus-mapper/ | http://mckinneylaw.iu.edu/faculty-staff/profile.cfm?Id=14 |
124 | 2016.04.13 | 3 | Health and wealth. | The Health Inequality Project calculates American life expectancies by income, gender, and geography. You can download the data at the national, state, county, and “commuting zone” levels. Where do poor Americans live the longest? New York City, Santa Barbara, and San Jose. [h/t Margot Sanger-Katz] | https://healthinequality.org/ https://healthinequality.org/data/ https://healthinequality.org/rankings/ | http://www.nytimes.com/2016/04/11/upshot/poor-new-yorkers-tend-to-live-longer-than-other-poor-americans.html |
125 | 2016.04.13 | 4 | He said, she said (less). | Over the weekend, Hannah Anderson and Matt Daniels published an interactive analysis of male and female speaking roles in 2,000 movie scripts. Among their findings: 308 scripts gave 90%+ of the film’s dialogue to men, while just 8 scripts did so for women. The duo has also released “as much data as we can share (without getting sued)” on GitHub. | http://polygraph.cool/films/ https://github.com/matthewfdaniels/scripts/ | |
126 | 2016.04.13 | 5 | Plane papers. | The Federal Aviation Administration maintains a database of all non-military aircraft registrations, which includes extensive details about each plane/helicopter/glider/blimp and their owners. Related: “Spies In The Skies.” [h/t Peter Aldhous] | http://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download/ http://www.buzzfeed.com/peteraldhous/spies-in-the-skies | https://twitter.com/paldhous |
127 | 2016.04.20 | 1 | Where computers (maybe) are. | An under-scrutinized quirk in a little-known, widely-used database “turned a random Kansas farm into a digital hell.” How? The database contains best-guess geographic coordinates for every IP address on the internet. But for millions of IP addresses, the best guess is just somewhere in the United States. And, until recently, the database translated that vague location into the latitude and longitude of a farm in Potwin, Kansas. (Now it points to a lake.) | http://fusion.net/story/287592/internet-mapping-glitch-kansas-farm/ https://dev.maxmind.com/geoip/geoip2/geolite2/ http://fusion.net/story/290772/ip-mapping-maxmind-new-us-default-location/ | |
128 | 2016.04.20 | 2 | The American consumer. | Last week, the Bureau of Labor Statistics published its midyear update to the Consumer Expenditure Survey. The survey collects data on spending, income, and a handful of characteristics about U.S. consumers. One tidbit: On average, Americans are spending approximately 33% of their income on housing, and a tad less than 1% on alcohol. [h/t Nathan Yau] | http://www.bls.gov/cex/midyear.htm http://www.bls.gov/cex/home.htm | http://flowingdata.com/2015/04/02/how-we-spend-our-money-a-breakdown/ |
129 | 2016.04.20 | 3 | Cricket. | Baseball season is in full-swing, basketball and hockey playoffs have begun, and the NFL draft is nigh. No better time to highlight some cricket data! Cricsheet.org has gathered ball-by-ball data on more than 2,700 matches played since the mid-2000s. Looking for historical data? A new GitHub repository contains stats for more than 40,000 matches going back to 1773 (but mostly since the 1970s), scraped from ESPN Cricinfo. Related: How, statistically, the coin toss affects who’ll win. [h/t Derek Willis] | http://cricsheet.org/ https://github.com/dwillis/toss-up http://www.espncricinfo.com/matches/engine/match/535000.html https://github.com/dwillis/python-espncricinfo http://www.espncricinfo.com/blogs/content/story/997931.html | https://twitter.com/derekwillis/status/720569555119116289 |
130 | 2016.04.20 | 4 | Where clouds congregate, and when. | Researchers have analyzed 15 years of satellite imagery to create a nearly-global dataset of seasonal cloud coverage. The data — available at a kilometer-square resolution — could help scientists monitor and predict changes in ecosystems. [h/t Grant Smith + Joanna Klein] | http://www.earthenv.org/cloud.html | https://twitter.com/grantmeaccess/status/720992509950865412 http://www.nytimes.com/2016/04/05/science/a-cloud-atlas-provides-clues-to-life-on-earth.html |
131 | 2016.04.20 | 5 | License to distill. | The U.S. Alcohol and Tobacco Tax and Trade Bureau publishes a few permit datasets, including this table of 1,900+ businesses licensed to produce and/or bottle liquor. [h/t Maggie Lee] | https://www.ttb.gov/foia/frl.shtml https://www.ttb.gov/foia/xls/frl-spirits-producers-and-bottlers.htm | https://twitter.com/maggie_a_lee |
132 | 2016.04.27 | 1 | FOIA, four ways. | On Saturday, BuzzFeed hosted a FOIA data hackathon. Participants used datasets — from MuckRock, FOIA Machine, FOIA Mapper, and FOIA.gov — to analyze federal, state, and local responsiveness to public records requests. The first three datasets contain details about individual FOIA requests and responses; FOIA.gov provides aggregate internal data from federal agencies. | https://github.com/FOIA-data-hackathon/Planning/wiki https://github.com/FOIA-data-hackathon/Planning/wiki/Datasets https://www.muckrock.com/news/archives/2016/apr/16/join-muckrock-and-buzzfeed-hack-foia-april-23rd/ https://github.com/cirlabs/foiamachine/tree/master/stats https://foiamapper.com/foia-downloads/ http://www.foia.gov/data.html | |
133 | 2016.04.27 | 2 | Particle physics. | Last week, the researchers at CERN’s Compact Muon Solenoid Experiment released more than 300 terabytes of data. The datasets include raw particle-detection data from the Large Hadron Collider, as well as pre-processed datasets the researchers say “can be readily analysed by university or high-school students.” [h/t Dad] | http://cms.web.cern.ch/news/cms-releases-new-batch-research-data-lhc http://opendata.cern.ch/about/CMS | https://www.linkedin.com/in/ed-vine-a480347 |
134 | 2016.04.27 | 3 | Flight delays. | The Bureau of Transportation Statistics requires the nation’s largest airlines to report scheduled and actual timing data for every domestic flight. The corresponding database includes information about delays, cancellations, and diversions, among other fields — and goes back to 1987. In January 2016, departing flights taxied for an average of 16 minutes, a minimum of 1 minute, and a maximum of 2 hours, 38 minutes. Related: “Which Flight Will Get You There Fastest?” [h/t Tom Augspurger] | http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data http://projects.fivethirtyeight.com/flights/ | http://tomaugspurger.github.io/modern-1.html |
135 | 2016.04.27 | 4 | Congressional junkets. | The U.S. House of Representatives requires all staff to reveal all “gift travel” — i.e., “free” trips that the government didn’t pay for. The Office of the Clerk compiles those filings into a database containing each trip’s dates and sponsors. (The Consumer Electronics Show paid for 49 staffers and one congressman to visit the Las Vegas convention in January.) The Senate publishes similar data, except it doesn’t include the sponsor name ... which kind of undermines the entire point. [h/t John Stanton] | http://clerk.house.gov/public_disc/giftTravel.aspx http://www.senate.gov/pagelayout/legislative/g_three_sections_with_teasers/lobbyingdisc.htm#lobbyingdisc=grt | https://twitter.com/dcbigjohn |
136 | 2016.04.27 | 5 | A long time ago, in an API far, far away. | The Star Wars API provides programmatic access to data about every character, species, spaceship, planet, and film in George Lucas’ cinematic universe. You can also download JSON files containing all the data. [h/t Robin Sloan] | https://swapi.co/ https://github.com/phalt/swapi/tree/master/resources/fixtures | https://twitter.com/robinsloan |
137 | 2016.05.04 | 1 | Scientific paper trails. | Sci-Hub bills itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers.” Who’s downloading papers from the site? “Everyone,” Science magazine concluded after analyzing data culled from six months of Sci-Hub server logs. For every download, the dataset identifies the paper downloaded, the date and time, an anonymized version of the downloader’s IP address, and a rough location. [h/t Melissa Bierly + Tom Grahame] | http://sci-hub.cc/ http://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone http://datadryad.org/resource/doi:10.5061/dryad.q447c | https://twitter.com/melissa_bierly https://twitter.com/tfgrahame |
138 | 2016.05.04 | 2 | The Ku Klux Klan, 1915–1940. | Scholars at Virginia Commonwealth University have identified and mapped the locations of 2,000 KKK branches active in the early 20th century. The dataset contains the city, state, earliest-known-date, and sources for each “klavern.” Related: “Active Hate Groups in the United States in 2015,” a report by the Southern Poverty Law Center. [h/t K Reed] | http://labs.library.vcu.edu/klan/ http://scholarscompass.vcu.edu/hist_data/1/ https://www.splcenter.org/fighting-hate/intelligence-report/2016/active-hate-groups-united-states-2015 | https://twitter.com/ternary_logic/status/726464655632269312 |
139 | 2016.05.04 | 3 | Disciplined doctors. | The National Practitioner Data Bank tracks medical malpractice payments, license suspensions, Medicare expulsions, and other lists of penalized physicians. The public use data file includes dozens of details per entry but excludes the part that is almost certainly most important to patients: the doctors' names. Related: “Doctors perform thousands of unnecessary surgeries,” according to a 2013 USA Today investigation that relied partly on the NPDB. | http://www.npdb.hrsa.gov/ http://www.npdb.hrsa.gov/resources/publicData.jsp http://www.usatoday.com/story/news/nation/2013/06/18/unnecessary-surgery-usa-today-investigation/2435009/ | |
140 | 2016.05.04 | 4 | Goooooooooaaaaal. | OpenFootball collects and publishes results and rosters from national and international soccer/football matches, including the Premier League and the World Cup. Related: English soccer/football results, 1871–2014. [h/t Wendy Mak] | http://openfootball.github.io/ https://github.com/jalapic/engsoccerdata | https://twitter.com/wwymak/status/714796436609757184 |
141 | 2016.05.04 | 5 | Grape timing. | Climate scientists have compiled a dataset of grape-harvest-dates from 380 European vineyards, across 27 regions, and stretching back 650 years. The earliest data-point refers to a Burgundy harvest in 1354. Related: The original academic paper. [h/t Martín González] | https://www.ncdc.noaa.gov/cdo/f?p=519:1:0::::P1_STUDY_ID:13194 http://www.clim-past.net/8/1403/2012/ | https://twitter.com/martgnz |
142 | 2016.05.11 | 1 | Secret offshore companies. | On Monday, the International Consortium of Investigative Journalists released data on 210,000 companies, trusts, and funds named in the massive Panama Papers leak. The database is searchable online and downloadable as several CSV files. The dataset includes companies’ officers, registered addresses, and middlemen. It supplements a pre-existing cache of of 105,000 companies named in ICIJ’s 2013 "Offshore Leaks" investigation. | https://panamapapers.icij.org/blog/20160509-offshore-database-release.html https://offshoreleaks.icij.org/ https://offshoreleaks.icij.org/pages/database https://www.icij.org/offshore | |
143 | 2016.05.11 | 2 | Potentially habitable planets. | Since 2009, NASA’s Kepler spacecraft has been looking for Earth-like exoplanets — i.e., planets outside our solar system. Through the NASA Exoplanet Archive, you can explore, filter, and download databases of “candidate” and “confirmed” exoplanets, including Kepler’s discoveries. [h/t David Kipping] | http://kepler.nasa.gov/ http://exoplanetarchive.ipac.caltech.edu/docs/intro.html http://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=cumulative | https://twitter.com/david_kipping/status/728410422873870336 |
144 | 2016.05.11 | 3 | “Marihuana.” | The Institute for Cannabis (established in 1985 as The Institute for Hemp) has obtained, via FOIA, the U.S. Drug Enforcement Administration’s list of organizations licensed to handle marijuana — or, as the license application form calls it, “marihuana.” Many of the nearly 3,000 licensees are law enforcement organizations, but universities, pharmacies, and hospitals also pepper the list. [h/t Michael Ravnitzky] | http://birrenbach.com/INSTITUTE/ http://birrenbach.com/INSTITUTE/foia/dea/ http://www.deadiversion.usdoj.gov/drugreg/reg_apps/225/225_form.pdf | |
145 | 2016.05.11 | 4 | Obesity over time. | An international network of researchers who study noncommunicable diseases estimates the annual prevalence of obesity and diabetes for approximately 200 countries and territories around the world. The data currently covers 1975–2014 and is based, on 2,000+ surveys, according to the group. Related: Bloomberg’s chart and maps of the data. | http://www.ncdrisc.org/ http://www.ncdrisc.org/d-adiposity.html http://www.ncdrisc.org/d-diabetes.html http://www.ncdrisc.org/about-us.html http://www.bloomberg.com/graphics/2016-global-obesity/ | |
146 | 2016.05.11 | 5 | Upward mobility. | In response to a freedom-of-information request, the NYC Department of Buildings provided WNYC with a spreadsheet of 76,088 “registered elevator devices” in the city. Elevators and escalators dominate the list, but you’ll also find dumbwaiters, handicap lifts, and a few other vertical transporters. The spreadsheet includes data on location, speed, maximum capacity, floors served, and more. Related: FiveThirtyEight analyzed the data last week. [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle + John Templon] | https://github.com/datanews/elevators http://fivethirtyeight.com/features/new-yorks-elevators-define-the-city/ | https://twitter.com/jtemplon/status/727919536561885186 |
147 | 2016.05.18 | 1 | Drug safety. | To help monitor drug safety, the FDA collects “adverse event” reports submitted by patients, doctors, and manufacturers. You can download the (anonymized) reports from the FDA directly, but that dataset includes duplicate cases, and sometimes calls the same drug by different names. A group of researchers recently announced that they’ve cleaned up the data — removing duplicates and standardizing nomenclature — so that you don’t have to. The resulting dataset covers 4,245 drugs, more than 17,000 types of reactions, and nearly 5 million case reports. Previously: The SIDER database of pharmaceutical side effects, featured Nov. 11, 2015. | http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm http://www.nature.com/articles/sdata201626 http://datadryad.org/resource/doi:10.5061/dryad.8q0s4 http://sideeffects.embl.de/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-11-edition | |
148 | 2016.05.18 | 2 | Fossils. | The Paleobiology Database, run by a non-profit group of researchers, has aggregated data on more than a million fossils from all around the world. You can access the dataset — organized by species, era, and location — via an interactive map, download form, or API. | https://paleobiodb.org https://paleobiodb.org/navigator/ https://paleobiodb.org/cgi-bin/bridge.pl?a=displayDownloadGenerator https://paleobiodb.org/data1.2/occs/list_doc.html | |
149 | 2016.05.18 | 3 | Immigrants, internationally. | The United Nations publishes estimates of the number of foreign-born residents living in every country. The figures cover 1990 to 2015, at five-year intervals. The Vatican (100% foreign-born) and the United Arab Emirates (88%) had the highest proportion of immigrant residents in 2015; the U.S. (46.6 million) boasted the largest total immigrant population. The dataset also includes estimates by age, sex, and country of origin. Previously: Refugees in America, featured Nov. 25, 2015. [h/t Manu Balachandran] | http://www.un.org/en/development/desa/population/migration/data/estimates2/estimates15.shtml http://www.wrapsnet.org/Reports/InteractiveReporting/tabid/393/Default.aspx https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-25-edition | https://www.theatlas.com/charts/4JYKtQedx |
150 | 2016.05.18 | 4 | Tens of millions of parking tickets. | I Quant NY author Ben Wellington recently discovered that New York City had been “ticketing legally parked cars for millions of dollars a year.” To reach that finding, Wellington analyzed three years of parking tickets, amounting to more than 30 million summonses. NYC isn’t alone in providing parking ticket data; Philadelphia, Toronto, Baltimore, Seattle, and others publish similar datasets. | http://iquantny.tumblr.com/ http://iquantny.tumblr.com/post/144197004989/the-nypd-was-systematically-ticketing-legally https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2014-August-/jt7v-77mi https://data.cityofnewyork.us/dataset/Parking-Violations-Issued-Fiscal-Year-2015/c284-tqph https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2016/kiv2-tbus https://www.opendataphilly.org/dataset/parking-violations http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=ca20256c54ea4310VgnVCM1000003dd60f89RCRD https://data.baltimorecity.gov/Transportation/Parking-Citations/n4ma-fj3m https://data.seattle.gov/Public-Safety/Parking-Violations-All/q2m9-8vqf | |
151 | 2016.05.18 | 5 | Musical metadata. | The MusicBrainz database contains metadata on more than one million artists, 16 million recordings, 900,000 pieces of cover art. You can download the data in bulk or query it via an API. Previously: The smaller-but-more-detailed Million Song Dataset, featured Feb. 10. [h/t Geoff Boeing] | https://musicbrainz.org/ https://musicbrainz.org/statistics https://musicbrainz.org/doc/MusicBrainz_Database/Download http://musicbrainz.org/doc/Development/XML_Web_Service/Version_2 http://labrosa.ee.columbia.edu/millionsong/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-02-10-edition | http://geoffboeing.com/2016/05/analyzing-lastfm-history/ |
152 | 2016.05.25 | 1 | Ain’t no mountain high, ain’t no valley low. | Governments around the world have used “LiDAR” — a laser-powered surveying technology — to build impressively precise elevation maps. In many cases, they’ve also released these topographic datasets to the public. The U.S., for instance, publishes gobs of LiDAR data through the Interagency Elevation Inventory. And you can also find LiDAR datasets for the United Kingdom, Spain, Finland, Slovenia, Denmark, Switzerland, the Netherlands, and New York City. Related: Using LiDAR data to print a 3D map of London. | https://en.wikipedia.org/wiki/Lidar https://coast.noaa.gov/inventory/ https://environmentagency.blog.gov.uk/2015/09/18/laser-surveys-light-up-open-data/ http://pnoa.ign.es/presentacion http://www.maanmittauslaitos.fi/en/professionals/topographic-data/remote-sensing/laser-scanning http://evode.arso.gov.si/indexd022.html?q=node/12 http://rapidlasso.com/2014/05/15/lasmoons-asger-s-petersen/ http://www.swisstopo.admin.ch/internet/swisstopo/en/home/products/height.html http://www.ahn.nl/index.html https://data.cityofnewyork.us/City-Government/1-foot-Digital-Elevation-Model-DEM-/dpc8-z3jc http://www.aeracode.org/2016/5/16/hello-london-rising/ | |
153 | 2016.05.25 | 2 | Risky predictions. | “There’s software used across the country to predict future criminals. And it’s biased against blacks,” a ProPublica analysis has found. The investigation focused on risk assessments and recidivism in Broward County, Florida, and found that black defendants were more likely than white defendants to be mislabeled as “high risk.” The reporters have published their methodology, code, and the underlying data — including two years of Broward County risk assessments — on GitHub. | https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing https://github.com/propublica/compas-analysis | |
154 | 2016.05.25 | 3 | Historical San Francisco rents. | To help understand San Francisco’s soaring real estate prices, Eric Fischer transcribed decades of apartment and house listings in the San Francisco Chronicle. For each year from 1948 through 1979, Fischer jotted down every monthly rent advertised in the paper on the first Sunday in April. (Similar data for 1979 through 2001 is available from San Francisco’s Housing Study DataBook.) The transcriptions are available on GitHub. [h/t Kendall Taggart + Michael Andersen] | https://experimental-geography.blogspot.com/2016/05/employment-construction-and-cost-of-san.html http://sfrb.org/san-francisco-housing-study-databook https://github.com/ericfischer/housing-inventory/ | https://twitter.com/KendallTTaggart/status/732737115205701633 https://medium.com/@andersem/a-guy-just-transcribed-30-years-of-for-rent-ads-heres-what-it-taught-us-about-sf-housing-prices-bd61fd0e4ef9#.spmzmbb6r |
155 | 2016.05.25 | 4 | American soccer salaries. | The Major League Soccer Players Union publishes salary data going back to 2007, and released 2016’s figures last week. (At $7.17 million in total compensation, Orlando City’s Kaká ranks as the league’s highest-paid player.) The MLSPU publishes the data as PDFs; I’ve converted those PDFs into CSVs for you. [h/t Rose Eveleth + John Templon] | https://www.mlsplayers.org/salary_info.html https://github.com/data-is-plural/mls-salaries | https://twitter.com/roseveleth/status/733311382679130112 https://twitter.com/jtemplon |
156 | 2016.05.25 | 5 | Photography biography. | The Photographers’ Identities Catalog aggregates data on more than 110,000 photographers and photo studios throughout history. The information “has been culled from trusted biographical dictionaries, catalogs and databases, and from extensive original research” by the New York Public Library’s photography experts. The catalog — which includes data on gender, geography, range of years active, and more — is available as raw CSVs on GitHub. | http://pic.nypl.org/ https://github.com/NYPL/pic-data | |
157 | 2016.06.01 | 1 | Veterans in America. | In 2014, approximately 22 million U.S. military veterans were still alive, including 1 million who served in World War II, 7.2 million who served during the Vietnam War era, and 3.9 million who have served in post-9/11 wars. Those numbers come from the VA’s National Center for Veterans Analysis and Statistics, which publishes estimates and future-projections of the country’s veteran population. You can explore the data by age, race, ethnicity, gender, military branch, state, county, era of service, and more. (To see the files, click on the “Population Tables” header.) [h/t Charles Worthington] | http://www.va.gov/vetdata/Veteran_Population.asp | http://opendata.stackexchange.com/a/1253 |
158 | 2016.06.01 | 2 | Farmers in Africa. | Between 2002 and 2004, researchers surveyed more than 9,500 farming households in 11 African countries to better understand how climate change might affect agricultural practices. Last month, they published the detailed results and documentation in Scientific Data. The dataset includes responses to questions about plantings, harvests, yields, water sources, animal purchases, taxes paid, and much more. | http://www.nature.com/articles/sdata201620?WT.ec_id=SDATA-201605 | |
159 | 2016.06.01 | 3 | Income inequality, country-by-country. | The United Nations University’s World Income Inequality Database contains historical Gini coefficients for more than 170 countries — in some instances stretching back to the 1930s or ‘40s. The latest version of the database was released in October 2015 and includes key details about each estimate, such as the name of the primary source and the quality of data collection. | https://www.wider.unu.edu/project/wiid-%E2%80%93-world-income-inequality-database https://en.wikipedia.org/wiki/Gini_coefficient https://www.wider.unu.edu/download/WIID3.3 | |
160 | 2016.06.01 | 4 | Speling chellange. | The Scripps National Spelling Bee publishes the competition’s results online, but not in any analysis-friendly format. Thankfully, statistician Christopher Long has scraped and spreadsheet-ified the Scripps results going back to 1996 – including last week’s finals. Related: FiveThirtyEight uses the data to ask, “Where Do Spelling Bee Words Come From?” | http://spellingbee.com/public/results/2016/round_results http://angrystatistician.blogspot.com/ https://github.com/octonion/spelling http://fivethirtyeight.com/features/where-do-spelling-bee-words-come-from/ | |
161 | 2016.06.01 | 5 | The LEGO-verse. | BrickLink is a website for buying and selling LEGOs. It also happens to publish a (nearly?) complete inventory of every LEGO set and piece produced since 1949. Related: LEGO sets have become increasingly violent, according to a recent study. [h/t Lindsey Cook] | https://www.bricklink.com https://www.bricklink.com/catalogDownload.asp http://www.bartneck.de/publications/2016/legoViolence/index.html | http://tinyletter.com/UpDownAllAround/letters/so-random |
162 | 2016.06.08 | 1 | Nuclear accidents. | Researchers in Europe have published a database of 216 nuclear energy accidents — a compendium they say is “twice the size of the previous best data set.” For each accident, the database contains the date, location, description, and four measurements of severity: its ratings on the International Nuclear Event Scale and on the Nuclear Accident Magnitude Scale, the number of fatalities, and total monetary cost. (The three most expensive: Chernobyl, Fukushima, and a 1995 accident at Japan’s Monju Nuclear Power Plant, estimated to have caused $15.5 billion in damages.) [h/t Dad] | https://innovwiki.ethz.ch/index.php/Nuclear_events_database http://onlinelibrary.wiley.com/doi/10.1111/risa.12587/full http://www-ns.iaea.org/tech-areas/emergency/ines.asp http://www.davidsmythe.org/nuclear/accidents.htm https://en.wikipedia.org/wiki/Monju_Nuclear_Power_Plant | https://www.linkedin.com/in/ed-vine-a480347 |
163 | 2016.06.08 | 2 | Government court payouts. | The U.S. government maintains a “judgment fund,” which it uses to pay plaintiffs when federal agencies lose in court (or settle “actual or imminent lawsuits”). The Department of the Treasury, which administers the fund, publishes data on these payouts for each fiscal year going back to FY2006. [h/t CJ Ciaramella] | https://www.fiscal.treasury.gov/fsservices/gov/pmt/jdgFund/judgementFund_home.htm https://jfund.fms.treas.gov/jfradSearchWeb/JFPymtSearchAction.do | http://tinyletter.com/cjciaramella/letters/foia-rundown-elephants |
164 | 2016.06.08 | 3 | Local justice data. | The Sunlight Foundation’s Hall of Justice brings together “nearly 10,000” criminal justice datasets and research documents from across the United States. You can search for topics and filter by geography, publisher, and accessibility (open, open-but-not-machine-readable, restricted access, et cetera.). Related: Sunlight’s “lessons learned from a year of opening police data.” [h/t Susie Cambria + Noah Veltman] | http://hallofjustice.sunlightfoundation.com/ http://sunlightfoundation.com/blog/2016/05/04/lessons-learned-from-a-year-of-opening-police-data/ | https://twitter.com/susiecambria https://twitter.com/veltman |
165 | 2016.06.08 | 4 | The Netflix Prize, archived. | In 2006, Netflix launched a $1 million challenge to beat the company’s movie-recommendation algorithm. In 2009, Netflix awarded the prize to a group of AT&T scientists (though ultimately didn’t use the winning algorithm). The challenge, which was open to the public, was based on a dataset of 100 million ratings from 480,000 (anonymized) users, corresponding to more than 17,000 movies between Oct. 1998 and Dec. 2005. The dataset, once hosted at UC Irvine, is currently available through the Internet Archive. Previously: MovieLens, featured Jan. 27. [h/t Brandon Loudermilk] | http://www.netflixprize.com/ https://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml https://web.archive.org/web/20090925184737/http://archive.ics.uci.edu/ml/datasets/Netflix+Prize https://archive.org/details/nf_prize_dataset.tar http://grouplens.org/datasets/movielens/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-01-27-edition | http://opendata.stackexchange.com/a/7884 |
166 | 2016.06.08 | 5 | Billboard hits and lyrics. | Statistics grad student Kaylin Walker scraped 50 years of Billboard’s “Year-End Hot 100” rankings and those songs’ lyrics. Related: Walker’s analysis and methodology. [h/t Melissa Bierly] | https://github.com/walkerkq/musiclyrics http://kaylinwalker.com/50-years-of-pop-music/ | https://blog.modeanalytics.com/analytics-dispatch-026/ |
167 | 2016.06.15 | 1 | (Almost) every politician. | On everypolitician.org, you can search and download data on 70,000+ legislators (past and present) from 233 countries. (Among those missing: Cuba, Ethiopia, and Qatar.) The dataset includes each lawmaker’s party affiliation, years served, gender, social media profiles, and more. Related: Every member of the United States Congress since 1789. | http://everypolitician.org/ http://docs.everypolitician.org/repo_structure.html http://everypolitician.org/countries.html https://github.com/unitedstates/congress-legislators | |
168 | 2016.06.15 | 2 | Title IX investigations. | The Chronicle of Higher Education has been tracking federal investigations into sexual assault on college campuses. Recently, The Chronicle added an API, so that developers and data analysts can access the data more easily. Currently, the dataset includes 292 investigations conducted since April 2011 — 49 of which have been resolved. [h/t Jon Davenport] | http://projects.chronicle.com/titleix/ http://projects.chronicle.com/titleix/api/v1/docs/ | https://twitter.com/JonDavenport1/status/741372292710707200 |
169 | 2016.06.15 | 3 | Air quality. | Last month, the World Health Organization released its latest update to the Global Urban Ambient Air Pollution Database, which now covers nearly 3,000 cities in 103 countries. For each city, the dataset includes annual average density of two key categories of particulates (PM2.5 and PM10), as well as details regarding the data collection. According to the organization’s own analysis, “98% of cities in low and middle income countries with more than 100,000 inhabitants do not meet WHO air quality guidelines.” Related: ”A New Air Pollution Database Is Good, but Imperfect.” | http://www.who.int/phe/health_topics/outdoorair/databases/cities/en/ http://blogs.scientificamerican.com/guest-blog/a-new-air-pollution-database-is-good-but-imperfect/ | |
170 | 2016.06.15 | 4 | Gone phishing. | PhishTank is a clearinghouse that tracks thieves’ attempts to steal personal information and online credentials. The website also publishes bulk data on all verified phishing attempts — 44,000 and counting. With more than 1,000 phishing attempts recorded against it, PayPal is the single most-targeted website in the database. [h/t Herman Slatman] | https://www.phishtank.com/index.php https://www.phishtank.com/developer_info.php | https://github.com/hslatman/awesome-threat-intelligence |
171 | 2016.06.15 | 5 | Gone fishing. | The National Oceanic and Atmospheric Administration’s Fisheries Statistics Division provides data on seafood caught by U.S. commercial fisheries, sliceable by month, species, and fishing gear. You can learn, for example, that these fisheries caught 88,893,305 pounds of Dungeness crab in 2006 — the highest recorded total since at least 1950. [h/t Gwynn Guilford] | https://www.st.nmfs.noaa.gov/commercial-fisheries/commercial-landings/ | https://www.theatlas.com/charts/4yVqvlTR |
172 | 2016.06.22 | 1 | Nonprofit IRS filings — at long last. | Last week, the Internal Revenue Service released a huge dataset of nonprofits’ annual Form 990 filings, which provide details on program expenses, salaries, and more. More than 60% of Form 990s are filed digitally, according to the IRS. Previously, those forms were only available as images; now the IRS is publishing them as analysis-friendly XML files. (You can also download the data in bulk from the Internet Archive, thanks to Carl Malamud, the public domain advocate who led the fight for 990s-as-XML.) One early observer noted that the some of the data was misformatted, and has provided instructions for fixing it. [h/t Andrew Sullivan + Kendall Taggart] | https://www.irs.gov/uac/newsroom/irs-makes-electronically-filed-form-990-data-available-in-new-format https://aws.amazon.com/public-data-sets/irs-990/ https://archive.org/details/IRS990-efile https://twitter.com/licyeus/status/743308612672466944 https://gist.github.com/licyeus/95b99d6feb423ebea604b5f3e2cdf590 | https://twitter.com/licyeus/status/743308612672466944 https://twitter.com/kendallttaggart |
173 | 2016.06.22 | 2 | 6,000 years of urbanization. | Earlier this month, researchers published “the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000,” along with a detailed methodology. The dataset digitizes and geocodes population numbers originally tabulated by historian Tertius Chandler (Four Thousand Years of Urban Growth) and political scientist George Modelski (World Cities: -3,000 to 2,000). Though “far from comprehensive,” the authors say that the dataset a “first step towards understanding the geographic distribution of urban populations throughout history.” Related: “Watch 6,000 years of urbanization taking over the world.” | http://www.nature.com/articles/sdata201634 http://urban.yale.edu/data http://www.worldcat.org/title/four-thousand-years-of-urban-growth/oclc/59678315 http://www.worldcat.org/title/world-cities-3000-to-2000/oclc/57695214 http://qz.com/706051/706051/ | |
174 | 2016.06.22 | 3 | Bull vs. man. | Next month, thousands of adrenaline junkies will gather in Pamplona for the city’s annual Running of the Bulls. The San Fermin festival, which organizes the spectacle, publishes injury data on its website. (Here’s a shortcut to display every year of data, instead of one year at a time.) Last year, the bulls gored 10 runners and injured another 27. Related: “Your Chances Of Being Gored By A Bull In Pamplona Are Getting Higher.” | http://www.sanfermin.com/index.php/en/encierro/como-correr/cuanta-gente-corre-en-el-encierro-de-sanfermin http://www.sanfermin.com/index.php/en/encierro/que-es http://www.sanfermin.com/index.php/en/encierro/buscador/buscador-encierros http://www.sanfermin.com/old/encierrometro/buscador_encierros.php?lang=eng&buscar=1 http://fivethirtyeight.com/datalab/your-chances-of-being-gored-by-a-bull-in-pamplona-are-getting-higher/ | |
175 | 2016.06.22 | 4 | 2BR with vinyl siding, sweet 2BR with vinyl siding. | The U.S. Census Bureau’s Annual Characteristics of New Housing culls data on features such as square footage, wall material, number of bedrooms, and number of fireplaces. (Air conditioning was present in 93% of new single-family homes built in 2015, up from 49% in 1973.) Related: “Houses Keep Getting Bigger, Even as Families Get Smaller.” [h/t Lindsey Cook] | http://www.census.gov/construction/chars/ http://www.nytimes.com/2016/06/04/upshot/houses-keep-getting-bigger-even-as-families-get-smaller.html | http://tinyletter.com/UpDownAllAround/letters/orlando |
176 | 2016.06.22 | 5 | The Hum. | “Most people find this website because they are searching for the source of an unusual low frequency sound.” The World Hum Database currently includes more than 10,000 reader-submitted reports, including a recent submissions that describe the noise as sounding “like a fridge,” “like a train in the distance,” and “like a cicada that never shuts up.” [h/t Susie Cambria] | http://www.thehum.info/ https://www.google.com/fusiontables/DataSource?docid=1EyjVZqUPpoXGQDa_cry9DqFaBVMOgEyq-3qo85bx#rows:id=1 | https://twitter.com/susiecambria |
177 | 2016.07.06 | 1 | Public Policy. | The Correlates of State Policy Project aims to become a “one-stop shop” for data related to public policy in America’s 50 states. So far, the project is tracking 700+ aspects of each state’s laws, budgets, demographics, and more. Among the policy variables: Can pharmacies dispense emergency contraception without a prescription? Does the state ban corporal punishment in schools? and Does the state have an endangered species act? Don’t miss the codebook, which describes the data and sources in greater detail. Related: State and Local Public Policies in the United States, a similar project, for which an update to include 2014 data is “underway.” [h/t Rob Gillezeau] | http://ippsr.msu.edu/public-policy/correlates-state-policy http://www.matthewg.org/correlatesofstatepolicyprojectv1Codebook.pdf http://www.statepolicyindex.com/about/ | https://twitter.com/robgillezeau/status/746080180280537088 |
178 | 2016.07.06 | 2 | Nursing homes. | Last month, German investigative nonprofit Correctiv published a searchable database of 13,000 nursing homes in the country. The data are based on government inspections, and the reporters have published the raw and processed data on GitHub. Related: ProPublica’s searchable database of nursing homes in the United States and the Medicare’s nursing home data. [h/t Sandhya Kambhampati] | https://correctiv.org/en/investigations/nursing-homes/guide/ https://correctiv.org/en/investigations/nursing-homes/articles/2016/06/09/nursing-homes-what-we-know-what-we-do-not-know/ https://github.com/correctiv/pflege-notebook http://projects.propublica.org/nursing-homes/ https://data.medicare.gov/data/nursing-home-compare | https://twitter.com/sandhya__k |
179 | 2016.07.06 | 3 | Russian election results. | In a recently-updated paper, three academics say they’ve found “convincing evidence of election fraud” in federal Russian elections since 2004. To support their analyses, the researchers have published the underlying data, which includes polling station data from seven Russian elections (as well as one Polish and one Spanish election, which showed no such signs of fraud). Related: WSJ analysis of Russian parliamentary election “points to widespread fraud” (2012). [h/t Arthur Bashlykov] | http://arxiv.org/abs/1410.6059 https://figshare.com/articles/kobakEtAl_AOAS2016_suppData_zip/3126883 http://www.wsj.com/articles/SB10001424052970203391104577124540544822220 | https://www.linkedin.com/in/arthur-bashlykov-8a3b2b102 |
180 | 2016.07.06 | 4 | NYC property taxes and exemptions. | Property tax data in New York City is technically available to the public, but the city makes it difficult to access. So a pair of civic hackers liberated the data. Now you can download 1.1 million rows of bulk data, which details each property’s type, assessed value, taxes due, owner’s name, and more. You can also download 750,000 rows of tax exemptions and abatements. Related: “A Look at NYC’s $650 Million Property Tax Breaks Related to Religion” | http://chriswhong.com/open-data/liberating-data-from-nyc-property-tax-bills/ http://iquantny.tumblr.com/post/146688053904/payer-or-prayer-a-look-at-nycs-650-million | |
181 | 2016.07.06 | 5 | You shouldn’t point lasers at airplanes. | And yet, people do... by the thousands. In 2005, the Federal Aviation Administration created a system for pilots to report “laser events,” which it says can temporarily blind crewmembers. The administration has published five years of data from the reporting system. In 2014, the most recent year available, pilots reported 3,894 laser beamings. The vast majority involved a green beam, and none were reported to have caused an injury. | http://www.faa.gov/news/press_releases/news_story.cfm?newsId=12765 http://www.faa.gov/about/initiatives/lasers/laws/ | |
182 | 2016.07.13 | 1 | Every United Nations vote, 1946–2014. | This repository contains voting data from each of the UN General Assembly’s the first 69 sessions. One spreadsheet summarizes the topic and results of each voted-upon resolution. (The dataset also indicates whether the U.S. State Department identified the vote as “important” — such those condemning human rights violations in Syria and North Korea — in its annual Voting Practices in the United Nations report.) Another file contains each country’s individual voting decisions. [h/t David Robinson] | https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379 http://www.state.gov/p/io/rls/rpt/index.htm | https://twitter.com/drob/status/751398401867182080 |
183 | 2016.07.13 | 2 | The money bone. | Late last month, the Centers for Medicare and Medicaid Services added data from 2015 to its Open Payments database, which tracks medical companies’ payments to doctors and teaching hospitals. The payments — which include consulting fees, gifts, honoraria, meals, drinks, grants, and more — totaled more than $7.5 billion last year. Related: ProPublica’s Dollars for Docs project, which began tracking medical industry payments in 2010, long before CMS released the OpenPayments database. [h/t Cat Ferguson + Chris Hamby] | https://www.cms.gov/Newsroom/MediaReleaseDatabase/Press-releases/2016-Press-releases-items/2016-06-30.html https://openpaymentsdata.cms.gov/ https://www.cms.gov/OpenPayments/About/Natures-of-Payment.html https://projects.propublica.org/docdollars/ | https://twitter.com/biocuriosity https://twitter.com/ChrisDHamby |
184 | 2016.07.13 | 3 | Hundreds of millions of street addresses. | OpenAddresses.io is an effort to collect the official geocoordinates of the all the world’s physical addresses. (These data come from “authoritative” sources, such as city governments. When Google Maps tells you the location of an address, it’s often just a very-educated guess, extrapolated from coarser data.) As of Monday evening, the project had processed 265,078,567 addresses, mostly in North America, Europe, Japan, and Australia. Related: “Open-source geo is really something right now.” | https://openaddresses.io/ https://results.openaddresses.io/ https://trackchanges.postlight.com/open-source-geo-is-really-something-right-now-f8e310c5f57a#.t2f638jnj | |
185 | 2016.07.13 | 4 | Workplace safety. | The U.S. Occupational Safety and Health Administration (OSHA) conducted 86,000 workplace inspections last year. The agency makes its inspection results — including investigations of fatal accidents and severe injuries — available in bulk and via an API. | http://ogesdw.dol.gov/views/data_summary.php http://developer.dol.gov/health-and-safety/dol-osha-enforcement | |
186 | 2016.07.13 | 5 | Catch ‘em all. | Pokéapi is an API “detailing everything about the Pokémon main game series,” including every character, evolution, battle skill, and more. The data is also available as a series of CSVs. Currently, however, the dataset doesn’t include details from the so-hot-right-now Pokémon Go game. | https://pokeapi.co/ https://github.com/phalt/pokeapi/tree/master/data/v2/csv https://en.wikipedia.org/wiki/Pok%C3%A9mon_Go | |
187 | 2016.07.20 | 1 | Coups d'état. | Two political science professors at the University of Kentucky are compiling a dataset of coup attempts. So far, the dataset covers both successful and unsuccessful attempts from 1950 to late 2015. During those 65+ years, coup plotters have been foiled about half the time, with 236 victories and 238 failures. According to the dataset, Bolivia’s top leaders have faced 23 coup attempts, including 11 successful overthrows — more than any other country by either metric. [h/t Arthur Charpentier] | http://www.jonathanmpowell.com/coup-detat-dataset.html | https://twitter.com/freakonometrics |
188 | 2016.07.20 | 2 | Tech support. | StackOverflow is a Q&A site for programmers, and part of the larger StackExchange network of Q&A communities. StackExchange publishes periodic data dumps of the networks’ users, questions, answers, votes, and comments. On Monday, the company released “StackLite,” a smaller, easier-to-use slice of the data. (Even so, it contains metadata on more than 15 million questions.) If you don’t want to download anything, you can also explore and analyze the data online. [h/t David Robinson] | http://stackoverflow.com/ http://stackexchange.com/ https://archive.org/details/stackexchange https://github.com/dgrtwo/StackLite https://data.stackexchange.com/ | http://varianceexplained.org/r/stack-lite/ |
189 | 2016.07.20 | 3 | 🔥 🔥 🔥 . | The National Fire Incident Reporting System (NFIRS) is “the world’s largest, national, annual database of fire incident information,” containing about 1 million fires per year, including wildfires, structure fires, vehicle fires, and more. NFIRS data from 2013 (and prior years) are available online from FEMA. Looking for 2014’s data? The government asks you to request it via postal mail; or you could trust the copy a public safety analyst uploaded in March. (See the links at the bottom of that page.) The U.S. Fire Administration, which maintains NFIRS, publishes additional datasets, including a spreadsheet of 27,000+ fire departments and a database of on-duty firefighter fatalities. Also, the U.S. Geological Survey publishes data on current and historical wildfire perimeters. [h/t Nick Penzenstadler + Nadja Popovich] | https://www.usfa.fema.gov/data/nfirs/ https://www.fema.gov/media-library/assets/documents/112009 https://www.linkedin.com/pulse/nfirs-2014-available-dov-chelst https://github.com/dnchelst/NFIRS https://www.usfa.fema.gov/data/statistics/order_download_data.html https://apps.usfa.fema.gov/census-download/main/download https://apps.usfa.fema.gov/firefighter-fatalities/ http://rmgsc.cr.usgs.gov/outgoing/GeoMAC/ | https://twitter.com/npenzenstadler/status/754010911292190720 https://twitter.com/popovichn |
190 | 2016.07.20 | 4 | World heritage sites. | Today, UNESCO’s World Heritage Committee will wrap up its 40th session, during which it has “inscribed” more than 20 new awe-inspiring places around the world. Online, the organization publishes spreadsheets and map files of 1,031 heritage sites it has previously inducted. For each site, the spreadsheet tracks its location, size, date inducted, category (“cultural,” “natural,” or “mixed”), and which selection criteria it met, and more. Through 2015, the countries with the largest number of heritage sites were Italy (51), China (48), and Spain (44). | http://whc.unesco.org/en/sessions/40COM/ http://whc.unesco.org/en/newproperties/ http://whc.unesco.org/en/syndication http://whc.unesco.org/en/criteria/ | |
191 | 2016.07.20 | 5 | Paperwork, work, work, work, work, work. | Thanks to the Paperwork Reduction Act, federal agencies must get approval from the Office of Information and Regulatory Affairs for any “information collection” (e.g., a form) that seeks 10 or more responses. You can search all information collections — under review, approved, or rejected — online, or download an XML file of all active collections. | http://www.reginfo.gov/public/do/PRASearch http://www.reginfo.gov/public/do/PRAXML | |
192 | 2016.07.27 | 1 | How we die. | The Global Burden of Disease dataset represents “the largest and most comprehensive effort to date to measure epidemiological levels and trends worldwide,” according to the Institute for Health Metrics and Evaluation, which runs the project. For each disease and each country, the dataset contains estimates of the total deaths, years of life lost, and years lived with disability. The estimates are currently available for 1990, 1995, 2000, 2005, 2010, and 2013. Related: “Where We Live and How We Die: What a year of death looks like around the world.” [h/t Mimi Onuoha + Data & Society] | http://www.healthdata.org/gbd/data http://www.healthdata.org/gbd https://howwegettonext.com/where-we-live-and-how-we-die-36eeb4c256ab#.6g464ysu0 | https://twitter.com/thistimeitsmimi http://us7.campaign-archive1.com/?u=00b33d1beca407762446037f0&id=5cf8e71652&e=3bafe38e66 |
193 | 2016.07.27 | 2 | Public transit. | Transitland and TransitFeeds both aggregate data on routes, stops, and timetables from hundreds of public transit systems — from the Bay Area’s BART, to New York’s MTA, to Milan’s ATM, to Budapest’s BKK. | https://transit.land/ https://transitfeeds.com/ | |
194 | 2016.07.27 | 3 | Public libraries. | The U.S. Institute of Museum and Library Services annually collects responses from 9,000 public library systems. The results, currently available through 2013, include information about the libraries’ collection size, physical footprint, population served, hours, and more. Previously: Every known museum in the United States, featured Nov. 11, 2015. | https://www.imls.gov/research-evaluation/data-collection/public-libraries-united-states-survey/public-libraries-united https://www.imls.gov/research-evaluation/data-collection/museum-universe-data-file https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-04-edition | |
195 | 2016.07.27 | 4 | Walrus hangouts. | The Pacific walrus (Odobenus rosmarus divergens) accounts for the vast majority of walruses on the planet. When they’re not swimming, Pacific walruses like to rest at places called “haulouts.” A new dataset and study include details on 150 current and historic haulouts, the largest of which has been reported to attract more than 100,000 walruses. Miscellany: Three of the study’s authors work for the U.S. Department of the Interior; the fourth works for Russia’s Institute of Biological Problems of the North. [h/t Keith Collins] | http://alaska.usgs.gov/products/data.php?dataid=74 http://www.ibpn.ru/en/ | https://twitter.com/collinskeith |
196 | 2016.07.27 | 5 | Bigfoot sightings. | The Bigfoot Field Researchers Organization dubs itself “the only scientific research organization exploring the bigfoot/sasquatch mystery.” The BFRO collects and vets sighting reports, and publishes them online. (Direct link to KMZ file.) Related: “'Squatch Watch: 92 Years of Bigfoot Sightings in the US and Canada.” [h/t Joshua Stevens + Lynn Cherny] | http://www.bfro.net/ http://www.bfro.net/news/google_earth.asp http://www.bfro.net/app/AllReportsKMZ.aspx http://www.joshuastevens.net/visualization/squatch-watch-92-years-of-bigfoot-sightings-in-us-and-canada/ | https://twitter.com/jscarto/status/743861481998016512 https://twitter.com/arnicas/status/743859743945555968 |
197 | 2016.08.03 | 1 | Vaccination nations. | The World Health Organization publishes a slew of datasets on national vaccination rates and policies. Some facts gleaned from the data: Asked whether they provided routine vaccinations to children at school, just 55% of 191 countries that responded said they did. And: In 2015, Equatorial Guinea reported that only 26% of infants had received a first dose of measles vaccine, a lower rate than any other country’s. [h/t Philip Shemella] | http://www.who.int/immunization/monitoring_surveillance/data/en/ | http://opendata.stackexchange.com/questions/7220/vaccination-policies/9345#9345 |
198 | 2016.08.03 | 2 | Deaths in police custody. | At least 6,913 people died while in the custody of Texas police, jails, and prisons between 2005 and 2015, according to the newly-launched Texas Justice Initiative. The data, gathered through freedom-of-information requests, contains the age, sex, and race/ethnicity of each person who died, as well as the general cause of death and a more detailed summary. Read more at: The Atlantic. Related: California’s Department of Justice publishes similar statistics and raw data. [h/t Melissa Segura + Reade Levinson] | http://texasjusticeinitiative.org/ http://www.theatlantic.com/politics/archive/2016/07/7000-deaths-in-custody-texas/493325/ https://openjustice.doj.ca.gov/death-in-custody/overview https://openjustice.doj.ca.gov/data | https://twitter.com/MelissaDSegura https://twitter.com/readelev |
199 | 2016.08.03 | 3 | Electricity prices. | In May 2016, U.S. residential consumers paid an average of roughly 12.8 cents per kilowatt hour of electricity. The price was lowest in Louisiana (9.28 cents) and Washington state (9.54 cents), and highest in Hawaii (26.87 cents) and Connecticut (21.63 cents). These data-points, and more, are available through the Energy Information Administration’s electric power reports, which are updated monthly. [h/t Jordan Wirfs-Brock] | http://www.eia.gov/electricity/monthly/epm_table_grapher.cfm?t=epmt_5_06_a http://www.eia.gov/electricity/monthly/ | https://github.com/InsideEnergy/24-energy-stories-CAR16 |
200 | 2016.08.03 | 4 | Measuring up. | A group of public health researchers have estimated the average height of adults in 200 countries over the course of a century. Their calculations are based on a re-analysis of 1,472 previous studies, which collectively measured nearly 19 million participants. The resulting dataset contains annual height estimates for both men and women born each year between 1896 and 1996. During that time, South Korean women’s average height increased by approximately 8 inches, the largest gain of any group. These days, the Netherlands boasts the tallest men, and Latvia the tallest women. | https://elifesciences.org/content/5/e13410 http://www.ncdrisc.org/d-height.html | |
201 | 2016.08.03 | 5 | Email like it’s 1993. | The 20 Newsgroups dataset contains 20,000 messages (including some duplicates) sent to 20 Usenet bulletin boards in 1993. Among the groups: alt.atheism, misc.forsale, sci.electronics, talk.politics.guns, and talk.politics.mideast. | http://qwone.com/~jason/20Newsgroups/ https://en.wikipedia.org/wiki/Usenet_newsgroup | |
202 | 2016.08.10 | 1 | Pretrial inmates. | Connecticut has begun publishing a daily census of every inmate held in jail while awaiting trial. Starting July 1, the database contains one row per inmate per day; each row includes basic demographic data (age, gender, race), as well as the inmate’s bond amount, main offense, and jail location. Read more at: The New Haven Independent and TrendCT. Question: This release seems unprecedented; does any other state or country publish such detailed data on pretrial inmates? [h/t Camille Seaberry] | https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciltie/b674-jy6w http://www.newhavenindependent.org/index.php/archives/entry/bail_reform1/ http://trendct.org/2016/07/20/pre-trial-inmates/ | http://www.ctdatahaven.org/staff |
203 | 2016.08.10 | 2 | U.S. slave populations, 1790–1860. | For more than a century, the U.S. Census collected slave population figures. An assistant professor at George Mason University has aggregated that data, and mapped it. He cautions: “Treat the Census numbers skeptically: even in the best of circumstances the Census undercounts the population.” Previously: New Orleans slave sales in the December 30 edition; slave ship voyages in the January 20 edition. | http://lincolnmullen.com/ https://github.com/lmullen/slavery-map/blob/master/census.csv http://lincolnmullen.com/projects/slavery/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-30-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-01-20-edition | |
204 | 2016.08.10 | 3 | Hospital ratings. | The Centers for Medicare & Medicaid Services evaluates hospitals on dozens of measures — relating to safety, timeliness of care, patient satisfaction, and more — and publishes the results online as the “Hospital Compare” dataset. The dataset also includes an overall score, which distills each hospital’s results into a single five-star rating. If you don’t want to download the data, you can explore the results online. [h/t Drew Ivan] | https://www.medicare.gov/hospitalcompare/Data/Data-Updated.html# https://data.medicare.gov/data/archives/hospital-compare https://www.medicare.gov/hospitalcompare/Data/Hospital-overall-ratings-calculation.html https://www.medicare.gov/HospitalCompare/search.html | https://twitter.com/drewivan |
205 | 2016.08.10 | 4 | Rocks. | Macrostrat.org provides data and maps on thousands of geologic formations around the world. The database currently includes 1,474 “regional columns,” 33,903 “rock units,” and 1,750,044 “geologic map polygons.” You can also explore the data through the University of Minnesota’s “Flyover Country” iOS and Android apps. [h/t Grant J. Smith] | https://macrostrat.org/ https://macrostrat.org/#api https://macrostrat.org/burwell/ http://fc.umn.edu/ | https://twitter.com/grantmeaccess/status/720992509950865412 |
206 | 2016.08.10 | 5 | Heartbeats. | PhysioNet has published sound and data files for more than 3,000 heart recordings (a.k.a. phonocardiograms). The files support PhysioNet’s 2016 contest, which seeks algorithms that can detect abnormal heart sounds. [h/t Joe Isaacson] | http://physionet.org/physiobank/database/challenge/2016/ https://en.wikipedia.org/wiki/Phonocardiogram http://physionet.org/challenge/2016/#introduction | https://github.com/jisaacso/DeepHeart |
207 | 2016.08.17 | 1 | Beaches. | The U.S. Environmental Protection Agency’s BEACON system contains data on more than 5,000 public beaches. For each state’s most “significant” beaches, BEACON’s downloadable reports include data on water quality, pollution advisories, closures, and more. Of these highly-visited beaches, the longest — at nearly 24 miles — is the Oregon Dunes National Recreation Area’s South Jetty, also home to “the largest expanse of coastal sand dunes in North America.” | https://watersgeo.epa.gov/beacon2/reports.html http://www.fs.fed.us/visit/destination/oregon-dunes-national-recreation-area-%E2%80%93-south-jetty-area | |
208 | 2016.08.17 | 2 | Internet access. | Through its Form 477 program, the Federal Communications Commission collects detailed data on broadband internet access in the United States. One of the easiest ways to access county-level data is through the agency's Mapping Broadband Health in America project, which overlays internet access data and physical health indicators. The latest tabulations come from 2014. In more than a quarter of counties with at least 1,000 residents that year, broadband reached less than 50% of the population. | https://www.fcc.gov/general/broadband-deployment-data-fcc-form-477 https://www.fcc.gov/health/maps/methodology | |
209 | 2016.08.17 | 3 | Music makers. | The American Society of Composers, Authors and Publishers (ASCAP) boasts a membership of “more than 585,000 US composers, songwriters, lyricists and music publishers of every kind of music.“ The organization also maintains a downloadable catalog of the writers and publishers behind nearly 9 million songs. (But the downloaded files lack key details, such as the date the song was published.) | http://www.ascap.com/about https://mobile.ascap.com/aceclient/AceWeb/ | |
210 | 2016.08.17 | 4 | Relative living standards. | The Penn World Table contains GDP estimates, normalized for purchasing power, for 182 countries. These “real GDP” estimates — based on a combination of price surveys and national accounts data — stretch back at least to 1960, and many to 1950. In the most recent year available, 2014, Qatar’s real GDP per capita ranked highest: roughly $144,340 in 2011 U.S. dollars. The Central African Republic’s ranked lowest (~$594), and the United States’ ranked 11th (~$52,292). [h/t Willem Kerstholt] | http://www.rug.nl/research/ggdc/data/pwt/ https://en.wikipedia.org/wiki/National_accounts | https://www.linkedin.com/in/willemkerstholt |
211 | 2016.08.17 | 5 | Hunter-gatherers. | In the 1990s, ethnoarchaeologist Lewis Binford digitized more than 200 variables describing 339 groups of hunter-gatherers, a project his collaborator and widow Amber Johnson continues to maintain. The data come from historical ethnographies of societies, ranging from the Chichimec of the 1570s (in what is now Mexico), to the Dorobo of the 1920s (in what is now Kenya), to the Shompen of the 1980s (in the Nicobar Islands). | https://en.wikipedia.org/wiki/Lewis_Binford http://ajohnson.sites.truman.edu/data-and-program/ | |
212 | 2016.08.24 | 1 | Crime in cities. | The Marshall Project has collected and analyzed four decades of FBI data “on the most serious violent crimes in 68 police jurisdictions.” The FBI data covers 1975 through 2014; the reporters “also obtained data directly from 61 local agencies for 2015 — a period for which the FBI has not yet released its numbers.” Between 2010 and 2015, violent crime increased most in Milwaukee (+11%) and declined most in Prince George’s County, Md. (-22%). | https://github.com/themarshallproject/city-crime https://www.themarshallproject.org/2016/08/18/crime-in-context | |
213 | 2016.08.24 | 2 | One billion Australian healthcare claims. | Australia’s Department of Health has recently released an enormous dataset of Medicare and subsidized-prescription claims. It includes all claims from a random 10% sample of patients, and “contains approximately 1 billion lines of data relating to approximately 3 million Australians.” The Medicare claims go back to 1984, and the prescription claims go back to 2003. [h/t Drew Ivan] | https://data.gov.au/dataset/a8e3c0bc-44ac-4e9a-8b3c-b779438ddb10 | |
214 | 2016.08.24 | 3 | Oil concessions. | The OpenOil project aims to collect and standardizes data oil and gas development contracts around the world. So far, they’ve gathered at least some data from more than 60 countries. They’ve also published a map of oil concessions in the Middle East and Africa. [h/t Michael Gardiner] | http://openoil.net/ http://repository.openoil.net/wiki/Concession_Layer_Methodology http://maps.openoil.net/concessions/ | |
215 | 2016.08.24 | 4 | New York racehorse deaths and injuries. | New York State tracks every time a horse has been injured or died at a state race track since March 2009. The dataset, which is updated often, also includes a few other types of incidents, such as when a rider falls or horse loses badly. Related: “Horses’ Deaths at Aqueduct Prompt New Rules.” [h/t Mark Secada] | https://data.ny.gov/Government-Finance/Equine-Death-and-Breakdown/q6ts-kwhk http://www.nytimes.com/2015/01/18/sports/horses-deaths-at-aqueduct-prompt-new-rules.html | |
216 | 2016.08.24 | 5 | German traffic signs. | The German Traffic Sign Recognition Benchmark dataset contains 50,000+ images of 43 kinds of German traffic signs — from the classic “STOP,” to various speed limits, to roundabout indicators. The dataset, published by researchers at Ruhr-Universität Bochum’s Institut für Neuroinformatik, formed the basis of a 2011 machine-learning competition. [h/t Viktor Schepik] | http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset | |
217 | 2016.08.31 | 1 | Family money. | The Panel Study of Income Dynamics is “the longest running longitudinal household survey in the world,” according to its University of Michigan overseers. The study, which began in 1968, has interviewed more than 70,000 people, including four generations of some families. You can access the data for free, but you first need to register for an account and agree to a set of guidelines. An example insight: In 2013 — the most recent year for which data is available — approximately 11% of families said they owned a business in the previous year. [h/t Don Fullerton + Nirupama S. Rao] | http://psidonline.isr.umich.edu/ | http://www.nber.org/papers/w22580 |
218 | 2016.08.31 | 2 | Fatal car crashes. | On Monday, the Department of Transportation released 2015 data from its Fatality Analysis Reporting System. The dataset contains detailed information about every fatal motor-vehicle crash in the U.S., aggregated from a variety of state databases, including police reports, death certificates, and licensing files. In 2015, such crashes led to 35,092 deaths, 7.2% more than in 2014. [h/t Tanya Snyder] | https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze http://www.nhtsa.gov/FARS | https://twitter.com/TSnyderDC/status/770339872792014848 |
219 | 2016.08.31 | 3 | Global agriculture. | EarthStat provides geographic data on harvest regions, yields, and fertilizer use for more than 100 crops. The website also publishes data on pasture land, water depletion, and climatological effects on crop yields. | http://www.earthstat.org/ | |
220 | 2016.08.31 | 4 | The California Database Hunt. | California Senate Bill 272, enacted last year, required every local government agencies to publish a “catalog of enterprise systems” — essentially a guide to all the big databases they keep — by July 1 of this year. To find out who complied, a group of data-transparency organizations hosted the California Database Hunt last weekend. Volunteers searched 680 agencies, and published two spreadsheets of their findings: 430 (63%) of local agencies had posted their database catalogs, while 250 had not. [h/t Stephanie M. Lee] | https://leginfo.legislature.ca.gov/faces/billHistoryClient.xhtml?bill_id=201520160SB272 https://www.eff.org/deeplinks/2016/08/transparency-advocates-collect-more-400-database-catalogs | https://twitter.com/stephaniemlee |
221 | 2016.08.31 | 5 | Tennis time. | The 2016 U.S. Open began on Monday. It’s as good an occasion as any to highlight the work of. TennisAbstract.com’s Jeff Sackmann, who has published decades of match results and historical rankings from the men’s ATP and women’s WTA tours. Related: How FiveThirtyEight is using the data to forecast this year’s U.S. Open. Also: Prize money for the four Grand Slam tournaments, by gender and over time. And: The Tennis Racket. [h/t Nadja Popovich + John Templon] | http://www.tennisabstract.com/ https://github.com/JeffSackmann https://github.com/JeffSackmann/tennis_atp https://github.com/JeffSackmann/tennis_wta http://fivethirtyeight.com/features/how-were-forecasting-the-2016-us-open/ https://github.com/popovichN/grand-slam-prize-money https://www.buzzfeed.com/heidiblake/the-tennis-racket | https://twitter.com/popovichn https://twitter.com/jtemplon |
222 | 2016.09.07 | 1 | Healthcare spending. | Since 1996, the Medical Expenditure Panel Survey has collected data on “the specific health services that Americans use,” and the “health insurance held by and available to U.S. workers.” In a typical year, the survey collects data from more than 30,000 people from more than 10,000 families. In addition to the raw data files, the Agency for Healthcare Research and Quality, which runs the survey, also provides summary data tables. They show that, for example, in 2013 an estimated 61% of Americans faced expenses for prescription drugs, which cost the median patient about $278 before insurance. [h/t Ricardo Pietrobon] | https://meps.ahrq.gov/mepsweb/ https://meps.ahrq.gov/mepsweb/data_stats/download_data_files.jsp https://meps.ahrq.gov/mepsweb/data_stats/quick_tables.jsp https://meps.ahrq.gov/mepsweb/data_stats/tables_compendia_hh_interactive.jsp?_SERVICE=MEPSSocket0&_PROGRAM=MEPSPGM.TC.SAS&File=HCFY2013&Table=HCFY2013_PLEXP_A&VAR1=AGE&VAR2=SEX&VAR3=RACETH5C&VAR4=INSURCOV&VAR5=POVCAT13&VAR6=REGION&VAR7=HEALTH&VARO1=4+17+44+64&VARO2=1&VARO3=1&VARO4=1&VARO5=1&VARO6=1&VARO7=1&_Debug= | https://twitter.com/rpietro |
223 | 2016.09.07 | 2 | Radio rights. | The Federal Communications Commission decides who can use the nation’s airwaves and how. To date, they’ve issued millions of licenses, including nearly 200,000 last year for broadcast, personal use, law enforcement, and more. Almost exactly six years ago, the FCC launched a consolidated portal that pulls data from its various licensing systems into a single dataset. You can download all 17 million licenses in bulk, search for specific licenses online, or query the dataset’s API. [h/t Marc DaCosta] | http://reboot.fcc.gov/blog?entryId=752037 https://www.fcc.gov/licensing-databases/licensing http://reboot.fcc.gov/license-view/ http://reboot.fcc.gov/license-view/search-results.html https://www.fcc.gov/general/license-view-api | https://twitter.com/marc_dacosta |
224 | 2016.09.07 | 3 | Night lights. | Earlier this summer, a group researchers published a “new world atlas of artificial night sky brightness,” also known as light pollution. You can download a KMZ version of their atlas and view it in Google Earth. The researchers haven’t made their most detailed, “floating point” dataset available for public download; instead, they ask that you first submit a data-request form. [h/t Matthew Petroff] | http://advances.sciencemag.org/content/2/6/e1600377.full http://pmd.gfz-potsdam.de/contact/showshort.php?id=escidoc:1541893&contactform http://pmd.gfz-potsdam.de/contact/showshort.php?id=escidoc:1541893&contactform | https://mpetroff.net/2016/06/light-pollution-map/ |
225 | 2016.09.07 | 4 | Innovation nation. | Last week, a team of researchers published HistPat, a database containing county-of-residence data for 2.8 million U.S. patents granted between 1836 and 1975. The database covers approximately 83% of all patents granted to U.S. residents during that time, according to the authors. The most frequent home counties for innovation were New York County (422,234 patents); Cook County, Ill. (215,021), and Los Angeles County (90,171). Related: The National Bureau of Economic Research’s dataset of patent citations, 1975-1999. And: “Cancer moonshot” patents, 1976–2016. [h/t Drew Ivan] | http://www.nature.com/articles/sdata201674 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BPC15W http://www.nber.org/patents/ https://developer.uspto.gov/product/cancer-moonshot-patent-data | https://twitter.com/drewivan |
226 | 2016.09.07 | 5 | The federal fleet. | The U.S. General Services Administration publishes an annual dataset about vehicles owned and leased by the federal government. The spreadsheets — which contain details on total inventories, cost, usage, and fuel consumption — go back to fiscal year 2011. In FY 2015, federal vehicles drove 4.8 billion miles, down about 9% from FY 2011. [h/t John Templon] | http://www.gsa.gov/portal/content/102943 | https://twitter.com/jtemplon |
227 | 2016.09.14 | 1 | Minimum wages. | Researchers at the Washington Center for Equitable Growth have compiled a dataset of current and historical minimum wages in America. The federal and state minimum-wage data stretches back to May 1974 — when the federal minimum was $2.00 per hour, or roughly equivalent $9.76 per hour in today’s dollars — while the data for cities and counties starts in January 2004. [h/t Ben Casselman] | http://equitablegrowth.org/ https://github.com/equitablegrowth/VZ_historicalminwage/releases http://data.bls.gov/cgi-bin/cpicalc.pl?cost1=2&year1=1974&year2=2016 | https://twitter.com/bencasselman/status/773516754354049024 |
228 | 2016.09.14 | 2 | Health habits. | The CDC calls its Behavioral Risk Factor Surveillance System “the largest continuously conducted health survey system in the world.” Every year, the survey asks more than 400,000 American adults about a range of health-related topics, from tobacco to seatbelt use, from alcohol consumption to arthritis, from HIV testing to immunizations. Annual datasets from 1984–2015 are currently available. [h/t Ricardo Pietrobon] | http://www.cdc.gov/brfss/ | https://twitter.com/rpietro |
229 | 2016.09.14 | 3 | Sea ice. | The National Snow and Ice Data Center, based at the University of Colorado, publishes the Sea Ice Index. The data files, which track ice coverage in the Arctic and Antarctic oceans, include daily and monthly measurements from November 1978 to the present. Lately, the extent of sea ice on the Arctic Ocean has been two or more standard deviations below its long-term average, according to the center, while Antarctic sea ice remained at average levels. [h/t Dan Vergano] | https://nsidc.org/about/overview https://nsidc.org/data/seaice_index/ http://nsidc.org/data/docs/noaa/g02135_seaice_index/ http://nsidc.org/arcticseaicenews/2016/09/arctic-sea-ice-nears-its-minimum-extent-for-the-year/ | https://twitter.com/dvergano |
230 | 2016.09.14 | 4 | State prison admissions, by county. | Reporters at the New York Times have assembled a dataset counting the number of inmates each U.S. county sent to state prison in 2006, 2013, and 2014. The reporters derived the numbers from the Bureau of Justice Statistics’ National Corrections Reporting Program, which only certain researchers can access. Related: “This small Indiana county sends more people to prison than San Francisco and Durham, N.C., combined. Why?” | https://github.com/TheUpshot/prison-admissions http://www.icpsr.umich.edu/icpsrweb/NACJD/series/38/studies/36373?archive=NACJD&sortBy=7 http://www.nytimes.com/2016/09/02/upshot/new-geography-of-prisons.html | |
231 | 2016.09.14 | 5 | Captionless cartoons, captioned. | A group of computer scientists and the New Yorker’s cartoon editor walk into a room… and write an academic article titled, “Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest.” The corresponding dataset — available via the “cartoons” link on this page — includes 50 cartoons and nearly 300,000 reader-submitted captions. | http://arxiv.org/abs/1506.08126 http://clair.si.umich.edu/homepage/downloads.html | |
232 | 2016.09.28 | 1 | State-level results. | Perhaps better known for its campaign-finance data, the Federal Election Commission also publishes official state-level results for presidential, House, and Senate elections going back to 1982. The results include all official candidates, and sometimes even write-ins (depending on the state). In the 2008 presidential election, eight Rhode Island voters wrote-in “Stephen Colbert,” five scribbled “Joe the Plumber,” and seven chose “Jesus.” | http://www.fec.gov/pubrec/electionresults.shtml | |
233 | 2016.09.28 | 2 | County-level and precinct-level results. | OpenElections, a Knight Foundation–funded project, aims “to create the first free, comprehensive, standardized, linked set of election data for the United States.” They’ve made progress, but are looking for additional volunteers. In the meantime, you can download county-level presidential results from the National Atlas of the United States for 2004, 2008, and 2012 — or all combined. And you can download precinct-level results from 2002 to 2012 from the Harvard Election Data Archive (codebook here). | http://www.openelections.net/ http://openelections.net/about/ http://openelections.net/get-involved/ https://catalog.data.gov/dataset/2004-presidential-general-election-county-results-direct-download https://catalog.data.gov/dataset/2008-presidential-general-election-county-results-direct-download https://catalog.data.gov/dataset/presidential-general-election-results-2012-direct-download https://github.com/helloworlddata/us-presidential-election-county-results https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/21919 http://projects.iq.harvard.edu/eda/ https://dl.dropboxusercontent.com/u/156214/heda_docs.pdf | |
234 | 2016.09.28 | 3 | Ways and means. | The U.S. Election Assistance Commission’s Election Administration and Voting Survey “includes data on the ability of civilian, military and overseas citizens to register to vote and successfully cast a ballot,” as well as an overview of each state’s voting laws and procedures. [h/t Derek Willis] | http://www.eac.gov/ http://www.eac.gov/research/election_administration_and_voting_survey.aspx | https://twitter.com/derekwillis |
235 | 2016.09.28 | 4 | Global elections. | The Constituency-Level Elections Archive, based at the University of Michigan, collects and standardizes results from lower-house legislative elections around the world. (In the U.S., the lower house is the House of Representatives; in the U.K., it’s the House of Commons; in Albania, it’s the Kuvendi i Shqipërisë.) The latest release covers 1,591 elections from 136 countries. [h/t Jeremy Darrington] | http://www.electiondataarchive.org/ http://www.electiondataarchive.org/datacenter.html | http://libguides.princeton.edu/elections/foreign |
236 | 2016.09.28 | 5 | Bush v. Gore v. hanging chads. | After 2000’s contentious election, the National Opinion Research Center — funded by a consortium of news organizations — rigorously reviewed 175,010 Florida ballots that weren’t recognized as “valid” votes for president. In November 2001 the researchers concluded that, even with a full recount of disputed ballots, George W. Bush still would have won the state by 493 votes. The underlying data is available in several formats. | http://www.electionstudies.org/florida2000/sponsors.htm http://www.electionstudies.org/florida2000/index.htm http://www.nytimes.com/2001/11/12/us/examining-vote-overview-study-disputed-florida-ballots-finds-justices-did-not.html http://www.electionstudies.org/florida2000/data/data_files.htm | |
237 | 2016.10.05 | 1 | Global foreign aid. | AidData, an organization based at the College of William & Mary, has compiled a dataset of more than 1.5 million foreign aid projects between 1947 and 2013. Together, the dataset accounts for more than $7 trillion in commitments from 96 donors such as the U.S. government, UNICEF, the Nordic Development Fund, and the World Bank. AidData also publishes geospatial datasets and a data user guide. Previously: ForeignAssistance.gov, featured Jan. 13. [h/t Kedar Pavgi] | http://aiddata.org/about-aiddatas-work http://aiddata.org/country-level-research-datasets http://aiddata.org/subnational-geospatial-research-datasets http://aiddata.org/data-user-guide http://beta.foreignassistance.gov/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-01-13-edition | https://twitter.com/KedarPavgi/status/774595172034371584 |
238 | 2016.10.05 | 2 | Educational attainment. | Researchers at the Vienna-based Wittgenstein Centre for Demography and Global Human Capital have developed a dataset of historical and projected education levels for 171 countries. For five-year age groups in each country, the project estimates the percentage of people in each of several categories of educational attainment — no education, primary education, secondary education, post-secondary education, and a few gradations in between. The dataset is available to browse and download via the Wittgenstein Centre Data Explorer – look for “Educational Attainment Distribution” in the “indicators” dropdown. | http://www.wittgensteincentre.org/en/index.htm https://www.cambridge.org/core/journals/journal-of-demographic-economics/article/a-harmonized-dataset-on-global-educational-attainment-between-1970-and-2060-an-analytical-window-into-recent-trends-and-future-prospects-in-human-capital-development/D5540E2C23E4CB89AF08ECD9379B38FD http://www.oeaw.ac.at/fileadmin/subsites/Institute/VID/dataexplorer/index.html | |
239 | 2016.10.05 | 3 | Student loan default rates. | The federal government publishes default rates for federal student loans, aggregated by school, state, and school type. Last week, it published data covering students whose loans were due for repayment beginning in FY2013.The national default rate for those students as of this August was 11.3%. At certain schools, however, more than a third of students defaulted. More: Some background on the 10 colleges with highest default rates, by my colleague Molly Hensley-Clancy. | https://studentaid.ed.gov/sa/about/data-center/student/default http://www.ed.gov/news/press-releases/national-student-loan-cohort-default-rate-declines-steadily https://www.buzzfeed.com/mollyhensleyclancy/these-colleges-have-the-worst-student-loan-default-rates https://twitter.com/mollyhc | |
240 | 2016.10.05 | 4 | R&D spending. | The UNESCO Institute for Statistics’ data on national research and development budgets contains estimates of personnel and total spending by field, funding source, and more. You can also explore the data online through a series of interactive graphics. [h/t Rebecca Galloway] | http://data.uis.unesco.org/Index.aspx?DataSetCode=SCN_DS&lang=en http://www.uis.unesco.org/_LAYOUTS/UNESCO/research-and-development-spending/ | https://twitter.com/rtgalloway |
241 | 2016.10.05 | 5 | Highway traffic. | FOIA enthusiast Max Galka received a month of highway traffic data from the U.S. Department of Transportation. The dataset “includes hourly traffic counts for each hour of each day of [November 2015] at approximately 4,000 continuous traffic counting locations nationwide.” In all, the dataset “amounts to a total of 14 million traffic count readings and a total of 6 billion vehicles counted.” | https://twitter.com/galka_max http://metrocosm.com/map-us-traffic/ | |
242 | 2016.10.12 | 1 | Presidential newspaper endorsements. | Noah Veltman has collected all presidential endorsements (and non-endorsements) of 100+ major newspapers from 1980 (Reagan vs. Carter) to 2016. You can view the data as a spreadsheet, or as a formatted table. | https://twitter.com/veltman https://github.com/veltman/endorsements/ http://noahveltman.com/endorsements/ | |
243 | 2016.10.12 | 2 | Forest cover. | The World Bank keeps statistics on total forest coverage per country and worldwide. (Between 1990 and 2015, that worldwide total declined from 41.3 million to 40.0 million square kilometers.) More than 98% of all land area in Suriname was forest in 2015, according to a related dataset — the highest proportion of any country. [h/t Tariq Khokhar + Max Galka] | http://data.worldbank.org/indicator/AG.LND.FRST.K2 http://data.worldbank.org/indicator/AG.LND.FRST.ZS?year_high_desc=true | about:blank https://twitter.com/galka_max/status/781943480708894720 |
244 | 2016.10.12 | 3 | Performances and exhibitions. | The New York Philharmonic’s performance history dataset contains “all known concerts” — more than 20,000 of ‘em — played by the Philharmonic and the groups with which it has merged (e.g., the New York Symphony). Last month, the Museum of Modern Art published a dataset containing “all of the known exhibitions held at the museum from 1929 through 1989” — 1,788 in total. The first featured Cézanne, Gauguin, Seurat, and van Gogh. [h/t Stacy-Marie Ishmael + Miriam Posner + Chad Weinard] | https://github.com/nyphilarchive/PerformanceHistory/ https://github.com/MuseumofModernArt/exhibitions http://www.moma.org/calendar/exhibitions/1767 | https://twitter.com/s_m_i/status/783427305142046720 https://twitter.com/miriamkp/status/783424561773633536 https://twitter.com/caw_/status/776529362439012354 |
245 | 2016.10.12 | 4 | Computer memory prices. | John C. McCallum has collected the advertised prices of computer memory over time. In 1957, one byte of memory cost $392, or the equivalent of $411 million per megabyte; today, one metabyte costs about a third of a cent. [h/t Jorge Luis] | http://www.jcmit.com/index.htm http://www.jcmit.com/memoryprice.htm | https://twitter.com/jorgeluis500 |
246 | 2016.10.12 | 5 | “Murray, I think you dropped New Zealand's mineral exports for '08-'09.” | You don’t have to like Flight of the Conchords to enjoy New Zealand’s national statistics website, though it couldn’t hurt. The country publishes data on a broad range of topics, including abortion, work stoppages, the Māori census, and, of course, exports. In '08 and '09, the country exported NZD $3.5 billion and NZD $2.4 billion, respectively, of ”mineral fuels, mineral oils and products of their distillation; bituminous substances; mineral waxes.” [h/t Drew Ivan] | https://www.youtube.com/watch?v=buoztHLk9JQ http://www.stats.govt.nz/ http://www.stats.govt.nz/browse_for_stats/health/abortion.aspx http://www.stats.govt.nz/browse_for_stats/income-and-work/Strikes.aspx http://www.stats.govt.nz/browse_for_stats/people_and_communities/maori.aspx http://nzdotstat.stats.govt.nz/wbos/Index.aspx?DataSetCode=TABLECODE7311 | https://twitter.com/drewivan |
247 | 2016.10.19 | 1 | American manufacturing. | The Census Bureau’s Annual Survey of Manufacturers provides state-by-state and industry-by-industry statistics for America’s manufacturing sector. Metrics include the number of employees, annual payroll, “value added,” beginning-of-year inventory, and many more. In 2014, dog and cat food manufacturers employed about 18,000 people nationwide. Related: “Why Are Politicians So Obsessed With Manufacturing?” [h/t Scott Stern + RJ Andrews] | http://www.census.gov/programs-surveys/asm.html http://www.nytimes.com/2016/10/09/magazine/why-are-politicians-so-obsessed-with-manufacturing.html | https://twitter.com/sstern_mit/status/786974010911428608 https://twitter.com/infowetrust/status/786971887653924864 |
248 | 2016.10.19 | 2 | County-level health care. | Each year, the Department of Health and Human Services updates its Area Health Resources Files, a vast suite of local health care data collated from more than 50 sources. Among the topics covered: the number health care professionals by specialty, various rates of hospital usage, air quality, and demographic profiles. You can download the data, or explore and map it online. [h/t Ricardo Pietrobon] | http://ahrf.hrsa.gov/index.htm http://ahrf.hrsa.gov/download.htm http://ahrf.hrsa.gov/arfdashboard/HRCT.aspx http://ahrf.hrsa.gov/arfdashboard/ArfGeo.aspx | https://twitter.com/rpietro |
249 | 2016.10.19 | 3 | Global financial history. | The Jordà-Schularick-Taylor Macrohistory Database claims to be “the most extensive long-run macro-financial dataset to date.” It contains dozens of variables — GDP per capita, long-term interest rates, and the timing of systemic financial crises, for example — for 17 “advanced economies”. The dataset uses a Creative Commons license and has been extensively documented. | http://www.macrohistory.net/data/ | |
250 | 2016.10.19 | 4 | 8,675 farmers markets. | The Department of Agriculture publishes a spreadsheet of farmers markets in the United States. For each market, the dataset notes its location, hours, and the types of goods available (e.g., vegetables, seafood, flowers, et cetera). [h/t Susie Lu] | https://catalog.data.gov/dataset/farmers-markets-geographic-data | http://www.susielu.com/data-viz/farmers-markets |
251 | 2016.10.19 | 5 | Readers like you. | Today’s newsletter marks the 50th edition of Data Is Plural, as well as its one-year anniversary. To celebrate, I’ve started publishing a spreadsheet that details each edition’s basic stats — total subscribers, the “open rate,” the number of people who chose to unsubscribe, and more. | https://github.com/data-is-plural/newsletter-stats#data-is-plural-newsletter-stats | |
252 | 2016.10.26 | 1 | Congressional Research Service reports, in bulk. | The website EveryCRSReport.com provides unprecedented public access to reports from the Congressional Research Service — essentially the national legislature’s think-tank. The website, which launched last week by Demand Progress and Congressional Data Coalition, also lets you download metadata and text for each report. [h/t Daniel Schuman] | https://www.everycrsreport.com/ https://medium.com/@danielschuman/why-i-came-to-believe-crs-reports-should-be-publicly-available-and-built-a-website-to-make-it-77b4b0f6233e https://www.everycrsreport.com/about.html https://www.everycrsreport.com/download.html | https://twitter.com/danielschuman |
253 | 2016.10.26 | 2 | School testing. | The Department of Education’s EDFacts data tracks public grade schools’ participation and proficiency rates on standardized math and reading/language exams. The files provide data on all students who took the tests, broken down by race/ethnicity, sex, disability status, homelessness, and more. A related set of data files, available on the same page, tracks high-school graduation rates. | http://www2.ed.gov/about/inits/ed/edfacts/data-files/index.html | |
254 | 2016.10.26 | 3 | Cities and culture. | The World Cities Culture Forum, a convening of 32 major cities on six continents, has assembled a series of mini-datasets on 70+ “cultural indicators”. Those indicators range from the number of art galleries in each city (Paris had 1,151 in 2012) to the number of international tourists each city sees per year (Istanbul had 11.8 million in 2014) to the value of cinema ticket sales (Shanghai sold $563 million in 2014). Note: The data points draw on various sources — at least one just says “Google” — and aren’t necessarily directly comparable. [h/t Camilo Moreno] | http://www.worldcitiescultureforum.com/ http://www.worldcitiescultureforum.com/data http://www.worldcitiescultureforum.com/data/art-galleries http://www.worldcitiescultureforum.com/data/number-of-international-tourists-per-year http://www.worldcitiescultureforum.com/data/total-value-of-cinema-ticket-sales-per-year-ppp http://www.worldcitiescultureforum.com/data/number-of-comedy-clubs | https://twitter.com/cmorenok |
255 | 2016.10.26 | 4 | Airborne. | OpenFlights.org has collected data on more than 60,000 flight routes, including 915 itineraries departing Atlanta’s Hartsfield–Jackson International Airport. (That airport was recently named the world’s busiest, for the 18th year in a row.) For each route, the dataset indicates the airline, the departing airport, the arriving airport, the number of stops, and what type of plane is typically used. The website also provides datasets on thousands of airports and airlines. Important caveat: “This data is not suitable for navigation.” | http://openflights.org/data.html http://www.usatoday.com/story/travel/flights/todayinthesky/2016/09/12/worlds-busiest-airport-atlanta-takes-title-again/90271028/ | |
256 | 2016.10.26 | 5 | TSA confiscations. | Between October 2014 and September 2015, the U.S. Transportation Security Administration confiscated 22,196 “dangerous” items at airports, including 156 times at New York’s JFK. (Twice there, someone had placed fireworks in checked baggage.) That’s according to data obtained from the government by FOIA enthusiast Max Galka, who has also built an interactive map of the confiscations. | http://metrocosm.com/get-the-data/#tsa https://twitter.com/galka_max http://metrocosm.com/confiscated-items-airport-security/ | |
257 | 2016.11.02 | 1 | Medicare beneficiaries. | The U.S. government’s Medicare Health Outcomes Survey tracks the “physical and mental health and well-being” of Americans covered by Medicare. Each survey, currently available for 1998–2000 to 2012–2014, follows a sample of Medicare beneficiaries for two years, and asks them questions along the lines of, “In the past 12 months, have you had a problem with balance or walking?” The 2012–2014 data includes (at least partial) responses from 296,320 people. [h/t Ricardo Pietrobon] [Update, 2016-11-02: The original link in this item points to an ICPSR page, which provides access only to people at "member institutions." Here's a better link to the data: http://www.hosonline.org/en/data-dissemination/research-data-files/] | http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/23380?classification=ICPSR.IX.&q=&sortBy=7 | https://twitter.com/rpietro |
258 | 2016.11.02 | 2 | Where we live and build. | The European Commission’s Global Human Settlement Layer combines satellite imagery and census data to measure three things: population, building density, and urban/rural classification. The resulting datasets are fairly detailed — they provide population estimates for every 250-meter square in the world, for example — and are available for 1975, 1990, 2000, and 2015. [h/t Alaistair Rae] | http://ghslsys.jrc.ec.europa.eu/index.php http://ghslsys.jrc.ec.europa.eu/datasets.php | http://www.statsmapsnpix.com/2016/10/the-global-human-settlement-layer.html |
259 | 2016.11.02 | 3 | Complaints against NYC police. | Earlier this autumn, New York City began publishing a dataset of official citizen complaints against the city’s police, for every case closed since 2006. For each of the 200,000+ allegations, the main dataset includes various details about the incident — e.g., where it took place, and whether there’s video evidence — but no information about the officer involved. Related: Similar data from Indianapolis, which includes demographic information about the complained-against officers but not their names. Also related: “The local projects that are making police complaint data open and accessible.” Previously: Complaints against Chicago police, featured Nov. 11, 2015. [h/t Eve Ahearn] | http://www1.nyc.gov/site/ccrb/policy/data-transparency-initiative.page https://www.projectcomport.org/department/IMPD/ http://sunlightfoundation.com/blog/2016/10/25/the-local-projects-that-are-making-police-complaint-data-open-and-accessible/ https://cpdb.co/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-25-edition | https://twitter.com/eveahe |
260 | 2016.11.02 | 4 | Millions of Amazon reviews. | Julian McAuley, an assistant professor at UC San Diego, has collected a massive amount of user-generated data from Amazon.com, including 142.8 million reviews and 1.4 million answered Q&As. (As of mid-2014, Sophie la Girafe was the most-reviewed item in the baby category. Backstory here.) Much of the data can be downloaded directly, but the largest files require contacting McAuley for access. [h/t Reddit user samofny] | http://cseweb.ucsd.edu/~jmcauley/ http://jmcauley.ucsd.edu/data/amazon/ http://jmcauley.ucsd.edu/data/amazon/qa/ https://www.amazon.com/dp/B000IDSLOG http://www.slate.com/articles/arts/number_1/2011/03/im_french_chew_on_me.html | https://www.reddit.com/r/datasets/comments/59owtn/amazon_review_data_1428_million_reviews_spanning/ |
261 | 2016.11.02 | 5 | The dangerous dogs of Austin, Texas. | The city publishes a spreadsheet — last updated in May — of local dogs who’ve officially been “declared dangerous.” (“They have attacked in the past. The owner is required to provide $100,000 in financial responsibility. If they attack again the court could order them put to sleep.”) The file currently contains 63 entries, from a Labrador named Charlie to a Blue Lacy named Flint. [h/t Sharon Machlis] | https://data.austintexas.gov/Public-Safety/Declared-Dangerous-Dogs/ykw4-j3aj | https://twitter.com/sharon000 |
262 | 2016.11.16 | 1 | Hate crimes in the United States. | Since the 1990s, the FBI has collected data on hate crimes from local law enforcement agencies. On Monday, the bureau released data for 2015, reporting “5,850 criminal incidents and 6,885 related offenses, as being motivated by bias toward race, ethnicity, ancestry, religion, sexual orientation, disability, gender, and gender identity.” Those numbers are based on reports from 14,997 participating agencies. On the FBI’s website, you can view and download summary tables of the most recent data. You can also download incident-specific data for 1992 through 2014 from the National Archive of Criminal Justice Data. Unfortunately, as ProPublica noted yesterday, the FBI dataset is “deeply flawed”; more than 3,000 law enforcement agencies don’t participate in the program. [h/t John Templon] | https://www.fbi.gov/news/pressrel/press-releases/fbi-releases-2015-hate-crime-statistics https://ucr.fbi.gov/hate-crime/2015/topic-pages/jurisdiction_final https://www.icpsr.umich.edu/icpsrweb/NACJD/series/57/studies?searchIn=TITLE&archive=NACJD&q=%22Hate+Crime+Data%22&sortBy=7 https://www.propublica.org/article/hate-crimes-are-up-but-the-government-isnt-keeping-good-track-of-them | https://twitter.com/jtemplon |
263 | 2016.11.16 | 2 | Fake news on Facebook. | Last month, colleagues at BuzzFeed News and I analyzed and fact-checked 1,000+ posts from hyperpartisan Facebook pages, and found a disturbingly high rate of fake news. Here’s the data. Facebook CEO Mark Zuckerberg has dismissed the possibility that fake news influenced the election, calling it a “pretty crazy idea”. Meanwhile, renegade Facebook employees have now formed an unofficial task force to battle fake news on the platform. | https://www.buzzfeed.com/craigsilverman/partisan-fb-pages-analysis https://github.com/BuzzFeedNews/2016-10-facebook-fact-check https://www.buzzfeed.com/stephaniemlee/zuckerberg-techonomy-fake-news-election https://www.buzzfeed.com/sheerafrenkel/renegade-facebook-employees-form-task-force-to-battle-fake-n | |
264 | 2016.11.16 | 3 | Election Day on “the front page of the internet.” | Jason Baumgartner — a.k.a. Stuck_In_the_Matrix — has collected and published every submission and comment posted to Reddit from November 8th through November 10th. For each of the nearly 8 million comments, the dataset includes the message, the author, the subreddit it was posted to, the comment thread’s ID, and more. Previously: 1.7 billion Reddit comments, featured Nov. 25, 2015. | https://www.reddit.com/user/Stuck_In_the_Matrix https://www.reddit.com/r/datasets/comments/5ch2bq/reddit_raw_election_data_comments_and_submissions/ https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-25-edition | |
265 | 2016.11.16 | 4 | The most important entries on Wikipedia. | Germany-based researcher Andreas Thalhammer has applied PageRank — the algorithm at the heart of Google’s origin story — to the world of Wikipedia. The result: the DBpedia PageRank dataset, which estimates the importance of each page based on the other pages that link to it. You can download the data directly, or query it online. (According to the metric, Aristotle, Plato, and Karl Marx are history’s three most Wiki-central philosophers.) | https://twitter.com/thalhamm https://en.wikipedia.org/wiki/PageRank http://people.aifb.kit.edu/ath/#DBpedia_PageRank http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&qtxt=PREFIX+rdf%3A%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+vrank%3A%3Chttp%3A%2F%2Fpurl.org%2Fvoc%2Fvrank%23%3E%0D%0APREFIX+dbo%3A%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0D%0A%0D%0ASELECT+%3Fs+%3Fv+%0D%0AFROM+%3Chttp%3A%2F%2Fdbpedia.org%3E+%0D%0AFROM+%3Chttp%3A%2F%2Fpeople.aifb.kit.edu%2Fath%2F%23DBpedia_PageRank%3E+%0D%0AWHERE+{%0D%0A%3Fs+rdf%3Atype+dbo%3AUniversity.%0D%0A%3Fs+vrank%3AhasRank%2Fvrank%3ArankValue+%3Fv.%0D%0A}%0D%0AORDER+BY+DESC%28%3Fv%29+LIMIT+50%0D%0A&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=PREFIX+rdf%3A%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+vrank%3A%3Chttp%3A%2F%2Fpurl.org%2Fvoc%2Fvrank%23%3E%0D%0APREFIX+dbo%3A%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0D%0A%0D%0ASELECT+%3Fs+%3Fv+%0D%0AFROM+%3Chttp%3A%2F%2Fdbpedia.org%3E+%0D%0AFROM+%3Chttp%3A%2F%2Fpeople.aifb.kit.edu%2Fath%2F%23DBpedia_PageRank%3E+%0D%0AWHERE+%7B%0D%0A%3Fs+rdf%3Atype+dbo%3APhilosopher.%0D%0A%3Fs+vrank%3AhasRank%2Fvrank%3ArankValue+%3Fv.%0D%0A%7D%0D%0AORDER+BY+DESC%28%3Fv%29+LIMIT+50%0D%0A&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on | |
266 | 2016.11.16 | 5 | Every street tree in NYC. | Earlier this month, New York City published the results of its decennial tree count. You can explore a map of every street tree in NYC — nearly 700,000 of ‘em — or download the corresponding dataset, which contains info on each tree’s species, circumference, health status, and other observations. (Note: That dataset appears to contain about one-third fewer trees than the map’s count, for reasons I can’t quite figure out.) Results of the 1995 and 2005 tree censuses are also available. | https://www.nycgovparks.org/trees/treescount https://tree-map.nycgovparks.org/ https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Blockface-Data/ju3b-rwpy https://data.cityofnewyork.us/browse?q=Street%20Tree%20Census | |
267 | 2016.11.30 | 1 | What kills us. | The CDC’s Underlying Cause of Death database provides county-level mortality statistics based on death certificates of U.S. residents for each year from 1999 to 2014. The tool lets you group the data by geography, demographics, place of death (e.g., inpatient hospital, hospice, home, etc.), and other variables. In 2014, for example, about 40,000 residents died of pancreatic cancer — with the highest rates coming in America’s most-rural counties (~15.6 deaths per 100,000 residents) and the lowest rates in country’s most-urban counties (~11.3 per 100,000). The CDC’s “compressed mortality” datasets contain slightly less detail, but go all the way back to 1968. [h/t Drew Ivan] | https://wonder.cdc.gov/ucd-icd10.html https://wonder.cdc.gov/wonder/help/ucd.html https://wonder.cdc.gov/mortSQL.html | https://twitter.com/drewivan |
268 | 2016.11.30 | 2 | Gunshot detections. | Earlier this month, Forbes published an examination of ShotSpotter, a company that uses networks of outdoor microphones to detect and locate gunshot-like sounds. Forbes found that ShotSpotter has produced “few tangible results.” “In some cities, ShotSpotter hasn’t had the effect city officials and residents had hoped for. While officers are responding to more illegal gunfire, they rarely catch the shooter.” To support its findings, Forbes has published the ShotSpotter data they received from police departments in seven cities: Brockton, Mass.; East Palo Alto, Calif.; Kansas City, Mo.; Milwaukee, Wis.; Omaha, Neb.; San Francisco, Calif.; and Wilmington, N.C. The data varies somewhat for each city, but typically includes the date, time, location, and outcome of the each gunshot alert. [h/t Matt Drange] | http://www.forbes.com/sites/mattdrange/2016/11/17/shotspotter-struggles-to-prove-impact-as-silicon-valley-answer-to-gun-violence/ http://www.forbes.com/sites/mattdrange/2016/11/17/shotspotter-alerts-police-to-lots-of-gunfire-but-produces-few-tangible-results | https://twitter.com/mattdrange |
269 | 2016.11.30 | 3 | Comparing election forecasts. | This year, I decided to grade a bunch of prominent election forecasts for BuzzFeed News. Now that Michigan has finally been called, I’ve published the results. I’ve also published the underlying data and code on GitHub, including state-level predictions from all nine forecasters in the analysis. | https://www.buzzfeed.com/jsvine/grading-the-2016-election-forecasts https://www.buzzfeed.com/jsvine/2016-election-forecast-grades https://github.com/BuzzFeedNews/2016-11-grading-the-election-forecasts | |
270 | 2016.11.30 | 4 | Five years of Facebook posts from 15 news sites. | Data analyst Patrick Martinchek has published a dataset of all Facebook posts from “15 of the top mainstream media sources” — a group that includes The New York Times, The Wall Street Journal, NPR, Fox News, and other familiar sources — from January 2012 through Nov. 8, 2016. Related: “What I Discovered About Trump and Clinton From Analyzing 4 Million Facebook Posts.” | https://www.facebook.com/Patrick.Martinchek https://data.world/martinchek/2012-2016-facebook-posts https://shift.newco.co/what-i-discovered-about-trump-and-clinton-from-analyzing-4-million-facebook-posts-922a4381fd2f | |
271 | 2016.11.30 | 5 | I’ll take “Datasets” for $200. | A few years ago, Reddit user trexmatt uploaded 216,930 Jeopardy! trivia-tidbits, scraped from j-archive.com, “the nearly comprehensive online Jeopardy! archive maintained by obsessive fans.” Each entry lists the question, answer, category, value, round, show number, and show air-date. | https://www.reddit.com/user/trexmatt/ https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/ http://www.j-archive.com/ http://www.slate.com/articles/arts/culturebox/2011/02/this_fanmaintained_episode_database_helps_contestants_prepare_for_jeopardy.html | |
272 | 2016.12.07 | 1 | Pipelines. | The U.S. Energy Information Administration publishes a bunch of geographic data, including shapefiles mapping the country’s crude oil, petroleum product, hydrocarbon gas liquid, and natural gas pipelines. (They were last updated five months ago.) Additionally, the Pipeline and Hazardous Materials Safety Administration keeps track of “significant incidents” — for example, those that caused a serious injury or $50,000 in damage. Related: “Six maps that show the anatomy of America’s vast infrastructure.” Also related: ProPublica’s Pipeline Safety Tracker, covering 1986–2012. | https://www.eia.gov/maps/layer_info-m.php http://www.phmsa.dot.gov/pipeline/library/data-stats/flagged-data-files https://www.washingtonpost.com/graphics/national/maps-of-american-infrastrucure/ https://projects.propublica.org/pipelines/ | |
273 | 2016.12.07 | 2 | Solar panels. | The Open PV Project is a “community driven, comprehensive database” of solar panel installations in the U.S., ranging from home installations to utility-scale projects. The database, run by the Department of Energy, contains more than 1 million installations — with a total capacity of 16,000+ megawatts — and tracks their locations, sizes, costs, installers, and other variables. [h/t Dad] | https://openpv.nrel.gov/index | https://www.linkedin.com/in/ed-vine-a480347 |
274 | 2016.12.07 | 3 | Chicago cab rides. | Last month, Chicago’s city government published data on more than 100 million local taxi rides taken in the city since 2013. (The city gathers the data through “periodic reporting by two major payment processors believed to cover most taxis in Chicago.”) The dataset contains each ride’s start/end times, pickup/dropoff location (based on Chicago’s “community areas”), distance, cost, payment type, and taxi company. Related: “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance,” which contains pointers to similar data for New York City. [h/t Dan Nguyen] | http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/ https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ | https://twitter.com/dancow/status/803471830707093504 |
275 | 2016.12.07 | 4 | STEM surveys. | The IPUMS Higher Ed portal provides data from three “leading surveys for studying the science and engineering (STEM) workforce in the United States.” The surveys currently cover 1993 through 2013 and include questions about educational choices, demographics, employment outcomes, and more. Requires a free account. [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle] | https://highered.ipums.org/highered/ | |
276 | 2016.12.07 | 5 | Classical music, annotated. | “MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition.” [h/t Lon Riesberg] | http://homes.cs.washington.edu/~thickstn/musicnet.html | http://dataelixir.com/issues/108#resources |
277 | 2016.12.14 | 1 | Medicare drug costs. | The federal government has released data on Medicare’s prescription drug spending from 2011 to 2015. Previously, Medicare had only published data on the most expensive drugs; the new release includes data on all drugs used by at least 11 Medicare patients in a given year. Caveat: Medicare “is prohibited from publicly disclosing drug-specific information on manufacturer rebates,” so the “spending metrics do not reflect any manufacturers’ rebates or other price concessions.” [h/t Charles Ornstein] | https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/2015MedicareData.html | https://twitter.com/charlesornstein/status/806587500307353601 |
278 | 2016.12.14 | 2 | Breached accounts. | Troy Hunt runs HaveIBeenPwned.com, a service that lets you see whether your email address has been included in any major data breaches. Last week, Hunt published an anonymized dataset based on the breaches he’s collected. (That post provides a torrent file for the dataset; you can also download the data here.) Unlike the HaveIBeenPwned website, the dataset doesn’t include information about specific accounts; instead it counts the number of email addresses that have been compromised on particular combinations of websites. For example, 14.6 million email addresses appeared in both the LinkedIn and Dropbox breaches. (You can read more about each breach here.) | https://haveibeenpwned.com/ https://www.troyhunt.com/heres-1-4-billion-records-from-have-i-been-pwned-for-you-to-analyse/ https://github.com/data-is-plural/haveibeenpwned-account-combinations https://haveibeenpwned.com/PwnedWebsites | |
279 | 2016.12.14 | 3 | Water world. | The European Commission and Google engineers have mapped surface water – including lakes, rivers, reservoirs, oceans, and more – on every 30-meter-by-30-meter square on Earth between 1984 and 2015. During that time, “permanent surface water has disappeared from an area of almost 90,000 square kilometres, roughly equivalent to that of Lake Superior, though new permanent bodies of surface water covering 184,000 square kilometres have formed elsewhere.” The data, based on the U.S. government’s Landsat satellite images, are available to download and explore online. Related: “Mapping Three Decades of Global Water Change,” published by The New York Times, based on this dataset. | http://www.nature.com/nature/journal/vaop/ncurrent/full/nature20584.html https://landsat.usgs.gov/ https://global-surface-water.appspot.com/download https://global-surface-water.appspot.com/ http://www.nytimes.com/interactive/2016/12/09/science/mapping-three-decades-of-global-water-change.html | |
280 | 2016.12.14 | 4 | Crowdfunding. | A Lithuania-based web-scraping company has been collecting data on Kickstarter projects and Indiegogo campaigns every month. The datasets include (among other things) each project’s number of backers, amount pledged, and category. You can also explore the data online. [h/t Vincent Granville] | https://webrobots.io/about-us/ https://webrobots.io/kickstarter-datasets/ https://webrobots.io/indiegogo-dataset/ http://crowdfunding.webrobots.io/ | http://www.datasciencecentral.com/forum/topics/more-free-data-sets |
281 | 2016.12.14 | 5 | Energy use at 10 Downing Street. | UK-based CarbonCulture helps organizations measure and publish their buildings’ energy and water use in near-realtime. Among the first users: 10 Downing Street, the Tate Modern, and University College London. For each building, you can download yearly datasets, which are broken down into 30-minute intervals. [h/t Max Roser] | https://platform.carbonculture.net/places/10-downing-street/9/https://platform.carbonculture.net/about/ https://platform.carbonculture.net/places/10-downing-street/9/ https://platform.carbonculture.net/places/tate-modern/8962/ https://platform.carbonculture.net/communities/ucl/30/ | https://twitter.com/MaxCRoser/status/802553471618732033 |
282 | 2016.12.21 | 1 | The Affordable Care Act, quantified. | Last week, the U.S. Department of Health and Human Services released a dataset of state-level Obamacare metrics. The dataset is divided into five main categories: coverage gains, employer coverage, individual market coverage, Medicaid, and Medicare. Between 2010 and 2015, the proportion of Nevadans without health insurance dropped from 22.6% to 12.3% — the largest percentage-point decrease of any state. (In 2015, an estimated 17.1% of Texans still didn’t have health insurance, the highest rate of any state that year.) The metrics come from various sources, including the Census, academic studies, and the department’s own estimates. [h/t Nadja Popovich] | https://aspe.hhs.gov/compilation-state-data-affordable-care-act | https://twitter.com/PopovichN |
283 | 2016.12.21 | 2 | Petroleum rig counts. | Since the 1940s, oilfield services corporation Baker Hughes and its predecessor companies have been publishing “rig counts” — the number of rigs actively drilling for oil and/or gas in various parts of the world. These days, the company updates its North America numbers every week and its international counts every month. As of December 16, they counted 637 rigs in — and offshore of — the United States, nearly half of them in Texas. [h/t Jordan Wirfs-Brock] | http://phx.corporate-ir.net/phoenix.zhtml?c=79687&p=irol-rigcountsoverview http://phx.corporate-ir.net/phoenix.zhtml?c=79687&p=irol-reportsother http://phx.corporate-ir.net/phoenix.zhtml?c=79687&p=irol-rigcountsintl | https://github.com/InsideEnergy/24-energy-stories-CAR16 |
284 | 2016.12.21 | 3 | The birds and the bees (and more). | The U.S. Geological Survey’s BISON service brings together “species occurrence” data from hundreds of sources. The service, whose name stands for ”Biodiversity Information Serving our Nation,” currently contains 262 million records, each of which refers to the observation of “an organism at a particular time in a particular place.” Most of the observations are based on direct sightings; others use fossils, written records, or other sources. The data aren’t available for bulk download, but can be accessed via BISON’s free API. [h/t Clare Malone] | https://bison.usgs.gov/ | http://fivethirtyeight.com/features/how-trumps-white-house-could-mess-with-government-data/ |
285 | 2016.12.21 | 4 | Two planes too close. | The FAA’s Near Midair Collision System keeps track of incidents where two planes flew uncomfortably close to each other. The system, which is based on reports from pilots and flight crew members, contains more than 7,500 incidents dating back to 1987. The FAA received 305 of these reports for the first 10 months of 2016, including 35 classified as “critical.” | http://www.asias.faa.gov/pls/apex/f?p=100:33:0::NO::: http://www.asias.faa.gov/pls/apex/f?p=100:35:0::NO::P35_REGION_VAR:1 | |
286 | 2016.12.21 | 5 | The geography of language on Twitter. | Last week, Quartz published an addictive tool that lets you map word usage on Twitter, by U.S. county. It’s based on an academic analysis of 890 million geocoded tweets uttered between October 2013 and November 2014. Data and details available here. | http://qz.com/862325/the-great-american-word-mapper/ https://sites.google.com/site/wordmapperinfo/ | |
287 | 2017.01.11 | 1 | What Facebook knows about us. | In September, ProPublica published a Chrome extension that showed readers what Facebook said it knew about them — and then asked readers to share that data. In the following months, readers unearthed more than 52,000 of the “unique interest categories” that Facebook uses for advertising, such as “yoga,” “beer,” and “Scent of a Woman (1992 film).” But ProPublica’s reporters also found that Facebook doesn’t tell users about the “far more sensitive” data it buys about their offline lives, which can include “their income, the types of restaurants they frequent and even how many credit cards are in their wallets.” To support these findings, ProPublica published two key datasets: the crowdsourced “interest categories” and the list of categories that Facebook allows advertisers to target. | https://www.propublica.org/article/breaking-the-black-box-what-facebook-knows-about-you https://www.propublica.org/article/facebook-doesnt-tell-users-everything-it-really-knows-about-them https://www.propublica.org/datastore/dataset/facebook-ad-categories | |
288 | 2017.01.11 | 2 | Getting warmer. | Scientists expect that, when the final numbers come in, 2016 will have been Earth’s hottest year on record. The National Oceanic and Atmospheric Administration publishes monthly data on “temperature anomalies” — how much hotter or cooler a month was than the 20th century average. (November 2016, the most recent month available, was 0.73° Celsius warmer than the average November.) You can grab the data for the entire globe, by hemisphere, or by continent; for the land and ocean combined, or separately; and going all the way back to 1880. Related: My colleague Peter Aldhous demonstrates how he charted this data using R. Also: NOAA released its 2016 U.S. “State of the Climate” report on Monday. | https://www.buzzfeed.com/peteraldhous/another-hottest-year https://www.ncdc.noaa.gov/cag/time-series/global https://buzzfeednews.github.io/2016-12-warmest-year/ https://www.ncdc.noaa.gov/sotc/national/201613 | |
289 | 2017.01.11 | 3 | Four wars’ bombing missions. | Years ago, Lt. Col. Jenns Robertson began entering information into “a simple Excel spreadsheet that eventually matured into the largest compilation of releasable U.S. air operations data in existence.” Last month, the Department of Defense published a “beta” version of this data, known as Theater History of Operations Reports (THOR). Currently, THOR’s data covers bombing operations from World War I, World War II, the Korean War, and the Vietnam War. For each bombing, the reports include data about the aircraft, munitions, targets, results, and more. | https://www.data.mil/s/v2/data-stories-an-overview-of-thor/a100cd16-c2a7-453b-8ea6-45947c1bbc51/ | |
290 | 2017.01.11 | 4 | So many satellites. | CelesTrak’s T.S. Kelso has been obsessively transcribing NORAD’s “resident space object” data for decades. Among his offerings: the SATCAT satellite catalog, which provides data on all known satellites launched since 1957 — more than 41,900 of ‘em. Kelso also provides a SATCAT Boxscore, which is like a baseball box score ... but for satellites. The U.S., it turns out, is responsible for almost exactly one-third of the 1,590 satellites classified as “active.” Previously: The Union of Concerned Scientists’ satellite database, featured Dec. 30, 2015. [h/t Noah Veltman] | https://celestrak.com/ https://celestrak.com/webmaster.asp https://celestrak.com/NORAD/documentation/ https://celestrak.com/satcat/search.asp https://celestrak.com/satcat/boxscore.asp http://www.ucsusa.org/nuclear-weapons/space-weapons/satellite-database https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-30-edition | http://noahveltman.com/ |
291 | 2017.01.11 | 5 | Where Waldo is. | In 2015, computer scientist Randy Olson tried computing “the optimal search strategy for finding Waldo” in the seven original Where’s Waldo? books. In doing so, he transcribed a 2013 Slate chart of Waldo’s locations (itself transcribed from those seven original books). The resulting dataset contains 68 rows — one for each Waldo — and four columns: book, page, x coordinate, and y coordinate. | http://www.randalolson.com/2015/02/03/heres-waldo-computing-the-optimal-search-strategy-for-finding-waldo/ http://www.slate.com/articles/arts/culturebox/2013/11/where_s_waldo_a_new_strategy_for_locating_the_missing_man_in_martin_hanford.html http://www.randalolson.com/wp-content/uploads/wheres-waldo-locations.csv | |
292 | 2017.01.18 | 1 | TrumpWorld. | At BuzzFeed News, a few colleagues and I spent the past two months compiling a big database of organizations and people connected to President-elect Trump, his family, advisers, and Cabinet picks. On Sunday, we published what we’ve found so far — connections between more than 1,500 organizations and people altogether. Still, there are certainly things we’ve missed. So you can download and search the data, but you can also help us expand it. See something we’ve overlooked? Let us know! | https://www.buzzfeed.com/johntemplon/help-us-map-trumpworld | |
293 | 2017.01.18 | 2 | Food stamp foods. | Late last year, the USDA published a study that used “point-of-sale transaction data from a leading grocery retailer to examine the food choices” of households receiving Supplemental Nutrition Assistance Program (SNAP) benefits. In an appendix, the report ranks the total spending on major commodities by SNAP households and non-SNAP households. Soft drinks, “fluid milk products,” and ground beef were the top three commodities purchased by SNAP households. Milk, soft drinks, and cheese were the top three for non-SNAP households. That information is presented as a PDF table, but I’ve converted it to a spreadsheet-friendly text file for you. [h//t Reddit user "junglejuicy"] | https://www.fns.usda.gov/snap/foods-typically-purchased-supplemental-nutrition-assistance-program-snap-households https://github.com/data-is-plural/usda-snap-spending-study https://www.reddit.com/r/datasets/comments/5o249x/foods_typically_purchased_by_supplemental/ | |
294 | 2017.01.18 | 3 | Online and offline prices. | Between December 2014 and March 2016, Alberto Cavallo — co-founder of MIT’s Billion Prices Project — sent 323 crowdsourced workers to collect product prices from 56 large retailers in 10 countries. Then, he found the prices for the same products on the retailers’ websites. The results, which contain tens of thousands of observations, are available as several Excel spreadsheets. (Caveat: The dataset’s “Terms of Use” rules stipulate that the information is “EXCLUSIVELY FOR USE IN ACADEMIC RESEARCH AND PUBLICATIONS”.) Related: Cavallo summarized his findings in a paper published recently by the American Economic Review. | http://www.mit.edu/~afc/ http://bpp.mit.edu/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FXXOUHF https://www.aeaweb.org/articles?id=10.1257/aer.20160542 | |
295 | 2017.01.18 | 4 | German rail. | State-owned Deutsche Bahn AG is Europe’s largest railway company by revenue, serving 12 million train and bus passengers each day. It also happens to publish a bunch of open data, including datasets on its routes, stations, platforms, and cargo facilities. [h/t Martin Bergmann] | http://www.railway-technology.com/features/featureengines-of-trade-the-ten-biggest-rail-companies-by-revenue-4943955/ http://data.deutschebahn.com/dataset?groups=datasets http://data.deutschebahn.com/dataset/data-streckennetz http://data.deutschebahn.com/dataset/data-stationsdaten http://data.deutschebahn.com/dataset/data-bahnsteig-regio http://data.deutschebahn.com/dataset/betriebsstellen-gueterverkehr | https://www.linkedin.com/in/bergma |
296 | 2017.01.18 | 5 | Dot-gov domains. | The General Services Administration recently updated its list of known .gov domains. It currently includes more than 1,300 federal domains — from aapi.gov to youthrules.gov — and more than 4,300 domains registered by state, local, and native sovereign agencies. | https://github.com/GSA/data/tree/gh-pages/dotgov-domains http://aapi.gov http://www.youthrules.gov/ | |
297 | 2017.01.25 | 1 | Colleges and economic mobility. | A team of economists studying “the equality of opportunity” has published new research identifying which colleges “help the most children climb the income ladder.” For their analysis, the researchers combined federal tax records and data from the Department of Education. California State University–Los Angeles was one of the greatest engines of mobility; nearly 1 in 10 students enrolled there began in the bottom 20% of income but reached the top 20% by their early thirties. You can download the findings, which include similar statistics for more than 2,000 schools, as a series of spreadsheets. Related: “Some Colleges Have More Students From the Top 1 Percent Than the Bottom 60. Find Yours,” from the New York Times. | http://www.equality-of-opportunity.org/team/ http://www.equality-of-opportunity.org/college/ http://www.equality-of-opportunity.org/data/ https://www.nytimes.com/interactive/2017/01/18/upshot/some-colleges-have-more-students-from-the-top-1-percent-than-the-bottom-60.html | |
298 | 2017.01.25 | 2 | Three centuries of UK macroeconomic data. | The Bank of England publishes a spreadsheet of historical economic data going back, in some cases, to the late 1600s. The country’s GDP in 1700 was £11.7 billion in 2013 prices. That’s about 1/157th the size of the UK’s GDP in 2015. And in November 1694, monthly short-term interest rates were roughly 6%. [h/t Ian Greenleigh] | http://www.bankofengland.co.uk/research/Pages/onebank/threecenturies.aspx | https://data.world/ian/3-centuries-of-uk-economy-data |
299 | 2017.01.25 | 3 | TV talk. | The GDELT Project and the Internet Archive have collaborated to make the latter's Television News Archive more powerfully searchable. Their new tool, announced in December, lets you search across “more than 5.7 billion words from over 150 distinct stations spanning July 2009 to present” at a sentence-by-sentence level. The results are downloadable as CSV or JSON files. Previously: The Political TV Ad Archive (Feb. 2, 2016). | http://gdeltproject.org/ https://archive.org/details/tv http://television.gdeltproject.org/cgi-bin/iatv_ftxtsearch/iatv_ftxtsearch https://blog.archive.org/2016/12/20/new-research-tool-for-visualizing-two-million-hours-of-television-news/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-02-10-edition | |
300 | 2017.01.25 | 4 | European trees. | EU-Forest is a new dataset that, according to its authors, “extends by almost one order of magnitude the publicly available information on European tree species distribution.” The new project merges and harmonizes data from 21 national forest surveys and two related databases. In all, EU-Forest includes more than 580,000 observations of more than 200 species in 1km-by-1km square plots of land, and is available in both tabular and geospatial file formats. Previously: American tree maps (Dec. 23, 2015) and NYC street trees (Nov. 16, 2016). | https://figshare.com/collections/A_high-resolution_pan-European_tree_occurrence_dataset/3288407 http://www.nature.com/articles/sdata2016123 https://figshare.com/articles/Tree_occurrences_at_species_level/3497885 https://figshare.com/articles/Occurrences_location_shapefile/3497891 https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-23-edition http://tinyletter.com/data-is-plural/letters/data-is-plural-2016-11-16-edition | |
301 | 2017.01.25 | 5 | Standard mugshots. | The National Institute of Standards and Technology publishes Special Database 18 “for use in development and testing of automated mugshot identification systems.” The dataset contains 3,248 mugshot photos portraying 1,573 different people (mostly men), and includes each arrestee’s age and gender. [h/t Noah Veltman] | https://www.nist.gov/srd/nist-special-database-18 | http://noahveltman.com/ |
302 | 2017.02.08 | 1 | Metro/subway ridership. | Two weeks ago, Bloomberg News reporters requested entrance and exit data from Washington, DC’s Metrorail system for three days: Jan. 20, 2009 (Obama's first inauguration), Jan. 20, 2017 (Trump's inauguration), and Jan. 21, 2017 (the Women's March). A week later, they received the data — but as PDFs, which they turned into structured data and published this week. Related: NYC’s MTA publishes detailed turnstile-by-turnstile data, and Chicago publishes daily “L” ridership data for each station going back to 2001. Plus: “Second Avenue Subway Relieves Crowding on Neighboring Lines,” which uses the NYC data. | https://github.com/bizweekgraphics/wmata-ridership-data http://web.mta.info/developers/turnstile.html https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f http://www.nytimes.com/2017/02/01/nyregion/second-avenue-subway-relieves-crowding-on-neighboring-lines.html | |
303 | 2017.02.08 | 2 | International house prices since 1975. | The International House Price Database combines and standardizes house price indices from 23 countries — mostly in Europe and North America, but also including South Africa, Australia, New Zealand, Japan, South Korea, and Israel. The dataset, published by the Federal Reserve Bank of Dallas, is deeply documented and updated quarterly. Previously: Historical San Francisco rents (May 25, 2016) and the U.S. Census Bureau’s Annual Characteristics of New Housing (June 22, 2016). | https://www.dallasfed.org/institute/houseprice/ https://www.dallasfed.org/institute/houseprice#tab3 https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-05-25-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-22-edition | |
304 | 2017.02.08 | 3 | Nobel Prizes. | The prestigious Scandinavian awards have an API. The official documentation explains it succinctly: “The data is free to use and contains information about who has been awarded the Nobel Prize, when, in what prize category and the motivation, as well as basic information about the Nobel Laureates such as birth data and the affiliation at the time of the award. The data is regularly updated as the information on Nobelprize.org is updated, including at the time of announcements of new Laureates.” Related: “These Nobel Prize Winners Show Why Immigration Is So Important For American Science,” by my colleague Peter Aldhous. Plus: The R code supporting Peter's analysis. | https://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_org/developer/ https://nobelprize.readme.io/v1.0 https://www.buzzfeed.com/peteraldhous/immigration-and-science http://www.peteraldhous.com/ https://buzzfeednews.github.io/2017-01-immigration-and-science/ | |
305 | 2017.02.08 | 4 | Recipe ingredients. | For their 2011 paper, “Flavor network and the principles of food pairing,” four scientists analyzed 56,498 recipes downloaded from three websites — allrecipes.com, epicurious.com, and menupan.com. To support their findings, the authors published two datasets. One names the cuisine and ingredients for each recipe. The other dataset counts how often any two ingredients appeared in the same recipe. (Parmesan cheese and beef appeared together 93 times; starfruit and Algerian geranium oil just once.) Related: “food2vec – Augmented cooking with machine intelligence,” published last month. [h/t Rob Barry] | http://www.nature.com/articles/srep00196 http://www.nature.com/articles/srep00196#supplementary-information https://jaan.io/food2vec-augmented-cooking-machine-intelligence/ | http://rob-barry.com/ |
306 | 2017.02.08 | 5 | Life expectancies. | The World Health Organization publishes life expectancy estimates for 194 countries, for each year between 2000 and 2015. Related: “One Dataset, Visualized 25 Ways.” Previously: American life expectancies by city (April 13, 2016). | http://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/ http://flowingdata.com/2017/01/24/one-dataset-visualized-25-ways/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-04-13-edition | |
307 | 2017.02.15 | 1 | Volunteer weather reports. | The National Weather Service’s Cooperative Observer Program (COOP) is a 127-year-old network of volunteer weather observers. “More than 8,700 volunteers take observations on farms, in urban and suburban areas, National Parks, seashores, and mountaintops,” according to the NWS. Want to become a volunteer? Because the program is so old, “many areas already have the necessary stations operating,” but “about 200 observers resign each year, about 4 per state.” While you’re waiting, you can download the COOP data from Iowa State University. [h/t Bill Frischling] | http://www.nws.noaa.gov/om/coop/ http://www.nws.noaa.gov/om/coop/what-is-coop.html http://www.nws.noaa.gov/om/coop/become.htm https://mesonet.agron.iastate.edu/COOP/ | https://twitter.com/billfrisch |
308 | 2017.02.15 | 2 | Museum-worthy images. | Last week, the Metropolitan Museum of Art made 375,000 images free to use, remix, and share under a Creative Commons Zero license. The museum also publishes bulk metadata on more than 420,000 pieces of art; that file indicates whether a given artwork is in the public domain, and hence whether the images fall under the new license. You can also search the images here. Other museums providing open-access imagery include the National Gallery of Art, the Getty, and Amsterdam’s Rijksmuseum. Previously: Mo’ museum metadata (Nov. 4, 2015). [h/t Joshua Barone + Sarah Bond] | http://www.metmuseum.org/blogs/digital-underground/2017/open-access-at-the-met https://github.com/metmuseum/openaccess http://www.metmuseum.org/art/collection#!?perPage=20&showOnly=withImage%7Copenaccess&sortBy=Relevance&sortOrder=asc&offset=0&pageSize=0 https://images.nga.gov/en/page/show_home_page.html http://search.getty.edu/gateway/search?q=&cat=highlight&f=%22Open+Content+Images%22&rows=10&srt=a&dir=s&pg=1 https://www.rijksmuseum.nl/en/api https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-04-edition | https://www.nytimes.com/2017/02/07/arts/design/met-museum-makes-375000-images-available-for-free.html http://www.forbes.com/sites/drsarahbond/2017/02/08/the-met-museum-just-made-375000-images-open-access-but-here-are-a-few-more-museums-that-are-oa/ |
309 | 2017.02.15 | 3 | Clinical trials. | The Clinical Trials Transformation Initiative — a public-private partnership of more than 80 organizations — upgraded its clinical trials database late last month. The relational database, called the Aggregate Analysis of ClinicalTrials.gov (AACT), contains “all information (protocol and result data elements) about every study registered” through that titular government website. The AACT data is well-documented and accessible both via download and remote database connection. ClinicalTrials.gov also publishes the underlying data itself, but as one big XML file. | https://www.ctti-clinicaltrials.org/ https://www.ctti-clinicaltrials.org/news/upgraded-aact-database-offers-improved-functionality-analyzing-clinicaltrialsgov-data http://aact.ctti-clinicaltrials.org/learn_more http://aact.ctti-clinicaltrials.org/download http://aact.ctti-clinicaltrials.org/connect https://clinicaltrials.gov/ct2/resources/download | |
310 | 2017.02.15 | 4 | Mood swings. | From the Journal of Open Psychology Data: “We present a dataset of a single (N=1) participant diagnosed with major depressive disorder, who completed 1,478 measurements over the course of 239 consecutive days in 2012 and 2013.” The “participant” happens to be one of the study’s authors — Peter C. Groot, a researcher at Maastricht University Medical Centre. Each day, he recorded the degree to which “I feel relaxed,” “I feel lonely,” “I worry,” and responses to dozens of other prompts. [h/t Sacha Epskamp] | http://openpsychologydata.metajnl.com/articles/10.5334/jopd.29/ https://osf.io/j4fg8/ | https://twitter.com/SachaEpskamp/status/830762054399168512 |
311 | 2017.02.15 | 5 | Student athletes. | The NCAA publishes data on its student athletes’ academic progress and graduation rates. The numbers are aggregated by school and sport — from baseball, to women’s bowling, to mixed rifle. [h/t Albert Bowden] | https://www.icpsr.umich.edu/icpsrweb/content/NCAA/data.html | http://opendata.stackexchange.com/a/10527 |
312 | 2017.02.22 | 1 | Subsidized housing. | Earlier this month, the Department of Housing and Urban Development released its “Picture of Subsidized Households” report for 2016. The dataset describes the living conditions, demographics, and finances of families receiving subsidies via the agency’s various programs — including public housing, Section 8 vouchers, and several others. The figures are provided for the entire U.S., by state, metro area, housing agency, city, county, Census tract, and even by housing development. HUD provides a data dictionary explaining each field, as well as a tool to query the data without downloading the entire dataset. [h/t Pat Smith] | https://twitter.com/HUDUSERnews/status/830145945987858436 https://www.huduser.gov/portal/datasets/assthsg.html https://www.huduser.gov/portal/datasets/pictures/dictionary_2016.pdf | https://twitter.com/cityresearch |
313 | 2017.02.22 | 2 | Nearby stars and potential exoplanets. | Last week, a team of researchers released a dataset containing “60,949 Doppler velocity measurements covering 1,624 stars taken over 20 years” from the Keck Observatory in Hawaii. The authors have already used the dataset to identify more than 100 exoplanets — i.e., planets outside our solar system. Now, they’re hoping that the public and other researchers will use their data to help discover even more. Previously: The NASA Exoplanet Archive (May 11, 2016). [h/t Arthur Bashlykov] | http://home.dtm.ciw.edu/ebps/data/ http://news.mit.edu/2017/dataset-nearby-stars-available-public-exoplanets-0213 https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-05-11-edition | https://www.linkedin.com/in/arthur-bashlykov-8a3b2b102 |
314 | 2017.02.22 | 3 | Local UV exposure. | The National Cancer Institute has estimated ultraviolet radiation exposure estimates for every county in the continental United States. The estimates, based on a peer-reviewed methodology and 30 years of data from the National Solar Radiation Data Base, can also be explored using the institute’s mapping tool. Luna County, New Mexico had the highest estimated UV exposure at 5,723 Watt-hours per square meter; Clallam County, Washington, was exposed to the least estimated UV radiation, at 3,012 Wh/m². [h/t J. Albert Bowden II] | https://gis.cancer.gov/tools/uv-exposure/ https://www.researchgate.net/profile/Zaria_Tatalovich/publication/228942287_A_comparison_of_thiessen-polygon_kriging_and_spline_models_of_UV_exposure/links/56605c3b08aebae678aa0abf.pdf http://rredc.nrel.gov/solar/old_data/nsrdb/ https://gis.cancer.gov/geoviewer/app/ | https://data.world/albert/us-county-level-uv-exposure |
315 | 2017.02.22 | 4 | NBA refereeing. | Since March 2015, the National Basketball Association has issued post-game reports reviewing referees’ calls during the final two minutes of neck-and-neck games. The NBA publishes those reports as PDFs; journalist Russell Goldenberg has been converting them to spreadsheet-friendly CSVs. Goldenberg is also analyzing and visualizing the data — updated daily — to show, for example, which players are benefitting most from incorrect and missed calls. (Answer so far: the Wizards’ Marcin Gortat and the Nets’ Brook Lopez.) | http://official.nba.com/nba-last-two-minute-reports-archive/ https://github.com/polygraph-cool/last-two-minute-report/tree/master/output https://pudding.cool/2017/02/two-minute-report/ | |
316 | 2017.02.22 | 5 | Pick a card, any card. | When researchers asked 1,354 people to name or visualize a playing card, 1 in 6 of them first chose the Ace of Spades. Here’s the data, which includes each participant’s three card choices, age, and gender. | http://www.psychologyofmagic.org/research/cards/paper.html https://osf.io/534g2/ | |
317 | 2017.03.01 | 1 | Words kids learn. | Wordbank is an “open database of children's vocabulary development.” So far, the Stanford-hosted project has gathered data from more than 71,000 standardized and anonymized vocabulary questionnaires across 23 languages. You could spend hours exploring the data online, charting how quickly children learn individual words, how quickly the same word (e.g., “grandma,” “abuela,” “ба́бушка”) is learned in different languages, and connections between words. You can download the data for each word or for each child’s vocabulary. Bonus: Wordbank has an R package and a GitHub repository. [h/t Hacker News user "Jasamba"] | http://wordbank.stanford.edu/ http://wordbank.stanford.edu/analyses?name=item_trajectories http://wordbank.stanford.edu/analyses?name=uni_lemmas http://wordbank.stanford.edu/analyses?name=networks http://wordbank.stanford.edu/analyses?name=item_data http://wordbank.stanford.edu/analyses?name=instrument_data http://langcog.github.io/wordbankr/ https://github.com/langcog/wordbank | https://news.ycombinator.com/item?id=13726395 |
318 | 2017.03.01 | 2 | Police officers as immigration enforcers. | In an early executive order, Donald Trump instructed the Department of Homeland Security to expand its use of Section 287(g) of the Immigration and Nationality Act, which allows the federal government to deputize local law enforcement agencies in its search for undocumented immigrants. In response to FOIA requests, DHS has previously released data on the local agencies that participate in the 287(g) program. The Marshall Project has collated the DHS data, which includes the number of immigrants deported, for 2006 to 2013 (the most recent year available). During that timespan, “more than 175,000 people nationwide were deported under the program,” Anna Flagg writes. “More than 30,000 of them came from Maricopa County, Ariz., the most from any single jurisdiction.” [h/t Tom Meagher] | https://www.whitehouse.gov/the-press-office/2017/01/25/presidential-executive-order-enhancing-public-safety-interior-united https://www.uscis.gov/ilink/docView/SLB/HTML/SLB/0-0-0-1/0-0-0-29/0-0-0-9505.html https://github.com/themarshallproject/ICE287g-removals https://www.themarshallproject.org/2017/02/20/the-opposite-of-sanctuary | http://www.tommeagher.com/about.html |
319 | 2017.03.01 | 3 | Vehicle specs. | The National Highway Traffic Safety Administration provides an impressively rich API detailing every manufacturer, make, and model in its database. The API can translate cars’ Vehicle Identification Numbers into the nitty-gritty details that those VINs encode, including the plant where the vehicle was manufactured, number of doors, engine measurements, fuel type, and more. [h/t Justin Myers] | https://vpic.nhtsa.dot.gov/ | http://www.justinmyers.net/ |
320 | 2017.03.01 | 4 | A decade-plus of Seattle library checkouts. | Last month, the Seattle Public Library released a dataset tracking the total number of checkouts for each title by year and month from April 2005 to December 2016 (so far). The dataset isn’t limited to physical books; it also includes e-books, magazines, CDs, DVDs, and more. Last year, the three most popular physical books were Paula Hawkins’s The Girl on the Train (2,355 checkouts), Lauren Groff’s Fates and Furies (2,151 checkouts), and Ta-Nehisi Coates’s Between the World and Me (2,134 checkouts). | https://shelftalkblog.wordpress.com/2017/02/14/for-the-love-of-data-an-open-data-release/ https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6/data | |
321 | 2017.03.01 | 5 | Gator hunting. | Florida’s Fish and Wildlife Conservation Commission publishes data from its statewide recreational alligator hunt. For each alligator harvested between 2000 and 2015, the dataset includes the date, the hunting area, and the length of the carcass. (Legal hunting tools include crossbows, harpoons, spearguns, fishing poles, snatch hooks, and bang sticks — but not rifles, pistols, or other guns.) [h/t Christopher Groskopf + Neil Bedi + Eric Sagara] | http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/ http://myfwc.com/wildlifehabitats/managed/alligator/harvest/ http://myfwc.com/media/3791759/alligator-hunting-guide.pdf | https://github.com/onyxfish/nicar-2017-agate https://twitter.com/esagara |
322 | 2017.03.08 | 1 | The federal checkbook. | From Treasury.io: “Every day at 4pm, the United States Treasury publishes data tables summarizing the cash spending, deposits, and borrowing of the federal government.” Those data tables “catalog all the money taken in that day from taxes, the programs, and how much debt the government took out.” On Monday, for instance, the government spent $481 million on the Postal Service. One hitch: The Treasury’s data tables are (subjectively) ugly and (objectively) spreadsheet-unfriendly. So Treasury.io — an open-source civic project — continuously converts the files into good ol’ tabular data. You can download individual tables as CSVs, get the whole dataset as a big SQLite database, or query the API. There’s also a data dictionary and a Twitter bot. | http://treasury.io/ https://www.fms.treas.gov/fmsweb/viewDTSFiles?dir=w&fname=17030600.txt https://github.com/csvsoundsystem/federal-treasury-api https://github.com/csvsoundsystem/federal-treasury-api/wiki/Treasury.io-Data-Dictionary https://twitter.com/treasuryio | |
323 | 2017.03.08 | 2 | Historical Bitcoin prices. | The Bitcoin exchange rate hit an all time high last week, at more than $1,290 to the dollar. That’s according to CoinDesk’s Bitcoin Price Index, an average rate derived from several major exchanges. You can download daily and hourly data for the index and its components. [h/t Jan Doggen] | http://www.coindesk.com/price/ http://www.coindesk.com/price/bitcoin-price-index/ | http://opendata.stackexchange.com/a/6891 |
324 | 2017.03.08 | 3 | Drug patents and exclusivity. | The FDA’s “Orange Book” lists approved drugs, their associated patents, and government-granted exclusivity rights. The Orange Book is available as a 1,400-page PDF, but you can also download the key data as structured text files. The files are updated monthly. Related: “Drugs For Rare Diseases Have Become Uncommonly Rich Monopolies,” published by Kaiser Health News and NPR in January. Question for readers: The Orange Book data comes as tilde-delimited files, the first I’ve ever seen. Do you have ~any other examples~? [h/t Sydney Lupkin] | https://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm https://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM071436.pdf https://www.fda.gov/Drugs/InformationOnDrugs/ucm129689.htm http://www.npr.org/sections/health-shots/2017/01/17/509506836/drugs-for-rare-diseases-have-become-uncommonly-rich-monopolies | https://twitter.com/slupkin |
325 | 2017.03.08 | 4 | Speaking roles in 2016’s blockbusters. | Researcher Amber Thomas has parsed the transcripts of last year’s 10 highest grossing films. The resulting data files indicate each character’s number of turns speaking, number of words spoken, and gender. Previously: Dialogue from 2,000 movies, by gender (April 13, 2016). | https://proquestionasker.github.io/ https://proquestionasker.github.io/projects/MovieDialogue/ https://github.com/ProQuestionAsker/2016MovieDialogue https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-04-13-edition | |
326 | 2017.03.08 | 5 | Pictures of food. | A trio of European researchers has published a dataset containing 101,000 photos of food — 1,000 images each from 101 food categories, all downloaded from foodspotting.com. The categories include apple pie, escargots, onion rings, paella, bibimbap, prime rib, and more. [h/t Reddit user cavedave] | https://www.vision.ee.ethz.ch/datasets_extra/food-101/ https://www.foodspotting.com | https://www.reddit.com/r/datasets/comments/5v436t/food_101_pictures_of_food_dataset/ |
327 | 2017.03.15 | 1 | Who’s visited the U.S. on visas, and how. | Donald Trump’s new travel ban is scheduled to take effect at 12:01am Eastern tonight. The State Department doesn’t publish realtime visa data, but it does publish historical data, including the number of non-immigrant visas issued each fiscal year between 1997 and 2016, by nationality and visa type. (For example, the government issued 226 “fiancé(e)” K-1 visas to Syrian nationals in fiscal year 2016.) The agency also reports how many visas of each type it refused each year, as well as refusal rates by nationality. [h/t Thomas Kasang] [Update, 2017-12-12: The State Department link appears no longer to be working; here's a copy from the Wayback Machine: http://web.archive.org/web/20171201161048/https://travel.state.gov/content/visas/en/law-and-policy/statistics/non-immigrant-visas.html ] | https://travel.state.gov/content/visas/en/law-and-policy/statistics/non-immigrant-visas.html http://web.archive.org/web/20171201161048/https://travel.state.gov/content/visas/en/law-and-policy/statistics/non-immigrant-visas.html | https://github.com/axibase/atsd-use-cases/blob/master/USVisaRefusal/README.md |
328 | 2017.03.15 | 2 | Sounds of YouTube. | Last week, a research team at Google published AudioSet, a dataset of “2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.” The clips have been classified into hundreds of categories, including “plucked string instrument,” “computer keyboard,” “chuckle, chortle,” “snoring,” and “fowl.” [h/t Suman Deb Roy] | https://research.google.com/audioset/ https://research.google.com/audioset/dataset/plucked_string_instrument.html https://research.google.com/audioset/dataset/computer_keyboard.html https://research.google.com/audioset//dataset/chuckle_chortle.html https://research.google.com/audioset/dataset/snoring.html https://research.google.com/audioset/dataset/fowl.html | https://twitter.com/_RoySD/status/840227343142670336 |
329 | 2017.03.15 | 3 | Many millions of mortgages. | Freddie Mac — the government-sponsored, publicly traded company also known as the Federal Home Loan Mortgage Corporation — publishes data on 23 million single-family home mortgages it has originated or guaranteed since 1999. The dataset includes the loan amount and interest rate, the borrower’s credit score, the property type (e.g., condo, co-op, manufactured housing), metro area, first payment month, whether the borrower is a first-time homebuyer, and lots more. Freddie Mac requests that you register before downloading the data, but you can also access the files directly. Don’t miss the terms and conditions, which prohibit republishing the files. Previously: Data on millions more loans from the Home Mortgage Disclosure Act (Dec. 30, 2015). | http://www.freddiemac.com/news/finance/sf_loanlevel_dataset.html https://www.reddit.com/r/datasets/comments/5x0tws/freddie_mac_fixedrate_mortgage_dataset_from/ https://freddiemac.embs.com/FLoan/HistoricalDataTerms.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-30-edition | |
330 | 2017.03.15 | 4 | Chicago traffic camera violations. | The Windy City publishes two datasets on traffic violations. One tallies the daily number of speeding violations in each Children’s Safety Zone; the other, red-light violations at each camera-surveilled intersection. Both go back to July 2014. The city also publishes a spreadsheet of city-towed vehicles. Related: The Chicago Tribune’s long-running investigation into the city’s traffic camera troubles. [h/t Jacob Sheff] | https://data.cityofchicago.org/Transportation/Speed-Camera-Violations/hhkd-xvj4/ https://data.cityofchicago.org/Transportation/Red-Light-Camera-Violations/spqx-js37 https://data.cityofchicago.org/Transportation/Towed-Vehicles/ygr5-vcbg http://www.chicagotribune.com/news/watchdog/redlight/ | https://www.datazar.com/file/f0c0d92c9-1ae3-468d-ac50-dfc82c32b30c |
331 | 2017.03.15 | 5 | Nearly every proposed amendment to the Constitution. | To prepare for an exhibition last year, the National Archives and Records Administration created a dataset of more than 11,000 constitutional amendment proposals introduced in Congress between 1787 and 2014. [h/t Justin Lewis] | https://www.archives.gov/open/dataset-amendments.html | https://www.datazar.com/file/f5d4b5cb5-8a4e-4ff0-905e-fe2f1acbd5e0 |
332 | 2017.03.22 | 1 | Real-time air quality. | The team at Berkeley Earth has released the data files behind their real-time global air quality map. The map and data track measurements of pollution particles smaller than 2.5 microns in diameter. “Under typical conditions,” the Berkeley Earth team writes, this particulate matter “is the most damaging form of air pollution likely to be present, contributing to heart disease, stroke, lung cancer, respiratory infections, and other diseases.” Previously: The World Health Organization’s Global Urban Ambient Air Pollution Database (June 15, 2016). | http://berkeleyearth.org/about/ http://berkeleyearth.org/air-quality-real-time-maps-data-download/ http://berkeleyearth.org/air-quality-real-time-map/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-15-edition | |
333 | 2017.03.22 | 2 | Previous federal budget proposals. | To accompany its 2016 and 2017 budget proposals, the Obama administration published machine-readable copies on GitHub. Each proposal’s data are divided into three CSV files: for budget authority, outlays, and receipts. The accompanying user guide explains the data sources and structure. Sample tidbit: The White House expected the Department of Homeland Security to pull in $712 million in excise taxes from the Oil Spill Liability Trust Fund in 2017. [h/t Dan Nguyen] | https://github.com/WhiteHouse/budgetdata/tree/2017 https://github.com/WhiteHouse/budgetdata/blob/2017/USER_GUIDE.md https://github.com/WhiteHouse/budgetdata/blob/2017/data/receipts.csv https://www.uscg.mil/npfc/About_NPFC/osltf.asp | https://www.reddit.com/r/datasets/comments/5zzpli/previous_2_years_of_white_house_budgets_available/ |
334 | 2017.03.22 | 3 | Indian state elections. | Five states in India, representing nearly 250 million residents — Punjab, Uttar Pradesh, Uttarakhand, Goa, and Manipur — have already held legislative assembly elections this year. India’s Election Commission publishes these results, but only as webpages. A couple of Hyderabad-based developers have scraped the website, and published CSVs of the data on GitHub. Previously: Data Is Plural’s election edition (Sept. 28, 2016). | http://eciresults.nic.in/ https://github.com/Vizbi/state-elections https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-09-28-edition | |
335 | 2017.03.22 | 4 | Construction spending. | The Census’ Value of Construction Put in Place Survey “provides monthly estimates of the total dollar value of construction work done in the U.S.” For instance, construction spending in 2016 totaled approximately $1.1 trillion, $89 billion of which went to education-related construction. The survey has been collected monthly since 1964; historical data files are available going back to 1993. [h/t Kevin Gilmore] | https://www.census.gov/construction/c30/c30index.html https://www.census.gov/construction/c30/historical_data.html | https://www.datazar.com/file/f18c892eb-b940-4177-88b8-cbd92e9ae5f5 |
336 | 2017.03.22 | 5 | Gone fishing. | NOAA Fisheries’ Greater Atlantic Region publishes spreadsheets of the federal permits it awards to fishing vessels, operators, and dealers. For each vessel, the data includes the boat’s name, owner, principal port city, length, horsepower, and categories of fish permitted. The agency’s Southeast Regional Office also publishes lists of its permits — for shark dealers, domestic swordfish dealers, spiny lobster tailing, and more — but as HTML tables with no CSV-export option. [h/t J. Albert Bowden II] | https://www.greateratlantic.fisheries.noaa.gov/aps/permits/data/index.html http://sero.nmfs.noaa.gov/operations_management_information_services/constituency_services_branch/freedom_of_information_act/common_foia/index.html http://sero.nmfs.noaa.gov/operations_management_information_services/constituency_services_branch/freedom_of_information_act/common_foia/SK.htm http://sero.nmfs.noaa.gov/operations_management_information_services/constituency_services_branch/freedom_of_information_act/common_foia/SD.htm http://sero.nmfs.noaa.gov/operations_management_information_services/constituency_services_branch/freedom_of_information_act/common_foia/LT.htm | https://data.world/albert/permits-vessels-ifq-foias |
337 | 2017.03.29 | 1 | Military spending. | The Stockholm International Peace Research Institute’s Military Expenditure Database is based on official reports, International Monetary Fund yearbooks, newspaper articles, and other sources. It covers most major countries since the 1950s and more than 100 countries since 1988. The dataset also quantifies military spending on a per-capita basis, as share of the country’s GDP, and as a proportion of total government spending. Also: The Defense Manpower Data Center publishes spreadsheets detailing the number of active and reserve U.S. personnel stationed in each state, territory, and foreign country. Previously: SIPRI’s database of international arms transfers (Nov. 18, 2015). [h/t K.K. Rebecca Lai, Troy Griggs, Max Fisher and Audrey Carlsen] | https://www.sipri.org/databases/milex https://www.sipri.org/databases/milex/sources-and-methods https://www.dmdc.osd.mil/appj/dwp/index.jsp https://www.dmdc.osd.mil/appj/dwp/dwp_reports.jsp https://www.sipri.org/databases/armstransfers | https://www.nytimes.com/interactive/2017/03/22/us/is-americas-military-big-enough.html |
338 | 2017.03.29 | 2 | Food surveillance. | Late last year, the FDA began publishing a dataset of ”adverse events” that have been reported to its Center for Food Safety and Applied Nutrition. The database currently covers January 2004 through December 2016, and includes reports of (suspected) bad reactions to foods, dietary supplements, and cosmetics. For instance, the first row names a particular brand of chocolate chips as the potential culprit in the hospitalization of a two-year-old girl, whose symptoms included a rash, swelling face, cough, and difficulty breathing. Previously: FDA adverse event data for pharmaceutical drugs (May 18, 2016). [h/t Sheila Hagar + Drew Ivan] | https://blogs.fda.gov/fdavoice/index.php/2016/12/why-fda-is-making-data-extracted-from-reports-of-adverse-events-for-foods-and-cosmetics-available-to-the-public/ https://www.fda.gov/Food/ComplianceEnforcement/ucm494015.htm https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-05-18-edition | https://twitter.com/ubsheilahagar https://twitter.com/drewivan |
339 | 2017.03.29 | 3 | Failed banks. | The Federal Deposit Insurance Corporation publishes a spreadsheet of failed banks for which the agency has been appointed as a receiver — some 550 banks since October 2000. It also provides short descriptions of each bank failure. The most recent: Proficio Bank of Cottonwood Heights, Utah, which closed on March 3. More on the FDIC’s receivership program here. | https://www.fdic.gov/bank/individual/failed/banklist.html https://www.fdic.gov/bank/historical/bank/ https://www.fdic.gov/about/strategic/strategic/receivership.html | |
340 | 2017.03.29 | 4 | Real-time air quality, part II. | After last week’s item on Berkeley Earth’s real-time air quality data, reader Olaf Veerman pointed me to OpenAQ. The open-source project currently gathers pollution data from nearly 5,500 locations in 47 countries, aggregated “from real-time government and research grade sources.” You can download the data via OpenAQ’s API. [h/t Olaf Veerman] | http://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-22-edition http://berkeleyearth.org/air-quality-real-time-maps-data-download/ https://twitter.com/oBirdman https://openaq.org https://github.com/openaq https://openaq.org/#/countries https://docs.openaq.org/ | https://twitter.com/oBirdman |
341 | 2017.03.29 | 5 | 100 million domain names. | The anonymously-published DNS Census 2013 “is an attempt to provide a public dataset of registered domains and DNS records” — essentially the Internet’s phone book. The dataset, which has also been uploaded to the Internet Archive, includes 2.7 billion Domain Name System records and 106,928,034 distinct domains, organized by extension (e.g., .com, .info, .edu). RIP, certificationcommissionforhealthcareinformationtechnology.biz. [h/t Andrew Ferlitsch] | https://dnscensus2013.neocities.org/ https://archive.org/details/DNSCensus2013 https://archive.org/download/DNSCensus2013/2nd-level-domains/ | http://opendata.stackexchange.com/a/2122 |
342 | 2017.04.05 | 1 | 3D NYC. | In 2014, the NYC Department of Information Technology & Telecommunications conducted a massive aerial survey of the city. Then, they converted the images and data they collected into a three-dimensional model of every building in all five boroughs. Related: In December, The New York Times used the data to map the city’s shadows. Also related: Berlin, the Hague, and Lyon offer digital 3D models of their cities, too. Previously: LiDAR-powered elevation data from around the world (May 25, 2016). [h/t Dan Nguyen] | http://www1.nyc.gov/site/doitt/initiatives/3d-building.page https://www.nytimes.com/interactive/2016/12/21/upshot/Mapping-the-Shadows-of-New-York-City.html http://www.businesslocationcenter.de/en/downloadportal https://data.overheid.nl/data/dataset/3d-model-den-haag/resource/2191118b-5ccc-436b-a5f8-eca12f8f8281 https://data.grandlyon.com/search/?Q=maquettes+textur%25C3%25A9es https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-05-25-edition | https://www.reddit.com/r/datasets/comments/62mcpy/3d_geospatial_data_for_new_york_city_buildings/ |
343 | 2017.04.05 | 2 | Cherry blossoms. | Yasuyuki Aono, an associate professor at Osaka Prefecture University, has collected the historical flowering dates of Kyoto’s Prunus jamasakura cherry trees going all the way back to the 9th century. The dataset is based on “many diaries and chronicles written by Emperors, aristocrats, [governors] and monks,” Aono writes. The dates are those “on which cherry blossom viewing parties had been held or full flowerings had been observed.” Over the past century, Kyoto’s cherry trees have been blooming earlier and earlier. Related: @bbgblossoms, a Twitter bot that tracks the status of the Brooklyn Botanic Garden’s 152 cherry trees. [h/t Eric Steig] | http://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/ https://twitter.com/hausfath/status/848939887839526912 https://twitter.com/bbgblossoms https://www.bbg.org/collections/cherries | https://twitter.com/ericsteig/status/848656113201315840 |
344 | 2017.04.05 | 3 | Government bond ownership. | Bruegel, “a European think tank that specialises in economics,” publishes a quarterly-updated dataset quantifying sovereign bond holdings for 12 countries: Belgium, Finland, France, Germany, Greece, Ireland, Italy, Netherlands, Portugal, Spain, the U.K., and the United States. For each country, the dataset tells you what proportion of the federal government’s bonds are held by each of five types of owners: the country’s central bank, other public institutions, domestic banks, other domestic investors, and foreign investors. [h/t @CoolDatasets] | http://bruegel.org/ http://bruegel.org/publications/datasets/sovereign-bond-holdings/ | https://twitter.com/CoolDatasets/status/839851026949812224 |
345 | 2017.04.05 | 4 | Science grants. | The National Science Foundation publishes data on all of the grants the agency has awarded since the 1970s (and some earlier ones, too). Each grant is represented as an XML file, which contains information about the project, the awardee, and the NSF division that awarded the grant. [h/t France A. Córdova] | https://www.nsf.gov/awardsearch/download.jsp | http://opendata.stackexchange.com/a/10945 |
346 | 2017.04.05 | 5 | Avian invasions. | In peer-reviewed paper published last week, a trio of University College London researchers describe their Global Avian Invasions Atlas. The dataset includes information on “971 species, introduced to 230 countries and administrative areas across all eight biogeographical realms, spanning the period 6000 BCE – AD 2014.” | http://www.nature.com/articles/sdata201741 https://figshare.com/articles/Data_from_The_Global_Avian_Invasions_Atlas_-_A_database_of_alien_bird_distributions_worldwide/4234850 | |
347 | 2017.04.12 | 1 | Plum presidential appointments. | Every four years, Congress publishes United States Government Policy and Supporting Positions, better known as the Plum Book. The 2016 version, which is available as both PDF and Excel files, identifies more than 8,000 executive and legislative branch jobs subject to “noncompetitive appointment.” Those positions include 1,710 presidential appointments, which are as wide-ranging as the ambassadorship to Afghanistan and the directorship of the Occupational Safety and Health Administration’s Whistleblower Protection Program. Related: For positions requiring its confirmation, the Senate publishes XML files of pending, confirmed, and withdrawn nominees. | https://www.gpo.gov/fdsys/pkg/GPO-PLUMBOOK-2016/content-detail.html https://www.senate.gov/legislative/nominations.htm | |
348 | 2017.04.12 | 2 | Miles per gallon. | The Environmental Protection Agency publishes fuel efficiency data on all the car models it has tested, going back to the 1980s… minus all the Volkswagen, Audi, and Porsche diesels caught cheating. The data typically includes three estimates: for city driving, highway driving, and a city-highway combination. | https://www.fueleconomy.gov/feg/download.shtml | |
349 | 2017.04.12 | 3 | Pirated papers. | Sci-Hub, which describes itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers,” recently released a list of the 62,835,101 academic papers it has collected. That dataset identifies each paper only by its DOI — a short, unique ID. Helpfully, graduate student Bastian Greshake has extracted the journal name, publisher, and publication ear from those DOIs. Greshake has also combined that data with six months of Sci-Hub download data (previously featured in DIP 2016.05.04), and analyzed the datasets together. Among his findings: Both are “largely made up of recently published articles, with users disproportionately favoring newer articles and 35% of downloaded articles being published after 2013.” | https://sci-hub.cc/ https://figshare.com/articles/List_of_DOIs_of_papers_collected_by_SciHub/4765477 http://www.apastyle.org/learn/faqs/what-is-doi.aspx http://ruleofthirds.de/ https://zenodo.org/record/472493 http://datadryad.org/resource/doi:10.5061/dryad.q447c https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-05-04-edition http://biorxiv.org/content/early/2017/04/10/124495 | |
350 | 2017.04.12 | 4 | International aid for maternal and child health. | Researchers at the World Health Organization have assembled a dataset of international aid — both from official government assistance and private grants — devoted to reproductive, maternal, newborn, and child health from 2003 to 2013. The dataset, which the researchers described in a recent academic article, draws on 2.1 million records, and is based largely on the OECD’s Creditor Reporting System. Related: Earlier this month, the U.S. State Department cut all its funding for the UN's family planning agency; it was the agency’s third-largest donor. | http://datacompass.lshtm.ac.uk/320/ http://www.nature.com/articles/sdata201738 http://stats.oecd.org/Index.aspx?datasetcode=CRS1 https://www.buzzfeed.com/jinamoore/the-us-wont-give-any-more-money-to-the-un-population-fund | |
351 | 2017.04.12 | 5 | One million comic book panels. | Comic books make use of white space — or gutters — to propel the story forward, relying on readers’ intuitive ability to fill in the gaps between panels. To see whether computers could learn to make the same inferences, a group of computer scientists built a giant corpus of public-domain comics and tried training a series of neural networks on it. (Spoiler: Humans are much better at this.) The underlying dataset contains 1.2 million panels from nearly 200,000 scanned pages of nearly 4,000 books in the Digital Comic Museum, all published during the 1938–1954 “Golden Age” of American comics. It also contains 2.5 million chunks of text extracted from the comics’ speech balloons, thought bubbles, and narration boxes. [h/t Robin Sloan] | https://arxiv.org/abs/1611.05118 https://obj.umiacs.umd.edu/comics/index.html https://digitalcomicmuseum.com/ | https://www.robinsloan.com/ |
352 | 2017.04.19 | 1 | UK, US, Rx. | The UK’s National Health Service publishes monthly data on drugs prescribed in England through the country’s single-payer health care system. (Drugs prescribed in Scotland, Wales, or Northern Ireland aren’t included.) For each prescriber-and-drug combination, the dataset includes the quantity and cost of prescriptions for each month since August 2010. The US publishes similar data about prescriptions issued through Medicare, but only on an annual basis and currently only covering 2013 and 2014. Related: ProPublica’s Prescriber Checkup, which uses the Medicare data to examine doctors’ prescribing patterns. Previously: A decade-plus of Australian prescription data (DIP 2016.08.24). [h/t Adam Crahen] | https://data.gov.uk/dataset/prescribing-by-gp-practice-presentation-level https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html https://projects.propublica.org/checkup/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-08-24-edition | https://twitter.com/acrahen/status/853487201837101056 |
353 | 2017.04.19 | 2 | Vaccination rates by state. | The CDC’s National Center for Immunization and Respiratory Diseases collects and publishes state-by-state vaccination rates for infants, kindergartners, teens, and adults — plus, flu vaccination rates for several age groups. Each dataset includes several years’ worth of data, with many going back to 2008 or 2009. Related: “California Shows The Rest Of The Country How To Boost Kindergarten Vaccination Rates,” by my colleague Peter Aldhous, with additional county-level data from the Golden State. Previously: International vaccination rates and policies (DIP 2016.08.03). | https://www.cdc.gov/vaccines/imz-managers/coverage/childvaxview/data-reports/index.html https://www.cdc.gov/vaccines/imz-managers/coverage/schoolvaxview/data-reports/index.html https://www.cdc.gov/vaccines/imz-managers/coverage/teenvaxview/data-reports/index.html https://www.cdc.gov/vaccines/imz-managers/coverage/adultvaxview/data-reports/index.html https://www.cdc.gov/flu/fluvaxview/interactive.htm https://www.buzzfeed.com/peteraldhous/record-vaccination-in-california http://www.peteraldhous.com/ https://www.cdph.ca.gov/programs/immunize/Pages/ImmunizationLevels.aspx https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-08-03-edition | |
354 | 2017.04.19 | 3 | Tropical cyclones. | Through its International Best Track Archive for Climate Stewardship project, the National Oceanic and Atmospheric Administration publishes what it calls “the most complete global set of historical tropical cyclones available.” For each tropical cyclone — a category that includes typhoons, hurricanes, tropical depressions, and more — the dataset includes its position, wind speed, central pressure, and classification at six-hour intervals. The dataset is updated annually and includes some historical cyclones from as early as 1842. [h/t Daniel Miller] | https://www.ncdc.noaa.gov/ibtracs/ https://www.ncdc.noaa.gov/ibtracs/index.php?name=ibtracs-data | https://opendata.stackexchange.com/a/10994 |
355 | 2017.04.19 | 4 | Where plants grow best. | The USDA’s Plant Hardiness Zone Map “is the standard by which gardeners and growers can determine which plants are most likely to thrive at a location.” The USDA and Oregon State, which have jointly developed the map, previously sold access to the underlying data through a vendor. But after the vendor shut down earlier this year, OSU began publishing the data free of charge (though with some licensing restrictions). The dataset is available as detailed shapefiles and as ZIP code–based spreadsheets. [h/t Waldo Jaquith + Lynn Cherny] | http://planthardiness.ars.usda.gov/PHZMWeb/Default.aspx http://web.archive.org/web/20151001025259/http://climatesource.com/cgi-bin/csshop/scan/st=db/co=yes/sf=category/se=phz_us_phz/op=eq/va=banner_text=US%20%28grids%20%26%20shapefiles%29.html http://www.prism.oregonstate.edu/projects/plant_hardiness_zones.php | https://twitter.com/waldojaquith/status/851995078067453952 https://twitter.com/arnicas/status/852198803310616576 |
356 | 2017.04.19 | 5 | Spelling self-corrections. | For a 2012 academic paper, researchers captured the keystrokes of paid volunteers as they typed descriptions of images. Whenever a participant used the backspace key to correct a word, the researchers added it to a dataset of self-corrections. Each of the 44,000 lines in the English-language version of the dataset contains the original mistake and the correction. The most common change was in → on. Other common fixes included waling → walking and pople → people. [h/t Seth Stephens-Davidowitz] | http://dl.acm.org/citation.cfm?id=2390665.2390749 https://www.microsoft.com/en-us/download/details.aspx?id=52418 | http://sethsd.com/everybodylies/ |
357 | 2017.04.26 | 1 | National park visitors. | The U.S. National Park Service publishes a ton of data about visitors to its parks, historic sites, memorials, preserves, and more. Among them: Visitors per park (annually since 1904, and monthly since 1979), overnight stays by type of lodging (tents, RVs, backcountry, etc.), and traffic. Related: “The National Parks Have Never Been More Popular” (FiveThirtyEight, 2016). [h/t Jack King] | https://irma.nps.gov/Stats/Reports/National https://fivethirtyeight.com/features/the-national-parks-have-never-been-more-popular/ | https://data.world/inform8n/us-national-parks-visitation-1904-2016-with-boundaries |
358 | 2017.04.26 | 2 | Word frequencies. | You’re probably familiar with the Google Books Ngram Viewer, which lets you chart word and phrase frequencies over time. Google publishes the underlying data but those files can (depending on your tools and goals) be cumbersomely large. Here’s an alternative: DIP reader (and former colleague) Chris Wilson has condensed the overall frequencies for 87,000 words — those found in the CMU Pronouncing Dictionary — into a svelte, four-megabyte file. Related: BYU’s advanced interface to the Google Books data. Also related: “The Pitfalls of Using Google Ngram to Study Language” (Wired, 2015). And also: “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them” (The Atlantic, 2017). | https://books.google.com/ngrams/graph?content=data+is%2Cdata+are&year_start=1800&year_end=2012&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20is%3B%2Cc0%3B.t1%3B%2Cdata%20are%3B%2Cc0 http://storage.googleapis.com/books/ngrams/books/datasetsv2.html http://mechanicalscribe.com/notes/google-ngrams-for-cmu-pronunciation-dictionary/ http://www.speech.cs.cmu.edu/cgi-bin/cmudict https://github.com/mechanicalscribe/cmu_tf_idf http://googlebooks.byu.edu/x.asp https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/ https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/ | |
359 | 2017.04.26 | 3 | Women’s empowerment in India. | For each of India’s 36 states and Union Territories, the country’s latest National Family Health Survey includes 114 metrics, such as the percentages of “households using iodized salt” and “men who have comprehensive knowledge of HIV/AIDS.” Unfortunately, the government publishes the reports only as PDFs. But the Hindustan Times has extracted the data for the survey’s eight “women’s empowerment and gender based violence” metrics, including the percentages of “ever-married women who have ever experienced spousal violence” and “women having a bank or savings account that they themselves use.” They’ve published that data as a spreadsheet and used it to construct an interactive Women Empowerment Index. [h/t Gurman Bhatia] | http://rchiips.org/nfhs/factsheet_NFHS-4.shtml https://github.com/HindustanTimesLabs/women-empowerment-index https://docs.google.com/spreadsheets/d/179onU4jvFPqhLlM-7ZJu5xv0LQLOsQq-TrMJNFQkSjI/edit#gid=1937549234 http://www.hindustantimes.com/interactives/women-empowerment-index/ | http://www.gurmanbhatia.com/ |
360 | 2017.04.26 | 4 | Marriage and divorce, pregnancy and infertility in the U.S. | The CDC has been running its National Survey of Family Growth since 1973. For the first three decades, it surveyed only women ages 15-44. Starting in 2002, it began also surveying men. The latest survey was conducted in 2013-15, when it collected data from 10,205 residents about sexual activity and contraception, pregnancy and infertility, marriage and divorce, adoption, parenting, and more. [h/t Allen B. Downey] | https://www.cdc.gov/nchs/nsfg/index.htm https://www.cdc.gov/nchs/nsfg/nsfg_2013_2015_puf.htm | http://greenteapress.com/thinkstats/html/thinkstats002.html#htoc5 |
361 | 2017.04.26 | 5 | This must be the r/place. | For April Fools, Reddit launched a million-pixel canvas called “r/place.” Users could place a single-pixel tile, in one of 16 colors, anywhere on the canvas — but only every five minutes. By the end of r/place’s 72-hour lifetime, Redditors had placed 16.5 million tiles on the canvas, likely making it “the largest collaborative art project in history.” Last week, Reddit published the entire history of the canvas as structured data. [h/t Felipe Hoffa] | https://www.reddit.com/r/place/ https://www.reddit.com/r/redditdata/comments/6640ru/place_datasets_april_fools_2017/ | https://twitter.com/felipehoffa/status/854395005028454401 |
362 | 2017.05.10 | 1 | The border fence. | There’s about 700 miles of official fencing between the U.S. and Mexico, covering about one-third of the full border. The Department of Homeland Security doesn’t provide structured spatial data about the fence’s path. But, thanks to a Texas law professor’s FOIA and some serious elbow grease, reporters at Reveal have created “the most detailed border fence map publicly available.” For each segment of fence, Reveal’s dataset includes the fence type (i.e., pedestrian, vehicle, or unknown), the government’s name for the segment, and the project through which the segment was built. | https://law.utexas.edu/humanrights/borderwall/maps/background-maps.html http://cironline.org/blog/post/surprising-tools-cir-used-map-us-mexico-border-fence-6255 https://www.revealnews.org/article/the-wall-building-a-continuous-u-s-mexico-barrier-would-be-a-tall-order/ https://github.com/cirlabs/border_fence_map | |
363 | 2017.05.10 | 2 | Insurance premiums and payouts. | Last month, ProPublica and Consumer Reports published an analysis of car insurance costs in four states, finding that “some major insurers charge minority neighborhoods as much as 30 percent more than other areas with similar accident costs.” The reporters also published a detailed methodology and dataset supporting their findings. The dataset contains company-by-company insurance premiums for a (hypothetical) college-educated, excellent-credit, accident-free 30-year-old woman in each of 6,261 ZIP codes in the four states — California, Texas, Missouri, and Illinois. The dataset also includes several years of average (per-car) insurance payouts for each ZIP code, which the reporters obtained from state insurance commissioners. Related: The insurance industry's rebuttal and ProPublica's counter-rebuttal. | https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-white-areas-same-risk https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-methodology https://projects.propublica.org/graphics/carinsurance#src-line http://www.insurancejournal.com/news/national/2017/04/05/447012.htm https://www.propublica.org/article/the-car-insurance-industry-attacks-our-story-our-response | |
364 | 2017.05.10 | 3 | Three million grocery orders. | Groceries-on-demand startup Instacart has released a dataset containing 3 million orders from 200,000 (anonymized) users. “For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order,” the company’s head of data science writes. “We also provide the week and hour of day the order was placed, and a relative measure of time between orders.” Here’s the data dictionary. | https://www.instacart.com/datasets/grocery-shopping-2017 https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2 https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b | |
365 | 2017.05.10 | 4 | What do you do with a PhD in science? | The National Science Foundation’s Survey of Doctorate Recipients “is a longitudinal biennial survey conducted since 1973 that provides demographic and career history information about individuals with a research doctoral degree in a science, engineering, or health (SEH) field from a U.S. academic institution.” You can download aggregated data and detailed survey responses going back to 1993. The next release is scheduled for this month. Related: The NSF has published an interactive graphic of the data. [h/t Peter Aldhous] | https://www.nsf.gov/statistics/srvydoctoratework/ https://www.nsf.gov/statistics/srvydoctoratework/#tabs-2 https://sestat.nsf.gov/datadownload/ https://www.nsf.gov/statistics/next-releases.cfm#survey5 https://www.nsf.gov/nsb/sei/infographic2/ | http://www.peteraldhous.com/ |
366 | 2017.05.10 | 5 | *Such* an important dataset. | Grad students in Princeton’s computer science department have published a dataset they call Self-Annotated Reddit Corpus, or “SARC” for short. “The corpus has 1.3 million sarcastic statements — 10 times more than any previous dataset,” the authors write, and takes advantage of Reddit users’ habit of tagging sarcastic comments with an “/s”. Related: A dataset of sarcastic Amazon reviews. [h/t Carlos Somohano + Reddit user cavedave] | http://nlp.cs.princeton.edu/SARC/ https://arxiv.org/abs/1704.05579 https://github.com/ef2020/SarcasmAmazonReviewsCorpus/wiki | https://www.getrevue.co/profile/datamachina/issues/data-machina-issue-114-55313 https://www.reddit.com/r/datasets/comments/68s1c9/a_collection_of_sarcastic_and_regular_amazon/ |
367 | 2017.05.17 | 1 | North Korean missile tests. | The James Martin Center for Nonproliferation Studies publishes what it calls “the first database to record flight tests of all missiles launched by North Korea capable of delivering a payload of at least 500 kilograms a distance of at least 300 kilometers.” The database currently contains 107 missile tests — starting with North Korea’s first, launched in April 1984, to its latest, launched Sunday morning. For each test, the data includes the missile’s launch site, highest altitude, distance travelled, landing location, success/failure, and other details. [h/t Ian Greenleigh] | http://www.nonproliferation.org/ http://www.nti.org/analysis/articles/cns-north-korea-missile-test-database/ https://www.buzzfeed.com/gracewyler/north-korea-reportedly-launches-ballistic-missile | https://data.world/ian/the-cns-north-korea-missile-test-database |
368 | 2017.05.17 | 2 | Global food prices. | The UN World Food Programme’s vulnerability analysis group collects and publishes food price data for more than 1,000 towns and cities in more than 70 countries. The dataset, which goes back more than a decade, covers basic staples, such as wheat, rice, milk, oil, and more. It’s updated monthly and feeds into (among other things) the UNWFP’s price-spike indicators. Related: The Humanitarian Data Exchange, which hosts the dataset for the UN. Also: The Economist’s Big Mac Index. [h/t Andrew McCartney] | http://vam.wfp.org/ https://data.humdata.org/dataset/wfp-food-prices http://foodprices.vam.wfp.org/ALPS-at-a-glance.aspx https://data.humdata.org/ http://www.economist.com/content/big-mac-index | |
369 | 2017.05.17 | 3 | Rising seas. | How might rising sea levels affect coastal flooding? A new-ish NOAA Technical Report, published in January, combines historical data on global sea levels with “regional factors contributing to sea level change for the entire U.S. coastline.” The result: Localized projections under six sea-level rise scenarios, ranging from “low” to “extreme.” You can download the data (at the bottom of this page) or explore it on a map. Related: Climate Central describes what NOAA’s “extreme” scenario could mean for America (including more maps and calculations). Previously: Tide gauge data (DIP 2016.03.23) and sea ice measurements (DIP 2016.09.14). [h/t Susie Cambria] | http://www.noaa.gov/media-release/new-regional-sea-level-scenarios-help-communities-prepare-for-risks https://sealevel.nasa.gov/understanding-sea-level/key-indicators/global-mean-sea-level https://scenarios.globalchange.gov/sea-level-rise https://coast.noaa.gov/slr/ http://www.climatecentral.org/news/extreme-sea-level-rise-stakes-for-america-21387 https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-03-23-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-09-14-edition | https://about.me/susiecambria |
370 | 2017.05.17 | 4 | “The watch list Chicago police fought to keep secret.” | The Chicago Sun-Times has obtained and published an August 2016 copy of the Chicago Police Department’s “Strategic Subject List,” a database that scores nearly 400,000 (unnamed) people on a scale from 10 to 500, based on an algorithm that attempts to estimate their risk of being involved in gun violence (either as a shooter or a victim). The database includes demographic, geographic, criminal history, and other information about the people it ranks. “But the database doesn’t indicate — and the police won’t say — how much weight is given to each factor in computing the scores, which are produced using an algorithm developed at the Illinois Institute of Technology,” according to the Sun-Times. | http://chicago.suntimes.com/politics/what-gets-people-on-watch-list-chicago-police-fought-to-keep-secret-watchdogs/ | |
371 | 2017.05.17 | 5 | Story arcs. | “The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia.” The plots describe movies, books, plays, TV series, TV episodes, video games, and other stories — essentially, any *thing that has a Wikipedia article with the word “plot” in one of its subheadings. Related: “Examining the arc of 100,000 stories: a tidy analysis” and “Gender and verbs across 100,000 stories: a tidy analysis,” two blog posts by David Robinson that use the data. | https://github.com/markriedl/WikiPlots http://varianceexplained.org/r/tidytext-plots/ http://varianceexplained.org/r/tidytext-gender-plots/ http://varianceexplained.org/about/ | |
372 | 2017.05.24 | 1 | America’s card catalog. | Last week, the Library of Congress released its largest dataset ever: nearly 25 million records for books, maps, manuscripts and other items in its online catalog. For each item, the data includes standardized bibliographic information, such as the title, author, publication date, and genre. (The dataset represents the online catalog as it was in 2013; more recent data will cost you.) Related: A bit of background about the library’s MARC (Machine Readable Cataloging Records) data format. | https://www.loc.gov/item/prn-17-068/ http://www.loc.gov/cds/PDFdownloads/mds2016.pdf https://opensource.com/article/17/4/bit-about-marc-handlers | |
373 | 2017.05.24 | 2 | Domestic radicalization. | The Profiles of Individual Radicalization in the United States (PIRUS) database “contains deidentified individual-level information on the backgrounds, attributes, and radicalization processes of nearly 1,500 violent and non-violent extremists who adhere to far right, far left, Islamist, or single issue ideologies in the United States” — including the Klu Klux Klan, the Taliban, and the Animal Liberation Front, among others. The dataset covers 1948 through 2013 and was released earlier this year by a team at the University of Maryland. [h/t Lorand Bodo] | http://www.start.umd.edu/data-tools/profiles-individual-radicalization-united-states-pirus http://www.start.umd.edu/news/profiles-individual-radicalization-united-states-pirus-data-now-available | https://twitter.com/LorandBodo/status/864186557242249216 |
374 | 2017.05.24 | 3 | Ransomware payments. | When the malware program known as “WannaCry” hit hundreds of thousands of computers earlier this month, it demanded that the computers’ owners pay $300 in Bitcoin — or lose all of their data. Keith Collins at Quartz has been using Blockchain’s API to track Bitcoin payments to the three digital wallets that the hackers designated to receive the ransoms. He’s published the data and is also using it to power a Twitter bot. Related: “Victims of the WannaCry ransomware attacks have stopped paying up” and “Inside the digital heist that terrorized the world—and only made $100k,” both by Collins. Previously: Historical Bitcoin prices (DIP 2017.03.08). | http://keithcollins.github.io/ https://blockchain.info/api https://github.com/keithcollins/actual_ransom https://twitter.com/actual_ransom https://qz.com/986094/wannacry-ransomware-attacks-victims-have-stopped-paying-the-ransom/ https://qz.com/985093/inside-the-digital-heist-that-terrorized-the-world-and-made-less-than-100k/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-08-edition | |
375 | 2017.05.24 | 4 | Fifty million doodles. | Google is clever: It created a drawing game, got 15 million people to play it, and then turned those doodles into into a public dataset of people drawing. You can download the raw data, or just browse the doodles online. | https://quickdraw.withgoogle.com/data https://github.com/googlecreativelab/quickdraw-dataset https://quickdraw.withgoogle.com/data | |
376 | 2017.05.24 | 5 | 🎶 Two thousand cans of craft beer on the wall 🎶. | The website CraftCans.com publishes a database of 2,000+ canned beers. For each beer, the database lists its name, style, brewery, size, alcohol level, and bitterness. The website doesn’t provide a direct download, but — as Jean-Nicholas Hould points out — you can basically just copy-paste the website’s data into your favorite spreadsheet program. Or, if you want something slightly cleaner, you can use this script. Related: This data-profiling tutorial by Hould, which uses the data. Also related: RateBeer.com’s API, but you’ll need to request a developer key to use it. Plus: This interactive graphic, which uses the RateBeer data to explore America’s microbrew epicenters. And also: Official brewery production stats from the U.S. Alcohol and Tobacco Tax and Trade Bureau. [h/t Daniel Brady] | http://www.craftcans.com/ http://craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text http://www.jeannicholashould.com/python-web-scraping-tutorial-for-craft-beers.html https://gist.github.com/jsvine/c537ac9509e7d0ed713cced4992faf39 http://www.jeannicholashould.com/profiling-a-dataset-of-craft-beers.html https://www.ratebeer.com/json/ratebeerapi.asp https://pudding.cool/2017/04/beer/ https://www.ttb.gov/beer/beer-stats.shtml | http://danjbrady.com/ |
377 | 2017.05.31 | 1 | Government payrolls. | Last week at BuzzFeed News, we shared a vast trove federal payroll data. Those records — provided by Office of Personnel Management through the Freedom of Information Act — cover more than 40 years and millions of employees. The dataset includes salaries, titles, job types, and demographic variables. In many-but-not-all cases (per OPM’s data release policies), it also includes names. Previously, federal payroll data had been searchable online, but very little was available in downloadable, analysis-friendly formats. Also: Many states – including New York, California, Florida, New Jersey, Minnesota, Arkansas, South Carolina, and Washington – proactively make payroll data available for download. (Some cities, such as Chicago, do, too.) | https://www.buzzfeed.com/jsvine/sharing-hundreds-of-millions-of-federal-payroll-records http://php.app.com/agent/federalemployees/search https://www.fedsdatacenter.com/federal-pay-rates/ https://data.ny.gov/browse?tags=salaries%2Fpayroll&utf8=%E2%9C%93 http://publicpay.ca.gov/ http://salaries.myflorida.com/ http://www.yourmoney.nj.gov/transparency/payroll/ https://mn.gov/mmb/transparency-mn/payrolldata.jsp https://www.ark.org/dfa/transparency/employee_compensation.php http://www.admin.sc.gov/accountability-portal/state-salaries http://fiscal.wa.gov/salaries.aspx http://fiscal.wa.gov/salaries.aspx https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w | |
378 | 2017.05.31 | 2 | Government lobbying. | U.S. lobbyists must notify Congress within 45 days of being retained by new clients. Every quarter after that, they’re required to file activity reports that detail the agencies they lobbied, the topics they covered, and the income they earned. Bulk downloads of both types of reports are available as XML files from the House (going back to 2004) and from the Senate (since 1999). Although they receive the same filings, each chamber “follows different data-cleaning, processing, and editing procedures before storing the data,” according to this recent GAO report. | http://disclosures.house.gov/ld/ldsearch.aspx https://www.senate.gov/legislative/Public_Disclosure/LDA_reports.htm http://www.gao.gov/products/GAO-16-320 | |
379 | 2017.05.31 | 3 | State gun laws. | A team of researchers at the Boston University School of Public Health has collected data on the presence/absence of 133 different types of firearm laws in each U.S. state, for each year between 1991 and 2016. The legal provisions are grouped into 14 categories, such as background checks, “Stand Your Ground” laws, and child access prevention. You can download a spreadsheet of the data, and also browse state-by-state summaries. Previously: The Correlates of State Policy Project (DIP 2016.07.06). | https://www.statefirearmlaws.org/about.html https://www.statefirearmlaws.org/ https://www.statefirearmlaws.org/categories.html https://www.statefirearmlaws.org/table.html https://www.statefirearmlaws.org/state-by-state.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-06-edition | |
380 | 2017.05.31 | 4 | Industrial sector data. | Aswath Damodaran — a professor of finance at the NYU’s business school — maintains a trove of data on per-sector financials, including effective tax rates, return on equity, and working capital ratios by industry. For most datasets, Damodaran publishes both current and historical versions. [h/t Tim McGovern] | http://pages.stern.nyu.edu/~adamodar/New_Home_Page/ http://pages.stern.nyu.edu/~adamodar/New_Home_Page/data.html http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/taxrate.htm http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/roe.html http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/wcdata.html http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datacurrent.html http://pages.stern.nyu.edu/~adamodar/New_Home_Page/dataarchived.html | https://twitter.com/herdingbats |
381 | 2017.05.31 | 5 | NYC doggies. | You might have seen New York City’s bubble map of dog names. It turns out that the underlying dataset — which includes the name, gender, age as of 2015, breed, and borough of more than 110,000 dogs — is available on GitHub. You can also download slightly older, but more detailed data from WNYC’s Dogs of NYC project. That data includes each dog’s coat colors, whether it had been spayed/neutered, and its ZIP code. Related: Similar pet license data from Tacoma, Wash., and Edmonton, Canada. [h/t Alex P. Miller + Dan Nguyen] | http://a816-dohbesp.nyc.gov/IndicatorPublic/dognames/ https://github.com/Kaz-A/dog_names/ https://fusiontables.google.com/data?docid=1pKcxc8kzJbBVzLu_kgzoAMzqYhZyUhtScXjB0BQ#rows:id=1 https://project.wnyc.org/dogs-of-nyc/ https://data.cityoftacoma.org/Neighborhoods/Current-Pet-License-City-of-Tacoma-Fircrest/qnnn-t9wt https://data.edmonton.ca/Community-Services/Pet-Licenses-by-Neighbourhood/5squ-mg4w | https://twitter.com/alexpmil/status/861703366203801600 http://danwin.com/ |
382 | 2017.06.07 | 1 | Millions of scientists, and their migrations. | ORCID is a nonprofit organization that provides unique identifiers for researchers — mostly scientists so far — to make it easier to distinguish between them. It has issued more than 3 million IDs so far, and provides annual bulk downloads of all researchers’ public profiles. In many cases, the researchers have supplied their education and employment histories. That enabled Science magazine to analyze the migrations of more than 110,000 researchers who’ve listed multiple countries in these public CVs. (The data and code underlying the analysis are also available to download.) [h/t Shaun Coffey] | https://orcid.org/ https://orcid.org/content/orcid-public-data-file-use-policy http://www.sciencemag.org/news/2017/05/vast-set-public-cvs-reveals-world-s-most-migratory-scientists http://datadryad.org/resource/doi:10.5061/dryad.48s16 | https://twitter.com/ShaunCoffey/status/865880767015956480 |
383 | 2017.06.07 | 2 | Trump’s pre-presidency flights. | Before Donald Trump began flying on Air Force One, he rode a fleet of private aircraft. Reporters at Bloomberg used the Freedom of Information Act to obtain flight records for three major components of that fleet — a ”Boeing 757 with gold-plated seatbelt buckles, known as Trump Force One during the campaign; a Cessna 750 Citation X jet; and a Sikorsky helicopter”. For each of the more than 1,500 flights taken between August 2010 and November 2016, the dataset contains the date, time, and airport of both the departure and arrival. Trump wasn’t necessarily aboard each of those flights; the dataset does not contain passengers information. Related: Bloomberg’s analysis/maps of the data. Also related: The Washington Post used the data to estimate the flights’ CO2 emissions. | https://github.com/BloombergGraphics/2017-trump-flight-data https://www.bloomberg.com/news/features/2017-06-01/this-is-where-trump-traveled-before-becoming-president https://www.washingtonpost.com/news/politics/wp/2017/06/06/trumps-campaign-planes-alone-had-the-carbon-footprint-of-500-americans-for-a-year/ | |
384 | 2017.06.07 | 3 | Severe workplace injuries. | Beginning in January 2015, the Occupational Safety and Health Administration began requiring U.S. employers to report “all severe work-related injuries, defined as an amputation, in-patient hospitalization, or loss of an eye.” You can download a spreadsheet of these injuries — some 20,000 in 2015 and 2016 combined. It contains the injury dates, descriptions, and outcomes, as well as the employers’ names and locations. Previously: OSHA’s more detailed (but slightly more cumbersome) inspection data and API (DIP 2016.07.13). [Clarification, 2017-06-07/2017-06-14: The dataset dataset reflects "federal OSHA states only.” It excludes “injuries in state plans," which cover private sector employees in 21 states.] | https://www.osha.gov/severeinjury/index.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-13-edition | |
385 | 2017.06.07 | 4 | Annotated Reddit conversations. | Researchers at Google took a semi-random sample of 9,473 Reddit threads, containing 116,347 comments in total. Then, they paid people to categorize each comment by its “discourse act” — e.g., whether it was a question, answer, announcement, agreement, humor, et cetera. The result is Coarse Discourse, “a dataset for understanding online discussions.” [h/t Roberto Bayardo] | https://github.com/google-research-datasets/coarse-discourse | https://twitter.com/roberto_bayardo/status/864591636097110017 |
386 | 2017.06.07 | 5 | E. coli at Ocean Beach. | The San Francisco Public Utilities Commission’s Beach Water Quality Monitoring Program measures bacteria levels at fifteen locations on the city’s shoreline. You can download the measurements by clicking the “raw data” link below this map. The data powers the (unsurprisingly) unofficial @BeachPooBot account on Twitter. [h/t Reddit user cavedave] | http://sfwater.org/index.aspx?page=87 http://sfwater.org/cfapps/lims/beachmain1.cfm https://github.com/John-Brandon/Beach_Poo_Bot https://twitter.com/BeachPooBot | https://www.reddit.com/r/datasets/comments/5vk72n/san_francisco_shit_in_the_water_data/ |
387 | 2017.06.14 | 1 | Supreme Court transcripts. | Oyez.org bills itself as, among other things, “a complete and authoritative source for all of the [Supreme] Court’s audio since the installation of a recording system in October 1955.” The site has an API and releases all its material — including timestamped transcripts of oral arguments — under a Creative Commons license. A least two GitHub repositories have aggregated the transcripts and make them easy to bulk-download. For each segment of audio, the transcripts list the start/end time, the speaker, and the text. Related: PuppyJusticeAutomated, a YouTube channel that (a) must be seen to be understood and (b) uses the Oyez API. Previously: CourtListener (DIP 2016.04.13) and The Supreme Court Database (DIP 2016.02.24). [h/t Walker Boyle + Reddit user 21cannons] | https://www.oyez.org/ https://www.oyez.org/about https://api.oyez.org https://www.oyez.org/license https://github.com/walkerdb/supreme_court_transcripts/ https://github.com/free-law-coalition/oyez-scotus https://www.youtube.com/c/PuppyJusticeAutomated https://github.com/ALSchwalm/PuppyJusticeAutomated https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-04-13-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-02-23-edition | https://github.com/walkerdb https://www.reddit.com/r/datasets/comments/6epse3/all_existing_supreme_court_oral_argument/ |
388 | 2017.06.14 | 2 | Federal corporate prosecutions. | Last week, the University of Virginia School of Law launched an expanded version of its Corporate Prosecution Registry. The revamped database includes “detailed information about every federal organizational prosecution since 2001, as well as deferred and non-prosecution agreements with organizations since 1990” — more than 3,000 cases so far. Previously: Good Jobs First’s Violation Tracker (DIP 2015.11.11). [h/t Tom Jackman] | http://content.law.virginia.edu/news/201706/go-resource-researching-corporate-prosecution-just-got-more-powerful http://lib.law.virginia.edu/Garrett/corporate-prosecution-registry/index.html http://lib.law.virginia.edu/Garrett/corporate-prosecution-registry/about.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-11-edition | https://www.washingtonpost.com/news/true-crime/wp/2017/06/05/new-database-of-rarely-tracked-corporate-crime-prosecutions-launches-today/ |
389 | 2017.06.14 | 3 | Business owners. | The Census Bureau’s Survey of Business Owners and Self-Employed Persons “provides the only comprehensive, regularly collected source of information on selected economic and demographic characteristics for businesses and business owners by gender, ethnicity, race, and veteran status.” The most recent data comes from 2012. The survey has been conducted every five years since 1972, but data from before 1992 is “available only in printed form.” Related: “30% Of The Black-Owned Businesses In New York Disappeared In 5 Years,” by my colleague Cora Lewis. | https://www.census.gov/programs-surveys/sbo/about.html https://www.census.gov/programs-surveys/sbo/data.html https://www.buzzfeed.com/coralewis/in-5-years-new-york-lost-30-of-its-black-owned-businesses https://twitter.com/cora | |
390 | 2017.06.14 | 4 | Antibiotic resistance | . ResistoMap is an interactive visualization of antibiotic drug resistance, based on more than 1,500 bacteria genome samples from people’s intestinal tracts. The data behind the visualization is available to download. It’s partly based on two prior datasets: McMaster University’s Comprehensive Antibiotic Resistance Database (“a bioinformatic database of resistance genes, their products and associated phenotypes”) and the University of Gothenburg’s BacMet (“an easy-to-use bioinformatics resource of antibacterial biocide- and metal-resistance genes”). [h/t Carlos Somohano] | http://resistomap.rcpcm.org/ https://figshare.com/s/081a528b7ad55725a2ae https://card.mcmaster.ca/ http://bacmet.biomedicine.gu.se/index.html | https://www.getrevue.co/profile/datamachina/issues/data-machina-issue-119-61138 |
391 | 2017.06.14 | 5 | L.A. pot dispensaries. | The Los Angeles City Controller has released a map of the city’s openly-operating medical marijuana businesses. You can access a spreadsheet of the 191 dispensaries that comply with Proposition D, which the city passed in 2013. Additionally, you can find hundreds of (active and inactive) dispensaries by filtering the city’s business registrations to those whose primary NAICS category is listed as “medical marijuana collective.” [h/t Zack Quaintance] | http://www.lacontroller.org/mjrelease http://lacontroller.maps.arcgis.com/apps/Cascade/index.html?appid=8737a0a6f867495d93b6ba484eaf8cbc https://controllerdata.lacity.org/Revenue/191-Prop-D-Compliant-Medical-Marijuana-Businesses/vva2-przx/ https://ballotpedia.org/City_of_Los_Angeles_Medical_Marijuana_Dispensaries,_Measures_D,_E_and_F_(May_2013) | http://www.govtech.com/civic/Whats-New-in-Civic-Tech-App-Foster-Support-for-Kentucky-Fiber-Initiative.html |
392 | 2017.06.21 | 1 | 130 million traffic stops. | “Police pull over more than 50,000 drivers on a typical day, more than 20 million motorists every year. Yet the most common police interaction — the traffic stop — has not been tracked, at least not in any systematic way,” according to the Stanford Open Policing Project. To that end, the group has been collecting and standardizing traffic-stop data from state police agencies across America. Its first data release, published Monday, contains 130 million records from 31 states. The records vary by agency, but the most-complete states include the date, time, location, reason, and outcome of each stop; the driver’s race, gender, and age; whether a search was conducted; and whether the search found contraband. Related: The project’s findings so far. Previously: Raw traffic stop data from a smaller number of states (DIP 2015.10.28). | https://openpolicing.stanford.edu/ https://openpolicing.stanford.edu/data/ https://openpolicing.stanford.edu/findings/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-10-28-edition | |
393 | 2017.06.21 | 2 | Famine warnings. | “Created by USAID in 1985 to help decision-makers plan for humanitarian crises,” the Famine Early Warning Systems Network (FEWS NET) “provides evidence-based analysis on some 34 countries.” As part of its work, FEWS NET publishes geospatial shapefiles that score each country’s “most likely food security outcome” on standardized scale: Minimal, Stressed, Crisis, Emergency, and Famine. Previously: Global food prices (DIP 2017.05.17). [h/t Melissa Segura] | https://www.fews.net/ https://www.fews.net/shapefiles http://www.fews.net/IPC https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-17-edition | https://twitter.com/melissadsegura |
394 | 2017.06.21 | 3 | Two million open-source projects, and their dependencies. | Libraries.io monitors “over 2.4m unique open source projects, 25m repositories and 85m interdependencies between them.” Last week, the site released its first bulk dataset, which describes each project’s metadata, published versions, and dependencies on other software libraries. [h/t Nadia Eghbal] | https://libraries.io https://medium.com/@BenJam/libraries-io-releases-data-on-over-25m-software-repositories-ab1db665826e https://libraries.io/data | https://twitter.com/nayafia/status/876795645847248896 |
395 | 2017.06.21 | 4 | Political party manifestoes. | The Manifesto Project has collected and coded more than 4,000 electoral manifestoes from more than 1,000 political parties in more than 50 countries between 1945 and 2015. For each manifesto, the project’s dataset indicates whether the document expresses support for/against dozens of policies and attitudes, including “market regulation,” a “national way of life”, “environmental protection,” and “anti-imperialism.” You can also browse the manifestoes online. Caveat: The dataset is subject to a somewhat restrictive usage policy. [h/t The Quartz Directory of Essential Data] | https://manifesto-project.wzb.eu/ https://manifesto-project.wzb.eu/datasets https://visuals.manifesto-project.wzb.eu/mpdb-shiny/cmp_dashboard_corpus/ https://manifesto-project.wzb.eu/information/terms_of_use | https://docs.google.com/spreadsheets/d/1hU7Snj4KZ-ppyy388l-sV4I26n4yGVb8xYnygPOS-5k/edit#gid=1436509184 |
396 | 2017.06.21 | 5 | Prisoners’ tattoos. | The Florida Department of Corrections’ public database contains a table describing current and released inmates’ tattoos. That data includes each tattoo’s location (e.g., “right arm,” “stomach,” “face”) and description (“cross,” “tribal,” and “skull” being the most common). Helpful: Dan Nguyen’s guide to converting the database into SQLite and CSV files. Related: Recent analyses by The Economist and by The Palm Beach Post. | http://www.dc.state.fl.us/pub/obis_request.html https://gist.github.com/dannguyen/c6bcc9884c25cf68f3550560ccae5ca8 http://www.economist.com/news/christmas-specials/21712032-what-can-be-learned-prisoners-tattoos-statistical-analysis-art http://www.mypalmbeachpost.com/news/body-art-000-florida-prison-inmates-runs-from-freaky-kinky/HXpaJsmobtJCGPP0vI9WlI/ | |
397 | 2017.06.28 | 1 | Infectious diseases in Europe. | The European Centre for Disease Prevention and Control’s Surveillance Atlas of Infectious Diseases lets you browse, map, and download data on the historical incidence of several dozen diseases — from anthrax to Zika — in each of the European Economic Area’s countries. Related: Keila Guimarães’s recent investigation into penicillin shortages, which uses the Centre’s data on syphilis cases. | http://atlas.ecdc.europa.eu/public/index.aspx?Instance=GeneralAtlas https://qz.com/984705/syphilis-is-on-the-rise-because-penicillin-isnt-profitable/ | |
398 | 2017.06.28 | 2 | People’s genes. | OpenSNP is a website that lets people publish the results of their genetic tests (such as those sold by 23andMe, deCODEme, FamilyTreeDNA), “find others with similar genetic variations, [get] the latest primary literature on their variations, and help scientists find new associations.” Since 2012, users have uploaded more than 3,000 sets of genetic variants, which you can download individually or in bulk or access via OpenSNP’s API. Users can also list various personal traits, such as eye color, height, coffee consumption, and lactose intolerance. Useful primer: SNP stands for “single nucleotide polymorphism,” the NIH explains. They’re “the most common type of genetic variation”; each one “represents a difference in a single DNA building block, called a nucleotide.” | https://opensnp.org/ https://opensnp.org/statistics https://opensnp.org/genotypes https://github.com/openSNP/snpr/wiki/JSON-API https://opensnp.org/phenotypes https://ghr.nlm.nih.gov/primer/genomicresearch/snp | |
399 | 2017.06.28 | 3 | Real estate inventories. | The National Association of Realtors publishes monthly real estate inventory data “at the national level, the 500 largest metropolitan areas, the 1,000 largest counties, and over 15,000 zip codes.” The data, based on the realtors’ multiple listing services, goes back five years and “tracks key market metrics including list prices, days on market, and total active inventory.” As of early June, six counties — Manhattan, plus five in California — had median listing prices above $1 million. Previously: The Census Bureau’s Annual Characteristics of New Housing (DIP 2016.06.22), international house prices (DIP 2017.02.08), millions of mortgages (DIP 2015.12.30), and millions more mortgages (DIP 2017.03.15). [h/t Reddit user bbekks] | http://research.realtor.com/data/inventory-trends/ http://www.realtor.com/advice/buy/what-is-the-mls-multiple-listing-service/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-22-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-02-08-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-30-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-15-edition | https://www.reddit.com/r/datasets/comments/6flgrf/request_im_using_the_full_historical_inventory/ |
400 | 2017.06.28 | 4 | Every federally tax-exempt nonprofit. | The Internal Revenue Service publishes a file listing all “organizations eligible to receive tax-deductible charitable contributions” — currently more than 1 million charities, private foundations, and other groups. (Not all nonprofits apply for, or receive, tax-exempt status from the IRS; but all tax-exempt organizations are nonprofits.) Previously: Annual IRS 990 filings, in bulk (DIP 2016.06.22). [h/t Norbert Krupa + Derek Willis] | https://apps.irs.gov/app/eos/forwardToDeductStatusSearchHelp.do https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-22-edition | https://opendata.stackexchange.com/questions/16/is-there-a-complete-list-of-all-us-tax-exempt-nonprofits-in-machine-readable-for |
401 | 2017.06.28 | 5 | 140-character politics. | The recently-launched Tweets Of Congress is collecting and publishing daily archives of tweets by congressional representatives, caucuses, and committees. Meanwhile, the Trump Twitter Archive has collected more than 30,000 of @realDonaldTrump’s tweets, which you can search and download. | https://alexlitel.github.io/congresstweets/ http://www.trumptwitterarchive.com/about http://www.trumptwitterarchive.com/archive https://github.com/bpb27/trump_tweet_data_archive | |
402 | 2017.07.19 | 1 | Solar eclipses. | You’ve probably seen The Washington Post’s solar eclipse graphics from last Monday. The stellar maps are largely based on an online tool that uses data from NASA's Five Millennium Canon of Solar Eclipses. The tool can (among other things) generate maps and KMZ files describing the paths of the 11,898 solar eclipses Earth will have experienced between and 2000 BCE and 3000 CE. Helpful: NASA’s key to understanding the data terminology. | https://www.washingtonpost.com/graphics/national/eclipse/ http://xjubier.free.fr/en/site_pages/solar_eclipses/5MCSE/xSE_Five_Millennium_Canon.html https://eclipse.gsfc.nasa.gov/SEpubs/5MCSE.html https://eclipse.gsfc.nasa.gov/SEcat5/SEcatkey.html | |
403 | 2017.07.19 | 2 | Dublin in detail. | Last week, a team at NYU announced “the world’s densest urban aerial laser scanning (LiDAR) dataset” — a 1.4-billion-point description of Dublin’s city center. They write: ”At over 300 points per square meter, this is more than 30 times denser than typical LiDAR data and is an order of magnitude denser than any other aerial LiDAR dataset.” The researchers collected the topographical data during a series of criss-crossing flyovers on March 26, 2015. They’ve also published a short, illustrative video. Previously: LiDAR datasets (DIP 2016.05.25) and 3D models (DIP 2017.04.05) of cities and countries around the world. [h/t Darrell Etherington] | http://cusp.nyu.edu/press-release/nyu-center-urban-science-progress-professor-releases-worlds-densest-urban-aerial-laser-scanning-dataset/ https://geo.nyu.edu/catalog/nyu_2451_38684 https://www.youtube.com/watch?v=qEi2Wo7Bcuk https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-05-25-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-04-05-edition | https://techcrunch.com/2017/07/12/nyu-releases-the-largest-lidar-dataset-ever-to-help-urban-development/ |
404 | 2017.07.19 | 3 | UN General Debate speeches. | Each September, the United Nations gathers for its annual General Assembly. Among the activities: the General Debate, a series of speeches delivered by the UN’s nearly 200 member states. The statements provide “an invaluable and, largely untapped, source of information on governments’ policy preferences across a wide range of issues over time,” write a trio of researchers who, earlier this year, published the UN General Debate Corpus — a dataset containing the transcripts of 7,701 speeches from 1970 to 2016. The researchers have also published an online tool for exploring and visualizing the dataset. Previously: UN General Assembly votes since 1946 (DIP 2016.07.13). [h/t Ronny Patz] | https://gadebate.un.org/en http://www.smikhaylov.net/wp-content/uploads/2017/04/UNGDC_RAP_Final.pdf https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0TJX8Y http://ungd.smikhaylov.net/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-13-edition | https://twitter.com/ronpatz/status/883052266567208960 |
405 | 2017.07.19 | 4 | Global economic forecasts. | The International Monetary Fund’s World Economic Outlook Database contains the fund’s projections for future “national accounts, inflation, unemployment rates, balance of payments, fiscal indicators, trade for countries and country groups” and commodity prices. (They predict that farm-bred Norwegian salmon will cost $6.79/kg in 2022.) The database also contains historical observations for many of the economic indicators back to 1980. [h/t David Mihalyi] | https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx | https://twitter.com/davidmihalyi/status/869562784186544128 |
406 | 2017.07.19 | 5 | Old Faithful et al. | The National Park Service and Geyser Observation and Study Association have been using water-temperature sensors to track the eruption times of dozens of geysers in Yellowstone — Old Faithful, of course, but also Beehive, Little Squirt, and Narcissus. GeyserTimes.org combines this data with historical logbooks and observations from “geyser gazers” to form what it describes as “the most comprehensive database of geyser eruption and observation data on the internet.” | http://www.gosa.org/about.aspx http://geyserstudy.org/electronicsummary.aspx http://www.geyserstudy.org/geyser.aspx?pGeyserNo=OLDFAITHFUL http://www.geyserstudy.org/geyser.aspx?pGeyserNo=BEEHIVE http://www.gosa.org/geyser.aspx?pGeyserNo=LITTLESQUIRT http://geyserstudy.org/geyser.aspx?pGeyserNo=NARCISSUS http://geysertimes.org/data.php http://www.geyserstudy.org/ofvclogs.aspx http://geysertimes.org/about.php | |
407 | 2017.07.26 | 1 | Interned Japanese Americans. | The Densho Digital Repository is an archive of oral histories, photographs, newspaper clippings, and other primary sources relating to the internment of Japanese Americans during World War II. Among the materials: several datasets listing people sent to the internment camps, based on official government records. The largest dataset contains more than 100,000 entries and includes details such as each internee’s “relocation” site, arrival date, hometown, birth year, time spent in Japan, marital status, religion, educational degrees, occupation, and military service. The National Archives hosts the raw data, as well as its documentation. | https://ddr.densho.org/ https://ddr.densho.org/names/ https://catalog.archives.gov/id/1264228 | |
408 | 2017.07.26 | 2 | Trump’s visits to Trump properties. | NBC News has been tracking the president’s visits to his own luxury properties. For each day since Trump took office, the data — available to download at the bottom of the page — tells you which properties he visited and whether any were golf courses. Since February, Trump has visited his properties roughly 10 days a month, including 25 trips to Mar-a-Lago and 42 trips to his golf courses. Related: A similar tracker from The New York Times. [h/t Rachel Schallom] | http://www.nbcnews.com/politics/donald-trump/how-much-time-trump-spending-trump-properties-n753366 https://www.nytimes.com/interactive/2017/04/05/us/politics/tracking-trumps-visits-to-his-branded-properties.html | http://tinyletter.com/best-of-interactives/letters/best-in-visual-storytelling-32 |
409 | 2017.07.26 | 3 | Women running for the U.S. House. | As the basis for his recent study, “Is Running Enough? Reconsidering the Conventional Wisdom about Women Candidates” (paywalled, but a draft is freely available), PhD candidate Peter Bucchianeri compiled a dataset of female candidates in House primary elections from 1972 to 2010. The spreadsheet covers 1,242 candidacies, and includes each candidate’s party, votes garnered in the primary and general elections, the seat’s incumbency status, the district’s demographics, and more. | https://link.springer.com/article/10.1007/s11109-017-9407-7 https://www.dropbox.com/s/l9p6mt9a4vgipxf/Bucchianeri%20-%20Is%20Running%20Enough.pdf?dl=0 http://www.peterbucchianeri.com/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CFPBRI | |
410 | 2017.07.26 | 4 | The Enron emails. | During the course of its Enron investigation, the Federal Energy Regulatory Commission obtained the emails of approximately 150 (mostly high-ranking) Enron staff. You can find versions of the dataset — cleaned, deduplicated, and restructured in various ways — hosted by Carnegie Mellon, UC Berkeley, and Duke Law. Related: “What the Enron Emails Say About Us,” published by The New Yorker last week. Nathan Heller writes: The Enron archive “remains one of the country’s largest private e-mail corpora turned public. Its lasting value is less as an account of Enron’s daywork than as a social and linguistic data pool, a record of the way we write online when we’re not preening for the public eye.” | https://www.ferc.gov/industries/electric/indus-act/wec/enron/info-release.asp https://www.cs.cmu.edu/~./enron/ http://bailando.sims.berkeley.edu/enron_email.html http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set/ http://www.newyorker.com/magazine/2017/07/24/what-the-enron-e-mails-say-about-us | |
411 | 2017.07.26 | 5 | Data podcast data. | Data Stories is a podcast about data visualization, hosted by Enrico Bertini and Moritz Stefaner. To celebrate their recently-published 100th episode, the hosts released a spreadsheet detailing the date, title, number and genders of guests, length, and timestamped subchapters of each episode so far. Related: Christian Laesser’s visualization of the data. [h/t Benjamin Cooley] | http://datastori.es/ http://datastori.es/100-data-stories-100/ http://projects.datavis.club/100-data-stories/ | https://medium.com/towards-data-science/data-curious-17-07-2017-a-roundup-of-data-stories-datasets-and-visualizations-from-last-week-2a6766ac54d6 |
412 | 2017.08.02 | 1 | Data from the search for MH370. | After Malaysia Airlines flight MH370 disappeared in March 2014, the Australian government undertook an enormous seafloor-mapping operation in search of the lost Boeing 777. Last month, it released data from the first phase of the project, which collected 278,000 square kilometers of bathymetry (i.e., seafloor topography) measurements. “In general, the world's deep oceans have had little investigation,” the government explains in an interactive map. “Only 10 to 15 percent of the ocean has been mapped with the sonar technology similar to that used in the search for MH370.” As a result, the MH370 search area “is now among the most thoroughly mapped regions of the deep ocean on the planet.” [h/t Soh Kam Yung] | http://marine.projects.ga.gov.au/mh370-phase-one-data-release.html https://geoscience-au.maps.arcgis.com/apps/Cascade/index.html?appid=038a72439bfa4d28b3dde81cc6ff3214 | https://news.ycombinator.com/item?id=14802105 |
413 | 2017.08.02 | 2 | European Union lobbying. | The EU publishes a searchable database of people and organizations registered to lobby the European Parliament and the European Commission. The website LobbyFacts.eu takes that data and makes it available via an API. LobbyFacts also scrapes the European Commission’s disclosed lobbying meetings, which you can download here (warning: 10-megabyte direct download). Related: You can also explore the lobbyists and meetings via InegrityWatch.eu, which uses LobbyFacts’ data. Previously: U.S. government lobbyists (DIP 2017.05.31). [h/t Enigma Public + Xavier Dutoit] | http://ec.europa.eu/transparencyregister/public/homePage.do?redir=false&locale=en https://lobbyfacts.eu/ http://api.lobbyfacts.eu/docs/api https://lobbyfacts.eu/transparency_meetings http://www.integritywatch.eu/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-31-edition | http://mailchi.mp/2ad52d3e7df1/between-two-rows-july-2017?e=64b85561c3 https://github.com/tttp/doi/blob/master/data/fetch.sh |
414 | 2017.08.02 | 3 | SAT, ACT, and AP scores. | The California Department of Education publishes aggregate scores on these high-school tests for each county, district, and school going back to the late 1990s. One hitch: For more than two months, the 2016 AP data “contained 350,000 more tests than had actually been taken,” according to inewsource.org’s Megan Wood, who spotted the discrepancies (and others) and got the department to fix them. Similar datasets are available from other states, including Texas, Florida, and Pennsylvania. Bonus: inewsource.org’s has also published easy-to-search tables of the California AP, SAT, and ACT scores. | http://www.cde.ca.gov/ds/sp/ai/ http://inewsource.org/2017/06/13/state-admits-posting-faulty-data/ http://tea.texas.gov/acctres/sat_act_index.html http://www.fldoe.org/accountability/accountability-reporting/act-sat-ap-data/index.stml http://www.education.pa.gov/K-12/Assessment%20and%20Accountability/Pages/SAT-and-ACT.aspx#tab-1 http://data.inewsource.org/interactives/california-ap-scores-2011-2016/ http://data.inewsource.org/interactives/california-sat-scores-2011-2016/ http://data.inewsource.org/interactives/california-act-scores-2011-2016/ | |
415 | 2017.08.02 | 4 | Individual library checkouts. | The Seattle Public Library publishes a dataset of every checkout of every physical item (e.g., paperback books and DVDs, but not e-books) since April 2005. It currently contains more than 90 million rows. Previously: The library’s monthly checkout counts, by title (DIP 2017.03.01). [h/t David Christensen] | https://data.seattle.gov/dataset/Checkouts-by-Title-Physical-Items-/3h5r-qv5w https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-01-edition | https://www.linkedin.com/in/davidrchristensen/ |
416 | 2017.08.02 | 5 | Harmony lovers. | The New York Philharmonic has published three spreadsheets listing its subscribers — including where they sat, how much they paid, and where they had their tickets sent — for a slew of orchestral seasons between 1883 and the late 1990s. The earliest data includes names, too. (“Miss A. Brown” of 715 Fifth Avenue seems to have been a big fan, having subscribed to 26 seats for the 1890-91 season.) Previously: The Philharmonic’s performance history (DIP 2016.10.12). [h/t Rachel Shorey] | http://archives.nyphil.org/index.php/open-data https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-10-12-edition | https://twitter.com/rachel_shorey/status/889944977282928641 |
417 | 2017.08.09 | 1 | Nutrition facts. | The USDA National Nutrient Database for Standard Reference is the primary source for most of the food nutrition facts you see in America. The database assesses more than 8,000 foods, from abiyuch to zwieback, and provides the average nutrient levels per 100 grams — e.g., protein, carbohydrates, vitamin D, caffeine, lycopene, and water. North of the border, you can find the (bilingual) Canadian Nutrient File. It’s based on the USDA data, but excludes stateside foods “known not to be on the Canadian market”, adds some foods (such as poutine and ptarmigan), and makes adjustments based on “Canadian levels of fortification and regulatory standards.” The United Kingdom has its own nutrient file, as do many other countries. [h/t Reddit user Alacritous] | https://www.ars.usda.gov/northeast-area/beltsville-md/beltsville-human-nutrition-research-center/nutrient-data-laboratory/docs/usda-national-nutrient-database-for-standard-reference/ https://ndb.nal.usda.gov/ndb/foods/show/2428 https://ndb.nal.usda.gov/ndb/foods/show/469 https://www.canada.ca/en/health-canada/services/food-nutrition/healthy-eating/nutrient-data/canadian-nutrient-file-2015-download-files.html https://www.canada.ca/en/health-canada/services/food-nutrition/healthy-eating/nutrient-data/canadian-nutrient-file-compilation-canadian-food-composition-data-users-guide-2010.html https://food-nutrition.canada.ca/cnf-fce/serving-portion.do?id=6772 https://food-nutrition.canada.ca/cnf-fce/serving-portion.do?id=5933 https://www.gov.uk/government/publications/composition-of-foods-integrated-dataset-cofid http://www.fao.org/infoods/infoods/tables-and-databases/en/ | https://www.reddit.com/r/datasets/comments/6g56sl/canadian_nutrient_file_food_database_containing/ |
418 | 2017.08.09 | 2 | Interstate commodity flows. | The federally funded Freight Analysis Framework “integrates data from a variety of sources to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation.” For each year between 2012 and 2015, the database “provides estimates for tonnage (in thousand tons) and value (in million dollars) by regions of origin and destination, commodity type, and mode.” Last week, Axios published an interactive map of the state-to-state flows for each commodity group, as well as some helpful caveats and “head-scratchers.” [h/t Chris Canipe] | http://faf.ornl.gov/fafweb/ https://www.axios.com/the-flow-of-goods-2463665414.html | https://twitter.com/ccanipe/status/892724588576206848 |
419 | 2017.08.09 | 3 | Pidgin and creole languages. | The Atlas of Pidgin and Creole Language Structures contains data on 76 languages, such as Trinidad English Creole, Afrikaans, Guadeloupean Creole, and Singapore Bazaar Malay. For each language, the dataset includes information about 130 “structural features,” example sentences, and more. Previously: The World Atlas of Language Structures (DIP 2016.01.06) and a database of the Trans-New Guinea language family (DIP 2015.11.04). [h/t Rachael Tatman] | http://apics-online.info/ http://apics-online.info/download http://apics-online.info/parameters http://apics-online.info/sentences https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-01-06-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-11-04-edition | https://www.kaggle.com/rtatman/atlas-of-pidgin-and-creole-language-structures |
420 | 2017.08.09 | 4 | A century of UK coastal flooding. | Earlier this year, the researchers behind SurgeWatch.org published an updated version of their their database of UK coastal floods. They combined tidal gauge data with reports from scientific journals, newspapers, and social media to identify 329 “coastal flooding events” that occurred between 1915 to 2016. For each event, the dataset includes the date, region, and severity level, which ranges from 1 (“nuisance”) to 6 (“disaster,” applied to only one event — the North Sea flood of 1953). | https://www.surgewatch.org/ https://www.bodc.ac.uk/data/published_data_library/catalogue/10.5285/481720c2-35bd-6c10-e053-6c86abc06bb3/ https://www.nature.com/articles/sdata2017100 https://en.wikipedia.org/wiki/North_Sea_flood_of_1953 | |
421 | 2017.08.09 | 5 | Talent agencies. | California’s Department of Industrial Relations publishes a dataset of all licensed talent agencies, with each agency’s name, address, license number, workers’ comp insurer, and bond issuer. Florida publishes something similar. Previously: Texas’s licensed professionals (DIP 2015.12.09). | https://www.dir.ca.gov/databases/dlselr/talag.html http://www.myfloridalicense.com/dbpr/sto/file_download/public-records-talent.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-09-edition | |
422 | 2017.08.16 | 1 | Historic newspapers. | Chronicling America — a project run by the Library of Congress and the National Endowment for the Humanities — provides information about more than 150,000 historic newspapers and access to digitized pages from many of them. Its API lets you search the database and doesn’t require registration; its bulk data includes text from more than 12 million pages. For instance, here’s the Omaha Daily Bee’s front page on April 7, 1917, the day after the U.S. entered World War I. [h/t Ed Summers] | http://chroniclingamerica.loc.gov/about/ http://chroniclingamerica.loc.gov/search/titles/results/ http://chroniclingamerica.loc.gov/about/api/ http://chroniclingamerica.loc.gov/ocr/ http://chroniclingamerica.loc.gov/lccn/sn99021999/1917-04-07/ed-1/seq-1/ | https://github.com/toddmotto/public-apis/commit/f8c7ea6e91e08b5a3f60a8c8a09f7f57c591a01f |
423 | 2017.08.16 | 2 | Terrorism prosecutions. | “The U.S. government has prosecuted 808 people for terrorism since the 9/11 attacks. Most of them never even got close to committing an act of violence.” Those are the findings of The Intercept’s Trial and Terror database, first published in April and most recently updated last week. The underlying data — available on GitHub — contains each defendant’s name and demographic details, as well as each case’s description, status, charges, charge date, conviction date (if convicted), jurisdiction, and more. | https://trial-and-terror.theintercept.com/ https://github.com/firstlookmedia/trial-and-terror-data | |
424 | 2017.08.16 | 3 | Brain scans. | The Open Access Series of Imaging Studies (OASIS) project is “aimed at making MRI data sets of the brain freely available to the scientific community,” with the goal of “[facilitating] future discoveries in basic and clinical neuroscience.” So far, the project has published two collections: a cross-sectional dataset of scans from 416 people, ages 18 to 96; and a longitudinal dataset, based on 150 people aged 60 to 96, each of whom were scanned at least two different times. [h/t Andrew Beam] | http://www.oasis-brains.org/ | https://github.com/beamandrew/medical-data |
425 | 2017.08.16 | 4 | The U.S. petroleum supply and exports. | The Energy Information Administration’s Petroleum Supply Monthly contains detailed data about how the United States obtains crude oil and petroleum products, and where that supply goes. In May, for instance, the U.S. refined nearly 314 million barrels of “finished motor gasoline” and exported 18.6 million barrels of it. | https://www.eia.gov/petroleum/supply/monthly/ | |
426 | 2017.08.16 | 5 | Prime psychology. | Robin Sloan, author of Mr. Penumbra’s 24‑Hour Bookstore, has a new book coming out next month — one that he believes “is the first novel in English to feature, as a main supporting character, a possibly-sentient sourdough starter.” To dole out advance copies of the book, Sloan conducted the following contest: Try to choose the smallest prime number that nobody else will pick. Now he’s posted the results — a CSV listing the number of contestants who chose each prime number. (Seventeen was the most popular number among the contest’s 1,354 entries; the smallest unique prime was 409.) | https://www.robinsloan.com/books/penumbra/ https://www.robinsloan.com/books/sourdough/ https://github.com/robinsloan/penumbra-primes | |
427 | 2017.08.23 | 1 | A century of malarial mosquitoes. | A team of researchers has compiled “the largest ever geo-coded database of anophelines in Africa.” (Anophelines are the only kind of mosquito that transmits malaria.) The database covers 1898 to 2016 and includes more than 13,400 observations of mosquitoes in specific locations. For each observation, the dataset lists the country, administrative region(s), and latitude/longitude, as well as the time period, the species identified, the sampling method, and the source of the information. [h/t Michael Chew] | https://wellcomeopenresearch.org/articles/2-57/ https://www.cdc.gov/malaria/about/biology/mosquitoes/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NQ6CUN | https://twitter.com/MichaelWKChew/status/898116844858552320 |
428 | 2017.08.23 | 2 | Who really controls UK companies? | Last year, the British government began requiring companies to identify all the people who exert power over them. The resulting “People with Significant Control” database contains each person’s name, country of residence, nationality, and “nature of control” — e.g., ownership of large numbers of shares, voting rights, or the ability to appoint/remove directors. [h/t Enigma Public] | https://www.gov.uk/government/news/people-with-significant-control-companies-house-register-goes-live http://download.companieshouse.gov.uk/en_pscdata.html | http://mailchi.mp/6dc38c39abc3/between-two-rows-august-2017 |
429 | 2017.08.23 | 3 | Carbon-conscious energy policies. | The Database of State Incentives for Renewables & Efficiency, “is the most comprehensive source of information on incentives and policies that support renewables and energy efficiency in the United States.” The database, which was founded in 1995 and is funded by the Department of Energy, includes tax rebates, solar energy buybacks, building standards, and more. You can download the data in several formats, or browse and search it online. [h/t Carol Brotman White] | http://www.dsireusa.org/ http://www.dsireusa.org/resources/data-and-tools/ http://programs.dsireusa.org/system/program | https://www.eia.gov/todayinenergy/detail.php?id=32332 |
430 | 2017.08.23 | 4 | NEH grants and grant-evaluators. | The congressionally-established National Endowment for the Humanities publishes a dataset of all of the grants it has awarded since the late 1960s. On the same page, you can download a file describing the organization’s 25,000+ “evaluators” — “knowledgeable persons outside NEH who are asked for their judgments about the quality and significance” of proposed projects. [h/t Brett Bobley + Max Kemman] | https://securegrants.neh.gov/open/data/ | https://twitter.com/brettbobley/status/895994169403080705 https://twitter.com/MaxKemman/status/895995120444596225 |
431 | 2017.08.23 | 5 | The World Color Survey. | The 1970s, a team of linguistic investigators canvassed the globe, armed with boxes of color chips. They sought out a couple dozen native speakers of 110 unwritten languages, and asked: What do you call these colors? The results are available online. Related: This Vox video provides context. | http://www1.icsi.berkeley.edu/wcs/ http://www1.icsi.berkeley.edu/wcs/data.html https://www.vox.com/videos/2017/5/16/15646500/color-pattern-language | |
432 | 2017.08.30 | 1 | Flood maps. | FEMA’s Flood Map Service Center publishes geospatial files that detail the agency’s flood risk assessments — both current and historical. The maps include flood zones, levee locations, “base flood elevations,” and more. Helpful: FEMA’s technical documentation. Related: “Why Houston Isn’t Ready for Harvey,” published last week by ProPublica and The Texas Tribune; and “Hell and High Water,” the reporting team’s deep dive on Houston last year. Previously: The most comprehensive global dataset of cyclone paths (DIP 2017.04.19). | https://msc.fema.gov/portal/advanceSearch https://www.fema.gov/media-library-data/886edbc98e2229a90d0593d5e46ddac9/Flood+Insurance+Rate+Map+Database+Technical+Reference.pdf https://projects.propublica.org/graphics/harvey https://projects.propublica.org/houston/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-04-19-edition | |
433 | 2017.08.30 | 2 | Redlining. | The Mapping Inequality project has digitized more than 150 of the “security maps” produced by the Home Owners' Loan Corporation between 1935 and 1940. Together, the maps “offer a view of Depression-era America as developers, realtors, tax assessors, and surveyors saw it — a set of interlocking color-lines, racial groups, and environmental risks.” To download the data for a given map, click on the cloud icon in the top-right corner. Related: A new research paper, by economists at the Federal Reserve Bank of Chicago, uses the data to quantify redlining’s lasting effects. Also related: The New York Times’ summary of the data and research. [h/t Kendall Taggart] | https://dsl.richmond.edu/panorama/redlining/ https://www.chicagofed.org/publications/working-papers/2017/wp2017-12 https://www.nytimes.com/2017/08/24/upshot/how-redlinings-racist-effects-lasted-for-decades.html | https://twitter.com/KendallTTaggart |
434 | 2017.08.30 | 3 | Home price indices. | The Federal Reserve Bank of St. Louis publishes S&P/Case-Shiller Home Price Index data, which measures changes in average home prices over time. The monthly-updated datasets — copyrighted, but free to download — are available at a national and metro-area level, and go back several decades. | https://fred.stlouisfed.org/release?rid=199 | |
435 | 2017.08.30 | 4 | Website logos. | Favicons are the little square icons in your browser’s tabs, placed there by the websites you’ve loaded. Two recent projects attempted to collect these markers from the web’s million most-trafficked domains. One, by programmer Colin Morris, collected 360,000 favicons in July 2016. The second, by researchers at ETH Zurich, collected 548,00 favicons in April 2017. Semi-related: Morris’s “Finding bad flamingo drawings with recurrent neural networks”; the analysis uses Google’s 50-million-doodles data, featured in DIP 2017.05.04. | https://archive.org/details/favicons_201708 https://data.vision.ee.ethz.ch/cvl/lld/ https://colinmorris.github.io/blog/bad_flamingos https://quickdraw.withgoogle.com/data https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-24-edition | |
436 | 2017.08.30 | 5 | Game of Thrones characters, judged. | Earlier this month, The New York Times asked readers to rate 50 of the show’s most recognizable characters along two dimensions: good ↔ evil, and ugly ↔ beautiful. They’ve received 190,000+ submissions. The results are accessible as two JSON files: one for the averages and another for the distributions. | https://www.nytimes.com/interactive/2017/08/09/upshot/game-of-thrones-chart.html https://int.nyt.com/newsgraphics/2017/2017-07-17-got-matrix/mean.json https://int.nyt.com/newsgraphics/2017/2017-07-17-got-matrix/contours.json | |
437 | 2017.09.13 | 1 | Global flooding. | The Dartmouth Flood Observatory’s Global Archive of Large Flood Events contains data about 4,500+ floods, dating back to 1985. It’s updated often, and is available in Excel, XML, HTML, and geospatial formats. The variables include each flood’s location, timespan, severity, main cause, and estimated impact. The organization also publishes detailed maps of the “maximum observed flooding” for specific disasters, such as for Hurricane Harvey and for Hurricane Irma. Related: A Science Magazine mini-profile of the DFO and its founder. Previously: U.S. tide gauges and flood observations (DIP 2016.03.23), UK coastal flooding (DIP 2017.08.09), and FEMA flood risk maps (DIP 2017.08.30). | http://floodobservatory.colorado.edu/index.html http://floodobservatory.colorado.edu/Archives/index.html http://floodobservatory.colorado.edu/Archives/ArchiveNotes.html http://floodobservatory.colorado.edu/Events/2017USA4510/2017USA4510.html http://floodobservatory.colorado.edu/Events/2017USA4516/2017USA4516.html http://www.sciencemag.org/news/2017/08/colorado-global-flood-observatory-keeps-close-watch-harvey-s-torrents https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-03-23-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-08-09-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-08-30-edition | |
438 | 2017.09.13 | 2 | House price indices, part two. | Two weeks ago, DIP featured Case-Shiller’s home price index data. There are, in fact, several other prominent (and downloadable) house price indices, including the Federal Housing Finance Agency’s House Price Index, the National Association of Realtors’ indices, and Zillow’s Home Value Index. Helpful: This guide to various home price indices and how they’re constructed, by Jed Kolko, formerly Trulia’s chief economist. Related: This critique of Case-Shiller’s approach, also by Kolko. | http://tinyletter.com/data-is-plural/letters/data-is-plural-2017-08-30-edition https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index.aspx https://www.nar.realtor/research-and-statistics/housing-statistics https://www.zillow.com/research/data/ http://www.calculatedriskblog.com/2012/05/kolko-dissecting-house-price-indices.html http://jedkolko.com/ http://www.calculatedriskblog.com/2014/08/kolko-lets-improve-not-ignore-seasonal.html | |
439 | 2017.09.13 | 3 | Trump, McConnell, Schumer, Ryan, and Pelosi on TV. | The Internet Archive has pumped footage from CNN, Fox News, MSNBC, and the BBC through software trained to recognize the faces of Donald Trump and majority/minority leaders of the U.S. House and Senate. The result: Face-O-Matic, a dataset released to the public last week. For each face the software found, the dataset includes the network, program, date, time, duration, and a link to the footage on the TV News Archive. Since mid-July, Face-O-Matic has logged more than 50,000 sightings. [h/t Nancy Watzman] | https://archive.org/details/faceomatic http://blog.archive.org/2017/09/06/face-o-matic-data-show-trump-dominates/ https://archive.org/details/tv | https://twitter.com/nwatzman |
440 | 2017.09.13 | 4 | SEC server logs. | When companies file reports to the U.S. Securities and Exchange Commission, they do so through the SEC’s EDGAR system. The SEC makes those filings available online, and it uses EDGAR’s server logs to analyze web traffic to the site. The SEC’s EDGAR Log File Data Set contains a set CSVs — one for each day between February 14, 2003 and December 31, 2016 — extracted from those server logs. For each document visited, the data includes the visitor’s unique-but-obfuscated IP address, the date and time of the visit, the IDs of the document and associated company, and some information about the visitor’s browser. [h/t Brian C. Keegan] | https://www.sec.gov/edgar.shtml https://www.sec.gov/data/edgar-log-file-data-set.html | https://www.brianckeegan.com/about/ |
441 | 2017.09.13 | 5 | It wood be hard to ignore this dataset. | The “robust and curated” Global Wood Density Database contains more than 16,000 entries, culled from scientific literature, websites, and unpublished scholarship. The densest so far is a Caesalpinia sclerocarpa from Mexico, weighing in at 1.39 grams per cubic centimeter. Related: The TRY database of “curated plant traits” (free registration required). [h/t Amy Zanne] | http://wooddensity.univ-tlse3.fr/ http://datadryad.org/handle/10255/dryad.235 https://www.desertmuseum.org/programs/alamos_trees_caescl.php https://www.try-db.org/TryWeb/Database.php | https://twitter.com/AmyZanne/status/901057652024893440 |
442 | 2017.09.20 | 1 | Broadband access and cost. | The U.S. Federal Communications Commission publishes a ton of data on the “wireline” telecommunications industry, including several datasets about broadband internet access. Among them: the places where providers offer service, subscriptions per 1,000 households in each Census tract, and a survey of plans available in urban areas. You can also find a spreadsheet of payphones-by-state at the bottom of that landing page. (As of last March, there were only 113 payphones left in North Dakota, down from 705 in 2008.) Related: “Signs of Digital Distress,” a new Brookings Institution report, with findings and maps based on the broadband subscription data. | https://www.fcc.gov/general/iatd-data-statistical-reports https://www.fcc.gov/general/broadband-deployment-data-fcc-form-477 https://www.fcc.gov/general/form-477-census-tract-data-internet-access-services https://www.fcc.gov/general/urban-rate-survey-data-resources https://www.brookings.edu/research/signs-of-digital-distress-mapping-broadband-availability/ | |
443 | 2017.09.20 | 2 | Post-disaster aerial imagery. | After major natural disasters, NOAA’s National Geodetic Survey routinely collects detailed aerial photos of the affected areas. For each disaster — including Hurricane Harvey, Hurricane Irma, and a couple dozen others — you can download the full set of (georeferenced) images, by date and survey flight. [h/t David Yanofsky] | https://storms.ngs.noaa.gov/ | http://yanofsky.info/ |
444 | 2017.09.20 | 3 | Voters’ attitudes and choices, over time. | The Democracy Fund Voter Study Group, “a research collaboration comprised of nearly two dozen analysts and scholars from across the political spectrum,” has published the participant-level data from its 2016 VOTER survey. It’s a “unique longitudinal data set” that represents the “political attitudes, values, and affinities” of 8,000 American adults who were interviewed first in December 2011, then again before and after the 2012 election, and again in December 2016. [h/t Jenny Listman] | https://www.voterstudygroup.org/newsroom/press-release-democracy-fund-voter-study-group-to-release-full-longitudinal-dataset https://www.voterstudygroup.org/publications/2016-elections/data | https://twitter.com/jblistman |
445 | 2017.09.20 | 4 | Trump Organization domain registrations. | Earlier this year, Politico reporters scoured the internet’s WHOIS records for domains registered to the Trump Organization. They found thousands, including TrumpRussia.com, No2Trump.com, Trumpublican.net, and ImBeingSuedByTheDonald.com. (Most, including those, just send readers to a generic “domain parking” landing page.) Politico has open-sourced the article’s components, including a JSON file containing 1,267 of the domains, which includes each domain’s owner, creation date, last-updated date, and expiration date. [h/t Tyler Fisher] | https://whois.icann.org/en/about-whois http://www.politico.com/interactives/2017/trump-organization-business-domain-names-vegas-moscow/ https://github.com/The-Politico/interactive_trump-urls https://raw.githubusercontent.com/The-Politico/interactive_trump-urls/master/dist/data/cards.json | http://tylerjfisher.com/ |
446 | 2017.09.20 | 5 | xkcd. | The popular “webcomic of romance, sarcasm, math, and language” provides an interface for grabbing data about each comic strip, including the title, image file, date of publication, easter-egg-y “alt” text, and transcript. [h/t Karl L. Hughes] | https://xkcd.com/ https://xkcd.com/json.html https://www.explainxkcd.com/wiki/index.php/title_text | https://github.com/toddmotto/public-apis/commit/73d22681b02e76b15fa9e910b0deab154ad19fca |
447 | 2017.09.27 | 1 | Easier-to-use crime data. | Earlier this month, the FBI and 18F released the first iteration of their Crime Data Explorer, a website that simplifies access to the FBI’s Uniform Crime Reporting program. You can download bulk data on individual incidents, state and national trends, hate crimes, arrests, assaults on officers, police employees, human trafficking, and cargo theft. You can also access the data via an API. Caution: The FBI’s data collection program is voluntary; not all law enforcement agencies participate. (In fact, more than 3,000 agencies don’t submit hate crime data.) [h/t Nick Wright] | https://18f.gsa.gov/2017/09/07/opening-the-nations-crime-data/ https://crime-data-explorer.fr.cloud.gov/ https://ucr.fbi.gov/ucr-program-data-collections https://crime-data-explorer.fr.cloud.gov/downloads-and-docs https://crime-data-explorer.fr.cloud.gov/api https://ucr.fbi.gov/nibrs/2015/tables/data-tables https://www.propublica.org/article/hate-crimes-are-up-but-the-government-isnt-keeping-good-track-of-them | http://nikolaswright.com/ |
448 | 2017.09.27 | 2 | Chyrons. | The TV News Archive’s new “Third Eye” project is extracting chyrons — those placards of text at the bottom of news broadcasts, also known as “lower thirds” — from four major cable networks: BBC News, CNN, Fox News, and MSNBC. The resulting database contains every chyron that Third Eye’s optical character recognition (OCR) software has extracted since late August. Related: This Washington Post piece analyzing cable news’ chyrons during James Comey’s congressional testimony, and this explanation of how they did it. [h/t Nancy Watzman] | http://blog.archive.org/2017/09/21/tv-news-chyron-data/ https://archive.org/services/third-eye.php https://www.washingtonpost.com/graphics/2017/politics/comey-hearing-chyrons/ https://source.opennews.org/articles/how-we-tracked-cable-news-chyrons/ | https://twitter.com/nwatzman |
449 | 2017.09.27 | 3 | Every building, river, and green space in Great Britain. | The UK’s Ordnance Survey makes detailed digital maps of Great Britain. Their free offerings include all of the island’s roads, rivers, green spaces, and place names. The Survey’s “open map” includes buildings, railways, electricity transmission lines, and other features. Related: Want only the buildings? The University of Sheffield’s Alasdair Rae has you covered. [h/t Robyn Inglis] | https://www.ordnancesurvey.co.uk/business-and-government/products/opendata-products.html https://www.ordnancesurvey.co.uk/business-and-government/products/os-open-roads.html https://www.ordnancesurvey.co.uk/business-and-government/products/os-open-rivers.html https://www.ordnancesurvey.co.uk/business-and-government/products/os-open-greenspace.html https://www.ordnancesurvey.co.uk/business-and-government/products/os-open-names.html https://www.ordnancesurvey.co.uk/business-and-government/products/os-open-map-local.html http://www.statsmapsnpix.com/2017/09/buildings-of-great-britain.html | https://twitter.com/rhinglis/status/910794000516427776 |
450 | 2017.09.27 | 4 | NYC streets: the good, the bad, and the closed. | New York City’s Department of Transportation publishes a bunch of data, including its own assessments of each street segment’s quality on a 1-to-10 scale. It also publishes spreadsheets of all construction-related street closures, by intersection and by block, updated daily. [h/t Christian Moscardi] | http://www.nyc.gov/html/dot/html/about/datafeeds.shtml https://data.cityofnewyork.us/Transportation/Street-Pavement-Rating/2cav-chmn https://data.cityofnewyork.us/Transportation/Street-Closures-due-to-construction-activities-by-/478a-yykk https://data.cityofnewyork.us/Transportation/Street-Closures-due-to-construction-activities-by-/i6b5-j7bu | https://twitter.com/c_moscardi |
451 | 2017.09.27 | 5 | Your job, in numbers. | For each of 966 occupations, the Department of Labor’s O*NET database quantifies the types knowledge, skills, abilities, education, and training required, tasks involved, tools used, and more job-related parameters. Related: The Upshot uses the data to ask (and answer), “What Is Your Opposite Job?” | https://www.onetcenter.org/database.html?p=3 https://www.nytimes.com/interactive/2017/08/08/upshot/what-is-your-opposite-job.html | |
452 | 2017.10.04 | 1 | Four decades of U.S. air quality. | The Environmental Protection Agency collects air quality samples from thousands of monitoring stations across the country. The resulting datasets, which go back to the 1980s, are available as daily files, annual files, and via an API. The monitored pollutants include ozone, carbon monoxide, sulfur dioxide, nitrogen dioxide, particulate matter, volatile organic compounds, and more. You can also download daily Air Quality Index ratings and information about each monitoring station. Previously: Global air pollution datasets from Berkeley Earth (DIP 2017.03.22) and from the World Health Organization (DIP 2016.06.15). [h/t Swier Heeres] | https://www.epa.gov/outdoor-air-quality-data/download-daily-data https://aqs.epa.gov/aqsweb/airdata/download_files.html https://aqs.epa.gov/aqsweb/documents/data_mart_welcome.html https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI https://aqs.epa.gov/aqsweb/airdata/download_files.html#Meta https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-22-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-15-edition | https://opendata.stackexchange.com/questions/11750/air-quality-in-all-cities-in-the-usa/11751#11751 |
453 | 2017.10.04 | 2 | Chest x-rays. | Last week, the National Institutes of Health released a dataset containing more than 100,000 anonymized chest x-rays, from 30,000 patients, “including many with advanced lung disease.” For each image, the associated metadata includes the patient’s age, gender, and diagnosis labels. (The dataset’s authors used natural language processing to extract those labels from radiological reports; they estimate that fewer than 10% of the labels are incorrect.) Related: Andrew L. Beam’s list of medical datasets for machine learning. [h/t Chris Hamby] | https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community https://nihcc.app.box.com/v/ChestXray-NIHCC https://nihcc.app.box.com/v/ChestXray-NIHCC/file/220660789610 https://github.com/beamandrew/medical-data | https://twitter.com/ChrisDHamby |
454 | 2017.10.04 | 3 | Media coverage. | Media Cloud, a collaboration between MIT and Harvard–based researchers, describes itself as “an open-source platform for studying media ecosystems.” The project lets you track topics and keywords across thousands of sources — including mainstream news publications in the U.S. and many other countries — at both a story and sentence level. You can access Media Cloud’s data via its dashboard or its API. Both require (free) registration. Related: “The Media Really Has Neglected Puerto Rico,” by Dhrumil Mehta at FiveThirtyEight; the analysis uses data from Media Cloud, the TV News Archive, and Google Trends. Also related: The geometry of hurricane coverage, as told through the front pages of The New York Times and Washington Post. | https://mediacloud.org/ https://dashboard.mediacloud.org/ https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md https://fivethirtyeight.com/features/the-media-really-has-neglected-puerto-rico/ https://archive.org/details/tv https://www.google.com/trends/ http://www.thefunctionalart.com/2017/09/low-tech-visualization-how-much-space.html | |
455 | 2017.10.04 | 4 | Privately owned public spaces. | In certain cities, private developers can earn zoning concessions by converting sections of their properties into plazas, atriums, mini-parks, and other open-to-the-public spaces. You can download datasets of these “privately owned public spaces” in San Francisco, Seattle, New York City, and — thanks to a recent collaboration between Guardian Cities and local community group — London. Related: A guide to NYC’s POPS. [h/t Reddit user seeriktus + Ed Vine] | https://en.wikipedia.org/wiki/Privately_owned_public_space https://data.sfgov.org/Culture-and-Recreation/Privately-Owned-Public-Open-Spaces/65ik-7wqd https://data.seattle.gov/Community/Privately-Owned-Public-Spaces-Map/52gz-md6f https://nycopendata.socrata.com/Housing-Development/Privately-Owned-Public-Spaces/fum3-ejky https://www.theguardian.com/cities/2017/jul/24/pseudo-public-space-explore-data-what-missing https://data.london.gov.uk/dataset/privately-owned-public-spaces https://apops.mas.org/find-a-pops/ | https://www.reddit.com/r/datasets/comments/72j3vw/dataset_london_privatelyowned_public_spaces_gigl/ https://www.linkedin.com/in/ed-vine-a480347/ |
456 | 2017.10.04 | 5 | These American Voices. | For a new interactive essay at The Pudding, Ash Ngu analyzed the gender composition of This American Life episodes. To support the findings, Ngu has published the underlying data, extracted from the show’s transcripts. Among the data extracted: the number of words spoken by each person in each act of each episode. | http://stanford.edu/~ashngu/cgi-bin/ https://pudding.cool/2017/09/this-american-life/ https://docs.google.com/spreadsheets/d/1KpGZzeBawsGsiYHhFgCkHFSImFlS2sdWFI4pnpUWdLQ/edit#gid=0 | |
457 | 2017.10.11 | 1 | Wildfires. | “Monitoring Trends in Burn Severity (MTBS) is an interagency program whose goal is to consistently map the burn severity and extent of large fires across all lands of the United States”; the most recent release contains more than 20,000 fires from 1984 to 2015. You can explore the data online, or download it in bulk. For more recent data, see GeoMAC, which aims to map all current wildfires; NOAA’s Hazard Mapping System, which uses satellites to detect fire locations and smoke plumes; and NASA’s MODIS and VIIRS datasets, which provide satellite-based detections for the entire globe. Previously: National Fire Incident Reporting System, which also includes structure fires and vehicle fires (DIP 2016.07.20). [h/t Max Joseph] | https://www.mtbs.gov/project-overview https://www.mtbs.gov/articles/announcement/data-release-may-1-2017 https://www.mtbs.gov/viewer/index.html https://www.mtbs.gov/direct-download https://www.geomac.gov/ http://www.ospo.noaa.gov/Products/land/hms.html https://earthdata.nasa.gov/earth-observation-data/near-real-time/firms/active-fire-data https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-20-edition | https://github.com/mbjoseph/mtbs-data |
458 | 2017.10.11 | 2 | Political crowd estimates. | The Crowd Counting Consortium, launched earlier this year, is a volunteer effort to “[collect] publicly available data on political crowds reported in the United States, including marches, protests, strikes, demonstrations, riots, and other actions.” The team publishes monthly spreadsheets that list each crowd’s date, location, type, and cause (e.g., “Oppose removal of confederate statue”); high and low size estimates; the number of reported arrests and injuries; links to sources; and additional details. Related: The project’s main coordinators have been summarizing their findings on the Washington Post’s Monkey Cage blog. [h/t Amanda L. James] | https://sites.google.com/view/crowdcountingconsortium/about https://sites.google.com/view/crowdcountingconsortium/view-download-the-data?authuser=0 https://www.washingtonpost.com/news/monkey-cage/wp/2017/09/25/charlottesville-and-its-aftermath-brought-out-many-protesters-in-august-but-still-more-were-against-trump-and-his-policies/ | http://www.amandalynnjames.com/ |
459 | 2017.10.11 | 3 | Commercial vehicle safety. | The Federal Motor Carrier Safety Administration helps to regulate the United States’ large trucks and passenger buses. The datasets available through its Safety Measurement System include a census of all regulated carriers, the results of safety inspections, and reported crashes. The crash files list the number of injuries and fatalities; the weather, light, and road conditions; the involved vehicle’s VIN and license plate number; and more. [h/t Dan Brady] | https://www.fmcsa.dot.gov/ https://ai.fmcsa.dot.gov/SMS/Tools/Downloads.aspx | http://danjbrady.com/ |
460 | 2017.10.11 | 4 | San Francisco Bay water. | The U.S. Geological Survey has been measuring water quality in the San Francisco Bay for nearly 50 years. The agency recently published 210,826 of these measurements, collected from dozens of monitoring stations between April 1969 and December 2015. (It’s “one of the longest records of water-quality measurements in a North American estuary,” according to a recent academic article describing the data.) Each row specifies the measurement’s date, station, depth, temperature, and salinity; many rows include levels of chlorophyll, oxygen, nitrate, ammonium, and other matter. | https://www.sciencebase.gov/catalog/item/5841f97ee4b04fc80e518d9f https://www.nature.com/articles/sdata201798 | |
461 | 2017.10.11 | 5 | Humans in motion. | Carnegie Mellon’s Motion Capture Database provides data files and videos representing humans performing various activities: shaking hands, drinking soda, exchanging “angry hand gestures,” doing cartwheels, mopping floors, laughing, chicken-dancing, and oh-so-much more. [h/t John Emerson] | http://mocap.cs.cmu.edu/ | https://backspace.com |
462 | 2017.10.18 | 1 | Puerto Rico’s recovery. | Since shortly after Hurricane Maria hit Puerto Rico, the territory’s government has been publishing a dashboard of recovery statistics. The website tracks a couple dozen metrics, including the percent of homes with electricity, number of people in shelters, and the number of open hospitals. For several of the main metrics, researcher Michael A. Johansson has been scraping daily figures from the dashboard and publishing them as a CSV file. Related: The Washington Post has been charting the recovery, and published a deep dive into the island’s ongoing power outages. | http://status.pr/ https://github.com/majohansson/maria-puerto-rico https://github.com/majohansson/maria-puerto-rico/blob/master/data/StatusPR.csv https://www.washingtonpost.com/news/politics/wp/2017/10/06/fema-buried-updates-on-puerto-rico-here-they-are/?utm_term=.85824229c3b4 https://www.washingtonpost.com/graphics/2017/national/puerto-rico-hurricane-recovery/ | |
463 | 2017.10.18 | 2 | Subnational conflicts. | University of Michigan–based researchers have created “a repository of micro-level, subnational event data on armed conflict and political violence around the world.” The project, dubbed xSub, standardizes information from 21 data sources, and includes conflicts in 139 countries between 1942 and 2016. For each administrative boundary (e.g., country, province, district) and data source, xSub’s data counts the number of violent incidents by year, month, week, or day. The numbers are also broken down by the sides involved, who initiated the conflict, and what types of force were used. [h/t Andy Halterman] | http://cross-sub.org/about/our-team http://cross-sub.org/ http://www.cross-sub.org/data | https://twitter.com/ahalterman/status/906563742879674368 |
464 | 2017.10.18 | 3 | Patents and trademarks. | The U.S. Patent and Trademark Office publishes a huge amount of bulk data, including detailed XML files that contain information about millions of patent/trademark applications, assignments, trials, and appeals. The agency also publishes a collection of “research datasets”, which distill those bulk XML files into easier-to-use tabular data. [h/t Rachael Tatman] | https://bulkdata.uspto.gov/ https://www.uspto.gov/learning-and-resources/ip-policy/economic-research/research-datasets | https://www.kaggle.com/rtatman/trademark-application |
465 | 2017.10.18 | 4 | Sister, Sister. | In the wake of the Second Vatican Council in the 1960s, Sister Marie Augusta Neal conducted an enormous opinion survey of Catholic “women religious.” More than 130,000 sisters responded to the 649 multiple-choice-question survey — the results of which the University of Notre Dame recently cleaned up and made available online. [h/t Kevin Schlottmann] | https://news.nd.edu/news/digital-preservation-at-notre-dame-breathes-new-life-into-1967-sisters-survey/ https://curate.nd.edu/show/0r967368551 | https://twitter.com/archivistkevin |
466 | 2017.10.18 | 5 | Get the idea? | ConceptNet “is a freely-available semantic network, designed to help computers understand the meanings of words that people use.” It defines approximately 28 million “statements,” i.e., relationships between various things. For instance, ConceptNet indicates that a newsletter is a type of “report”, and that a computer can be used to “send email”. You can download the entire dataset, or access it via an API. | http://conceptnet.io/ https://github.com/commonsense/conceptnet5/wiki/FAQ https://github.com/commonsense/conceptnet5/wiki/Relations http://conceptnet.io/c/en/newsletter http://conceptnet.io/c/en/computer https://github.com/commonsense/conceptnet5/wiki/Downloads https://github.com/commonsense/conceptnet5/wiki/API | |
467 | 2017.11.01 | 1 | Federal court cases. | The U.S. Federal Judicial Center’s “Integrated Data Base” contains a longitudinal record of all federal criminal, civil, and appellate court cases going back to the 1970s, as well as bankruptcy cases going back to late 2007. Each dataset contains dozens of detailed fields — including each case’s jurisdiction, name, docket number, relevant legal statutes, and more — accompanied by explanatory codebooks. You can download single-year snapshots and cumulative files, or interactively select specific slices of data to export. Related: “How the Bankruptcy System Is Failing Black Americans,” an investigation by ProPublica that used the IDB’s data on bankruptcy cases for its analysis. | https://www.fjc.gov/research/idb https://features.propublica.org/bankruptcy-inequality/bankruptcy-failing-black-americans-debt-chapter-13/ https://www.propublica.org/datastore/dataset/national-bankruptcy-chapter-7-13 https://projects.propublica.org/graphics/bankruptcy-data-analysis | |
468 | 2017.11.01 | 2 | High-profile sexual assault timelines. | Rebecca Zisser and Lazaro Gamio at Axios have compiled a timeline of alleged sexual assaults by Harvey Weinstein, Bill O'Reilly, Roger Ailes, Donald Trump, and Bill Cosby. For each of the 140+ cases recorded as of Oct. 20, the timeline indicates the year of the assault, the year the victim came forward (if they did), and the year of any legal settlement (if there was one). The underlying data is available as a spreadsheet. [h/t Mike Allen] | https://docs.google.com/spreadsheets/d/10CWJHTzvGtkQgyz5bdkolz1KeZLqq7sYqNA3zQaPlYo/view#gid=1175970372 | https://www.axios.com/axios-am-2498785314.html |
469 | 2017.11.01 | 3 | Deepwater Horizon’s effects. | For years, the National Oceanic & Atmospheric Administration has been working to assess the damage done to natural resources by the April 2010 Deepwater Horizon explosion and oil spill. As part of that effort, they’ve collected and compiled several dozen related datasets, including toxicity studies, plankton samples, necropsies of stranded turtles, dolphin health assessments, and a “backyard boater” survey. [h/t Sebastian Kraus] | https://www.diver.orr.noaa.gov/deepwater-horizon-nrda-data | https://www.mcc-berlin.net/en/about/team/kraus-sebastian.html |
470 | 2017.11.01 | 4 | County-level cardiovascular deaths. | Researchers at the University of Washington’s Institute for Health Metrics and Evaluation to estimated cardiovascular mortality rates for each U.S. county, for every year between 1980 and 2014. The findings, based on 32 million de-identified death records, population data from the Census, and other sources, are also broken down by particular disease (e.g., aortic aneurysm, ischemic stroke, etc.) and gender. Related: The researchers’ JAMA article describing their methodology and findings. Previously: The Global Burden of Disease dataset, published by the same institute (DIP 2016.07.27). [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle] | http://ghdx.healthdata.org/record/united-states-cardiovascular-disease-mortality-rates-county-1980-2014 https://jamanetwork.com/journals/jama/fullarticle/2626571 https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-27-edition | |
471 | 2017.11.01 | 5 | Chimp personalities. | “Jane Goodall drew the attention of a global audience with vivid depictions of the personalities of eastern chimpanzees (Pan troglodytes schweinfurthii) at Gombe National Park, yet only one attempt [in 1973] has been made to quantify these personality traits systematically,” writes a team of researchers in the latest issue of Scientific Data. To remedy the situation, the researchers paid field observers to score 128 Gombe chimpanzees on 24 personality traits — “dominant,” “excitable,” “helpful,” “sensitive,” and more — on a seven-point scale. | https://www.nature.com/articles/sdata2017146 https://osf.io/s7d9d/ | |
472 | 2017.11.08 | 1 | Gun origins. | The Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF) helps trace guns — such as those recovered at crime scenes by law enforcement agencies — back to their original manufacturers, wholesale distributors, dealers, and purchasers. Each year, ATF publishes a range of datasets based on these gun traces. The datasets for 2016 provide state-by-state tallies of gun caliber, state of original purchase, possessors’ age, associated crime, and more. Related: “Gun Laws Stop At State Lines, But Guns Don’t,” from FiveThirtyEight, using the data. Also related: “How a Gun Trace Works,” from The Trace. Previously: Firearm background checks (DIP 2015.12.09), which my colleague Peter Aldhous analyzed last week, finding that gun sales did not spike after the Las Vegas shooting. | https://www.atf.gov/firearms/national-tracing-center https://www.atf.gov/resource-center/data-statistics https://www.atf.gov/resource-center/firearms-trace-data-2016 https://fivethirtyeight.com/features/gun-laws-stop-at-state-lines-but-guns-dont/ https://www.thetrace.org/2016/07/how-a-gun-trace-works-atf-ffl/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-09-edition https://www.buzzfeed.com/peteraldhous/gun-sales-after-vegas-shooting | |
473 | 2017.11.08 | 2 | Silicon Valley diversity. | Reporters at the Center for Investigative Reporting asked 200+ of the largest Silicon Valley tech companies for their official diversity data. Specifically, the reporters requested each company’s latest EEO-1, the detailed demographic report that every large U.S. employer must submit to the federal government. Only 23 companies shared their data. For those that did, their numbers are now available as a tidy spreadsheet. [h/t Sophie Chou] | https://www.revealnews.org/article/hidden-figures-how-silicon-valley-keeps-diversity-data-secret/ https://www.revealnews.org/article/how-we-analyzed-silicon-valley-tech-companies-diversity-data/ https://apps.revealnews.org/silicon-valley-diversity-list/ https://github.com/cirlabs/Silicon-Valley-Diversity-Data | http://sophiechou.com/ |
474 | 2017.11.08 | 3 | Rent-to-own prices. | As part of NerdWallet’s recent investigation into Rent-A-Center, “the nation’s largest rent-to-own company,” reporters compiled pricing data for 39 consumer products on rentacenter.com. For each product, the dataset lists the various Rent-A-Center costs (e.g., installment fees for weekly/monthly payment plans, cash prices, et cetera) in each of 48 states and D.C. — plus prices for the same product at standard online retailers. Related: NerdWallet’s analysis of the data. | https://www.nerdwallet.com/blog/rentacenter/ https://www.nerdwallet.com/blog/finance/rent-a-center-methodology/ https://www.nerdwallet.com/blog/finance/rent-a-center-prices/ | |
475 | 2017.11.08 | 4 | Indian movie theaters. | Over at BuzzFeed India, Harsha Devulapalli and Janak Jain have crowned Hyderabad the best city in India for going to the movies, based on their analysis of nearly 600 theaters in eight major cities. The underlying dataset lists each theater’s location, name, average ticket price (where available), number of screens, and number of seats. | https://www.buzzfeed.com/harshadevulapalli/whats-the-best-city-in-india-to-watch-a-movie https://github.com/HarshaDevulapalli/indian-movie-theatres | |
476 | 2017.11.08 | 5 | The friends of Friends. | A few years ago, economist Alex Albright and a friend transcribed the plotline-sharing dynamics of Friends’ six friends, across all 236 episodes. In the very first episode (“The One Where Monica Gets a Roommate”), Monica and Rachel each have their own plotline; Rachel and Ross share a plotline; and Chandler, Joey, and Ross share another plotline. Related: Albright’s analysis of the data. | https://github.com/apalbright/Friends https://thelittledataset.com/2015/01/20/the-one-with-all-the-quantifiable-friendships/ | |
477 | 2017.11.29 | 1 | Protests and political violence in Africa and Asia. | The Armed Conflict Location & Event Data Project (ACLED), records the locations, dates, actors, and outcomes of “all reported political violence and protest events in over 60 developing countries in Africa and Asia.” The Africa datasets currently go back to 1997 and cover more than 50 countries. The Asia datasets currently only go back to 2015, but ACLED’s website says it’s planning to add data soon going back to 2010. Both of the datasets are extensively documented, as is the methodology . [h/t Lari McEdward] | https://www.acleddata.com/about-acled/ https://www.acleddata.com/data/ https://www.acleddata.com/asia-data/ https://www.acleddata.com/methodology/ | https://twitter.com/LariMcEdward |
478 | 2017.11.29 | 2 | Stolen guns. | Missing Pieces is “a yearlong investigation by The Trace and more than a dozen NBC TV stations [that has] identified more than 23,000 stolen firearms recovered by police between 2010 and 2016 — the vast majority connected with crimes.” To support the investigation, the reporters obtained more than 800,000 records of stolen and recovered guns, which they’ve standardized into a single CSV file and supplemented with a data dictionary. The dataset “contains nearly complete stolen-gun records for the states of California and Florida, both of which have centralized collections of gun-theft data,” as well as records from nearly 300 other agencies across the country. Previously: The ATF’s gun trace statistics (DIP 2017.11.08) and firearm background checks (DIP 2015.12.09). [h/t Sarah Ryley] | https://www.thetrace.org/features/stolen-guns-violent-crime-america/ https://www.thetrace.org/missing-pieces-data/ https://storage.googleapis.com/missing-pieces/missing_pieces_data_dict_11-20-2017.pdf https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-11-08-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2015-12-09-edition | https://twitter.com/MissRyley/status/932677322834153472 |
479 | 2017.11.29 | 3 | (Some) White House visitor logs. | ProPublica has published a searchable and downloadable dataset of visitor logs and meeting calendars from five White House agencies: the Office of Management and Budget, the Office of the U.S. Trade Representative, the Office of National Drug Control Policy, the Office of Science and Technology Policy, and the Council on Environmental Quality. ProPublica received the underlying documents from Property of the People, a transparency group that sued the Trump administration to release the records under the Freedom of Information Act. (The administration has not released the White House’s main visitor logs.) Related: Politico has manually compiled a searchable database it calls “The Unauthorized White House Visitor Logs”, based on thousands of known visits, meetings, phone calls, and other presidential interactions. Also related: The Obama administration’s White House visitor logs. | https://projects.propublica.org/graphics/wh-complex https://www.propublica.org/datastore/dataset/trump-administration-white-house-complex-visitor-records https://www.whitehouse.gov/administration/eop https://projects.propublica.org/graphics/wh-complex#methodology https://twitter.com/PropOTP https://www.politico.com/interactives/databases/trump-white-house-visitor-logs-and-records/index.html https://obamawhitehouse.archives.gov/goodgovernment/tools/visitor-records | |
480 | 2017.11.29 | 4 | California elections and campaign finance. | Since 2014, the California Civic Data Coalition has been working to improve access to CAL-ACCESS, “the jumbled, dirty and difficult government database that tracks campaign finance and lobbying activity in California politics.” Their cleaned-up datasets are updated often and include formats suitable for beginners, “database junkies,” and masochists. Last month, the organization released data files cataloging every state ballot measure and candidate for public office since 2000. [h/t Zack Quaintance] | https://www.californiacivicdata.org/about/ http://cal-access.ss.ca.gov/ https://calaccess.californiacivicdata.org/downloads/latest/ https://www.californiacivicdata.org/2017/10/31/processed-files/ | http://www.govtech.com/civic/Whats-New-in-Civic-Tech-New-York-City-Announces-Crowdfunding-Program-for-Women-Entrepreneurs.html |
481 | 2017.11.29 | 5 | Folktales. | The Aarne-Thompson-Uther Classification of Folk Tales organizes (mostly Indo-European) folktales into groups and hierarchies. As Atlas Obscura’s Cara Giaimo puts it, the ATU is “like the Dewey Decimal System, but with more ogres.” The ATU doesn’t publish any downloadable versions of its data, but researchers studying the “ancient roots” of such stories have built a data-matrix that denotes the presence/absence of the 275 ATU “tales of magic” across 50 Indo-European-speaking populations. [h/t Andrew McCartney] | http://www.mftd.org/index.php?action=atu http://www.mftd.org/index.php?action=browse&act=select&fld=langname https://www.atlasobscura.com/articles/aarne-thompson-uther-tale-type-index-fables-fairy-tales http://rsos.royalsocietypublishing.org/content/3/1/150645 http://rsos.royalsocietypublishing.org/content/3/1/150645.figures-only http://www.mftd.org/index.php?action=atu&act=range&id=300-749 | http://people.virginia.edu/~acm9q/ |
482 | 2017.12.06 | 1 | Two decades of workplace sexual harassment complaints. | My colleague Lam Thuy Vo obtained an anonymized dataset listing all 170,000+ sexual harassment claims submitted to the U.S. Equal Employment Opportunity Commission between October 1995 and September 2016. For each claim, the dataset indicates the date the complaint was filed, the complainant’s gender, and the general category of employer. Additional fields — available for most claims, but not all — indicate the complainant’s birthdate, race, and national origin, as well as the employer’s industry and approximate number of workers. Related: Lam’s story and interactive graphics, which place the data in context. | https://github.com/BuzzFeedNews/2017-12-eeoc-harassment-charges/ https://www.buzzfeed.com/lamvo/eeoc-sexual-harassment-data | |
483 | 2017.12.06 | 2 | Financial consumer complaints. | The Consumer Financial Protection Bureau’s consumer complaint database can be searched online, accessed via an API, and downloaded in bulk. The 915,000+ complaints the Bureau has received have been categorized into 18 financial product groups (e.g., mortgages, debt collection, student loans, cryptocurrency) and more than 160 kinds of issues (e.g., billing disputes, communication tactics, privacy). The agency says they “don’t verify all the facts alleged in these complaints,” but that they “take steps to confirm a commercial relationship between the consumer and the company.” [h/t Dan Brady] | https://www.consumerfinance.gov/data-research/consumer-complaints/ https://www.consumerfinance.gov/data-research/consumer-complaints/search/?from=0&searchField=all&searchText=&size=25&sort=created_date_desc https://dev.socrata.com/foundry/data.consumerfinance.gov/jhzv-w97w https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data https://www.consumerfinance.gov/data-research/consumer-complaints/ | http://www.danjbrady.com/ |
484 | 2017.12.06 | 3 | The StudentLife Study. | Back in 2013, four dozen Dartmouth College students agreed to let a custom smartphone app surveil them for the StudentLife Study. During the 10 weeks of the spring academic term, the app collected data on the students’ physical activity, GPS coordinates, eating schedule, sleep habits, phone usage, and more. The study combined all that information with a slew of other data, including the students’ class deadlines, academic performance, and their responses to surveys about stress, depression, personality, and sleep quality. The study’s public (and anonymized) dataset clocks in at 53 gigabytes. Related: “Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities,” a recently-released academic paper that uses the StudentLife dataset. [h/t Konrad Kording] | http://studentlife.cs.dartmouth.edu/ http://studentlife.cs.dartmouth.edu/dataset.html https://arxiv.org/abs/1711.06350 | https://twitter.com/KordingLab/status/936620930222247936 |
485 | 2017.12.06 | 4 | 5,000+ Brazilian news outlets. | Atlas da Notícia is a Brazilian project that aims to collect data on all local and regional news outlets in the country. Last month, the project released its first batch of data, which identified 5,354 newspapers and online publications in a total of 1,125 municipalities. The raw dataset is currently only available in Portuguese, but the aggregate tables have been translated into English. [h/t Sérgio Spagnuolo] | https://www.atlas.jor.br/en.html https://github.com/voltdatalab/Atlas-Analytics https://docs.google.com/spreadsheets/d/1SudAc6RAQuYu4bWj_gJnuGrRmJXTP_TYMdx7huHkrbA/edit#gid=0 | https://twitter.com/sergiospagnuolo |
486 | 2017.12.06 | 5 | One family’s spending. | An anonymous married couple has decided “to be completely open about [their] finances so that people can see what an actual family’s budget looks like.” In addition to blogging about their financial habits, they’ve also published a spreadsheet of “(almost) every dollar” they spent between December 2015 and November 2017. For each transaction, the dataset provides the date, dollar amount, category (e.g., “Groceries”), and meta-category (e.g., “Food”). | https://ourfamilyandfinances.blogspot.com/p/about-us.html https://ourfamilyandfinances.blogspot.com https://www.reddit.com/r/datasets/comments/7ei2ma/2_years_of_my_spending_history/ | |
487 | 2017.12.13 | 1 | Government-sponsored cyberattacks. | Last month, the Council on Foreign Relations launched the Cyber Operations Tracker, a database of “publicly known state-sponsored cyber incidents that have occurred since 2005.” The 191 attacks in the database so far have been sponsored by 16 different countries, with China, Russia, and Iran being the most cited. For each incident, the dataset also includes the type of attack (e.g. espionage, data destruction), its name (e.g., “Stuxnet”), a description, the date it occurred, its victims, and the type of response, if any. | https://www.cfr.org/blog/tracking-state-sponsored-cyber-operations https://www.cfr.org/interactive/cyber-operations | |
488 | 2017.12.13 | 2 | Fatal and nonfatal officer-involved shootings. | For an investigation published Monday, Vice News spent “nine months collecting data on both fatal and nonfatal police shootings from the 50 largest local police departments in the United States.” They’ve published raw and standardized data on every shooting, plus the code they used to analyze it. [h/t Allison McCann] | https://news.vice.com/story/shot-by-cops https://news.vice.com/story/nonfatal-police-shootings-data https://github.com/vicenews/shot-by-cops/ | https://twitter.com/atmccann/status/940239584616697856 |
489 | 2017.12.13 | 3 | Sports team ratings. | For several years now, the folks at FiveThirtyEight have been quantifying professional sports teams’ current and historical strength, mostly using Elo rating systems. Their global club soccer ratings go back to 2016, their basketball ratings go back to 1946, their American football ratings go back to 1920, and their baseball ratings go back to 1871. For each of those, the entire histories of match-by-match ratings are available as CSV files. [h/t Jay Boice] | https://en.wikipedia.org/wiki/Elo_rating_system https://github.com/fivethirtyeight/data/tree/master/soccer-spi https://github.com/fivethirtyeight/data/tree/master/nba-carmelo https://github.com/fivethirtyeight/data/tree/master/nfl-elo https://github.com/fivethirtyeight/data/tree/master/mlb-elo | https://github.com/fivethirtyeight/data/issues/91#issuecomment-349672900 |
490 | 2017.12.13 | 4 | State lawmakers’ financial disclosures. | For a recent investigation into state legislators’ financial interests, the Center for Public Integrity “analyzed disclosure reports from 6,933 lawmakers holding office in 2015 from the 47 states that required them.” You can search through the disclosures and download the data. For each of the 11,000+ disclosed interests, the dataset includes the lawmaker’s state, legislative body, and district; the name and industry of the financial interest; and a link to the lawmaker’s personal disclosure form. [h/t The Nerds at INN Labs] | https://www.publicintegrity.org/2017/12/06/21297/conflicted-interests-state-lawmakers-often-blur-line-between-publics-business-and https://apps.publicintegrity.org/disclosure/ https://github.com/PublicI/state-lawmakers-disclosures | http://mailchi.mp/inn/rudolph-the-red-nosed-robot |
491 | 2017.12.13 | 5 | Secure seeds. | In far northern Norway, the Svalbard Global Seed Vault safekeeps hundreds of millions of seeds, helping to back up the world’s biodiversity. Data on the vault’s deposits, which often contain hundreds of seeds apiece, are available to search and to download. [h/t Enigma Public] | http://www.seedvault.no/ http://www.economist.com/node/21549931 https://www.nordgen.org/sgsv/index.php | https://us5.campaign-archive.com/?u=04aa10cf99e0998bd8e69a109&id=e28bb68e14 |
492 | 2017.12.27 | 1 | Historical credit ratings. | The SEC requires Moody’s, Standard & Poor’s, and other “nationally recognized statistical rating organizations” to report their rating assignments and changes (e.g., upgrades, downgrades, withdrawals) going back to 2010. The agencies publish the reports as XBRL-formatted files, and update them monthly. But “because most researchers are unfamiliar with XBRL and cannot easily locate the history files, this valuable resource has seen limited use,” according to the Center for Municipal Finance’s RatingsHistory.info, which now provides the reports as easier-to-use CSVs. [h/t data.world] | https://www.sec.gov/structureddata/rocr-publication-guide.html http://www.municipalfinance.org/ http://ratingshistory.info/ | https://data.world/muni-finance/credit-ratings-history-data |
493 | 2017.12.27 | 2 | Marine traffic. | Ships use the internationally-standardized automatic identification systems (AIS) to broadcast their name, speed, direction, and other details. With a bit of radio hardware and software, anyone can collect the signals emitted by nearby vessels. AISHub aggregates AIS data from hundreds of volunteer signal-collectors around the world, and makes that data available via an API and online maps. The Finnish Transport Agency also provides an API of data collected by its AIS stations on the Baltic Sea and other local waters; Denmark’s government publishes free historical data of maritime traffic on Danish waters; and the Coast Guard publishes historical AIS data for U.S. coastal waters (currently only for 2009–2014). [h/t Topi Tjukanov + Miska Knapek] | https://www.navcen.uscg.gov/?pageName=AISmain http://www.aishub.net/ http://www.aishub.net/stations http://www.aishub.net/api https://meri.digitraffic.fi/api/v1/metadata/documentation/swagger-ui.html https://www.dma.dk/SikkerhedTilSoes/Sejladsinformation/AIS/Sider/default.aspx https://marinecadastre.gov/ais/ | https://twitter.com/tjukanov/status/932698453347635202 https://twitter.com/miskaknapek/status/939154263430836225 |
494 | 2017.12.27 | 3 | A better mammography database. | The Digital Database for Screening Mammography was first released two decades ago, in 1997. It contains data and images from 2,620 mammographies — a mix of normal, benign, and malignant cases. In a Scientific Data article published last week, a team of Stanford University researchers describe a series of improvements they’ve made to the original database; their Curated Breast Imaging Subset of DDSM has modernized the database’s image formatting, added detailed “region-of-interest” annotations, and converted the metadata into CSV files. | http://marathon.csee.usf.edu/Mammography/Database.html https://www.nature.com/articles/sdata2017177 https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM#9a3e6931566743438ac1bb86921a522d | |
495 | 2017.12.27 | 4 | Drug-free school zones in Tennessee. | As part of a recent investigation, reporters at Reason Magazine used public records law to obtain geospatial data on each of Tennessee's 8,544 drug-free zones. In addition the geographic boundaries, the shapefile also includes each zone’s name and type (school, childcare, park, or library). [h/t CJ Ciaramella] | http://reason.com/archives/2017/12/18/the-myth-of-the-playground-pus https://github.com/cjciaramella/tn-drug-free-school-zones | https://twitter.com/cjciaramella/status/942779337950015490 |
496 | 2017.12.27 | 5 | Italian for watermelon. | Through a series of surveys, L'Atlante della Lingua Italiana QUOTidiana has been asking Italian speakers what words they use to describe various everyday things. The results for each question can be browsed as maps, or downloaded as XML files. When shown a picture of a watermelon, most respondents wrote “anguria,” but others responded with “cocomero,” “melone,” “citrone,” or “zipangulu.” [h/t Giuseppe Sollazzo] | https://www.atlante-aliquot.de/index.php https://www.atlante-aliquot.de/primo_turno.php https://www.atlante-aliquot.de/primo_turno_anguria.php | http://mailchi.mp/586b767e6d30/preview-222-in-other-news-3576601?e=6c87ff0227 |
497 | 2018.01.03 | 1 | Taxes filed. | The IRS publishes a ton of tax statistics. One of the most interesting portions: data aggregated from individual income tax returns (i.e., Form 1040s), which the IRS provides at the state, county, and ZIP code level. Those datasets’ 100+ fields include details that range from the basic (e.g., the number of tax filings and total income reported) to the more obscure (e.g., the number of returns that included “educator expenses” and the total amount of overpayments refunded). [h/t Cecilia Reyes] | https://www.irs.gov/statistics https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-return-form-1040-statistics https://www.irs.gov/statistics/soi-tax-stats-historic-table-2 https://www.irs.gov/statistics/soi-tax-stats-county-data https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi | https://twitter.com/kcecireyes |
498 | 2018.01.03 | 2 | Distance learning. | The Open University Learning Analytics dataset features demographic information about 28,000+ students who, in 2013 and 2014, enrolled in any of seven particular distance learning courses at the UK’s Open University; their final results (distinction, pass, fail, or withdrawn); 173,000+ graded assignments; and 10+ million rows describing each student’s interactions with the courses’ “virtual learning environments.” Useful: The researchers’ academic article describing the dataset. | https://analyse.kmi.open.ac.uk/open_dataset http://www.open.ac.uk/ https://www.nature.com/articles/sdata2017171 | |
499 | 2018.01.03 | 3 | Animals on the move. | Movebank is a “a free, online database of animal tracking data hosted by the Max Planck Institute for Ornithology.” On the site’s data map, you can display the animal tracks from particular studies — for instance, the migrations of more than a dozen turkey vultures. Contributing researchers can decide whether to share the underlying data; not all do. (Here’s the data for those vultures, plus six buffalo in Kruger National Park, and seven Venezuelan oilbirds.) [h/t Hari Karthic] | https://www.movebank.org/ https://www.movebank.org/panel_embedded_movebank_webapp https://www.movebank.org/panel_embedded_movebank_webapp?gwt_fragment=page%253Dsearch_map_linked%252CindividualIds%253D17002752*%252B17002744*%252B17002737*%252B17002751*%252B17002732*%252B17002753*%252B17002743*%252B17002742*%252B17002748*%252B17002749*%252B17002739*%252B17002738*%252B17002740*%252B17002741*%252B17002745*%252B17002746*%252B17002747*%252B17002754*%252B17002750*%252Clat%253D10.423457393079747%252Clon%253D-86.25215000000173%252Cz%253D3 https://www.datarepository.movebank.org/handle/10255/move.363 https://www.datarepository.movebank.org/handle/10255/move.609 https://www.datarepository.movebank.org/handle/10255/move.269 | https://github.com/toddmotto/public-apis/commit/d601d012e401d229ed226a805f289a1a7aa97b96 |
500 | 2018.01.03 | 4 | Building permits. | The Census Bureau’s Building Permits Survey collects data from thousands of municipalities every month. For each municipality, metro area, and state, the datasets provide the number of permits issued for new residential housing, number housing units authorized, and total estimated value of the new construction. Previously: The Census Bureau’s Annual Characteristics of New Housing survey (DIP 2016.06.22). [h/t Susie Cambria + Issi Romem] | https://www.census.gov/construction/bps/definitions/ https://www.census.gov/construction/bps/how_the_data_are_collected/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-22-edition | https://twitter.com/susiecambria https://www.buildzoom.com/blog/paying-for-dirt-where-have-home-values-detached-from-construction-costs |
501 | 2018.01.03 | 5 | Knotted string. | “The Khipu Database Project began in the fall of 2002, with the goal of collecting all known information about khipu” — the knotted string textiles used for recordkeeping in the Inca Empire — “into one centralized repository.” The project’s datasets include detailed structural data about hundreds of khipu, as well as an inventory of all known specimens. Related: The College Student Who Decoded the Data Hidden in Inca Knots. | http://khipukamayuq.fas.harvard.edu/ProjectDescription.html http://khipukamayuq.fas.harvard.edu/DataTables.html https://www.atlasobscura.com/articles/khipus-inca-empire-harvard-university-colonialism | |
502 | 2018.01.10 | 1 | Offshore drilling. | The Bureau of Ocean Energy Management and the Bureau of Safety and Environmental Enforcement — two of the agencies that replaced the troubled U.S. Minerals Management Service in the wake of the Deepwater Horizon spill — publish a few dozen bulk datasets related to their oversight of offshore drilling operations. Among them: lease owners, production metrics, company details, pipeline permits and locations, incident investigations, and platform structures. Related: “American Idle: Decommissioning costs sink offshore drillers into latest crisis,” a 2017 Debtwire investigation that used the platform data. [h/t Alex Plough] | https://www.boem.gov/ https://www.bsee.gov/ https://www.data.bsee.gov/Main/RawData.aspx https://www.data.bsee.gov/Leasing/LeaseOwner/Default.aspx https://www.data.bsee.gov/Production/ProductionData/Default.aspx https://www.data.bsee.gov/Company/CompanyDetail/Default.aspx https://www.data.bsee.gov/Pipeline/PipelinePermits/Default.aspx https://www.data.bsee.gov/Pipeline/PipelineLocation/Default.aspx https://www.data.bsee.gov/Other/DataTables/IncidentInvestigations.aspx https://www.data.bsee.gov/Platform/PlatformStructures/Default.aspx http://investigations.debtwire.com/american-idle-decommissioning-costs-sink-offshore-drillers-into-latest-crisis/ | https://twitter.com/newshack |
503 | 2018.01.10 | 2 | Local health metrics. | The CDC’s 500 Cities Project provides “city and census tract-level data, obtained using small area estimation methods, for 27 chronic disease measures for the 500 largest American cities.” The metrics range from cancer prevalence to binge drinking to dental health to undersleeping. The latest data release was published in December and covers more than 28,000 Census tracts. [h/t Kate Rabinowitz] | https://www.cdc.gov/500cities/index.htm https://www.cdc.gov/500cities/about.htm https://chronicdata.cdc.gov/500-Cities/500-Cities-Local-Data-for-Better-Health-2017-relea/6vp6-wxuq | https://www.datalensdc.com/healthWealthGap.html |
504 | 2018.01.10 | 3 | 130+ years of prosecutor politicians. | With the help of research assistants, legal historian Jed Shugerman has compiled a “tentative database” of prosecutor politicians — presidents, Supreme Court justices, circuit court justices, governors, state attorneys general, and senators who served as prosecutors earlier in their careers. Shugerman’s spreadsheet goes back to 1880 and lists the dates served in office, political party, other offices held, and “relevant prosecutorial background” for each politician. [h/t Geoff Hing] | https://www.fordham.edu/info/23180/jed_shugerman https://shugerblog.com/2017/07/07/the-rise-of-the-prosecutor-politicians-database-of-prosecutorial-experience-for-justices-circuit-judges-governors-ags-and-senators-1880-2017/ | https://twitter.com/geoffhing |
505 | 2018.01.10 | 4 | Snow depth. | The National Water and Climate Center maintains a series of interactive snow maps. Their snow depth map is based on data from nearly one thousand monitoring stations around the country — mostly in western states, but also a handful in the Southwest, Northeast, and Midwest. To download data from a map, click on “Selected Stations” in the top-left corner, and then click “Export Data as CSV.” [h/t Charlie Loyd's collection of "near-realtime Earth observation resources" + Noah Veltman] | https://www.wcc.nrcs.usda.gov/ https://www.wcc.nrcs.usda.gov/snow/snow_map.html https://www.wcc.nrcs.usda.gov/webmap/#version=80.1&elements=W,R&networks=!&states=!&counties=!&hucs=&minElevation=&maxElevation=&elementSelectType=all&activeOnly=true&activeForecastPointsOnly=false&hucLabels=false&hucParameterLabels=false&stationLabels=&overlays=&hucOverlays=&mode=data&openSections=dataElement,parameter,date,basin,elements,location,networks&controlsOpen=true&popup=&popupMulti=&base=esriNgwm&displayType=station&basinType=6&dataElement=SNWD¶meter=OBS&frequency=DAILY&duration=I&customDuration=&dayPart=E&year=2018&month=1&day=7&monthPart=E&forecastPubMonth=1&forecastPubDay=1&forecastExceedance=50&seqColor=1&divColor=3&scaleType=D&scaleMin=&scaleMax=&referencePeriodType=POR&referenceBegin=1981&referenceEnd=2010&minimumYears=20&hucAssociations=true&lat=51.51&lon=-100.11&zoom=4.0 | https://planet.parts/ https://noahveltman.com/ |
506 | 2018.01.10 | 5 | San Diego burritos. | Scott Cole is a neuroscience PhD student at UC San Diego who, in his spare time, is leading a project to rate the region’s burritos on a 10-dimensional scale. | https://srcole.github.io/about/ https://srcole.github.io/100burritos/ | |
507 | 2018.01.17 | 1 | Three million NYC marriage licenses, reclaimed. | Reclaim The Records launched in 2015 and became a 501(c)(3) non-profit last year. Its mission: To “identify important genealogical records sets that ought to be in the public domain but which are being wrongly restricted by government archives, libraries, and agencies.” The organization files freedom-of-information requests and lawsuits to get the data, and “then we digitize everything we win and put it all online for free, without any paywalls or usage restrictions, so that it can never be locked up again.” Most of the records they’ve received so far have arrived as PDFs or microfilm. But a 2016 court settlement with the NYC City Clerk’s Office netted the group — and the public — a dataset of 3 million NYC marriage licenses from 1950 to 1995. | https://www.reclaimtherecords.org/ https://www.reclaimtherecords.org/records-request/2/ https://www.nycmarriageindex.com/ | |
508 | 2018.01.17 | 2 | Three decades of immigration policies. | The Immigration Policies in Comparison (IMPIC) project has quantified the immigration regulations of 33 OECD countries between 1980 and 2010. The project, led by political sociologist Marc Helbling, dives deeply into the regulations related to four policy areas: labor migration, family reunification, asylum/refugees, and “co-ethnics.” You can find the dataset’s detailed codebook and methodology in this PDF. Related: Helbling's summary of the project’s goals, approach, and initial findings (Migration Data Portal). [h/t David Brady] | http://www.impic-project.eu/ http://www.impic-project.eu/data/ http://www.impic-project.eu/people/ https://bibliothek.wzb.eu/pdf/2016/vi16-201.pdf http://migrationdataportal.org/blog/impic-new-and-more-comprehensive-way-measure-immigration-policies | https://twitter.com/DaveBrady72/status/943872675943821312 |
509 | 2018.01.17 | 3 | London air pollution. | The London Air Quality Network, run by researchers at King's College London, gathers data on levels of nitrogen dioxide, ozone, fine particulate matter, and other pollutants from more than 100 monitoring sites. You can download the data as CSV files (for up to six metric and site combinations at a time) or fetch JSON and XML data from the site’s API. Related: “London air pollution live data – where will be first to break legal limits in 2018?” (The Guardian). Previously: Air quality data from the EPA (DIP 2017.10.04), OpenAQ (DIP 2017.03.29), Berkeley Earth (DIP 2017.03.22), and the World Health Organization (DIP 2016.06.15). [h/t Gavin Freeguard] | https://www.londonair.org.uk/LondonAir/Default.aspx https://www.londonair.org.uk/london/asp/datadownload.asp https://www.londonair.org.uk/LondonAir/API/ https://www.theguardian.com/environment/ng-interactive/2018/jan/01/london-air-pollution-live-data-where-will-be-first-to-break-legal-limits-in-2018 https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-10-04-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-29-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-03-22-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-06-15-edition | https://us9.campaign-archive.com/?u=9fac8728699163e1b6adbdbeb&id=e96c45d25e |
510 | 2018.01.17 | 4 | Psychometric tests. | The Open Source Psychometrics Project “provides a collection of interactive personality tests with detailed results that can be taken for personal entertainment or to learn more about personality assessment.” You can download results from more than 30 such tests, including the Big Five Personality Test, the Kentucky Inventory of Mindfulness Skills, and Bob Altemeyer's Right-wing Authoritarianism Scale. Related: “Most Personality Quizzes Are Junk Science. I Found One That Isn’t” (FiveThirtyEight). [h/t Chris Zioutas] | https://openpsychometrics.org/ https://openpsychometrics.org/_rawdata/ https://fivethirtyeight.com/features/most-personality-quizzes-are-junk-science-i-found-one-that-isnt/ | https://www.datacircle.io/metric/raw-data-from-online-personality-tests/54a06c7e-0171-4e0c-42f7-08d5495b5656/ |
511 | 2018.01.17 | 5 | The Ghibliverse. | The unofficial Studio Ghibli API contains structured information about the famed Japanese animation studio’s films (e.g., Princess Mononoke and Spirited Away), plus the characters, locations, and vehicles featured in them. You can also download a single file containing all the data. | https://ghibliapi.herokuapp.com/ https://github.com/janaipakos/ghibliapi/blob/master/data.json | |
512 | 2018.01.24 | 1 | Global trade dynamics. | The Atlas of Economic Complexity has collected decades of import/export data from the United Nations Comtrade database, and then applied “a unique method to clean the data to account for inconsistent reporting practices.” You can download the raw data, learn more about the cleaning process in the FAQ, explore current and historical trade flows, and browse the Atlas’s rankings of countries by “economic complexity.” Related: The researchers have also created regionally-detailed economic atlases of Mexico and Columbia. [h/t Annie White] | http://atlas.cid.harvard.edu/ https://comtrade.un.org/ https://intl-atlas-downloads.s3.amazonaws.com/index.html http://atlas.cid.harvard.edu/learn/faq http://atlas.cid.harvard.edu/explore/ http://atlas.cid.harvard.edu/rankings/ http://complejidad.datos.gob.mx/ http://datlascolombia.com/ | https://twitter.com/anniewhite |
513 | 2018.01.24 | 2 | Financial well-being. | The Consumer Financial Protection Bureau’s National Financial Well-Being Survey collected more than 6,000 responses to the agency’s 10-question Financial Well-Being Scale, plus additional demographic and financial information. The survey results, which were collected in late 2016, come with a detailed methodology and data dictionary. Plus: You can take the questionnaire yourself, anonymously. [h/t Amy Cesal] | https://www.consumerfinance.gov/data-research/financial-well-being-survey-data/ https://www.consumerfinance.gov/consumer-tools/financial-well-being/ https://s3.amazonaws.com/files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-user-guide.pdf https://www.consumerfinance.gov/consumer-tools/financial-well-being/ | https://twitter.com/amycesal |
514 | 2018.01.24 | 3 | How long does it take to get to the nearest city? | A team led by researchers at the University of Oxford’s Malaria Atlas Project have estimated the time it would take (as of 2015) to get from any square kilometer in the world to the nearest city of 50,000+ people. The analysis, which improves upon a similar effort from 15 years earlier, benefits from “the first-ever, global-scale synthesis of two leading roads datasets – Open Street Map (OSM) data and distance-to-roads data derived from the Google roads database.” You can download the data as a GeoTIFF, or explore the map online. [h/t Data & Eggs] | https://map.ox.ac.uk/research-project/accessibility_to_cities/ http://forobs.jrc.ec.europa.eu/products/gam/index.php https://roadlessforest.eu/map.html | http://thedataface.com/data-and-eggs/volume-36 |
515 | 2018.01.24 | 4 | Duke’s Greeks. | Using a range of public sources, The Duke Chronicle collected data on all 1,739 students listed in the Class of 2018’s “Freshman Picture Book” — including their hometowns, details about their high schools, whether they won a merit scholarship, and whether they play on a sports team — in order to analyze “trends between those who do and don't join Greek life at Duke.” Related: “Is Greek life at Duke as homogenous as you think?,” the first story in the Chronicle’s multipart series based on the data. [h/t Gautam Hathi] | http://www.dukechronicle.com/article/2018/01/editors-note-regarding-recent-article-on-greek-life https://github.com/Chrissymbeck/Greek-Life-Demographics http://www.dukechronicle.com/article/2018/01/is-greek-life-at-duke-as-homogenous-as-you-think | https://twitter.com/gautamhathi/status/954853781589581826 |
516 | 2018.01.24 | 5 | Ramen ratings. | Hans Lienesch calls himself The Ramen Rater, and (as his website’s banner declares) he’s been “Celebrating the Instant Noodle for 15 Years.” Over that time, he’s amassed a spreadsheet of more than 2,600 ratings. [h/t dreyco] | https://www.theramenrater.com/about-2-2/about-2/ https://www.theramenrater.com/resources-2/the-list/ | https://www.reddit.com/r/datasets/comments/7phv58/2682_instant_noodle_ratings/ |
517 | 2018.02.07 | 1 | Hollywood’s romantic age gaps. | Lynn Fisher’s Hollywood Age Gap collects data on silver screen love interests — more than 880 so far, from more than 630 movies — and then calculates the difference in those actors’ ages. The largest gap so far is the 52-year age difference in Harold and Maude. The movie with the most pairings is Love Actually, with seven. You can download the data as JSON and CSV files from the project’s GitHub page. [h/t Julia Smith] | https://lynnandtonic.com/ https://hollywoodagegap.com/ https://github.com/lynnandtonic/hollywood-age-gap | https://us1.campaign-archive.com/?u=81670c9d1b5fbeba1c29f2865&id=3a8cef6b3c |
518 | 2018.02.07 | 2 | Terrorism incidents. | The Global Terrorism Database, run by a University of Maryland–based consortium, is an “open-source database” of more than 170,000 terrorist events. The database, which currently covers 1970 through 2016, is well-documented and includes information about about the attackers, locations, weapons, victims, and more. Note: To download the data, you first need to accept an end-user license agreement. Previously: Profiles of Individual Radicalization in the United States, from the same consortium (DIP 2017.05.24). [h/t Brian C. Keegan] | http://www.start.umd.edu/gtd/ https://www.start.umd.edu/gtd/using-gtd/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-24-edition | http://www.brianckeegan.com/about/ |
519 | 2018.02.07 | 3 | Invasive species. | The Global Register of Introduced and Invasive Species combines data and observations from thousands of sources to create a standardized database of such species in more than 200 countries. can be explored by kingdom (plants, animals, fungi, etc.), ecosystem, and country. Each slice of data can be downloaded as a CSV. Related: In a Scientific Data paper published last month, the researchers behind the effort described their methodology in detail. | http://www.griis.org/ https://www.nature.com/articles/sdata2017202 | |
520 | 2018.02.07 | 4 | Federal Reserve forecasts. | Before each meeting of the Federal Open Market Committee, the Federal Reserve’s research staff prepares a set of economic projections known as the Greenbook. Those forecasts are kept secret for five years, and then released to the public. The Philadelphia Fed’s archive of public Greenbook data dates back to 1966, and contains both PDFs and structured data files. | https://en.wikipedia.org/wiki/Greenbook https://www.philadelphiafed.org/research-and-data/real-time-center/greenbook-data/ | |
521 | 2018.02.07 | 5 | Donkey Kong. | KongTrackr hosts detailed stats about specific games played on the beloved arcade fixture, with a focus on record-setting scores. The website’s database, which can be downloaded as a single JSON file, currently includes 1,715 games by 450 players. Related: KongTrackr played a role in some recent high-score commotion. Also related: KongTrackr says its site is “heavily influenced” by this database of StarCraft 2 results. | http://kongtrackr.herokuapp.com http://kongtrackr.herokuapp.com/#/exportDatabase https://arstechnica.com/gaming/2018/02/donkey-kong-scoreboard-strips-billy-mitchells-high-score-claims/ http://aligulac.com/ | |
522 | 2018.02.14 | 1 | Nepal, post-earthquake. | In April 2015, the Ghorkha Earthquake killed more than 8,000 people in Nepal, and destroyed hundreds of thousands of homes. In early 2016, a team led by the not-for-profit Kathmandu Living Labs, in collaboration with Nepal’s government, undertook “a massive household survey using mobile technology to assess building damage in the earthquake-affected districts.” The responses to that survey are now available at the 2015 Nepal Earthquake Open Data Portal; you can explore the data online or download it in bulk. In all, the datasets include details on millions of individuals, plus information about each surveyed household and building. [h/t Reddit user “phishfart”] | http://www.kathmandulivinglabs.org/ http://admin.myrepublica.com/the-week/story/43132/banking-on-data.html https://opendata.klldev.org/#/about https://opendata.klldev.org/#/explore https://opendata.klldev.org/#/download | https://www.reddit.com/r/datasets/comments/7v4qrt/2015_nepal_earthquake_open_data_portal/ |
523 | 2018.02.14 | 2 | Historical congressional results, historical boundaries. | Through the Constituency-Level Elections Archive (DIP 2016.09.28) and other sources, you can get historical election results for the U.S. Congress. And through the work of Jeffrey B. Lewis et al., you can get data describing the historical boundaries of each congressional district. In a Scientific Data article published last year, quantitative geographer Levi John Wolf presented a dataset that brings the two types of information together, so that all congressional election results from 1896 to 2014 are “explicitly linked to the geospatial data about the districts themselves.” | http://www.electiondataarchive.org/index.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-09-28-edition http://cdmaps.polisci.ucla.edu/ https://www.nature.com/articles/sdata2017108 http://ljwolf.org/about https://osf.io/mjvkb/ | |
524 | 2018.02.14 | 3 | Human speech. | Common Voice is a Mozilla-led project that aims “to make voice recognition technology easily accessible to everyone.” To that end, the project asks visitors to record themselves speaking specific sentences, and to validate the recordings of other users. The whole dataset is available to download and currently clocks in at 12 gigabytes, compressed. (Bonus: That download page also links to other freely available voice datasets.) Related: The project’s FAQ. | https://voice.mozilla.org/ https://voice.mozilla.org/data https://voice.mozilla.org/faq | |
525 | 2018.02.14 | 4 | UK fire stats. | The United Kingdom’s Home Office publishes dozens of fire-safety related datasets, including aggregate statistics on response times, smoke alarms, and fire department staffing; incident-level data on appliance fires, vehicle fires, and fatalities; and much more. Of the 100,000+ domestic appliance fires reported over a six-year span, 52% were believed to have been caused by a “cooker incl. oven,” 11% by a “grill/toaster,” 2% by dishwashers, and just over 1% by deep-fat fryers. Semi-related: Jamie Oliver’s Bad Cheese Idea Is Still Starting Toaster Fires. [h/t Owen Boswarva] | https://www.gov.uk/government/statistical-data-sets/fire-statistics-data-tables http://nymag.com/selectall/2015/11/jamie-olivers-bad-idea-is-still-starting-fires.html | https://twitter.com/owenboswarva/status/961884995957731328 |
526 | 2018.02.14 | 5 | Imported bats. | Via a Freedom of Information Act request to the Fish and Wildlife Service, Newsweek reporter Kristin Hugo obtained a spreadsheet listing all imports of bats — vampire, fruit, yellow-shouldered, leaf-nosed, and more — to the United States between January 2016 and October 2017. | http://www.strangebio.com/post/170249749601/conservationists-to-americans-please-stop-buying https://docs.google.com/spreadsheets/d/1gtfdz6cqoenEsuauOPDy8mKJFernuy2CFde89ukPNtE/edit | |
527 | 2018.02.21 | 1 | Rohingya refugees. | The Humanitarian Data Exchange has collated dozens of datasets related to the Rohingya refugee crisis. Among them: the geographic boundaries of Rohingya refugee settlements in Bangladesh, the numbers of refugees living in those settlements, and the infrastructure available there. | https://data.humdata.org/event/rohingya-displacement https://data.humdata.org/dataset/outline-of-camps-sites-of-rohingya-refugees-in-cox-s-bazar-bangladesh https://data.humdata.org/dataset/site-location-of-rohingya-refugees-in-cox-s-bazar https://data.humdata.org/dataset/cox-s-bazar-refugee-settlement-infrastructure | |
528 | 2018.02.21 | 2 | U.S. Treasury sanctions. | Through its Office of Foreign Assets Control, the Treasury publishes several datasets that describe the people and companies subject to U.S. economic sanctions. The two main listings are the Specially Designated Nationals and Blocked Persons (“SDN”) and the Consolidated Sanctions List. Those contain only currently-sanctioned entities, but the Treasury also publishes (semi-structured) documents describing historical additions and removals. Related: Enigma Public’s Sanctions Tracker. [h/t Jennifer Roscoe] | https://www.treasury.gov/resource-center/sanctions/Pages/default.aspx https://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/Other-OFAC-Sanctions-Lists.aspx https://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/consolidated.aspx https://labs.enigma.com/sanctions-tracker/ | https://twitter.com/jenniferroscoe |
529 | 2018.02.21 | 3 | Happy moments. | HappyDB is “a corpus of 100,000 crowd-sourced happy moments.” An example: “My son gave me a big hug in the morning when I woke him up.” The researchers, who recently described their efforts in an academic paper, collected the sentiments from Mechanical Turk workers, who also supplied basic demographic information, such as age, gender, and whether they have children. [h/t Marcel Weiher] | https://rit-public.github.io/HappyDB/ https://arxiv.org/abs/1801.07746 | https://news.ycombinator.com/item?id=16381964 |
530 | 2018.02.21 | 4 | Seven years of GitHub activity. | The GitHub Archive is an effort to record the popular code-sharing website’s public timeline, “archive it, and make it easily accessible for further analysis.” The dataset, which includes more than 20 types of events and often contains more than 1 million events per day, goes back to February 2011. Related: Structured data representing the “commit histories” of two dozen popular open-source projects, including Rust, Pandas, Redis, and Bitcoin. | https://www.githubarchive.org/ http://developer.github.com/v3/activity/events/types/ https://github.com/gitential/datasets | |
531 | 2018.02.21 | 5 | Public commemorations. | The Open Plaques project is dedicated to “documenting the historical links between people and places as recorded by commemorative plaques.” The latest data dump contains nearly 40,000 plaques — the vast majority in the U.S., U.K., and Germany. OpenBenches, meanwhile, has collected similar data for 4,300+ memorial benches. [h/t Jason Norwood-Young] | http://openplaques.org/ http://openplaques.org/data https://openbenches.org/ https://github.com/edent/openbenches | https://us8.campaign-archive.com/?u=11977a67604b965526b63ee6e&id=50af5bb880 |
532 | 2018.02.28 | 1 | Peace agreements. | The PA-X Peace Agreements Database contains structured information about 1,500+ “formal, publicly-available documents” that address “conflict with a view to ending it.” The database covers more than 140 peace processes between 1990 and 2015, and each agreement has been coded for more than 200 variables — for instance, whether the agreement contains provisions about religious groups. [h/t Melissa Terras] | https://www.peaceagreements.org/ https://www.peaceagreements.org/files/PA_X_codebook_Version1_Feb_20_20.pdf | https://twitter.com/melissaterras/status/967344418146734080 |
533 | 2018.02.28 | 2 | Historical battles. | Political scientist Jeffrey Arnold has converted the U.S. Army Concepts Analysis Agency (CAA) Database of Battles from a series of Lotus 1-2-3 worksheets into tidier, easier-to-use CSV files. The dataset includes details of 660 battles — associated with several dozen wars — between 1600 and the mid/late-1900s. The fields indicate each battle’s “name, date, and location; the strengths and losses on each side; identification of the victor; temporal duration of the battle,” and more. | http://www.jrnold.me/ https://github.com/jrnold/CDB90/blob/master/src-data/M000121/README.TXT https://github.com/jrnold/CDB90 | |
534 | 2018.02.28 | 3 | American radiation. | The Environmental Protection Agency’s RadNet system “monitors the nation's air, precipitation and drinking water for radiation.” The radiation measurements, collected from 130+ stations in all 50 states plus the District of Columbia and Puerto Rico, are available on a “near-real-time” basis. Related: Randall Munroe’s radiation dose chart. Previously: SafeCast (DIP 2016.02.03). [h/t Stanislav Kralin] | https://www.epa.gov/radnet https://www.epa.gov/radnet/near-real-time-and-laboratory-data-state https://xkcd.com/radiation/ https://blog.safecast.org/about/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-02-03-edition | https://opendata.stackexchange.com/questions/10678/us-radiation-measurements |
535 | 2018.02.28 | 4 | Fellow mammals. | The American Society of Mammalogists’ Mammal Diversity Database “is home base for tracking the latest taxonomic changes to species and higher groups of mammals.” Currently, it contains more than 1,300 genera and 6,000 total species. Fun facts: The impala is the only member of the genus Aepyceros, and the name “Schmidly's deer mouse” can refer to either of two species in two entirely different genera. [h/t Himanshu Goenka] | https://mammaldiversity.org/ https://en.wikipedia.org/wiki/Impala | http://www.ibtimes.com/mammal-biodiversity-20-larger-previously-thought-new-database-find-2650075 |
536 | 2018.02.28 | 5 | Rick and Morty. | The (unofficial) Rick and Morty API provides data on 390+ characters, 60+ locations, and all 31 episodes of the science-fictional animated series. | https://rickandmortyapi.com/ https://rickandmortyapi.com/documentation#character https://rickandmortyapi.com/documentation#location https://rickandmortyapi.com/documentation#episode https://en.wikipedia.org/wiki/Rick_and_Morty | |
537 | 2018.03.07 | 1 | The grid. | The U.S. Energy Information Administration publishes near-real-time data on the Lower 48’s electrical grid. The datasets include net electricity generation, flows in and out of the country’s various “balancing authorities,” regional demand, and forecasts of demand. You can explore the data online, access it through the EIA’s API, or download it in bulk. Helpful: The EIA’s guide to the data and “known issues”. | https://www.eia.gov/realtime_grid/ https://www.eia.gov/opendata/qb.php?category=2123635 https://www.eia.gov/opendata/bulkfiles.php https://www.eia.gov/realtime_grid/docs/UserGuideAndKnownIssues.pdf | |
538 | 2018.03.07 | 2 | Pan-African surveys. | Afrobarometer “is a pan-African, non-partisan research network that conducts public attitude surveys on democracy, governance, economic conditions, and related issues in more than 35 countries in Africa.” You can download data from the first six rounds of surveys, conducted between 1999 and 2015. You can also read the detailed questionnaires and explore the results online. Note: To download the data, you’ll need to create a (free) account on the website. [h/t Jeffrey Arnold] | http://www.afrobarometer.org/about http://www.afrobarometer.org/data/merged-data http://www.afrobarometer.org/surveys-and-methods/questionnaires http://afrobarometer.org/online-data-analysis/getting-started | https://github.com/jrnold/afrobarometer |
539 | 2018.03.07 | 3 | More brain scans. | Last year, the Stanford Center for Reproducible Neuroscience launched OpenNeuro, “a free and open platform for analyzing and sharing neuroimaging data.” (It’s the successor to the center’s earlier initiative, OpenfMRI.) You can, for instance, download scans of brains that were watching a particular episode of The Twilight Zone. Related: The Brain Imaging Data Structure, “a simple and intuitive way to organize and describe your neuroimaging and behavioral data.” Previously: The Open Access Series of Imaging Studies (DIP 2017.08.16). [h/t Laura Noren and Brad Stenger] | http://reproducibility.stanford.edu/about-us/ https://openneuro.org/faq http://reproducibility.stanford.edu/openfmri-becomes-openneuro/ https://openfmri.org/ https://openneuro.org/datasets/ds001145/versions/00001 http://bids.neuroimaging.io/ http://www.oasis-brains.org/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-08-16-edition | https://cds.nyu.edu/newsletter/ |
540 | 2018.03.07 | 4 | The Gray Lady of 19th century Havana. | The University of Miami Libraries has digitized 53,000+ pages of La Gaceta de La Habana, “the paper of record during the Spanish colonial occupation of Cuba in the nineteenth century.” The digitized editions span 33 of the years between 1849 and 1897. Previously: Historical U.S. newspapers (DIP 2017.08.16). [h/t Mike Stucka + Heather Froehlich] | https://github.com/UMiamiLibraries/collections-as-data/tree/master/LaGaceta http://merrick.library.miami.edu/cubanHeritage/cubanlaw/lagaceta.php https://chroniclingamerica.loc.gov/about/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-08-16-edition | https://twitter.com/MikeStucka https://twitter.com/heatherfro/status/969658558206873600 |
541 | 2018.03.07 | 5 | Powerlifting. | OpenPowerlifting.org “aims to create a permanent, accurate, convenient, accessible, open archive of the world's powerlifting data. In support of this mission, all of the OpenPowerlifting data and code is available for download in useful formats.” So far, that includes 400,000+ performances at 9,000+ competitions in dozens of countries. [h/t u/cavedave] | http://www.openpowerlifting.org/data.html | https://www.reddit.com/r/datasets/comments/7uqn4i/powerlifting_data/ |
542 | 2018.04.04 | 1 | The Western hemisphere. | The GOES-16 satellite was launched into orbit in November 2016, and it’s been collecting near-realtime images and data ever since. (GOES stands for “Geostationary Operational Environmental Satellite.”) It collects data on 16 different spectral bands, and it can capture a full image of the Western Hemisphere every 15 minutes, plus “an image of the Continental U.S. every five minutes, and two smaller, more detailed images of areas where storm activity is present, every 60 seconds.” You can browse the images and data online, and also download them as NetCDF files. Related: Washington Post graphics reporter John Muyskens’ list of GOES-16 resources and usage examples. [h/t John Muyskens] | https://en.wikipedia.org/wiki/GOES-16 https://www.goes-r.gov/spacesegment/abi.html https://www.goes-r.gov/multimedia/dataAndImageryImages.html http://edc.occ-data.org/goes16/ https://github.com/jmuyskens/nicar18-data-blitz-goes-16 | https://twitter.com/JohnMuyskens/status/971797053910155265 |
543 | 2018.04.04 | 2 | Gender pay gaps in Great Britain. | The UK government has begun requiring all companies with at least 250 employees in Great Britain (i.e., England, Scotland, and Wales) to report the pay differences between their male and female workers. Today is the official deadline to submit the reports; as of last night, more than 8,800 employers had done so. The reports include the percentage gaps in hourly earnings, differences in bonus pay, and the proportions of male and female employees in each pay quartile. You can search the data online and also download it as a CSV. Related: The Guardian’s series of reports on the data. [h/t Peter Yeung] | https://www.gov.uk/guidance/gender-pay-gap-reporting-overview https://www.gov.uk/guidance/gender-pay-gap-reporting-overview#data-you-must-publish-and-report https://gender-pay-gap.service.gov.uk/Viewing/search-results https://gender-pay-gap.service.gov.uk/Viewing/download https://www.theguardian.com/society/equal-pay | https://us16.campaign-archive.com/?u=088b912cf6976d4efabca7bbc&id=b0accc7150 |
544 | 2018.04.04 | 3 | North Korea negotiations and provocations. | The Center for Strategic and International Studies’ Beyond Parallel project publishes several databases related to North Korean international relations — including 200+ negotiations between the U.S. and DPRK since 1990, and several hundred military provocations since 1958. Related: Los Angeles Times correspondent Matt Stiles’ visual explorations of the provocations data. Previously: The James Martin Center for Nonproliferation Studies’ North Korea Missile Test Database (DIP 2017.05.17). [h/t Matt Stiles] | https://beyondparallel.csis.org/about/ https://beyondparallel.csis.org/databases/ http://thedailyviz.com/tag/provocations/ http://www.nti.org/analysis/articles/cns-north-korea-missile-test-database/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-17-edition | https://twitter.com/stiles/status/971800447815176192 |
545 | 2018.04.04 | 4 | Russian presidential voting. | Software engineer Michael Penkov has scraped the official, polling station–level results for Russia’s recent presidential election, and made the data available as a single JSON file. He’s also published an introductory Python notebook, which explains the data structure and provides English translations for the Russian field names. | http://michael.penkov.id.au/about/ https://github.com/mpenkov/prezident2018 https://github.com/mpenkov/prezident2018/blob/master/scrapyproject/results.json.gz https://github.com/mpenkov/prezident2018/blob/master/Introduction.ipynb | |
546 | 2018.04.04 | 5 | Even more dog (and cat) names. | Last year, Data Is Plural pointed readers to dog registration data for NYC, Tacoma, and Edmonton. It turns out that government of Zurich also publishes local dog registrations, including each canine’s name, gender, and birth year. And the Sunshine Coast Council, in Australia, publishes a spreadsheet of both dogs and cats, their primary breeds and colors, and whether they’ve been spayed/neutered. [h/t Open Data Institute] | https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-31-edition https://www.europeandataportal.eu/data/en/dataset/https-data-stadt-zuerich-ch-dataset-pd_stapo_hundenamen https://data.sunshinecoast.qld.gov.au/Administration/Registered-Animals/7f87-i6kx/data | https://theodi.org/article/the-open-data-olympics-seven-weird-and-wonderful-open-datasets |
547 | 2018.04.18 | 1 | Evictions. | A team led by Princeton sociologist and Evicted author Matthew Desmond has compiled the United States’ first-ever national-scale, publicly-available database of eviction metrics. Desmond’s Eviction Lab has collected more than 80 million records from cities, counties, and states across the country, and used them to calculate the number of evictions and eviction filings in each place. (Short methodology here; longer methodology here.) You can download the aggregate data in bulk (after supplying your email address) and explore it through an interactive map. Related: “In 83 Million Eviction Records, a Sweeping and Intimate New Look at Housing in America” (The New York Times), which includes additional background and graphics. | https://www.penguinrandomhouse.com/books/247816/evicted-by-matthew-desmond/9780553447453/ https://evictionlab.org/map/ https://evictionlab.org/about/ https://evictionlab.org/methods/ https://evictionlab.org/docs/Eviction%20Lab%20-Methodology%20Report%20v.1.0.0.pdf https://evictionlab.org/get-the-data/ https://evictionlab.org/map/ https://www.nytimes.com/interactive/2018/04/07/upshot/millions-of-eviction-records-a-sweeping-new-look-at-housing-in-america.html | |
548 | 2018.04.18 | 2 | Executive orders. | The U.S. Office of the Federal Register publishes structured data on every presidential executive order since 1994. For each of the 886 entries, the dataset provides the order’s title, the date it was signed, the president who signed it, and where to find it in the Federal Register. [h/t u/cavedave] | https://www.federalregister.gov/executive-orders | https://www.reddit.com/r/datasets/comments/86cbdb/us_president_executive_orders/ |
549 | 2018.04.18 | 3 | State campaign finance laws. | The nonpartisan Campaign Finance Institute has launched a database of current and historical state campaign finance laws. The information goes back to 1996 and describes each state’s contribution limits, various kinds of prohibitions, disclosure rules, and more. You can download the full dataset or explore it online. [h/t Rachel Shorey] | http://cfinst.org/about.aspx http://cfinst.org/State/LawsDatabase.aspx http://cfinst.org/State/LawsDatabase_Download.aspx https://cfinst.github.io/ | https://twitter.com/rachel_shorey/status/981203851243122689 |
550 | 2018.04.18 | 4 | Academic parental leave policies. | Researchers at the University of Colorado at Boulder and the Santa Fe Institute have compiled a dataset of 200+ universities’ parental leave policies. For each institution, the dataset indicates the amount of paid leave granted to//taken by both women and men, and what type of leave it is (e.g., relief from teaching, from all duties, et cetera). [h/t Sam Way] | https://aaronclauset.github.io/parental-leave/ | https://twitter.com/samfway/status/984199839473807362 |
551 | 2018.04.18 | 5 | Miscellany. | The University of Florida’s Larry Winner has collected hundreds of “miscellaneous” datasets, many from niche academic studies. A few highlights: “Antiseptic as Treatment for Amputation – Upper Limb” (from an 1870 study), “Sex, Lies, and Religiosity” (1971), and “Reading Times by E-Reader Device and Lighting Conditions” (2013). [h/t Charles Minshew] | http://users.stat.ufl.edu/~winner/ http://users.stat.ufl.edu/~winner/datasets.html | https://twitter.com/charlesminshew |
552 | 2018.04.25 | 1 | School shootings. | Over the past year, reporters at the Washington Post ”attempted to identify every act of gunfire at a primary or secondary school during school hours since the Columbine High massacre on April 20, 1999.” Using a range of sources, the reporters ”reviewed more than 1,000 alleged incidents, but counted only those that happened on campuses immediately before, during or just after classes.” The resulting database, published last week, currently contains more than 200 incidents and can be downloaded as a CSV. For each shooting, the database includes details about the location, timing, circumstances, shooter, casualties, and the school’s students. [h/t The INN Nerds] | https://www.washingtonpost.com/graphics/2018/local/school-shootings-database/ https://github.com/washingtonpost/data-school-shootings | https://mailchi.mp/inn/were-having-a-texas-tea-party |
553 | 2018.04.25 | 2 | America’s roads. | The federal Highway Performance Monitoring System “includes inventory information for all of the Nation's public roads as certified by the States’ Governors annually.” And it’s not just highways: “All roads open to public travel are reported in HPMS regardless of ownership, including Federal, State, county, city, and privately owned roads such as toll facilities.” Shapefiles representing the HPMS data are available for 2011–2015. For each segment of road, the dataset indicates the average daily traffic, number of turn lanes, surface type, and dozens of other variables. Related: America’s Quietest Routes, which uses the data. | https://www.fhwa.dot.gov/policyinformation/hpms/fieldmanual/page01.cfm https://www.fhwa.dot.gov/policyinformation/hpms/shapefiles.cfm https://www.fhwa.dot.gov/policyinformation/hpms/fieldmanual/page00.cfm https://www.geotab.com/americas-quietest-routes/ | |
554 | 2018.04.25 | 3 | Wind turbines. | Lawrence Berkeley National Laboratory, the U.S. Geological Survey, and the American Wind Energy Association have partnered to publish the U.S. Wind Turbine Database. The dataset, which the government says will be “continuously updated,” currently contains 57,636 turbines and includes each turbine’s location, development project, manufacturer, model, height, rotor diameter, and other characteristics. You can download the data in several formats, and also explore it on an interactive map. [h/t Ed Vine] | https://eerscmap.usgs.gov/uswtdb/ https://eerscmap.usgs.gov/uswtdb/data/ https://eerscmap.usgs.gov/uswtdb/viewer/ | https://www.linkedin.com/in/ed-vine-a480347 |
555 | 2018.04.25 | 4 | A decade of New York Times front-page stories. | For her 2013 book, Making the News: Politics, the Media, and Agenda Setting, UC Davis professor Amber E. Boydstun oversaw the compilation of a dataset of every front-page article in the New York Times from 1996 to 2006. Each of the 31,034 articles have been categorized by topic, according a detailed codebook, and given a short summary. Related: The Comparative Agendas Project's list of datasets that use its topic-classification system, including Boydstun’s data. Also related: The NYT’s APIs. [h/t Cornelius Puschmann] | http://press.uchicago.edu/ucp/books/book/chicago/M/bo16382220.html http://www.amber-boydstun.com/supplementary-information-for-making-the-news.html http://www.amber-boydstun.com/uploads/1/0/6/5/106535199/nyt_front_page_policy_agendas_codebook.pdf https://www.comparativeagendas.net/pages/About https://www.comparativeagendas.net/datasets_codebooks https://developer.nytimes.com/ | https://twitter.com/cbpuschmann/status/972233548433383425 |
556 | 2018.04.25 | 5 | Cooks in the kitchen. | Computer-vision researchers convinced 32 participants (of 10 nationalities, living in 4 cities) to record everything they did in their kitchens for three days using a head-mounted camera. Later, the participants narrated what they had been doing. Taken together, the EPIC-Kitchens dataset includes 55 hours of video, nearly 40,000 narration segments, and more. [h/t Duncan Geere] | https://arxiv.org/abs/1804.02748 https://epic-kitchens.github.io/2018 | https://tinyletter.com/duncangeere |
557 | 2018.05.09 | 1 | Clinical trials. | OpenTrials, a collaboration between Open Knowledge International and Oxford University’s Ben Goldacre, “aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally.” The project’s “public beta” brings together data from several of the world’s largest clinical trial registries — including the United States’ ClinicalTrials.gov, the European Union Clinical Trials Register, and the WHO’s International Clinical Trials Registry Platform — and other related sources. You can explore the data through an online search tool, monthly bulk exports, and an API. | https://opentrials.net/ https://okfn.org/ http://www.badscience.net/about-dr-ben-goldacre/ https://clinicaltrials.gov/ https://www.clinicaltrialsregister.eu/ctr-search/search http://www.who.int/ictrp/en/ https://explorer.opentrials.net/ https://explorer.opentrials.net/data https://opentrials.net/2016/10/18/opentrials-api-information/ | |
558 | 2018.05.09 | 2 | Medical examiner reports. | Cook County, Illinois, publishes data on all deaths reported to its medical examiner — 20,000+ deaths since August 2014, and updated daily. (FYI: “Not all deaths that occur in Cook County are reported to the Medical Examiner or fall under the jurisdiction of the Medical Examiner.”) Connecticut’s Office of the Chief Medical Examiner has published data on all accidental drug deaths reported between 2012 and 2017. The Dallas Morning News’ Dana Amihere obtained autopsy data from the Dallas County medical examiner's office, and NJ Advance Media’s Stephen Stirling obtained data on “all cases referred to the NJ Medical Examiner system from 1996 to 2016.” [Correction, 2018-05-09: The original version of this item misspelled Stephen Stirling's name. Data Is Plural regrets the error.] | https://datacatalog.cookcountyil.gov/Public-Safety/Medical-Examiner-Case-Archive/cjeq-bs86 https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-2012-2017/rybz-nyjw https://github.com/write-this-way/dallas-co-autopsies https://data.world/stevestirling/n-j-medical-examiner-data | |
559 | 2018.05.09 | 3 | Himalayan expeditions. | The Himalayan Database tracks “all expeditions that have climbed in the Nepalese Himalaya.” The hyper-detailed database “is based on the expedition archives of Elizabeth Hawley, a longtime journalist based in Kathmandu, and it is supplemented by information gathered from books, alpine journals and correspondence with Himalayan climbers.” The database — long accessible only on CD, for a fee — is now available to download for free. (The main download is provided as a Microsoft Visual FoxPro database, but the .DBF files within it can be opened using other software, including LibreOffice.) Related: Yuichiro Miura, the oldest person to reach the summit of Mount Everest. [h/t Jacob Bradburn] | http://himalayandatabase.com/index.html http://himalayandatabase.com/downloads.html https://en.wikipedia.org/wiki/Yuichiro_Miura | https://twitter.com/JacobBradburnIO |
560 | 2018.05.09 | 4 | 1.7 billion Milky Way stars. | The European Space Agency’s Gaia spacecraft “has produced the richest star catalogue to date, including high-precision measurements of nearly 1.7 billion stars and revealing previously unseen details of our home Galaxy.” Those measurements, released last month, are available to download. They’ve also been used to create a high-resolution image of all observed stars and to expand the ESA’s interactive space map. Related: This Vox article provides some more context. [h/t u/Kopachris] | http://www.esa.int/Our_Activities/Space_Science/Gaia/Gaia_creates_richest_star_map_of_our_Galaxy_and_beyond https://www.cosmos.esa.int/web/gaia/dr2 https://gea.esac.esa.int/archive/ https://www.esa.int/spaceinimages/Images/2018/04/Gaia_s_sky_in_colour2 http://sky.esa.int/ https://www.vox.com/science-and-health/2018/4/26/17281640/gaia-3d-map-milky-way-sky-map | https://www.reddit.com/r/datasets/comments/8ev47w/astrometric_data_for_17_billion_stars_in_the/ |
561 | 2018.05.09 | 5 | Sidewalk grates. | You know those metallic grates embedded into city sidewalks? D.C.’s Office of the Chief Technology Officer has identified 10,000+ of them in the District. Also: 89,727 curb segments. [h/t Sunlight Open Cities] | http://opendata.dc.gov/datasets/sidewalk-grates http://opendata.dc.gov/datasets/curbs?geometry=-77.04,38.897,-77.024,38.9 | https://twitter.com/SunlightCities/status/992422225667031040 |
562 | 2018.05.23 | 1 | Ebola. | Caitlin Rivers, a computational epidemiologist at the Johns Hopkins Center for Health Security, has started compiling data tracking the current Ebola outbreak in the Democratic Republic of Congo. So far, the datasets are based on case counts and other information from the DRC’s Ministry of Health and the World Health Organization. A series of “data interpretation notes” accompanies each dataset. (Rivers administered a similar data repository during the 2014 Ebola outbreak.) Related: “Most Maps of the New Ebola Outbreak Are Wrong,” by Ed Yong. | http://www.caitlinrivers.com/ https://github.com/cmrivers/ebola_drc https://github.com/cmrivers/ebola https://www.theatlantic.com/health/archive/2018/05/most-maps-of-the-new-ebola-outbreak-are-wrong/560777/ | |
563 | 2018.05.23 | 2 | Public transit, curated. | As a way to “lower the barrier“ for analyzing public transportation data, researchers at Finland’s Aalto University have published “a curated collection of [now more than] 25 cities' public transport networks in multiple easy-to-use formats including network edge lists, temporal network event lists, SQLite databases, GeoJSON files, and the GTFS data format.” On the project’s website, you can browse, visualize, and download each city’s data. (The cities are mostly in Europe and Australia, but also include Detroit, Winnipeg, and Antofagasta, Chile.) Previously: TransitLand and TransitFeeds (DIP 2016.07.27). [h/t NYU Data Science Community Newsletter] | https://www.nature.com/articles/sdata201889 http://transportnetworks.cs.aalto.fi/ https://transit.land/ https://transitfeeds.com/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-27-edition | https://cds.nyu.edu/newsletter/ |
564 | 2018.05.23 | 3 | Political resistance campaigns. | The Nonviolent and Violent Campaigns and Outcomes (NAVCO) Data Project, based at the University of Denver, “catalogues major nonviolent and violent resistance campaigns around the globe from 1900-2013.” The project’s initial dataset explored the general characteristics of hundreds of campaigns; follow-up datasets have examined the annual activity and tactics of smaller subsets. Each dataset comes with a detailed codebook. Note: Free registration is required to download the most recent datasets. [h/t Peace Science Digest] | https://www.du.edu/korbel/sie/research/chenow_navco_data.html | https://twitter.com/PeaceSciDigest/status/988888349594173440 |
565 | 2018.05.23 | 4 | Wind. | Earlier this month, the Department of Energy’s National Renewable Energy Laboratory made a big new slice of its Wind Integration National Dataset available online. The latest version provides API access to 50 terabytes of wind-related measurements — about 10% of the full database. It includes “barometric pressure, wind speed and direction, relative humidity, temperature, and air density data” between 2007 and 2013, from nearly 5 million locations in/near the continental United States. The NREL has also published an animated map of the data. Note: Free registration is required to access the API. Previously: Wind turbines (DIP 2018.04.25). [h/t Michael McLaughlin] | https://www.nrel.gov/news/press/2018/nrel-releases-major-update-to-wind-energy-dataset.html https://www.nrel.gov/grid/wind-toolkit.html https://github.com/NREL/hsds-examples https://nrel.github.io/hsds-viz/ https://eerscmap.usgs.gov/uswtdb/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-04-25-edition | http://www.datainnovation.org/2018/05/tracking-wind-energy-in-the-united-states/ |
566 | 2018.05.23 | 5 | What Wikipedians cite. | The Wikimedia Foundation has published a dataset listing each clearly-cited source (e.g., a book with an ISBN, a scholarly article with a DOI, etc.) on each page of each of Wikipedia’s 298 languages editions — 15,693,732 source-page combinations in all. Related: “The Most-Cited Authors on Wikipedia Had No Idea,” by Louise Matsakis. [h/t Ted Lawless] | https://medium.com/freely-sharing-the-sum-of-all-knowledge/what-are-the-ten-most-cited-sources-on-wikipedia-lets-ask-the-data-34071478785a https://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 https://www.wired.com/story/wikipedia-most-cited-authors-no-idea/ | https://twitter.com/tedlawless/status/981960557807980544 |
567 | 2018.06.06 | 1 | Global gas and oil infrastructure. | The Department of Energy’s National Energy Technology Laboratory has published what it says is the “first-ever database inventory of oil and natural gas infrastructure information from the top hydrocarbon-producing and consuming countries in the world.” The database contains tons of geospatial information and “identifies more than 4.8 million individual features like wells, pipelines, and ports from more than 380 datasets in 194 countries. It includes information about the type, age, status, and owner/operator of infrastructure features.” Helpful: The authors’ (detailed) methodology paper. [h/t Michael McLaughlin] | https://www.energy.gov/fe/articles/netl-led-team-creates-first-ever-international-database-use-preventing-natural-gas https://edx.netl.doe.gov/dataset/global-oil-gas-features-database https://edx.netl.doe.gov/dataset/development-of-an-open-global-oil-and-gas-infrastructure-inventory-and-geodatabase | http://www.datainnovation.org/2018/05/cataloging-the-global-energy-infrastructure-to-prevent-oil-and-gas-leaks/ |
568 | 2018.06.06 | 2 | Volcanoes and eruptions. | The Smithsonian Institution’s Global Volcanism Program maintains a database of more than 12,000 volcanoes and 11,000 eruptions — dating from 10450 BCE to the present year. You can search the data online, and then download the results as a spreadsheet. Related: “Here's every volcano that has erupted since Krakatoa.” [h/t Duncan Geere + Rachel Schallom + Lazaro Gamio] | https://volcano.si.edu/ https://volcano.si.edu/gvp_votw.cfm https://volcano.si.edu/search_volcano.cfm https://volcano.si.edu/search_eruption.cfm https://www.axios.com/chart-every-volcano-that-erupted-since-krakatoa-467da621-41ba-4efc-99c6-34ff3cb27709.html | https://tinyletter.com/duncangeere/letters/s03e19-blocks https://us16.campaign-archive.com/?u=5c12dabe1e59a9fbde1174b8c&id=84a7784f6b https://twitter.com/LazaroGamio/status/1002273071854743553 |
569 | 2018.06.06 | 3 | Retracted medical papers. | PubMed, the National Library of Medicine’s search engine for biomedical and life-sciences literature, lets you search for retracted publications; just add "retracted publication"[PTYP] to your query. For instance, here are retracted articles that were originally published in 2016. Using the “Send to” link at the top-right of the query pages, you can download all the results. Data scientist Neil Saunders has gathered this data and condensed it into an interactive, graphical report. (Clicking on the axis labels takes you the relevant PubMed search.) Related: The code behind Saunders’ report. [h/t u/cavedave] | https://www.ncbi.nlm.nih.gov/pubmedhealth/PMHT0027066/ https://www.ncbi.nlm.nih.gov/pubmed/?term=%22retracted%20publication%22[PTYP]%20AND%202016[CRDT] https://nsaunders.wordpress.com/ https://neilfws.github.io/PubMed/pmretract/pmretract.html https://github.com/neilfws/PubMed/tree/master/retractions | https://www.reddit.com/r/datasets/comments/8m8pem/pubmed_retractions_report/ |
570 | 2018.06.06 | 4 | Cenotes. | The Mexican state of Yucatán publishes a dataset listing the names and locations of cenotes, the region’s famous water-filled sinkholes. Related: Other datasets from the Programa de Ordenamiento Ecológico Territorial del Estado de Yucatán. [h/t Forest Gregg] | http://bitacoraordenamiento.yucatan.gob.mx/documentos/detalles.php?IdArchivo=1058 https://en.wikipedia.org/wiki/Cenote http://bitacoraordenamiento.yucatan.gob.mx/galeria/index.php | https://opendata.stackexchange.com/questions/12864/mexican-sinkhole-locations/12866#12866 |
571 | 2018.06.06 | 5 | NICAR lightning talks. | Ever since 2010, the National Institute for Computer-Assisted Reporting (NICAR) annual conference has featured a session of five-minute “lightning talks,” selected by popular vote. NICARian Christine Zhang has compiled a spreadsheet of all 309 lightning talk proposals, the proposed presenters, their professional affiliations, how many votes each proposal received, and more. Related: “Nine Years of NICAR Lightning Talks (and Cats),” Zhang’s analysis of the data. Also related: The code behind Zhang’s analysis. | https://ire.org/nicar/ https://twitter.com/christinezhang https://docs.google.com/spreadsheets/d/1mRdBPRHPJHUlbK-kpGIx31Y4o73SXHz_899SQx1rvEY/edit#gid=614018567 https://source.opennews.org/articles/nine-years-nicar-lightning-talks-and-cats/ https://github.com/underthecurve/lightning-talks-analysis | |
572 | 2018.07.04 | 1 | The National Transit Database. | Every year, hundreds of U.S. transit systems — from the Pomona Valley Transportation Authority’s Claremont Dial-a-Ride to the Metropolitan Transportation Authority’s New York City Transit — submit detailed metrics to the congressionally-established National Transit Database. The NTD's datasets cover a broad set of topics, including “agency funding sources, inventories of vehicles and maintenance facilities, safety event reports, measures of transit service provided and consumed, and data on transit employees.” The NTD also provides a glossary, data collection manuals, and the underlying forms. [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle] | http://www.pvtrans.org/asp/site/claremont/ https://www.transit.dot.gov/ntd https://www.transit.dot.gov/ntd/ntd-data https://www.transit.dot.gov/ntd/national-transit-database-ntd-glossary https://www.transit.dot.gov/ntd/manuals https://www.transit.dot.gov/ntd/ntd-reporting-system-forms | |
573 | 2018.07.04 | 2 | Landslides. | The Cooperative Open Online Landslide Repository (COOLR) is a recently-launched NASA project that “seeks to cultivate an open platform where scientists and citizen scientists around the world can share landslide reports to guide awareness of landslide hazards for improving scientific modeling and emergency response.” The repository has been seeded with the agency’s Global Landslide Catalog, which it says is already “the largest openly available global database of rainfall-triggered mass movements known to date.” You can explore the COOLR data on an interactive map or download the data in several formats. | https://science.gsfc.nasa.gov/600/citizen-science/landslides/about.html http://blogs.discovermagazine.com/citizen-science-salon/2018/03/23/help-nasa-build-the-largest-open-landslide-catalog-with-landslide-reporter/ https://data.nasa.gov/Earth-Science/Global-Landslide-Catalog/h9d8-neg4/data https://science.gsfc.nasa.gov/600/citizen-science/landslides/resources.html#GLC https://maps.nccs.nasa.gov/arcgis/apps/webappviewer/index.html?id=824ea5864ec8423fb985b33ee6bc05b7 https://maps.nccs.nasa.gov/arcgis/apps/MapAndAppGallery/index.html?appid=574f26408683485799d02e857e5d9521 | |
574 | 2018.07.04 | 3 | Regional Medicare usage. | The U.S. Centers for Medicare & Medicaid Services publishes a series of “geographic variation” spreadsheets, which cover hundreds of metrics — such as kidney dialysis usage, the total cost of medical tests, and hospital readmission rates — related to Medicare beneficiaries’ healthcare in each state, county, and “hospital referral region.” [h/t Drew Ivan] | https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Geographic-Variation/GV_PUF.html | https://twitter.com/drewivan |
575 | 2018.07.04 | 4 | Anthony Bourdain’s travels. | Christine Zhang has compiled a CSV of 400+ locations featured in Anthony Bourdain’s No Reservations, The Layover, and Parts Unknown shows. The spreadsheet-as-remembrance includes each location’s name, country, latitude/longitude, plus the relevant episode’s show, season, number, and title. | https://twitter.com/christinezhang/status/1005974583206449152 https://github.com/underthecurve/bourdain-travel-places/blob/master/bourdain_travel_places.csv | |
576 | 2018.07.04 | 5 | Cars at auction. | Kansas City publishes a dataset of cars for sale at its monthly auction. As of yesterday, the dataset contained 482 cars. For each car, the variables include the make, model, year, VIN, reason for being auctioned — e.g., “abandoned,” “stolen,” “illegally parked” — and other details. | https://data.kcmo.org/Traffic/Kansas-City-Monthly-Car-Auction/32xf-gvw8 | |
577 | 2018.07.18 | 1 | SCOTUS extracurriculars. | Since July 2014, ScotusMap.com has been tracking the U.S. Supreme Court justices’ public events — “whether the Supreme Court is in session or on summer recess, the justices keep busy with writers’ conferences, state bar luncheons, award ceremonies, and more.” The map’s database now contains more than 700 entries, and even includes events attended by the retired justices. Bonus: The creators of ScotusMap recently launched ScotusWat.ch, a website (with downloadable data) that “tracks the public statements made by United States senators about how they plan to vote on the Supreme Court nominee, Brett Kavanaugh, and tallies them into a likely vote count.” [h/t Jay Pinho + Victoria Kwan] | http://www.scotusmap.com/posts/1 http://www.scotusmap.com/ http://www.scotuswat.ch/ | https://twitter.com/jaypinho https://twitter.com/victoriakwan_ |
578 | 2018.07.18 | 2 | 34,361 European migration deaths. | The Amsterdam-based activist group UNITED for Intercultural Action has, since the early 1990s, been collecting information about the deaths of Europe’s refugee-seekers. The organization's volunteers “update the data annually, spending six months at a time verifying reports, categorising deaths and entering them into the database,” according to The Guardian's story about the endeavor and its findings. “When the project began, they received physical clippings from a network of groups around Europe. Now, the data is collected from email submissions and Google Alerts in a number of languages.” The story features a PDF-listing of the deaths, including the date the migrants were found dead, names and countries of origin (where known), and the causes of death. The Italian civic-data organization OnData has converted the PDF to a spreadsheet. [h/t Giuseppe Sollazzo] | http://www.unitedagainstracism.org/ https://www.theguardian.com/world/2018/jun/20/the-list-europe-migrant-bodycount http://ondata.it/ https://github.com/ondata/the-list | https://us5.campaign-archive.com/?u=77ecabbd32e97a6caa9d7d40b&id=715ce882b7 |
579 | 2018.07.18 | 3 | Building footprints. | Microsoft’s Bing Maps team has published an open dataset describing the outlines of nearly 125 million buildings in the United States. To build the dataset, the team trained neural networks to detect buildings’ footprints in satellite images. | https://blogs.bing.com/maps/2018-06/microsoft-releases-125-million-building-footprints-in-the-us-as-open-data https://github.com/Microsoft/USBuildingFootprints | |
580 | 2018.07.18 | 4 | Makeup shades. | In a recent essay at The Pudding, Jason Li, Amber Thomas, and Divya Manian explored the shades of foundation offered by best-selling makeup brands in the U.S., Nigeria, India, and Japan. They also published the underlying data — color values for more than 600 shades from 36 different brands. | https://pudding.cool/2018/06/makeup-shades/ https://github.com/the-pudding/data/tree/master/makeup-shades | |
581 | 2018.07.18 | 5 | FOIA’ed FBI files. | “When somebody's obituary appears in the New York Times, FOIA The Dead sends an automated request to the FBI for their (newly-available) records.” So far, the project has obtained and published FBI’s files on 54 people. The site’s data includes each person’s name, a short description, a link to the relevant obituary, a link to the received records, and the number of pages obtained. [h/t Noah Veltman] | https://foiathedead.org/about/ https://foiathedead.org/ https://foiathedead.org/entries.json | https://noahveltman.com/ |
582 | 2018.08.01 | 1 | The death penalty. | Law professor Brandon L. Garrett has led an effort to compile data on every death sentence in the U.S. since the early 1990s. Garrett’s “End of its Rope” database currently includes more than 4,900 sentencings, and specifies each defendant’s name, race, and gender; the state, county, and year of the sentence; whether it was a resentencing; and whether the defendant has been executed. You can download the data, browse it online, and explore it via an interactive map. | http://www.brandonlgarrett.com/ http://endofitsrope.com/using-the-database/ http://endofitsrope.com/data-and-documents/ http://endofitsrope.com/database/ http://endofitsrope.com/ | |
583 | 2018.08.01 | 2 | Global tax revenues. | The Organisation for Economic Co-operation and Development (OECD) has launched a database “providing detailed and comparable tax revenue information for 80 countries around the world.” The Global Revenue Statistics Database, “which will expand to cover more than 90 countries by the end of 2018,” breaks tax revenues into dozens of categories and subcategories — such as sales taxes, taxes on capital gains, and taxes on exports. Related: The OECD’s interactive charts of the data. | http://www.oecd.org/tax/tax-policy/oecd-launches-largest-source-of-comparable-tax-revenue-data.htm https://stats.oecd.org/Index.aspx?DataSetCode=RS_GBL http://www.oecd.org/tax/tax-policy/global-revenue-statistics-database.htm | |
584 | 2018.08.01 | 3 | Consumer-product chemicals. | Researchers at the Environmental Protection Agency have created a new dataset of “reported and predicted information on more than 75,000 chemicals and more than 15,000 consumer products.” The Chemicals and Products Database, as they’ve named it, is an “aggregation of publicly available data on chemical-use categorization, consumer product composition [...], and functional use of chemicals”, and uses “a consistent scheme for categorizing products and chemicals.” You can download the data via the EPA’s Chemistry Dashboard. | https://www.nature.com/articles/sdata2018125 https://comptox.epa.gov/dashboard/downloads | |
585 | 2018.08.01 | 4 | Development projects and outcomes. | Earlier this year, Johns Hopkins professor Dan Honig released the Project Performance Database, which tracks the outcome ratings of international development projects (typically conducted by auditors on a four- or six-point scale). “The PPD is, at present, the world's largest” such database and “contains over 14,000 unique projects from eight agencies,” including the World Bank, the Asian Development Bank, and others. [h/t Paddy Carter] | http://danhonig.info/ https://dataverse.harvard.edu/dataverse/PPD | https://twitter.com/CarterPaddy/status/964413653377155072 |
586 | 2018.08.01 | 5 | Public restrooms. | There are many official datasets of public toilets, including those in New York City parks, Vancouver, Seattle parks, many UK cities, Australia, and New Zealand. [h/t Jens von Bergmann] | https://data.cityofnewyork.us/Recreation/Directory-Of-Toilets-In-Public-Parks/hjae-yuav https://data.vancouver.ca/datacatalogue/public-washrooms.htm https://data.seattle.gov/Parks-and-Recreation/Seattle-Parks-and-Recreation-GIS-Map-Layer-Shapefi/dfsk-abyq https://data.gov.uk/search?q=toilets https://data.gov.au/dataset/national-public-toilet-map https://catalogue.data.govt.nz/dataset/public-toilets2 | https://twitter.com/vb_jens/status/1020555383910223873 |
587 | 2018.08.08 | 1 | Militarized disputes. | The Militarized Interstate Dispute datasets provide details about more than 2,200 instances between 1816 and 2010 where a government “threatened, displayed, or used force against another” — including each dispute’s timing, participants, death count, result, and more. A supplementary database tracks the disputes’ locations. The datasets are part of the Correlates of War project, which was founded in 1963 and which strives for “the systematic accumulation of scientific knowledge about war.” [h/t Erik Beuck] | http://cow.dss.ucdavis.edu/data-sets/MIDs http://correlatesofwar.org/data-sets/MIDLOC http://cow.dss.ucdavis.edu/ http://cow.dss.ucdavis.edu/history | https://twitter.com/CharlesBeuck/status/1022935812147699713 |
588 | 2018.08.08 | 2 | Spending at Trump properties. | ProPublica is tracking the money that political campaigns and government agencies have reported spending at Donald Trump’s hotels, golf clubs, and restaurants. You can download the data, which includes the spender, property, date, amount, and listed purpose for each payment. From ProPublica’s notes: “Federal government spending is incomplete because many government agencies have actively fought requests to disclose spending at Trump properties. The data we have so far was released, in part, after lawsuits.” | https://projects.propublica.org/paying-the-president/ https://www.propublica.org/datastore/dataset/spending-at-trump-properties | |
589 | 2018.08.08 | 3 | Overlooked computer scientists. | Researchers at Primer, a machine learning and natural language processing startup, have released a dataset describing more than 36,000 notable computer scientists, “only 15%” of which have Wikipedia biographies. The researchers trained their algorithms on a corpus of existing Wikipedia articles, Wikidata entries, news articles, and the Semantic Scholar Open Research Corpus. (The latter contains data on more than 39 million research papers in computer science, neuroscience, and biomedical science.) The results include each computer scientist’s name, basic metadata, academic papers, and snippets of news articles mentioning them. Related: “Using Artificial Intelligence to Fix Wikipedia's Gender Problem” (Wired). [h/t Sara Blask] | https://blog.primer.ai/technology/2018/08/03/Quicksilver.html https://github.com/PrimerAI/primer_quicksilver http://labs.semanticscholar.org/corpus/ https://www.wired.com/story/using-artificial-intelligence-to-fix-wikipedias-gender-problem/ | https://twitter.com/sarablask |
590 | 2018.08.08 | 4 | Death and taxes in the Garden State. | The nonprofit organization Reclaim The Records recently obtained New Jersey’s death index, and has made it available to search and download. The records include structured data for 1,275,833 deaths in the state between 2001 and 2017, plus digitized images of the death index for 1901-1903, 1920-1929, and 1949-2000. The structured data contains each person’s name, date of birth, date of death, and death certificate number — plus, for the most recent records, the locations of birth and death. Also: NJ Advance Media has published data on 17 years of drug overdose deaths from the state’s Office of the State Medical Examiner, and property tax rolls for “all 2.3 million taxable parcels of land” in 2017. (Free registration required to download the files.) [h/t Benjamin Cooley + Martin Burch] | https://www.reclaimtherecords.org/about/ https://www.reclaimtherecords.org/records-request/21/ https://www.newjerseydeathindex.com/ https://data.world/njdotcom/nj-statewide-overdose-deaths-1999-to-2016 https://data.world/njdotcom/nj-property-tax-rolls-2017 | https://medium.com/towards-data-science/data-curious-27-08-2017-a-roundup-of-data-stories-datasets-and-visualizations-from-last-week-4c2a1c10b068 https://twitter.com/seecmb/status/998642204825587713 |
591 | 2018.08.08 | 5 | More street trees. | London, Belfast, Vancouver, Washington (D.C.), Philadelphia, Boston, Cambridge (Mass.), Madison, Providence, San Francisco, Oakland, and Berkeley are among the many cities that publish data cataloguing the trees that line their streets. Previously: NYC’s street trees (DIP 2016.11.16). [h/t Jens von Bergmann + Sunlight Open Cities + u/willwardo] | https://data.london.gov.uk/dataset/local-authority-maintained-trees https://www.opendatani.gov.uk/dataset/belfast-trees https://data.vancouver.ca/datacatalogue/streetTrees.htm http://dcgis.maps.arcgis.com/home/item.html?id=fea6079cf9bc4310a8b6c94f8c2bf1da https://www.opendataphilly.org/dataset/philadelphia-street-tree-inventory https://data.boston.gov/dataset/trees https://www.cambridgema.gov/GIS/gisdatadictionary/Environmental/ENVIRONMENTAL_StreetTrees https://data-cityofmadison.opendata.arcgis.com/datasets/street-trees https://data.providenceri.gov/Neighborhoods/Providence-Tree-Inventory/uv9w-h8i4/data https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq https://data.oaklandnet.com/Environmental/Oakland-Street-Trees/4jcx-enxf https://data.cityofberkeley.info/Natural-Resources/City-Trees/9t35-jmin/data https://www.nycgovparks.org/trees/treescount https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-11-16-edition | https://twitter.com/vb_jens/status/909600422213455872 https://twitter.com/SunlightCities/status/992429816237510660 https://www.reddit.com/r/datasets/comments/8vsf32/a_treasure_trove_of_geospatial_datasets_for_the/ |
592 | 2018.08.15 | 1 | Peer-to-peer loans. | The Lending Club, which matches borrowers with investors, publishes a dataset of all loans issued through its platform since 2007. The dataset’s many fields include each loan’s amount, term, interest rate, grade, status, and purpose (as a category, and often also a fuller description), as well as the borrower’s employer, home ownership status, and annual income. You can also download all declined loans, i.e., those “that did not meet Lending Club's credit underwriting policy.” [h/t Charlie Stanton] | https://www.lendingclub.com/info/download-data.action | |
593 | 2018.08.15 | 2 | Rocket launches. | SpaceX’s API provides data on the company’s rockets, launchpads, launches, and more. It also will tell you the current orbital position of the car SpaceX launched into space. [h/t Mike Allred] | https://github.com/r-spacex/SpaceX-API https://github.com/r-spacex/SpaceX-API/blob/master/docs/rocket.md https://github.com/r-spacex/SpaceX-API/blob/master/docs/launchpad.md https://github.com/r-spacex/SpaceX-API/blob/master/docs/launches.md https://github.com/r-spacex/SpaceX-API/blob/master/docs/home.md https://github.com/r-spacex/SpaceX-API/blob/master/docs/roadster.md https://en.wikipedia.org/wiki/Elon_Musk%27s_Tesla_Roadster | https://github.com/toddmotto/public-apis/commit/d684931d8e984ea878c6c5413fd606a51685e774 |
594 | 2018.08.15 | 3 | Sneaker factories. | Nike, Inc.’s manufacturing map displays 618 factories and material suppliers that the company uses to manufacture its products (as of May 2018). You can export the entire dataset, or browse and filter the data online. For each of the factories, the information includes the factory’s name, address, product type, number of workers, percentage of workers who are female, and more. [h/t Marc DaCosta] | http://manufacturingmap.nikeinc.com/ | http://marcdacosta.com/ |
595 | 2018.08.15 | 4 | Retail stores and products. | Best Buy’s API and Walmart’s API both let you search their products and stores. Both also require (free) registration to obtain an API key. In 2016, Best Buy also published bulk data describing its products and stores. [h/t Dan Nguyen + Dave Machado] | https://bestbuyapis.github.io/api-documentation/ https://developer.walmartlabs.com/docs https://github.com/BestBuyAPIs/open-data-set | https://www.reddit.com/r/datasets/comments/5m9z0c/best_buys_developer_api_extremely_welldocumented/ https://github.com/toddmotto/public-apis/commit/fa2dc74ccd4e91c1b8c90f3918c83a3a274aa734 |
596 | 2018.08.15 | 5 | Financial statements. | The SEC’s Office of Structured Disclosure publishes data extracted from corporations’ public financial statements. That dataset contains the numbers listed in each company’s primary financial statements — balance sheets, cash flows, et cetera. An even more detailed version of the dataset includes plain-text notes from the filings, plus numbers from a broader array of forms. Both datasets are updated quarterly and go back to 2009. | https://www.sec.gov/structureddata https://www.sec.gov/dera/data/financial-statement-data-sets.html https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html | |
597 | 2018.08.22 | 1 | Political ads online. | Google recently launched a database of political ads “that have appeared on Google and partner properties.” The searchable and downloadable dataset indicates the organization that paid for each advertisement, approximately how much they spent, how long the ad ran, what demographics were used for targeting, and roughly how many people it reached. A few months ago, Facebook launched a similar initiative, but you need to logged in to view it and you can’t download the data. You can, however, get Facebook political-advertising data from at least two sources: A repository of 267,000 ads scraped from Facebook’s official archive by NYU researchers, and ProPublica’s ongoing, detailed database of ads and targeting parameters gathered through their Political Ad Collector. [h/t Sheera Frenkel] | https://transparencyreport.google.com/political-ads/library https://www.facebook.com/ads/archive/?active_status=all&ad_type=political_and_issue_ads&country=US https://github.com/online-pol-ads/FBPoliticalAds https://www.propublica.org/datastore/dataset/political-advertisements-from-facebook https://projects.propublica.org/facebook-ads/ | https://www.nytimes.com/2018/07/17/technology/political-ads-facebook-trump.html |
598 | 2018.08.22 | 2 | Healthcare service in Africa. | The African Economic Research Consortium, African Development Bank, and the World Bank have partnered to create the Service Delivery Indicators program — ”a new Africa-wide initiative” that dispatches teams of surveyors “to gauge the quality of service delivery in basic health services” across the continent. The initiative’s de-identified data contains results for nine countries so far, including assessments of facility infrastructure, worker absenteeism, and patient case simulations. [h/t Matthew Collin] | https://worldbank.github.io/SDI-Health/ https://github.com/worldbank/SDI-Health | https://twitter.com/aidthoughts |
599 | 2018.08.22 | 3 | Natural disaster satellite imagery. | DigitalGlobe’s open data program publishes georeferenced satellite imagery from before and after major natural disasters. The archive currently includes a couple dozen events, including recent flooding in Kerala and California’s Carr Wildfire and Mendocino Complex Fire. Previously: NOAA's emergency response aerial imagery (DIP 2017.09.20). [h/t Laura Noren and Brad Stenger] | https://www.digitalglobe.com/opendata https://www.digitalglobe.com/opendata/all-events http://blog.digitalglobe.com/news/open-data-for-flooding-in-kerala-state-india/ http://blog.digitalglobe.com/news/open-data-for-the-carr-wildfire/ http://blog.digitalglobe.com/news/open-data-for-the-mendocino-complex-fire/ https://storms.ngs.noaa.gov/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-09-20-edition | https://cds.nyu.edu/newsletter/ |
600 | 2018.08.22 | 4 | Half a century of opinions. | The University of North Carolina’s Louis Harris Data Center serves as “the national depository for publicly available survey data collected by Louis Harris and Associates, Inc.” The online depository contains more than 1,000 Harris polls, some from as early 1958. In total, they include “160,000 questions asked of more than 1,200,000 respondents.” [h/t Xan Gregg] | https://dataverse.unc.edu/dataverse/harris | https://twitter.com/xangregg |
601 | 2018.08.22 | 5 | Jeans pockets. | Jan Diehm and Amber Thomas measured the pockets of 80 pairs of jeans — four pairs each from 20 brands, half marketed to men and the other half to women. Their findings “confirmed what every woman already knows to be true: women’s pockets are ridiculous.” In fact, “on average, the pockets in women’s jeans are 48% shorter and 6.5% narrower than men’s pockets.” For each pair of jeans, the duo’s underlying dataset contains the front and back pocket dimensions, material composition, retail price, and more. | https://pudding.cool/2018/08/pockets/ https://github.com/the-pudding/data/tree/master/pockets | |
602 | 2018.09.12 | 1 | Power outages. | Utility companies are required to report major power outages and other “electric disturbance events” to the Department of Energy within a business day (or, depending on the type of event, sooner) of the incident. The federal agency then aggregates the reports annual summary datasets. For each event, the data includes the time it began and was resolved, the geographic areas it affected, the type of incident, and the estimated number of customers affected. [h/t Jordan Wirfs-Brock] | https://www.oe.netl.doe.gov/oe417.aspx https://www.oe.netl.doe.gov/OE417_annual_summary.aspx | http://insideenergy.org/2014/08/18/data-explore-15-years-of-power-outages/ |
603 | 2018.09.12 | 2 | Deaths in Puerto Rico. | Last month, Puerto Rico’s government began publishing a dataset of all deaths registered in the U.S. territory from January 2017, updated weekly. For each death, the information includes the year and month of the death; the type and causes of death; the deceased’s age, sex, marital status, occupation, place of birth and residence, and more. Related: “More Than 2,000 Puerto Ricans Applied For Funeral Assistance After Hurricane Maria. FEMA Approved Just 75.” [h/t Giancarlo Gonzalez] | https://datos.estadisticas.pr/dataset/defunciones https://www.buzzfeednews.com/article/nidhiprakash/puerto-rico-hurricane-funeral-assistance-fema | https://twitter.com/giangonz/status/1035956181230071809 |
604 | 2018.09.12 | 3 | Parking tickets in Chicago. | “For the first time, the city’s database, which tracks more than 28 million parking and vehicle compliance tickets, is easily available to the public,” according to ProPublica Illinois, which has published the two-gigabyte dataset in collaboration with WBEZ. The dataset, which covers January 2007 to mid-May 2018, “includes information on when, where, and by whom tickets were issued; de-identified license plates; vehicle make; registration zip code; the violation for which the vehicle was cited; the payment status and more.” | https://www.propublica.org/nerds/download-chicago-parking-ticket-data https://www.propublica.org/datastore/dataset/chicago-parking-ticket-data | |
605 | 2018.09.12 | 4 | Neighborhood boundaries. | Zillow has created a dataset outlining the boundaries of more than 17,000 neighborhoods in the United States’ largest cities, spanning 49 states (all but Wyoming) plus D.C. and Puerto Rico. Related: OpenStreetMap, which is API-queryable, has a “neighbourhood” tag type. [h/t Volodymyr Kupriyanov] | https://www.zillow.com/howto/api/neighborhood-boundaries.htm https://wiki.openstreetmap.org/wiki/Databases_and_data_access_APIs https://wiki.openstreetmap.org/wiki/Tag:place%3Dneighbourhood | https://twitter.com/v_kupriyanov/status/977108035003969536 |
606 | 2018.09.12 | 5 | Breweries. | Open Brewery DB is a searchable database of more than 8,000 breweries in the United States (although “future plans are to import world-wide data”). The site provides an API, which lets you query by name, location, and type — microbrewery, regional brewery, brewpub, and so on. Previously: Official brewery statistics (DIP 2017.05.24). [h/t Chris Mears] | https://www.openbrewerydb.org/ https://www.ttb.gov/beer/beer-stats.shtml https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-24-edition | https://github.com/toddmotto/public-apis/commit/5777659c8e8a43437179c1cbc985a7a06d997c51 |
607 | 2018.09.19 | 1 | Parties and parliaments. | ParlGov, “a data infrastructure for political science,” has collected detailed information on 1,500+ political parties, the results of 900+ elections, and the formation of 1,400+ parliamentary cabinets. The 37 countries it covers include every member of the European Union plus certain non-EU members of the OECD (such as Israel, Turkey, and Canada — but not the United States). The datasets are available in several formats, can be explored online, and come with extensive documentation. [h/t Jovi Juan] | http://www.parlgov.org/ http://www.parlgov.org/data/table/view_party/ http://www.parlgov.org/data/table/view_election/ http://www.parlgov.org/data/table/view_cabinet/ http://www.parlgov.org/explore/ | https://twitter.com/DaoOfJ |
608 | 2018.09.19 | 2 | Home energy consumption. | For many decades, the Department of Energy’s Residential Energy Consumption Survey has been asking people about their homes’ energy-related characteristics (e.g., number of bedrooms and roofing materials) and energy-consuming appliances (e.g., television size and dishwasher use). Then, the agency cross-references those answers with billing data collected “directly from energy suppliers under a mandatory authority granted by Congress.” The survey has been conducted 14 times since 1978; survey microdata is available for the eight most recent iterations. | https://www.eia.gov/consumption/residential/ https://www.eia.gov/consumption/residential/data/2015/index.php?view=microdata | |
609 | 2018.09.19 | 3 | Open-access scholarship. | Unpaywall has collected data on millions of open-access scholarly articles, plus many more paywalled articles. You can download the full dataset, or submit specific Digital Object Identifiers to the website’s API or online form. For each article, you can learn whether it’s openly accessible, whether the journal that published it is open-access, and additional details about the article itself. [h/t @authcontroller] | http://unpaywall.org/ http://unpaywall.org/products/snapshot https://en.wikipedia.org/wiki/Digital_object_identifier http://unpaywall.org/products/api http://unpaywall.org/products/simple-query-tool | https://twitter.com/authcontroller/status/1029919541860560896 |
610 | 2018.09.19 | 4 | U.S. citizens’ deaths overseas. | The U.S. Department of State publishes, “to the maximum extent practicable,” a database of “each United States citizen who dies in a foreign country from a non-natural cause.” The database currently contains 13,045 deaths, starting in October 2002, and is updated every six months. For each incident, the database provides the date, city, and cause of death. [h/t Jacquelyn Elias] | https://travel.state.gov/content/travel/en/international-travel/while-abroad/death-abroad1/death-statistics.html | https://twitter.com/JacquieEli/status/1039884713933119493 |
611 | 2018.09.19 | 5 | New York real estate brokers. | New York State’s Department of State publishes a structured listing of all real estate brokers, salespeople, and offices currently licensed by the agency. Roughly half of the 160,000 licensees are registered to business addresses in New York City. The ZIP code with the largest raw number of active licenses is 10022, a chunk of Midtown East that includes (among other things) the Waldorf Astoria and Trump Tower. | https://data.ny.gov/Economic-Development/Active-Real-Estate-Salespersons-and-Brokers/yg7h-zjbf | |
612 | 2018.09.26 | 1 | Urban archaeology. | When Amsterdam began excavating parts of the Amstel River in 2003 to construct a new metro line, the city gave archaeologists access to two large sections of the riverbed. Over time, these archaeologists unearthed “a deluge of finds, some 700,000 in all: a vast array of objects, some broken, some whole, all jumbled together.” To showcase the work, the city has published Below the Surface, a website that lets you explore the 20,000 of the objects online, download detailed data on more than 130,000 of the artifacts, read the backstory, and watch a documentary about it. Among the discoveries: Thousands of tobacco pipes, hundreds of teapots, dozens of gin bottles, and one “miniature wind mill.” [h/t Adam J Calhoun + Manoj Mallela] | https://belowthesurface.amsterdam/en/pagina/de-opgravingen-0 https://belowthesurface.amsterdam/en https://belowthesurface.amsterdam/en/vondsten https://belowthesurface.amsterdam/en/pagina/publicaties-en-datasets https://belowthesurface.amsterdam/en/pagina/de-opgravingen-index https://vimeo.com/274460486 | https://twitter.com/neuroecology/status/1012336008459882497 https://twitter.com/mallelatweets/status/1012495477403803648 |
613 | 2018.09.26 | 2 | Local lobbying. | Some cities — including San Francisco, Los Angeles, and Austin — provide downloadable databases of lobbyists who’ve officially registered to influence their administrations. Chicago has gone one step further, publishing data on lobbyists’ compensation, expenditures, gifts, and more. Previously: Lobbying data from the U.S. House, U.S. Senate, and European Union (DIP 2017.05.31 + DIP 2017.08.02). [h/t Alisha Green and Laurenellen McCann] | https://sfethics.org/disclosures/lobbyist-disclosure https://data.lacity.org/A-Well-Run-City/Registered-City-Lobbyists/j4zm-9kqu https://data.austintexas.gov/City-Government/Lobbyists-Master-List-of-Lobbyists-Oracle-View-/96z6-upac https://digital.cityofchicago.org/index.php/improved-lobbyist-data/ https://data.cityofchicago.org/Ethics/Lobbyist-Data-Compensation/dw2f-w78u https://data.cityofchicago.org/Ethics/Lobbyist-Data-Expenditures-Large/xika-473c https://data.cityofchicago.org/Ethics/Lobbyist-Data-Gifts/5d79-9xqr http://disclosures.house.gov/ld/ldsearch.aspx https://www.senate.gov/legislative/Public_Disclosure/LDA_reports.htm http://ec.europa.eu/transparencyregister/public/homePage.do?redir=false&locale=en https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-05-31-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-08-02-edition | https://sunlightfoundation.com/2013/04/04/the-landscape-of-municipal-lobbying-data/ |
614 | 2018.09.26 | 3 | Family life. | The National Survey of Family Growth, run by the U.S. Centers for Disease Control and Prevention, “gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health.” Versions of the survey have been conducted nine times, dating back to 1973. The most recent results come from interviews of more than 10,205 people between September 2013 and September 2015. Related: The Pudding’s Amber Thomas used the data to explore trends in birth control. Bonus: Thomas also published the code and data behind her analysis. [h/t Giuseppe Sollazzo] | https://www.cdc.gov/nchs/nsfg/about_nsfg.htm https://amber.rbind.io/ https://pudding.cool/2018/07/birth_control/ https://github.com/the-pudding/data/tree/master/birth-control | https://us5.campaign-archive.com/?u=77ecabbd32e97a6caa9d7d40b&id=4553ae1b07 |
615 | 2018.09.26 | 4 | Social assistance programs. | The Social Assistance, Politics and Institutions database, developed at an United Nations University research center, “provides a synthesis of longitudinal and harmonized comparable information on social assistance programmes in developing countries, covering the period 2000-2015.” For each program, such as Brazil’s “Bolsa Familia,” the database describes its basic characteristics, budget and financing, and population coverage. [h/t Erik Gahner Larsen] | https://www.wider.unu.edu/project/sapi-social-assistance-politics-and-institutions-database | https://github.com/erikgahner/PolData |
616 | 2018.09.26 | 5 | Avocado prices. | The Hass Avocado Board publishes weekly data on the retail volume and average price of Hass avocados sold in the United States, based on information collected “directly from retailers’ cash registers.” The data is available at the national and city level going back to 2015, distinguishes between conventional and organic avocados of various sizes. Related: Justin Kiggins has aggregated the historical spreadsheets for 2015 through March 2018 into a single file. | http://www.hassavocadoboard.com/retail/volume-and-price-data http://www.justinkiggins.com/ https://www.kaggle.com/neuromusic/avocado-prices | |
617 | 2018.10.03 | 1 | Critical habitats. | The U.S. Fish & Wildlife Service publishes a database outlining the critical habitats for more than 700 threatened and endangered species. For each habitat, the dataset provides its geographic boundary lines, the species’ name and type, the size of the habitat, the date it was declared critical, and more. Related: Other geospatial datasets from the USFWS, including those on the Coastal Barrier Resources System and migratory bird populations. | https://ecos.fws.gov/ecp/report/table/critical-habitat.html https://www.fws.gov/gis/data/national/ https://catalog.data.gov/dataset/boundaries-of-the-john-h-chafee-coastal-barrier-resources-system https://migbirdapps.fws.gov/ | |
618 | 2018.10.03 | 2 | European electoral polling. | PollOfPolls.eu aggregates political polls from 30 European countries. The Vienna-based initiative has, for instance, collected and standardized more than 1,000 individual polls on British parliament since 2014, and 60 on the Bavarian state elections. You can download each set of standardized data as either JSON or CSV. [h/t Jovi Juan] | https://pollofpolls.eu/ | https://twitter.com/DaoOfJ |
619 | 2018.10.03 | 3 | Car traffic. | The UK Department for Transport’s traffic counts calculate the average daily number of vehicles “for every junction-to-junction link on the 'A' road and motorway network in Great Britain.” Likewise, California publishes the average daily traffic, peak hourly traffic, truck traffic, and ramp traffic for each of its state highways. Previously: U.S. interstate highway traffic (DIP 2016.10.05) and public roads (DIP 2018.04.25) [h/t Dave Fisher-Hickey + u/ron_leflore] | https://www.dft.gov.uk/traffic-counts/index.php http://www.dot.ca.gov/trafficops/census/ http://metrocosm.com/map-us-traffic/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-10-05-edition https://www.fhwa.dot.gov/policyinformation/hpms/fieldmanual/page01.cfm https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-04-25-edition | https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales https://www.reddit.com/r/datasets/comments/9j4y09/can_anyone_provide_me_with_real_time_traffic_data/e6orv72/ |
620 | 2018.10.03 | 4 | Bike traffic. | A slew of cities have installed devices to count bicycles that pass through major routes. At least several publish hourly or daily tallies: London, Ottawa, Edinburgh, Seattle, Cambridge, Mass., and the Washington, DC area. New York City provides daily counter-tallies for its East River bridges, but currently only as PDFs. Related: "[Transport for London]’s cycle counter data: initial thoughts" and “What we can learn from Seattle’s bike-counter data.” [h/t Giuseppe Sollazzo] | http://cycling.data.tfl.gov.uk/ http://data.ottawa.ca/dataset/bicycle-trip-counters-automated https://data.edinburghopendata.info/dataset/bike-counter-data-set-cluster http://www.seattle.gov/transportation/projects-and-programs/programs/bike-program/bike-counters https://data.cambridgema.gov/dataset/Eco-Totem-Broadway-Bicycle-Count/q8v9-mcfg http://counters.bikearlington.com/ http://www.nyc.gov/html/dot/html/bicyclists/bike-counts.shtml https://lastnotlost.wordpress.com/2018/09/12/counterdata/ https://www.seattletimes.com/seattle-news/transportation/what-we-can-learn-from-seattles-bike-counter-data/ | https://us5.campaign-archive.com/?u=77ecabbd32e97a6caa9d7d40b&id=e072aa9980 |
621 | 2018.10.03 | 5 | Florida’s billboards. | The Florida Department of Transportation publishes its inventory of active permits for billboards and other “outdoor advertising.” For each permit, the dataset provides details about the permit-holder and the structure itself — such as its location, height, whether it’s in a city, and more. [h/t Caitlin Ostroff] | http://fdotewp1.dot.state.fl.us/rightofway/DownloadData.aspx | https://twitter.com/ceostroff |
622 | 2018.10.10 | 1 | Public health policy. | LawAtlas.org publishes interactive maps that detail state and federal regulations on dozens of public health–related topics. Among them: e-cigarettes, HIV criminalization, fair housing, syringe distribution, and cell phone use while driving. (You can, for instance, use the e-cigarette map to identify all states where vaping is allowed in hotel rooms but prohibited in public parks.) You can download the underlying data, plus documentation about how the laws were categorized. Bonus: The website, run by Temple University’s Center for Public Health Law Research, will also teach you how to map laws yourself. Previously: The Correlates of State Policy Project (DIP 2016.07.06). | http://lawatlas.org/ http://lawatlas.org/topics http://lawatlas.org/datasets/electronic-nicotine-delivery-systems http://lawatlas.org/datasets/hiv-criminalization-statutes http://lawatlas.org/datasets/state-fair-housing-protections-1498143743 http://lawatlas.org/datasets/syringe-policies-laws-regulating-non-retail-distribution-of-drug-parapherna http://lawatlas.org/datasets/distracted-driving-1470663668 http://lawatlas.org/page/lawatlas-learning-library http://ippsr.msu.edu/public-policy/correlates-state-policy https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-07-06-edition | |
623 | 2018.10.10 | 2 | English health indicators. | England’s public health department generates quantitative “profiles” of the country’s well-being. The metrics include rates of HPV vaccination, dementia, exercise, diabetes, and much more. The results can be downloaded directly, and also accessed via an API. [h/t Sharon Machlis] | https://fingertips.phe.org.uk/ https://fingertips.phe.org.uk/profile/child-health-profiles/data#page/1/gid/1938133237 https://fingertips.phe.org.uk/profile-group/mental-health/profile/dementia/data#page/0 https://fingertips.phe.org.uk/profile/physical-activity/data#page/0 https://fingertips.phe.org.uk/profile/diabetes-ft/data#page/0 https://fingertips.phe.org.uk/api | https://twitter.com/sharon000 |
624 | 2018.10.10 | 3 | Public holidays. | Nager.Date calculates the timing — past, present, and future — of public holidays for more than 90 countries. The holidays can be browsed online, accessed via an API, or downloaded as CSVs (one per country per year). Now you know: Today is Cuba’s Día de la Independencia and Suriname’s Day of the Maroons. [h/t Tino Hager] | https://date.nager.at/ https://date.nager.at/Home/Countries https://date.nager.at/Home/Api https://date.nager.at/PublicHoliday/Country/CU/2018 https://date.nager.at/PublicHoliday/Country/SR/2018 | https://github.com/toddmotto/public-apis/commit/e2b1439b77c59e198db04566139cd3783fc41960 |
625 | 2018.10.10 | 4 | Medical marijuana in the Nutmeg State. | Connecticut’s Department of Consumer Protection has released a dataset listing all branded medical marijuana products registered with the state. For each of the nearly 4,000 products so far, the dataset describes the producer, brand name, form of dosage, and chemical potencies — plus links to images of each product and label. [h/t Kristin Hussey] | https://data.ct.gov/Health-and-Human-Services/Medical-Marijuana-Brand-Registry/egd5-wb6r/data | https://twitter.com/kristinhussey1 |
626 | 2018.10.10 | 5 | Parking meters in the five boroughs. | New York City provides the latitude, longitude, ID number, and current status — active, inactive, retired, planned, and removed — of more than 14,700 parking meters. [h/t Zack Quaintance] | https://data.cityofnewyork.us/Transportation/Parking-Meters-GPS-Coordinates-and-Status/5jsj-cq4s | http://www.govtech.com/civic/Whats-New-in-Civic-Tech-New-York-City-Releases-Its-Annual-Data-Report.html |
627 | 2018.10.31 | 1 | Bridges. | The Federal Highway Administration’s National Bridge Inventory contains detailed data on more than 600,000 “highway bridges” in the United States. The inventory goes back to 1992 and contains scores of fields, including the bridge’s age, condition, design, and materials. Now you know: Texas has the most highway bridges in the inventory, with more than 53,800. Bonus: You can also search the bridges via the unofficial BridgeReports.com. Related: The code the Baltimore Sun used to answer the question, “How safe are Maryland's bridges?” [h/t Christine Zhang] | https://www.fhwa.dot.gov/bridge/nbi.cfm https://www.fhwa.dot.gov/bridge/nbi/ascii.cfm https://bridgereports.com/ https://github.com/baltimore-sun-data/bridge-data http://www.baltimoresun.com/news/maryland/bs-md-bridge-collapse-maryland-20180815-story.html | https://twitter.com/christinezhang |
628 | 2018.10.31 | 2 | Foreign influence campaigns on Twitter. | Earlier this month, Twitter released data on the public activity of “3,841 accounts affiliated with the [Internet Research Agency], originating in Russia, and 770 other accounts, potentially originating in Iran.” Together, the datasets “include more than 10 million Tweets and more than 2 million images, GIFs, videos, and Periscope broadcasts.” Related: My colleague Peter Aldhous used this data — combined with data on 3 million “Russian troll tweets” released this summer by Clemson University researchers and FiveThirtyEight — to examine the Internet Research Agency’s traction before and after the 2016 election. Bonus: Peter’s code. | https://blog.twitter.com/official/en_us/topics/company/2018/enabling-further-research-of-information-operations-on-twitter.html https://about.twitter.com/en_us/values/elections-integrity.html#data https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/ https://www.buzzfeednews.com/article/peteraldhous/russia-online-trolls-viral-strategy https://buzzfeednews.github.io/2018-10-russian-troll-tweets/ | |
629 | 2018.10.31 | 3 | Electric utilities. | The U.S. Energy Information Administration uses Form EIA-861 to collect annual data from thousands of electric utilities about their sales, revenue, peak loads, customer counts, energy efficiency savings, and more. More than 3,400 utilities submitted the form (or its shorter cousin, EIA-861S) for 2017, and the data go back to 1990. [h/t Jordan Wirfs-Brock] | https://www.eia.gov/electricity/data/eia861/ | http://www.jordanwb.com/ |
630 | 2018.10.31 | 4 | Coal cleanup funds. | What happens when coal mines shut down? Money for their cleanup is supposed to be ensured by a system of bonds. But when Climate Home News’ Mark Olalde investigated these remediation funds, he found “a system incapable of dealing with large-scale bankruptcies, amid a declining industry, which severely threatens the environment and future of coal-mining communities across the country.” You can download the data behind Olalde’s findings — including bond databases covering the “23 states that produce 99% of US coal,” obtained via public records requests. [h/t Megan Darby] | http://www.climatechangenews.com/2018/03/15/investigated-coal-industrys-clean-funds/ http://www.climatechangenews.com/2018/03/14/us-coal-hasnt-set-aside-enough-money-clean-mines/ http://www.climatechangenews.com/2018/03/15/us-coal-mines-clean-up-bonds-database/ | https://twitter.com/climatemegan |
631 | 2018.10.31 | 5 | Probably-fake political committees. | When the Federal Election Commission receives a registration form that contains “questionable information” from a candidate or committee, the agency asks for additional information. If the FEC doesn’t get a proper response, it adds the registration to its dataset of “unverified filers”. Among the 500+ registrations currently on the list: “VoldemortCantStopTheVote.org”, “Department of Treasury,” “Wookie PAC,” and “Al Pacino.” [h/t Chris Zubak-Skees] | https://www.fec.gov/data/advanced/?tab=filings | https://twitter.com/zubakskees |
632 | 2018.11.07 | 1 | Court decisions. | The Caselaw Access Project aims “to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.” Currently, the project provides an API for fetching data on more than 6 million cases published between 1658 and 2018 — though public access is limited to downloading 500 cases per day. You can also download bulk data for all cases in Illinois and Arkansas, but getting bulk data for other states currently requires a research agreement. [h/t Caitlin Ostroff] | https://case.law/about/ https://case.law/api/ https://case.law/about/#usage https://case.law/bulk/ | https://twitter.com/ceostroff/status/1056983328711196673 |
633 | 2018.11.07 | 2 | Foreign gifts to U.S. universities. | The Department of Education requires U.S. universities to report all major gifts from (and contracts with) foreign entities. The agency’s database of these gifts and contracts currently covers 2012 to mid-2018, and includes 18,000+ entries from more than 150 schools. Related: In the wake of Jamal Khashoggi’s murder, the AP’s Collin Binkley and Chad Day used the data to examine colleges’ financial ties to Saudi Arabia. [h/t Meghan Hoyer] | https://studentaid.ed.gov/sa/about/data-center/school/foreign-gifts https://catalog.data.gov/dataset/foreign-gifts-and-contracts-report-2011 https://www.apnews.com/4d56411af6a8490e8030eacab4401571 | https://twitter.com/MeghanHoyer |
634 | 2018.11.07 | 3 | European protests, 1980 to 1995. | A team led by University of Kansas professor Ron Francisco has collected and codified data on protests, strikes, and other “coercive acts” in dozens of European countries during the late 20th century. There’s a row for each day of each protest, and each row specifies the issue at stake, the organizers, their target, the type of action, and the location — as well as the number of protesters, arrests, injuries, and deaths. [h/t Alexandre Léchenet] | http://web.ku.edu/~ronfrand/index.html http://web.ku.edu/~ronfrand/data/index.html | http://alphoenix.net/ |
635 | 2018.11.07 | 4 | To swerve or not to swerve. | A recent study revealed the results of “the Moral Machine, an online experimental platform designed to explore the moral dilemmas faced by autonomous vehicles.” The experiment asked participants to decide whether a self-driving car — faced with two deadly options — should stay on course (killing one group of pedestrians) or swerve (killing another). The project “gathered 40 million decisions in ten languages from millions of people in 233 countries and territories,” and a dataset containing every decision is available to download. Read more: “Should a self-driving car kill the baby or the grandma? Depends on where you’re from.” [h/t Walt Hickey] | https://www.nature.com/articles/s41586-018-0637-6 https://osf.io/3hvt2/?view_only=4bb49492edee4a8eb1758552a362a2cf https://www.technologyreview.com/s/612341/a-global-ethics-study-aims-to-help-ai-solve-the-self-driving-trolley-problem/ | https://numlock.substack.com/p/numlock-news-october-26-2018 |
636 | 2018.11.07 | 5 | All things Star Trek. | STAPI bills itself as “the first public Star Trek API.” It provides access to structured data not only about the fictional universe (e.g., 6,364 characters, 1,215 spacecraft, and 155 conflicts) but also its intersection with reality (e.g., 5,302 performers, 731 television episodes, 76 soundtracks). [h/t Cezary Kluczyński] | http://stapi.co/ | https://github.com/toddmotto/public-apis/commit/ca9f9d7869351a1653793deabd7c971e063d608d |
637 | 2018.11.28 | 1 | How high? | The German Aerospace Center is publishing global elevation data derived from its TanDEM-X satellite mission. For five years, two satellites orbited Earth together in a formation that allowed their radars to “ 'see' the same land area, but from slightly different perspectives” and to calculate elevations based on those differences. Although the most detailed versions of the data are “subject to restrictions due to the potential for commercial exploitation, and thus requires a scientific proposal,” the least detailed version (which still clocks in at more than 90 gigabytes) can be downloaded for free. [h/t Matt Brealey] | https://www.dlr.de/dlr/en/desktopdefault.aspx/tabid-10081/151_read-30139/year-all/#/gallery/32238 https://geoservice.dlr.de/web/dataguide/tdm90/ | https://twitter.com/badgrenola/status/1049698621413892096 |
638 | 2018.11.28 | 2 | Word associations. | The Small World of Words project “is a large-scale scientific study that aims to build a mental dictionary or lexicon in the major languages of the world.” The experiment has asked hundreds of thousands of participants to list their immediate associations with various words (such as “telephone,” “journalist,” and “yoga”). In all, the project has collected more than 15 million responses. You can download the data, examine the project’s analysis pipeline, and explore the responses online. [h/t Lewis Mitchell] | https://smallworldofwords.org/en/project https://smallworldofwords.org/en/project/stats https://smallworldofwords.org/en/project/research https://github.com/SimonDeDeyne/SWOWEN-2018 https://smallworldofwords.org/en/project/explore | https://twitter.com/lewis_math/status/1057894391707131904 |
639 | 2018.11.28 | 3 | International labor treaties. | Bilateral labor agreements regulate the migration of workers between two countries, and the Bilateral Labor Agreements Dataset aims to catalog as many of these treaties as it can. So far the University of Chicago Law School professors and researchers running the initiative have identified 582 treaties signed between 1945 and 2015. “However, this list is almost certainly underinclusive,” they write. “Many BLAs are not deposited in the major international treaty databases and they often do not receive much, if any, publicity.” [h/t Adam Chilton] | https://www.law.uchicago.edu/bilateral-labor-agreements-dataset | https://twitter.com/adamschilton/status/905943960882962432 |
640 | 2018.11.28 | 4 | Cattle, buffaloes, horses, sheep, goats, pigs, chickens, and ducks. | Last month, an international team of researchers published the third major version of their Gridded Livestock of the World dataset, which estimates the global distribution of cattle, buffaloes, horses, sheep, goats, pigs, chickens and ducks. The new dataset is based on 2010 statistics and provides estimates at “a spatial resolution of 0.083333 decimal degrees (approximately 10 km at the equator).” | https://www.nature.com/articles/sdata2018227#ref7 http://www.fao.org/livestock-systems/global-distributions/en/ https://dataverse.harvard.edu/dataverse/glw_3 | |
641 | 2018.11.28 | 5 | Dog bites. | New York City’s Department of Health publishes a dataset of 8,000+ reported instances of dogs biting humans, mostly from 2015 through 2017. The agency collects the reports “to determine if the biting dog is healthy ten days after the person was bitten in order to avoid having the person bitten receive unnecessary rabies shots.” [h/t Justin Baker] | https://data.cityofnewyork.us/Health/DOHMH-Dog-Bite-Data/rsgh-akpg | https://twitter.com/AskJustinBaker/status/1030548250623987712 |
642 | 2018.12.05 | 1 | Novel-writing, recorded. | In 2014, author C. M. Taylor began writing a new novel, this time with a twist: He would write the entire story on a laptop intentionally infected with spyware. With the help of the British Library, a program recorded every keystroke Taylor typed and took screenshots every few seconds. The novel, Staying On, was published in October; soon after, Taylor and the library made the spyware recordings available to download. [h/t Dan Hett] | https://en.wikipedia.org/wiki/C._M._Taylor https://blogs.bl.uk/english-and-drama/2018/11/c-m-taylor-on-keystroke-logging-project-with-british-library.html https://cmtaylorstory.com/portfolio/staying-on/ https://twitter.com/CMTaylorStory/status/1067066295471038464 https://data.bl.uk/cmtaylorkeylogging/ | https://twitter.com/danhett/status/1067359176131821568 |
643 | 2018.12.05 | 2 | Body camera usage. | The New Orleans Police Department’s “Body Worn Camera Metadata” contains the dates, times, durations, and locations for 2.7 million body camera recordings, going back to 2014. Related: The agency publishes similar data for 1.5 million in-car camera recordings. [h/t Alexandre Léchenet] | https://data.nola.gov/Public-Safety-and-Preparedness/NOPD-Body-Worn-Camera-Metadata/qarb-kkbj https://data.nola.gov/Public-Safety-and-Preparedness/NOPD-In-Car-Camera-Metadata/md3v-ph3u | http://lepanierasalade.fr/ |
644 | 2018.12.05 | 3 | UK grants. | The British nonprofit 360Giving helps grantmakers “to publish their grants data in an open, standardised way and helps people to understand and use the data.” Through its GrantNav platform, you can search across more than 300,000 grants — totalling more than £25 billion — given by scores of funders to nearly 180,000 recipients. You can download the results of each search, as well as the underlying datasets. [h/t Enigma Public] | http://www.threesixtygiving.org/ http://grantnav.threesixtygiving.org/ http://grantnav.threesixtygiving.org/datasets/ | https://us5.campaign-archive.com/?u=04aa10cf99e0998bd8e69a109&id=4a67186166 |
645 | 2018.12.05 | 4 | Subtitle word frequencies. | SUBTLEXus is a dataset of word frequencies in American English, derived from the subtitles for 8,388 films. The dataset, which covers more than 74,000 words, includes each word’s total frequency, the number of films in which the word appeared, and several other metrics. Bonus: Similar datasets are also available for Chinese and Dutch. [h/t The Language Goldmine] | https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/overview.htm https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/overview.htm http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl | http://languagegoldmine.com/ |
646 | 2018.12.05 | 5 | Snow plows. | Last month, I Quant NY’s Ben Wellington analyzed New York City’s raw snow plow data, “which had only been viewed 41 times before apparently.” The 250 million–row dataset is, as Wellington notes, “stored in an odd format” — snapshots that indicate, every 15 minutes, the last time each of the city’s street segments was plowed. Related: ClearStreets provides historical data from the City of Chicago’s Plow Tracker; Iowa Department of Transportation also publishes a live plow tracker; Syracuse and Pittsburgh have published historical snow plow data. | http://iquantny.tumblr.com/about http://iquantny.tumblr.com/post/180300705249/data-shows-no-increase-in-nyc-plowing-as-storm https://data.cityofnewyork.us/City-Government/DSNY-PlowNYC-Data/rmhc-afj9 http://clearstreets.org/data http://www.cityofchicago.org/city/en/depts/mayor/iframe/plow_tracker.html http://data.iowadot.gov/datasets/20a0c10c06a54240b5f2893e0187e22c_0 http://data.syrgov.net/datasets?t=snowplow https://data.wprdc.org/dataset/snow-plow-activity-2015-2016 | |
647 | 2018.12.12 | 1 | Bike and pedestrian safety. | A growing number of cities publish detailed data on bicyclist and pedestrian injuries involving cars, including New York City, Chicago, Boston, Seattle, St. Paul, Minn., Chapel Hill, N.C., Tempe, Ariz., Toronto, and London — many through the cities’ “Vision Zero” street-safety initiatives. (Some of the datasets also include car-on-car collisions.) Related: “The most dangerous intersections in Seattle for bicyclists and pedestrians.” [h/t Rachel Schallom + Jeff Asher] | http://www.nyc.gov/html/dot/html/about/vz_datafeeds.shtml https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if https://data.boston.gov/dataset/vision-zero-crash-records https://data.seattle.gov/Transportation/Collisions/vac5-r8kk https://information.stpaul.gov/Public-Safety/Pedestrian-And-Bike-Crash-Data-Dataset/bw92-5h94 https://www.chapelhillopendata.org/explore/?sort=modified&q=crashes https://data.tempe.gov/dataset/high-severity-traffic-crashes-1-08 http://opendata-torontops.opendata.arcgis.com/datasets/55d5b9f7af7d4710bc98743b2c005f02_0 https://tfl.gov.uk/corporate/publications-and-reports/road-safety https://visionzeronetwork.org/resources/vision-zero-cities/ https://www.seattletimes.com/seattle-news/transportation/the-most-dangerous-intersections-in-seattle-for-bicyclists-and-pedestrians/ | http://www.rachelschallom.com/ https://twitter.com/Crimealytics/status/862709032515194880 |
648 | 2018.12.12 | 2 | Computer vulnerabilities. | Common Vulnerabilities and Exposures is a downloadable list of more than 110,000 “publicly known cybersecurity vulnerabilities.” Each vulnerability is assigned a unique identifier (e.g., CVE-2014-0160) and given a description. The National Institute of Standards and Technology’s National Vulnerability Database takes the list and adds more information for each entry, “such as fix information, severity scores, and impact ratings.” That database is available in a variety of bulk downloads and data feeds; you can also search it online. [h/t GitHub user "nanoseconds"] | https://cve.mitre.org/ https://cve.mitre.org/data/downloads/index.html https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-0160 https://nvd.nist.gov/ https://cve.mitre.org/about/cve_and_nvd_relationship.html https://nvd.nist.gov/vuln/data-feeds https://nvd.nist.gov/vuln/search | https://github.com/toddmotto/public-apis/commit/5bed75a1ea9fb3df6cc03101bc0441acd00f0273 |
649 | 2018.12.12 | 3 | Sunniness. | The National Renewable Energy Laboratory’s solar datasets measure the average annual and monthly “total solar resource” for the United States, broken down by state, county, ZIP code, and roughly-10-square-kilometer chunks of the country. Bonus: More sun-radiation datasets via this Stack Overflow answer. [h/t Joe Hourclé] | https://www.nrel.gov/gis/data-solar.html https://opendata.stackexchange.com/questions/1064/api-for-sun-radiation-illuminance-data/1065#1065 | https://opendata.stackexchange.com/users/263/joe |
650 | 2018.12.12 | 4 | German political speeches. | Academic researcher Adrien Barbaresi has compiled a corpus of thousands of speeches from the the German Presidency, Presidency of the Bundestag, Chancellery, and Ministry of Foreign Affairs. The corpus, now in its third version, was first released in 2011. [h/t Adrien Barbaresi] | http://adrien.barbaresi.eu/ http://adrien.barbaresi.eu/corpora/speeches/ | https://github.com/awesomedata/apd-core/blob/4955130a8898b4d5ace0e699bc1506b3be75b659/core/NaturalLanguage/German-Political-Speeches-Corpus.yml |
651 | 2018.12.12 | 5 | Boy bands. | The Pudding’s Internet Boy Band Database is “an audio-visual history of every boy band to chart on the Billboard Hot 100 since 1980.” You can download the underlying data, which is stored in two files: boys.csv and bands.csv. | https://pudding.cool/2018/11/boy-bands/ https://github.com/the-pudding/data/tree/master/boybands | |
652 | 2018.12.19 | 1 | Life expectancy by Census tract. | The CDC’s Small-area Life Expectancy Estimates Project calculates how long someone, born in a given Census tract in 2010–15, might expect to live. The estimates are based on a combination of death records, Census population data, and statistical modeling. Related: “Map: What story does your neighborhood’s life expectancy tell?” (Quartz). Previously: Life expectancy by income, gender, and city (DIP 2016.04.13), and by country (DIP 2017.02.08). [h/t Dan Kopf] | https://www.cdc.gov/nchs/nvss/usaleep/usaleep.html https://qz.com/1462111/map-what-story-does-your-neighborhoods-life-expectancy-tell/ https://healthinequality.org/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-02-08-edition https://www.who.int/gho/mortality_burden_disease/life_tables/situation_trends/en/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-02-08-edition | https://twitter.com/dkopf/status/1073270528964608000 |
653 | 2018.12.19 | 2 | Global debt. | The International Monetary Fund’s Global Debt Database brings together “total gross debt” numbers for 190 countries, for the years 1950 to 2017. The database features a detailed methodology and includes indicators of government, household, and corporate debt. | https://www.imf.org/external/datamapper/datasets/GDD https://www.imf.org/en/Publications/WP/Issues/2018/05/14/Global-Debt-Database-Methodology-and-Sources-45838 | |
654 | 2018.12.19 | 3 | Philly property transactions. | Philadelphia’s Department of Records has begun publishing a dataset of all real estate transfers recorded since late 1999. The 3.7 million records include deeds, mortgages, condo declarations, and a few other types of documents. The deed data includes each property’s fair market value, address, grantor and grantee names, various taxes, and more. Bonus: An interactive visualization of the data. Previously: UK property sales (DIP 2016.03.23). [h/t Michael McLaughlin] | https://www.opendataphilly.org/dataset/real-estate-transfers https://data.phila.gov/visualizations/real-estate-transfers https://www.gov.uk/government/collections/price-paid-data https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-03-23-edition | https://www.datainnovation.org/2018/11/tracking-property-transactions-in-philadelphia/ |
655 | 2018.12.19 | 4 | How much alcohol? | Open Units is a dataset detailing the total amount of alcohol in 1,000+ beer and cider offerings, “based on information made public by drinks manufacturers, distributors and retailers.” For instance, a 355-mL bottle of Sierra Nevada Pale Ale contains 20 mL of alcohol, the same as a pint of Bud Light. [h/t Giuseppe Sollazzo] | https://www.getthedata.com/open-units | https://us5.campaign-archive.com/?u=77ecabbd32e97a6caa9d7d40b&id=ac77ad17f9 |
656 | 2018.12.19 | 5 | A lot of insect eggs. | A team of evolutionary biologists has compiled a dataset describing the size and shape of eggs laid by more than 6,700 insect species. You can explore and download the underlying data, which is based on measurements from 1,756 published sources. [h/t Cassandra Extavour] | https://www.biorxiv.org/content/early/2018/11/19/471953 https://shchurch.github.io/dataviz/index.html https://github.com/shchurch/insect_egg_database_viz/tree/master/data https://www.biorxiv.org/content/biorxiv/suppl/2018/11/19/471953.DC1/471953-1.pdf | https://twitter.com/redmakeda/status/1064708154276089856 |
657 | 2019.01.02 | 1 | Local incarceration, 1970–2015. | The Vera Institute of Justice’s recently-expanded Incarceration Trends project combines data from a range of government reports — such as the Census of Jails and the National Corrections Reporting Program — into a single, longitudinal, well-documented dataset. For each county and year, the dataset tallies the number of people admitted to jails and prisons, the average daily incarcerated jail and prison population, and other related details. Many of the counts are also broken down by race, ethnicity, and sex. Bonus: The institute’s interactive map of the data. [h/t Chris Henrichson + Sam Petulla] | https://www.vera.org/blog/expanding-our-knowledge-on-local-incarceration-trends https://github.com/vera-institute/incarceration_trends http://trends.vera.org/incarceration-rates | https://twitter.com/chenrichson/status/1073642842566918145 https://twitter.com/spetulla |
658 | 2019.01.02 | 2 | Hourly pedestrians. | Melbourne, Australia, has placed dozens of pedestrian-counting sensors across the city, and publishes a dataset of the hourly observations going back to 2009. Now you know: Among the 2.5 million entries so far, the highest count has been the 12,289 pedestrians at the Bourke Street pedestrian bridge between 6pm and 7pm on Friday, October 26, 2018. Bonus: Melbourne’s interactive map of the data. Related: Pedestrian counts from the Brooklyn Bridge and Somerville, Massachusetts. | https://data.melbourne.vic.gov.au/Transport-Movement/Pedestrian-volume-updated-monthly-/b2ak-trbp https://mapio.net/pic/p-36548659/ http://www.pedestrian.melbourne.vic.gov.au/ https://data.cityofnewyork.us/Transportation/Brooklyn-Bridge-Automated-Pedestrian-Counts-Demons/6fi9-q3ta https://data.somervillema.gov/dataset/Bicycle-Pedestrian-Counts/qu9x-4xq5 | |
659 | 2019.01.02 | 3 | Nighttime brightness in Niger and Nigeria. | A pair of researchers have used satellite imagery to quantify nighttime lights in five urban areas in Niger and Nigeria — Agadez, Katsina, Maradi, Niamey, and Zinder. Describing their findings in a recent issue of Scientific Data, the researchers write, “Our data showed 1) urban illumination fluctuated seasonally, 2) corresponding population fluctuations were sufficient to drive seasonal measles outbreaks, and 3) overlooking these fluctuations during vaccination activities resulted in below-target coverage levels, incapable of halting transmission of the virus.” | https://scholarsphere.psu.edu/concern/generic_works/cjs956g06f https://www.nature.com/articles/sdata2018256 | |
660 | 2019.01.02 | 4 | Ocean noises. | The UK Marine Noise Registry tracks “human activities in UK seas that produce loud, low to medium frequency (10Hz – 10kHz) impulsive noise” — including pile-driving, explosives, military sonar, and “acoustic deterrent devices.” For each of the UK’s oil and gas licensing blocks, the registry’s published data counts the number of days that a given type of impulsive noise was generated. Related: Owen Boswarva has built an interactive map of the data. [h/t Giuseppe Sollazzo] | http://jncc.defra.gov.uk/page-7070 https://www.ogauthority.co.uk/data-centre/data-downloads-and-publications/licence-data/ https://data.gov.uk/search?q=%22Marine+Noise+Registry%22&filters%5Bpublisher%5D=Joint+Nature+Conservation+Committee&filters%5Btopic%5D=&filters%5Bformat%5D=&sort=recent https://www.owenboswarva.com/ https://www.datadaptive.com/mnr/ | https://us5.campaign-archive.com/?u=77ecabbd32e97a6caa9d7d40b&id=f6b38561d7 |
661 | 2019.01.02 | 5 | A highly-measured beach. | The Narrabeen-Collaroy Beach Survey Program has been measuring a major stretch of the Sydney shore every month since April 1976. You can explore the data online and (free registration required) download it. [h/t Robbi Bishop-Taylor + Mitchell Harley] | http://narrabeen.wrl.unsw.edu.au/ http://narrabeen.wrl.unsw.edu.au/explore_data/ http://narrabeen.wrl.unsw.edu.au/download/narrabeen/ | https://twitter.com/EarthObserved/status/1070866071165448193 https://twitter.com/DocHarleyMD/status/1070870424869777408 |
662 | 2019.01.09 | 1 | The kids these days — and four decades ago. | Monitoring the Future surveys approximately 50,000 eighth-, tenth-, and twelfth-grade students in the U.S. each year. The project, which is funded by the National Institute on Drug Abuse, has been running since 1975. Although best known for its detailed drug-use questions, the surveys also ask questions related to education, labor, sex, race, politics, happiness, and other topics. Public-use versions of the data are available through the National Addiction & HIV Data Archive Program (free registration required). [h/t Dan Kopf] | http://www.monitoringthefuture.org/ http://www.monitoringthefuture.org/purpose.html https://www.icpsr.umich.edu/icpsrweb/NAHDAP/series/35?start=0&SERIESQ=35&ARCHIVE=NAHDAP&sort=DATEUPDATED%20desc&rows=50 | https://twitter.com/dkopf |
663 | 2019.01.09 | 2 | Sound effects. | Last spring, the BBC published an archive of 16,000+ sound effects, licensed ”for personal, educational or research purposes.” Each audio file is accompanied by a description, categorization, and its length. For instance, the first sound effect on the archive’s page is a 194-second clip described as “two-stroke petrol engine driving small elevator, start, run, stop,” and categorized as “Engines: Petrol.” Not documented, but useful: You can download a CSV of the metadata. Highlight: The one-two punch of “several men snoring, hilariously” and “several men snoring, less hilariously.” [h/t Amy King] | http://bbcsfx.acropolis.org.uk/ http://bbcsfx.acropolis.org.uk/assets/BBCSoundEffects.csv http://bbcsfx.acropolis.org.uk/?q=hilariously | https://twitter.com/sephiramy/status/1080887893265068032 |
664 | 2019.01.09 | 3 | Fiscal crises. | Researchers at the International Monetary Fund have built a historical database of fiscal crises, defined as “periods of extreme fiscal distress, when governments have not been able to contain large fiscal imbalances leading to the adoption of extreme measures (e.g., debt default and monetization of the deficit).” The researchers, building off of previous work, have “expand[ed] the country coverage to 188 countries, over 1970-2015, more than double the size of the sample relative to many other studies,” and identified 436 distinct episodes of fiscal crisis. [h/t David Tercero Lucas] | https://www.imf.org/en/Publications/WP/Issues/2017/04/03/Fiscal-Crises-44795 | https://twitter.com/David_III_L/status/1079827591685652480 |
665 | 2019.01.09 | 4 | Dual citizenship policies. | If you decide to acquire a new citizenship, do you get to keep your previous one? Are you allowed to renounce it? The Maastricht Center for Citizenship, Migration and Development’s Global Expatriate Dual Citizenship Dataset tracks how 200 countries have, each year since 1960, treated this situation. The extensive documentation provides links to the relevant laws, and descriptions of how each country’s rules have changed. [h/t Sam Petulla] | https://macimide.maastrichtuniversity.nl/dual-cit-database/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TTMZ08 | https://twitter.com/spetulla |
666 | 2019.01.09 | 5 | From Spanskgrøn to Østerland. | Norse World is an “online, open access searchable index and mapping of the foreign place names found in medieval East Norse texts.” Through the project’s interactive map, you can search and download the data. | https://www.uu.se/en/research/infrastructure/norseworld/ https://norseworld.nordiska.uu.se/index.php https://www.uu.se/en/research/infrastructure/norseworld/using-norse-world/search-and-filters https://www.uu.se/en/research/infrastructure/norseworld/using-norse-world/exporting-the-data | |
667 | 2019.01.16 | 1 | Journalists killed, imprisoned, and missing. | The Committee to Protect Journalists maintains a database of journalists who’ve been killed for reasons related to their work. The database goes back to 1992 and contains more than 1,300 entries, with details about the journalists, the circumstances of their deaths, and whether perpetrators have been convicted. More recently, the organization has also begun publishing data on journalists who’ve been imprisoned or gone missing. [h/t Giuseppe Sollazzo] | https://cpj.org/data/killed https://cpj.org/data/methodology https://cpj.org/data/imprisoned https://cpj.org/data/missing | https://mailchi.mp/0c00c1b1d808/preview-222-in-other-news-3678081?e=6c87ff0227 |
668 | 2019.01.16 | 2 | Congressional district demographics. | The Census Bureau’s My Congressional District tool lets you browse (and download) demographic, socioeconomic, and business data corresponding to each of the country’s 435 congressional districts. Political scientist Ella Foster-Molina has compiled a historical dataset containing similar information for 1972 to 2014; it also contains details about each district’s representatives — such as their personal characteristics, the committees they served on, and the number of bills they sponsored. [h/t Josh McCrain + Derek Willis] | https://www.census.gov/mycd/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CI2EPI | https://twitter.com/joshmccrain/status/1082321114708275200 https://twitter.com/derekwillis/status/1082302252965117952 |
669 | 2019.01.16 | 3 | Political party data, linked. | Party Facts is a “collaborative data collection” that links various political-party datasets together. The project has two main tables. One contains basic information about 4,100+ political parties in more than 200 countries, including each party’s mother-tongue name and English translation, year founded, and Wikipedia page. The second table cross-references each party with its unique identifier in 26 external datasets, such as ParlGov (DIP 2018.09.19), The Manifesto Project (DIP 2017.06.21), and the Constituency-Level Elections Archive (DIP 2016.09.28). [h/t Matt Grossmann + Erik Gahner] | https://partyfacts.herokuapp.com/ https://partyfacts.herokuapp.com/download/ https://partyfacts.herokuapp.com/data/ http://www.parlgov.org/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-09-19-edition https://manifesto-project.wzb.eu/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-06-21-edition http://www.electiondataarchive.org/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-09-28-edition | https://twitter.com/MattGrossmann/status/1083560158431723520 https://github.com/erikgahner/PolData/commit/3963599404680a5305a66daeda01bc52972118f9 |
670 | 2019.01.16 | 4 | Old shipping logs. | In previous centuries, maritime officers kept “detailed log books of the ships’ activities and management,” including observations of the wind and weather. The Climatological Database for the World's Oceans 1750-1850 has digitized a quarter-million entries from such logbooks, originally written in Dutch, English, French, and Spanish, and published them as detailed, structured data. Helpful: Steven Ottens has converted the project’s fixed-width files into tab-delimited data. [h/t Robi Sen + Roger Davies + Topi Tjukanov] | https://webs.ucm.es/info/cliwoc/object.htm https://webs.ucm.es/info/cliwoc/ http://projects.knmi.nl/cliwoc/ https://stvno.github.io/page/cliwoc/ | https://twitter.com/robi_sen/status/1049016327996755968 https://twitter.com/rogercdavies/status/1048926264575369218 https://twitter.com/tjukanov/status/1048498066230312960 |
671 | 2019.01.16 | 5 | Moooooooooo. | The U.S. Department of Agriculture’s Dairy Data Set contains annual tabulations of production, sales, imports, exports, consumption, and other economic aspects of “the U.S. dairy situation.” As seen in: “Nobody Is Moving Our Cheese: American Surplus Reaches Record High” (NPR). | https://www.ers.usda.gov/data-products/dairy-data/ https://www.npr.org/2019/01/09/683339929/nobody-is-moving-our-cheese-american-surplus-reaches-record-high | |
672 | 2019.01.30 | 1 | Fatal and non-fatal gun crime. | On Thursday, Sarah Ryley, Sean Campbell, and I published a deeply-reported investigation into U.S. cities’ failure to solve shootings — a year-long collaboration between The Trace and BuzzFeed News. To reach our quantitative findings, we analyzed (and standardized) three major FBI datasets, internal data from 22 police departments, and a database of Baltimore victims and suspects. Data, code, and methodologies for the analyses are available on GitHub. Related: Last year, The Washington Post published Murder with Impunity, a series examining unsolved homicides; their data, on 52,000+ homicides in 50 cities, is also available on GitHub. | https://www.buzzfeednews.com/article/sarahryley/police-unsolved-shootings https://www.thetrace.org/ https://www.buzzfeednews.com/ https://www.buzzfeednews.com/article/sarahryley/5-things-to-know-about-cities-failure-to-arrest-shooters https://github.com/the-trace-and-buzzfeed-news/federal-crime-data-analysis https://github.com/the-trace-and-buzzfeed-news/local-police-data-analysis https://github.com/the-trace-and-buzzfeed-news/baltimore-shootings-analysis https://github.com/the-trace-and-buzzfeed-news/introduction https://www.washingtonpost.com/graphics/2018/investigations/where-murders-go-unsolved/?utm_term=.a53db2e96521 https://github.com/washingtonpost/data-homicides | |
673 | 2019.01.30 | 2 | Hourly rainfall. | Since 1997, the Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN) algorithm has used satellite imagery to estimate rainfall rates around the world. The system’s hourly, daily, monthly, and annual estimates can now be explored online and downloaded. | https://www.nature.com/articles/sdata2018296?WT.ec_id=SDATA-201901 http://chrs.web.uci.edu/SP_activities00.php https://chrsdata.eng.uci.edu/ | |
674 | 2019.01.30 | 3 | Ethnonationalism. | Christina Isabel Zuber and Edina Szöcsik’s Ethnonationalism in Party Competition dataset compiles ratings for more than 200 political parties in 22 European countries. Experts rated the parties twice — first in 2011, and then again in 2017 — on a range of factors, such as the centrality of ethnonationalism to the parties’ platforms, and their positions on territorial autonomy for minorities. (Dataset access requires providing a name and email address.) [h/t Erik Gahner] | http://christinazuber.com/data/ | https://github.com/erikgahner/PolData |
675 | 2019.01.30 | 4 | Uncertain spellings | . “Funemployed programmer” Colin Morris looked for all the times where commenters on Reddit added “(sp?)”, or a related annotation, to their remarks. E.g., “SF is putting on quite a show, especially Kapernick (sp?).” Morris then compiled a dataset of the words that preceded those annotations, accompanied by examples of their usage. [h/t Rich Posert] | http://colinmorris.github.io/about/ https://www.reddit.com/r/CFB/comments/16h0ih/our_most_comments_in_a_game_thread_record_didnt/c7vyofl/ https://github.com/colinmorris/reddit-dubious-spelling | https://twitter.com/PosertInLab/status/1085362032583442432 |
676 | 2019.01.30 | 5 | Twin City radio spins. | “Shane Nackerud needed to know: Does 89.3 the Current play the Replacements every day?” To figure it out, the University of Minnesota librarian extracted track listings from 1.1 million @currentplaylist tweets from 2009 through 2018. He’s also published the total play counts by artist and the raw data. [h/t Kent Gerber + Amy Riegelman] | http://www.citypages.com/music/what-songs-artists-has-the-current-played-most-since-2009-this-u-of-m-librarian-crunched-the-numbers/504392941 https://twitter.com/currentplaylist https://docs.google.com/spreadsheets/d/1ByxdKfjDQ7RtSvLRaufHPa7hYSLXfzUO9QaFgVf_fJc/edit#gid=1779654962 https://drive.google.com/file/d/1XuWC9oTuQkwYlP6R6ljsOIpvMWfC-meZ/view | https://twitter.com/ktkgerber/status/1085944034240225281 https://twitter.com/amylibrarian/status/1085668015008608256 |
677 | 2019.02.06 | 1 | Cities’ CO2 emissions. | An international team of researchers has created a dataset of 343 cities’ CO2 emissions. The researchers aggregated and standardized the emissions data — largely self-reported — from three sources: the Carbon Disclosure Project, the Bonn Center for Local Climate Action and Reporting, and a new project at Peking University. The dataset includes cities large and small, from Lagos and Shanghai to Kadıovacık, Turkey (pop. 216) and Brisbane, California (pop. ~4,700). In addition to emissions, the dataset also provides contextual information about the cities, such as average household sizes and gasoline prices. | https://www.nature.com/articles/sdata2018280?WT.ec_id=SDATA-201901 https://doi.pangaea.de/10.1594/PANGAEA.884141 https://data.cdp.net/Emissions/2016-Citywide-GHG-Emissions/dfed-thx7 https://carbonn.org/ | |
678 | 2019.02.06 | 2 | Cabinet turnover. | For a recent analysis of Trump administration turnover, FiveThirtyEight compiled a dataset of the last seven presidents’ cabinets — covering the 24 positions included in Donald Trump’s cabinet. (As author Nathaniel Rakich notes, “Not every president designates the same positions to be in the Cabinet.”) The dataset includes each cabinet member’s position, start date, departure date, and total days in office. | https://fivethirtyeight.com/features/two-years-in-turnover-in-trumps-cabinet-is-still-historically-high/ https://github.com/fivethirtyeight/data/tree/master/cabinet-turnover | |
679 | 2019.02.06 | 3 | Oklahoma prisoners. | In the course of investigating why Oklahoma’s female incarceration rate is so high, The Frontier and the Center for Investigative Reporting obtained “a decade’s worth of state prison data never before analyzed by the state itself.” The data includes information about each prisoner, their prison sentences, and their entries and exits from Department of Corrections supervision. [h/t Dan Nguyen] | https://www.revealnews.org/article/let-down-and-locked-up-why-oklahomas-female-incarceration-is-so-high/ https://www.readfrontier.org/ https://www.revealnews.org/ https://www.revealnews.org/article/before-you-dive-into-oklahomas-prison-data-read-reveals-tips/ | https://www.reddit.com/r/datasets/comments/ajqp57/oklahoma_prisoners_2017_280k_records/ |
680 | 2019.02.06 | 4 | DC taxi rides. | The District of Columbia’s taxi trip data covers 2015–17 and includes each trip’s pickup and dropoff location, mileage, total fare, tip amount, and other details. Previously: Chicago and NYC taxi rides (DIP 2016.12.07). [h/t Richard Sigman] | http://opendata.dc.gov/datasets?q=taxicabs https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-12-07-edition | |
681 | 2019.02.06 | 5 | Advice sought. | To support its data-driven feature, “30 Years of American Anxieties,” The Pudding gathered 20,000 questions posed to legendary advice columnist Dear Abby. | https://pudding.cool/2018/11/dearabby/ https://github.com/the-pudding/data/tree/master/dearabby | |
682 | 2019.02.13 | 1 | The Census, but for forests. | The U.S. Forest Service’s Forest Inventory and Analysis program tracks “trends in forest area and location; in the species, size, and health of trees; in total tree growth, mortality, and removals by harvest; in wood production and utilization rates by various products; and in forest land ownership.” It also “serves as perhaps the largest publicly available” dataset of “downed and dead wood.” The inventory is available to download and comes with user guides. | https://www.fia.fs.fed.us/ https://www.nature.com/articles/sdata2018303?WT.ec_id=SDATA-201901 https://apps.fs.usda.gov/fia/datamart/datamart.html https://www.fia.fs.fed.us/library/database-documentation/index.php | |
683 | 2019.02.13 | 2 | Rebel groups and natural resources. | The Resources and Conflict Project’s Rebel Contraband Dataset “measures if and how rebel groups earn income from the exploitation of natural resources or criminal activities.” The dataset spans 1990–2015, covers more than 70 countries, and specifies dozens of types of resources — such as oil, cannabis, gold, tea, and timber. [h/t Eric Gahner] | http://civilwardynamics.org/data/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/COQ65B | https://github.com/erikgahner/PolData |
684 | 2019.02.13 | 3 | German companies. | The Open Knowledge Foundation Deutschland and OpenCorporates have partnered to make Germany’s official business register available to download in bulk. The dataset contains basic information about more than 5 million German companies, and more than 4 million associated officers. Note: Although the dataset’s landing page is written in German, its documentation is available in English. Related: Joachim Gassen’s initial analysis of the companies’ locations, using R. [h/t Sharon Machlis] | https://okfn.de/en/ https://opencorporates.com/ https://okfn.de/blog/2019/02/finally-open-company-data/ https://blog.opencorporates.com/2019/02/06/german-company-data-now-available-for-download-via-open-knowledge-deutschland/ https://offeneregister.de/ https://offeneregister.de/daten/ https://joachim-gassen.github.io/2019/02/where-the-german-companies-are/ | https://twitter.com/sharon000 |
685 | 2019.02.13 | 4 | Two decades of tobacco (and e-cigarette) laws. | The CDC’s State Tobacco Activities Tracking and Evaluation system tracks “current and historical state-level legislative data on tobacco [and now also e-cigarette] use prevention and control policies.” The system’s datasets provide quarterly snapshots — going back to 1995 — of rules concerning taxes, youth access, licensing, fire safety, and more. | https://www.cdc.gov/statesystem/index.html https://chronicdata.cdc.gov/browse?limitTo=datasets&sortBy=alpha&tags=legislation&utf8=%E2%9C%93 | |
686 | 2019.02.13 | 5 | Obstacle courses. | Drawing upon a fan wiki, Matt Laessig has created a spreadsheet of all 889 obstacles in the first 10 seasons of American Ninja Warrior. (Free registration required to download.) [h/t Ilan Brat] | https://sasukepedia.fandom.com/wiki/List_of_American_Ninja_Warrior_obstacles https://twitter.com/MattLaessig https://data.world/ninja/anw-obstacle-history | https://www.linkedin.com/in/ilanbrat |
687 | 2019.02.20 | 1 | Power plants. | The Global Power Plant Database, published by the World Resources Institute, “is a comprehensive, open source database of power plants around the world” and contains “information on plant capacity, generation, ownership, and fuel type.” The current edition, released in June 2018, covers 28,600+ power plants in 164 countries — including more than 1,000 each in Brazil, Canada, China, Great Britain, France, and the United States. Previously: U.S. power plants (DIP 2016.02.10). [h/t Kelly Rose + Paul Deane] | http://datasets.wri.org/dataset/globalpowerplantdatabase https://www.wri.org/our-work https://www.eia.gov/electricity/data/eia923/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-02-10-edition | https://www.linkedin.com/in/kelly-rose-44237bb6/ https://www.linkedin.com/feed/update/urn:li:activity:6499389420544892928 |
688 | 2019.02.20 | 2 | The Oscars. | The Academy of Motion Picture Arts and Sciences website hosts two searchable databases related to their annual awards show: one of nominees and winners, and another of acceptance speeches. The Academy doesn’t provide direct downloads, but many folks have created structured datasets from the records. For instance: Statistics professor Adam B. Kashlak has build a dataset that combines speech word-counts, Best Picture winners’ budgets, and total broadcast length. And: Alex Albright’s analysis from a few years ago, “I’d Like to Thank the Academy… for making this data available,” is based on her dataset of all speeches from the 2010–14 broadcasts. [h/t Jay Arthur] | http://awardsdatabase.oscars.org/ http://aaspeechesdb.oscars.org/ https://sites.ualberta.ca/~kashlak/ https://sites.ualberta.ca/~kashlak/kashCodeData.html https://thelittledataset.com/about/ https://thelittledataset.com/2015/02/19/id-like-to-thank-the-academy-for-making-this-data-available/ https://github.com/apalbright/Oscars/blob/master/raw_data/oscars_10-14.csv | https://www.qimacros.com/about-knowware/jay-arthur-tqm-lean-six-sigma/ |
689 | 2019.02.20 | 3 | EU-funded projects in the UK. | MyEU.uk’s interactive map lets you search and explore tens of thousands of European Union–funded projects in the United Kingdom, aggregated from a range of official sources. The initiative, which opposes Brexit, has published its data-collection and data-processing code as well as a spreadsheet of all projects it has identified. [h/t Jovi Juan] | https://www.myeu.uk/ https://www.myeu.uk/about/ https://github.com/TechForUK/my_eu https://docs.google.com/spreadsheets/d/1doQnfcwxIBdTM1mecBevEt_4Vgih06fy_lWSZl0Hwac/edit#gid=0 | https://twitter.com/daoofj |
690 | 2019.02.20 | 4 | Electronic search warrants. | Thanks to a 2015 state bill, when California law enforcement agencies obtain search warrants for digital communications (or are granted access to such information in an emergency), they must notify the people whose information they targeted. The state’s Department of Justice publishes data about these notifications, including the agency name, the grounds for the warrant, the nature of the investigation, the companies searched (e.g., AT&T, Verizon, Google, Facebook), and more. As seen in: “San Bernardino County Sheriff's electronic surveillance use — already highest in state — continues to surge” (Palm Springs Desert Sun, Jan. 2019). | https://www.lawfareblog.com/so-whats-california-electronic-communications-privacy-act https://openjustice.doj.ca.gov/data https://www.desertsun.com/story/news/crime_courts/2019/01/10/san-bernardino-county-sheriffs-department-searches-electronic-property-up/2542376002/ | |
691 | 2019.02.20 | 5 | Golfing discs. | The Professional Disc Golf Association publishes a spreadsheet of flying objects officially approved for use in competition. [h/t Ryan Maus] [Note, 2019-02-20: Original item included incorrect link, now fixed.] | https://www.pdga.com/introduction https://www.pdga.com/documents/pdga-approved-discs | https://twitter.com/RPMaus |
692 | 2019.03.06 | 1 | Last words. | The Texas Department of Criminal Justice publishes a list of each death row inmate executed since 1982 — the year the state resumed capital punishment. In addition to providing basic demographic information, the listing also links to transcriptions of the inmates’ final statements. And although state doesn’t provide the statements as structured data, Zi Chong Kao has created a spreadsheet of of them (plus additional details extracted from the state’s website) for his interactive tutorial, Select Star SQL. Related: “‘Love’ Is the Most Common Word in Death Row Last Statements” (Will Young, Oct. 2018). [h/t Noah Veltman] | https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html https://en.wikipedia.org/wiki/Lists_of_people_executed_in_Texas https://selectstarsql.com/frontmatter.html#dataset https://selectstarsql.com/ https://medium.com/s/story/love-is-the-most-common-word-in-death-row-last-statements-f15ab0e8ad16 | https://noahveltman.com/ |
693 | 2019.03.06 | 2 | Crops. | The U.S. Department of Agriculture’s CropScape website provides interactive access to the agency’s Cropland Data Layer — “a raster, geo-referenced, crop-specific land cover data layer created annually for the continental United States using moderate resolution satellite imagery and extensive agricultural ground truth.” You can use CropScape to filter the data’s acreage estimates (for more than a hundred different crops) by state, county, or custom-drawn geographies — or download the complete data in bulk. [h/t Katie McGaughey] | https://nassgeodata.gmu.edu/CropScape/ https://www.nass.usda.gov/Research_and_Science/Cropland/SARS1a.php https://www.nass.usda.gov/Research_and_Science/Cropland/sarsfaqs2.php https://www.nass.usda.gov/Research_and_Science/Cropland/Release/ | https://twitter.com/ktmcgaughey/status/1095479304505475073 |
694 | 2019.03.06 | 3 | International students. | The UNESCO Institute of Statistics compiles data on “internationally mobile” university students, including annual numbers of students by country of origin and country of study. Related: UNESCO's interactive map of student flows. [h/t Francisco Marmolejo] | http://uis.unesco.org/ http://uis.unesco.org/en/glossary-term/international-or-internationally-mobile-students http://data.uis.unesco.org/index.aspx?queryid=171 http://uis.unesco.org/en/uis-student-flow | https://twitter.com/fmarmole/status/761920615842328576 |
695 | 2019.03.06 | 4 | School dress codes. | For a recent article in The Pudding, Amber Thomas and two data assistants “recorded every rule listed in each dress code” at 481 public high schools in 36 states, plus “the words used in the dress code’s rationale, as well as any listed sanctions for breaking the dress code.” The 15,000+ rules and 1,470 sanctions are available to download. | https://pudding.cool/2019/02/dress-code-sexualization/ https://github.com/the-pudding/data/tree/master/dress_codes | |
696 | 2019.03.06 | 5 | Bird eggs. | A few years ago, a team of scientists examined the shapes of 49,000 bird eggs belonging to 1,400 different species. You can download their calculations of each species’ average egg length, asymmetry, and ellipticity, which formed the basis of a graphics-forward article in Science Magazine. [h/t Sophie Warnes] | http://science.sciencemag.org/content/356/6344/1249 http://science.sciencemag.org/content/suppl/2017/06/21/356.6344.1249.DC1 https://vis.sciencemag.org/eggs/ | https://www.getrevue.co/profile/FairWarning/issues/fair-warning-dialects-egg-shapes-and-the-race-to-2020-160100 |
697 | 2019.03.13 | 1 | Employment discrimination cases. | “Thousands of people report workplace discrimination to the government each year. Employers are rarely held accountable,” according to an investigation by the Center for Public Integrity. Reporters Maryam Jameel and Joe Yerardi “analyzed eight years of complaint data — through fiscal 2017 — from the [U.S. Equal Employment Opportunity Commission] as well as its state and local counterparts, reviewed hundreds of court cases and interviewed dozens of people who filed complaints.” The data (on more than 3.7 million allegations and their outcomes) and code are available online. Related: A visual exploration of the data. Previously: Two decades of workplace sexual harassment complaints (DIP 2017.12.06). [h/t Reddit user "cavedave" + Giuseppe Sollazzo] | https://publicintegrity.org/workers-rights/workplace-inequities/injustice-at-work/workplace-discrimination-cases/ https://twitter.com/mrym_jml https://twitter.com/joeyerardi https://github.com/PublicI/employment-discrimination https://www.washingtonpost.com/graphics/2019/business/discrimination-complaint-outcomes/ https://github.com/BuzzFeedNews/2017-12-eeoc-harassment-charges/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-12-06-edition | https://www.reddit.com/r/datasets/comments/avsf9z/workplace_discrimination_is_illegal_here_is_the/ https://mailchi.mp/e5d976d0dfe8/preview-222-in-other-news-3696061 |
698 | 2019.03.13 | 2 | U.S. wildfire costs. | Stanford University’s Big Local News project has compiled data from 100,000+ daily situation reports (known as “SIT-209”s) filed by federal firefighting authorities, detailing their efforts to suppress large wildfires. The dataset covers 2014 to 2017, and includes 240+ variables from each report, including estimated costs, damaged/destroyed buildings, injuries, fatalities, and more. Related: Eric Sagara’s quick introduction to the dataset. | https://twitter.com/BigLocalNews https://searchworks.stanford.edu/view/xj043rd8767 https://fam.nwcg.gov/fam-web/ https://twitter.com/esagara https://drive.google.com/file/d/1BMUcXKaLUI4kSqj0cDNO7Lhih0B_x9XU/view | |
699 | 2019.03.13 | 3 | Democratic endorsements. | FiveThirtyEight is tracking who’s endorsing whom to be the Democrats’ 2020 presidential nominee. The site has published a methodology describing its approach, plus the underlying data, which includes each endorser’s name, state, relevant position, and other details. (According to the site’s formula, Sens. Cory Booker and Kamala Harris are currently leading, although almost entirely based on home-state endorsements.) | https://projects.fivethirtyeight.com/2020-endorsements/democratic-primary/ https://fivethirtyeight.com/methodology/how-our-presidential-endorsement-tracker-works/ https://github.com/fivethirtyeight/data/tree/master/endorsements | |
700 | 2019.03.13 | 4 | 50,000 therapists. | The magazine Psychology Today hosts paid listings for therapists, who advertise their services to prospective patients. Andrew Thompson has created a dataset of the 50,000+ U.S. listings (as of October 2018), with each therapist’s name, city, specialties, and subject areas. | https://www.psychologytoday.com/us/therapists http://andrewsthompson.co/ https://components.one/datasets/therapists-by-metropolitan-regions/ | |
701 | 2019.03.13 | 5 | Zoo animal lifespans. | Researchers based at Chicago’s Lincoln Park Zoo have published “life expectancy estimates for hundreds of vertebrate species based on carefully vetted studbook data from North American zoos and aquariums.” Their dataset includes “sex-specific median life expectancies as well as sample size and 95% confidence limits for each estimate.” | https://www.nature.com/articles/sdata201919 https://figshare.com/articles/AZA_MLE_Jul2018_csv/7539968 | |
702 | 2019.03.20 | 1 | Africapolis. | “Produced by the OECD Sahel and West Africa Club, Africapolis.org is the only comprehensive and standardised geospatial database on cities and urbanisation dynamics in Africa. Combining demographic sources, satellite and aerial imagery and other cartographic sources, it is designed to enable comparative and long-term analyses of urban dynamics - covering 7,500 agglomerations in 50 countries.” You can download the data — which includes historical populations, urbanization metrics, and geospatial outlines — and also explore it online. [h/t Rafael Prieto Curiel] | http://www.oecd.org/swac/ http://www.africapolis.org/ http://africapolis.org/data http://africapolis.org/explore | https://twitter.com/rafaelprietoc/status/1105526120647135233 |
703 | 2019.03.20 | 2 | The Book of the States. | The Council of State Governments’ annual Book of the States compiles 50-state reference tables on a range of topics, including elections, finances, courts, and more. It has been published since 1935, and the tables for the past decade-plus are available as spreadsheets. Now you know: The chief justice of the California Supreme Court makes $256,059 per year — the highest compensation for any state judge, and nearly double New Mexico’s top judge, according to 2018’s Table 5.4. [h/t Cezary Podkul] | https://www.csg.org/ http://knowledgecenter.csg.org/kc/category/content-type/content-type/book-states http://knowledgecenter.csg.org/kc/content/book-states-2018-chapter-6-elections http://knowledgecenter.csg.org/kc/content/book-states-2018-chapter-7-state-finance http://knowledgecenter.csg.org/kc/content/book-states-2018-chapter-5-state-judicial-branch http://knowledgecenter.csg.org/kc/category/content-type/bos-archive http://knowledgecenter.csg.org/kc/content/book-states-2018-chapter-5-state-judicial-branch | https://twitter.com/Cezary/status/1104106392300875776 |
704 | 2019.03.20 | 3 | Metro-area segregation. | “[W]hy are so many cities and metropolitan areas still split along racial lines? And what is the role of local government in reinforcing those divides? To answer those questions, Governing conducted a six-month investigation of black-white segregation in the small cities of downstate Illinois.” As part of the investigation, the magazine calculated (and published) school and residential segregation metrics for hundreds of U.S. metropolitan areas, based on the latest Department of Education and Census Bureau data. Related: “The Most Diverse Cities Are Often The Most Segregated” (FiveThirtyEight, 2015). [h/t Mike Maciag] | https://www.governing.com/topics/public-justice-safety/gov-segregation-series.html https://www.governing.com/gov-data/school-segregation-dissimilarity-index-for-metro-areas.html https://www.governing.com/gov-data/residential-racial-segregation-metro-areas.html https://www.governing.com/gov-data/segregation-report-methodology.html https://fivethirtyeight.com/features/the-most-diverse-cities-are-often-the-most-segregated/ | https://twitter.com/mikemaciag |
705 | 2019.03.20 | 4 | Internet scans. | Security firm Rapid7’s Project Sonar “conducts internet-wide surveys across more than 70 different services and protocols to gain insights into global exposure to common vulnerabilities.” Much of the data (on DNS responses, SSL certificates, and more) can be bulk-downloaded through the company’s open data portal without an account, and historical data and the most-current data are available with a free account. Related: Project Sonar: An Underrated Source of Internet-wide Data (Patrik Hudak). Also: Rapid7’s guide to using their open data API with R. [h/t Sharon Machlis] | https://www.rapid7.com/research/project-sonar/ https://opendata.rapid7.com/sonar.fdns_v2/ https://opendata.rapid7.com/sonar.ssl/ https://opendata.rapid7.com/ https://0xpatrik.com/project-sonar-guide/ https://blog.rapid7.com/2019/02/13/level-up-your-internet-intelligence-using-the-rapid7-open-data-api-and-r/ | http://www.machlis.com/ |
706 | 2019.03.20 | 5 | Rooftop water tanks. | New York City requires the owners of buildings with rooftop water tanks to get the vessels inspected annually for things like sediment, bacteria, and dead bugs. The city publishes a dataset of the owner-report results, based on 15,000 inspections, mostly from 2015–17. Unfortunately: “A review of city records indicates that most building owners still do not inspect and clean their tanks” ... and the “city can’t even say with certainty how many there are or where they are located” ... and in “almost every case the [bacteriological] tests are conducted only after the tanks have been disinfected.” [h/t Zack Quaintance] | https://data.cityofnewyork.us/Health/Rooftop-Drinking-Water-Tank-Inspection-Results/gjm4-k24g https://www.cityandstateny.com/articles/policy/energy-environment/new-york-city-water-tank-hazards.html | http://www.govtech.com/civic/Whats-New-in-Civic-Tech-New-York-City-Releases-Its-Annual-Data-Report.html |
707 | 2019.03.27 | 1 | Special investigations and charges. | FiveThirtyEight has compiled a dataset of all U.S. special counsel, independent counsel, and special prosecutor investigations since 1973 — and the people charged in them. Related: FiveThirtyEight’s visual comparison of the Mueller probe to other investigations. Bonus: FiveThirtyEight’s Amelia Thomson-DeVeaux has also been tracking major lawsuits related to President Trump and his administration; that dataset currently contains 45 civil cases and 6 criminal cases. | https://github.com/fivethirtyeight/data/tree/master/russia-investigation https://projects.fivethirtyeight.com/russia-investigation/ http://ameliatd.com/about https://fivethirtyeight.com/features/what-trumps-legal-battles-tell-us-about-presidential-power/ https://github.com/fivethirtyeight/data/tree/master/trump-lawsuits | |
708 | 2019.03.27 | 2 | Spring firsts. | Phenology (literally: “the science of appearance”) is the location-and-species-specific study of recurring plant and animal phenomena, such as the annual arrivals and departures of migratory birds. The USA National Phenology Network collects observational data from thousands of citizen scientists, professional researchers, NGOs, and other groups; assesses the data’s quality; and makes it available to explore and download. Previously: The flowering dates of Kyoto’s Prunus jamasakura cherry trees going back to the 9th century (DIP 2017.04.05). [h/t Greta Kaul] | https://www.usanpn.org/home https://www.usanpn.org/data/quality https://www.usanpn.org/data https://www.usanpn.org/data/observational http://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-04-05-edition | https://twitter.com/gretakaul/status/1103324884363628544 |
709 | 2019.03.27 | 3 | Rebel groups. | “The Foundations of Rebel Group Emergence (FORGE) Dataset examines the roots of rebellion by considering the characteristics and activities of the ‘parent’ organizations from which rebel groups emerged,” plus details such as “the organization's ‘birthdate’ and founding location, initial goals, ideology, and ethnic/religious foundations.” The new dataset, developed by the University of Arizona’s Jessica Maves Braithwaite and the University of Maryland’s Kathleen Gallagher Cunningham, contains 430 rebel groups active between 1946 and 2011. [h/t Jori Breslawski + Michael Poznansky] | https://www.jessicamaves.com/forge.html https://www.jessicamaves.com/ http://www.kathleengallaghercunningham.com/ | https://twitter.com/BreslawskiJori/status/1108700123763224576 https://twitter.com/m_poznansky/status/1108384572033642498 |
710 | 2019.03.27 | 4 | Antarctic infrastructure. | University of Tasmania Ph.D. candidate Shaun T. Brooks has created a geospatial dataset of “all buildings and disturbance detected across Antarctica, manually digitised from Google Earth images.” The dataset includes research stations, lighthouses, weather stations, historic sites, and more. [h/t Jasmine Lee] | https://twitter.com/shauntbrooks https://data.aad.gov.au/metadata/records/AAS_5134_Antarctic_Disturbance_Footprint | https://twitter.com/JaszzyJas/status/1102692597384937472 |
711 | 2019.03.27 | 5 | Uber for X. | From Alexis C. Madrigal, writing at The Atlantic: “Now, a decade since Uber blazed the trail, and half that since the craze faded, we built a spreadsheet of 105 Uber-for-X companies founded in the United States, representing $7.4 billion in venture-capital investment. We culled from lists, dug in Crunchbase, and pulled from old news coverage. It’s not a comprehensive list, but it is a large sample of the hopes and dreams of the entrepreneurs of the time.” | https://www.theatlantic.com/technology/archive/2019/03/what-happened-uber-x-companies/584236/ https://docs.google.com/spreadsheets/d/1qPcpQ9rk08JhEApPSr2jSfJtSWa8RH0ANPibtWuRnh0/edit?usp=sharing https://jungleworks.com/11-uber-for-x-startups-that-failed-are-you-making-the-same-mistakes/ https://www.quora.com/Uber-for-X-What-startups-are-working-on-Uber-for-X https://www.producthunt.com/e/uber-for-x https://news.crunchbase.com/news/upcounsel-raises-12m-series-b-connect-lawyers-businesses/ https://www.wired.com/2015/10/why-homejoy-failed/ | |
712 | 2019.04.03 | 1 | Yemen air strikes. | To mark the four-year anniversary of the Saudi-led bombing campaign in Yemen, the Yemen Data Project last week released civilian casualty estimates for the entire air war. The project’s researchers collect and cross-reference data from a range of sources, including news reports, social media, video footage, local authorities, and NGOs; their published data contains dates, locations, and casualty estimates for more than 19,000 air raids. As seen in: “Saudi Strikes, American Bombs, Yemeni Suffering: How Saudi Arabia’s war tactics have fueled Yemen’s humanitarian crisis” (New York Times, December 2018). [h/t Andrea Carboni] | http://yemendataproject.org/ https://twitter.com/YemenData/status/1110285476244520960 http://yemendataproject.org/methodology-1.html http://yemendataproject.org/data.html https://www.nytimes.com/interactive/2018/12/27/world/middleeast/saudi-arabia-war-tactics-yemen-humanitarian-crisis.html | https://twitter.com/a_carboni/status/1110296341652144128 |
713 | 2019.04.03 | 2 | Teacher supply. | The UNESCO Institute of Statistics collects country-level data on the number of teachers, teacher-to-student ratios, and related figures. You can download the data or explore it in UNESCO’s eAtlas of Teachers or their interactive visualization of teacher supply in Asia. | http://data.uis.unesco.org/index.aspx?queryid=180 https://tellmaps.com/uis/teachers/#!/tellmap/873758989 http://uis.unesco.org/misc/uis/teachers.html | |
714 | 2019.04.03 | 3 | Moralizing gods. | To test the “moralizing gods” hypothesis (which posits that “belief in morally concerned supernatural agents culturally evolved to facilitate cooperation among strangers in large-scale societies”), the authors of a recent paper in Nature “coded records from 414 societies that span the past 10,000 years from 30 regions around the world, using 51 measures of social complexity and 4 measures of supernatural enforcement of morality.” The dataset is available to download. Findings: “Our analyses not only confirm the association between moralizing gods and social complexity, but also reveal that moralizing gods follow — rather than precede — large increases in social complexity.” [h/t Juan Moreno-Cruz + Peter Irvine] | http://seshatdatabank.info/nature-paper-on-moralizing-gods/ https://www.nature.com/articles/s41586-019-1043-4 http://seshatdatabank.info/datasets/ | https://twitter.com/jmorenocruz/status/1111720070068080640 https://twitter.com/peteirvine/status/1111362490229764097 |
715 | 2019.04.03 | 4 | Mid-Atlantic shorelines. | The Virginia Institute of Marine Science at The College of William & Mary maintains shoreline inventories for Virginia, Maryland, and parts of Delaware and North Carolina. The datasets include geospatial information about land use, vegetation, different types of structures (e.g., jetties, bulkheads, docks, boathouses), and more. [h/t Susie Cambria] | https://www.vims.edu/ https://www.vims.edu/ccrm/research/inventory/index.php | https://twitter.com/susiecambria |
716 | 2019.04.03 | 5 | The Index Thomisticus. | “In 1949, an Italian Jesuit priest named Roberto Busa presented a pitch to Thomas J. Watson, of I.B.M.,” according to a New Yorker article principally about the Enron email archive. “Busa was trained in philosophy, and had just published his thesis on St. Thomas Aquinas, the Catholic theologian with a famously unmanageable œuvre.” Watson agreed to help, “and, for the next thirty years, Busa encoded sixty-five thousand pages of Thomist text so that it could be word-searched, cross-referenced, and what we now call hyperlinked.” The Index Thomisticus became “the first corpus to be primed for digital scholarship,” and is available online to search and download. | https://www.newyorker.com/magazine/2017/07/24/what-the-enron-e-mails-say-about-us https://tinyletter.com/data-is-plural/letters/data-is-plural-2017-07-26-edition http://www.corpusthomisticum.org/it/index.age https://itreebank.marginalia.it/view/download.php | |
717 | 2019.04.10 | 1 | Political conditions. | “The Rulers, Elections, and Irregular Governance (REIGN) dataset describes political conditions in every country each and every month. These conditions include the tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups, coup attempts and other violent conflicts.” The latest dataset covers 200 countries, from 1950 to the present, and includes dozens of variables for each monthly snapshot. [h/t Erik Gahner] | https://oefresearch.org/datasets/reign https://oefdatascience.github.io/REIGN.github.io/ | https://github.com/erikgahner/PolData |
718 | 2019.04.10 | 2 | Protests in autocracies. | Political science professor Nils B. Weidmann and collaborators have taken tens of thousands of reports — published by the AP, AFP, and BBC Monitoring — of political protests in autocratic countries and have turned them into structured data. The resulting Mass Mobilization in Autocracies Database is available to download (free registration required), and comes with documentation and code examples. The database currently covers 2003–15, with data for 2016–17 in the works. | https://twitter.com/nils_weidmann https://mmadatabase.org/about/ https://mmadatabase.org/ https://mmadatabase.org/get/ https://mmadatabase.org/about/documentation/ https://mmadatabase.org/use/code-examples/ | |
719 | 2019.04.10 | 3 | FiveThirtyEight checks its work. | From Nate Silver: “we’ve been publishing forecasts for more than a decade now, and although we’ve sometimes tried to do an after-action report following a big election or sporting event, this is the first time we’ve studied all of our forecast models in a comprehensive way.” You can now explore and download thousands of FiveThirtyEight’s predictions about sports and politics (and their outcomes). [h/t Gavin Freeguard] | https://fivethirtyeight.com/features/when-we-say-70-percent-it-really-means-70-percent/ https://projects.fivethirtyeight.com/checking-our-work/ https://github.com/fivethirtyeight/checking-our-work-data | https://mailchi.mp/1f25aba9f45f/warning-graphic-content-5-april-2019 |
720 | 2019.04.10 | 4 | Public pension plans. | Boston College’s Center for Retirement Research compiles detailed financial data on state and local public pension plans. The database covers fiscal years 2001–18 and includes 180 public pension plans, which together “account for 95 percent of state/local pension assets and members in the US.” [h/t Cezary Podkul] | https://crr.bc.edu/ https://publicplansdata.org/public-plans-database/ | https://twitter.com/Cezary/status/1104106397015265280 |
721 | 2019.04.10 | 5 | Bird-building collisions. | To study the relationship between artificial light and “flight calling” among nocturnally-migrating species, a team of researchers examined 70,000 instances of birds colliding with buildings in Chicago. [h/t Ben Winger] | https://royalsocietypublishing.org/doi/10.1098/rspb.2019.0364 https://datadryad.org/resource/doi:10.5061/dryad.8rr0498 | https://twitter.com/winger_ben/status/1113393643883245568 |
722 | 2019.04.17 | 1 | Medical device safety. | The International Consortium of Investigative Journalists, along with media partners in dozens of countries, has been compiling a cross-border database of medical-device safety alerts. The alerts include recalls as well as less-urgent notifications published by health authorities and manufacturers. You can download the public database, which so far includes 90,000+ notices for devices in 18 countries. The records include the date and type of notice; a device identifier; the reason for the alert; a classification of its severity; and more. Related: The Implant Files, an investigative series by the consortium, based on the data. | https://www.icij.org/investigations/implant-files/about-the-implant-files-investigation/ https://medicaldevices.icij.org/ https://medicaldevices.icij.org/p/download https://www.icij.org/investigations/implant-files/ | |
723 | 2019.04.17 | 2 | SCOTUS confirmation transcripts. | The R Street Institute has converted the last five decades of successful Supreme Court confirmation hearings into a spreadsheet, with one row for each statement, question, and answer. The 15 transcripts begin with William Rehnquist’s 1971 hearing and end with Neil Gorsuch’s in 2017. (Robert Bork’s failed nomination is excluded, and Brett Kavanaugh’s 2018 transcript is not yet available.) [h/t Zachary Agatstein + Alex Spurrier] | https://www.rstreet.org/about-r-street/ https://www.rstreet.org/2019/04/04/supreme-court-confirmation-hearing-transcripts-as-data/ | https://twitter.com/callonzach/status/1113880923609673728 https://twitter.com/alspur/status/1114155897146814466 |
724 | 2019.04.17 | 3 | Scientific publishing, linked. | The Microsoft Academic Knowledge Graph, published under an Open Data Attributions license, describes 8+ billion relationships between scientific papers, their authors, affiliated institutions, conferences, journals, fields of study, and more. The data can be downloaded and also queried online through a SPARQL interface. [h/t Michael Färber] | http://ma-graph.org/ http://ma-graph.org/schema-linked-dataset-descriptions/ http://ma-graph.org/rdf-dumps/ http://ma-graph.org/sparql-endpoint/ | https://github.com/awesomedata/apd-core/commit/43e18bcf425c3d5c837957c959b3ef5cb04688f8 |
725 | 2019.04.17 | 4 | 18th-century coroner inquests. | The London Lives initiative “makes available, in a fully digitised and searchable form, a wide range of primary sources about eighteenth-century London, with a particular focus on plebeian Londoners.” As part of the project, digital historian Sharon Howard has compiled a dataset of 2,894 Westminster coroners’ inquests from 1760 to 1799. The fields include the date of death, the name of the deceased, the cause of death, the coroner’s verdict, and more. Bonus: A recent Twitter thread from Howard highlighting more datasets. | https://www.londonlives.org/static/Project.jsp http://sharonhoward.org/ https://github.com/sharonhoward/londonlives/tree/master/coroners_inquests https://twitter.com/sharon_howard/status/1117430102088921088 | |
726 | 2019.04.17 | 5 | Double rainbows. | The question: How many bags of Skittles must you open before finding two identical color-distributions? The answer: “82 days, 13 boxes, 468 packs, and 27,740 individual Skittles later [...]”. The data: available on GitHub. [h/t u/cavedave] | https://possiblywrong.wordpress.com/2019/01/09/identical-packs-of-skittles/ https://possiblywrong.wordpress.com/2019/04/06/follow-up-i-found-two-identical-packs-of-skittles-among-468-packs-with-a-total-of-27740-skittles/ https://github.com/possibly-wrong/skittles | https://www.reddit.com/r/datasets/comments/bau3fy/dataset_of_skittles_pack_color_counts_with_a_pair/ |
727 | 2019.04.24 | 1 | Democracy. | Varieties of Democracy bills itself as “a new approach to conceptualizing and measuring democracy” — one that “reflects the complexity of the concept of democracy as a system of rule that goes beyond the simple presence of elections.” The project scores countries annually on five high-level aspects of democracy, which are further broken down (by thousands of country-experts, based on a detailed codebook) into hundreds of more granular “indicators,” such as how often the government publicly attacks the judiciary, the extent to which authorities respect religious freedom, and the proportion of journalists who are women. Version 9 of the dataset, released earlier this month, covers 1789 to 2018 and includes 202 countries. [h/t John Polga-Hecimovich] | https://www.v-dem.net/en/ https://www.v-dem.net/en/reference/version-9-apr-2019/ https://www.v-dem.net/en/data/data-version-9/ | https://twitter.com/jpolga/status/1115260559665049600 |
728 | 2019.04.24 | 2 | World leaders. | The Archigos dataset provides historical data the leaders of nearly 200 countries between 1875 and 2015. The dataset — a collaboration between political scientists Hein Goemans, Kristian Skrede Gleditsch, and Giacomo Chiozza — includes basic demographic information, plus categorizations of how each leader came to power, how they lost it, and their post-office fate. Now you know: No UK prime minister has died in office since 1865; José María Velasco Ibarra became president of Ecuador five separate times, and removed by coup four times; Tunisian president Beji Caid Essebsi is 92 years old. [h/t Jeffrey Sachs] | http://www.ksgleditsch.com/archigos.html http://www.rochester.edu/college/faculty/hgoemans/ http://ksgleditsch.com/ http://www.chiozza.org/ https://en.wikipedia.org/wiki/Records_of_Prime_Ministers_of_the_United_Kingdom#Died_in_office https://en.wikipedia.org/wiki/Jos%C3%A9_Mar%C3%ADa_Velasco_Ibarra https://en.wikipedia.org/wiki/Beji_Caid_Essebsi | https://twitter.com/JeffreyASachs/status/1117484417776148480 |
729 | 2019.04.24 | 3 | Ride-hailing. | Chicago has become the first city to publish detailed data from ride-hailing services, such as Uber and Lyft. Last week, officials released three datasets — on (anonymized) drivers, vehicles, and trips. The driver and vehicle datasets cover early 2015 through December 2018. The trip dataset covers only November and December 2018; even so, it includes more than 17 million rides. For each ride, the records contain the rough pickup and dropoff location, duration, the approximate fare and tip, and more. [h/t Sharon Machlis + Dan Nguyen + Karl Sluis + Michael A. Rice] | https://chicago.curbed.com/2019/4/15/18311340/uber-lyft-chicago-data-fares-drivers https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Drivers/j6wf-834c https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Vehicles/bc6b-sq4u https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p | http://www.machlis.com/ https://twitter.com/dancow/status/1118312979756453889 https://twitter.com/karlsluis |
730 | 2019.04.24 | 4 | Software development time estimates. | Derek M. Jones analyzes software-engineering data. Recently, he convinced a small software company to release a dataset documenting its internal time estimates, spanning 10 years, 20 projects, and 10,000+ tasks. For each task, the dataset indicates the number of hours it was predicted to take, how long it actually took, the (anonymized) developers it was assigned to, and more. [h/t Erik Bern] | https://github.com/Derek-Jones/ESEUR-code-data https://github.com/Derek-Jones/SiP_dataset | https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html |
731 | 2019.04.24 | 5 | Hunger Games survival. | “In a Cox proportional hazards model, which covariates are associated with the odds (or hazard ratios) being ever in your favor?” To find out, Brett Keller created spreadsheet of all 24 tributes in the 74th Hunger Games, including the districts from which they hailed, their ages, and how many days they survived. | http://www.bdkeller.com/writing/hunger-games-survival-analysis https://docs.google.com/spreadsheets/d/1qXSvoXJxKeX2mjjCVFloM1ZwrefiTtZNgrssgfbWoTI/edit#gid=0 | |
732 | 2019.05.01 | 1 | State-owned oil companies. | The browseable and downloadable National Oil Company Database, a project of the Natural Resource Governance Institute, pulls together official data on nearly 100 metrics concerning 71 oil/gas companies owned by 61 countries. For instance: Petróleos de Venezuela, S.A., reported transferring roughly $5.5 billion dollars to its government in 2016, down from nearly $28 million in 2013; Saudi Aramco produces the equivalent of 13 million barrels of oil daily; and in 2017, Russia’s Rosneft generated approximately $283,000 in revenue per employee. [h/t Rachel Ziemba] | https://www.nationaloilcompanydata.org/ https://resourcegovernance.org/ | https://twitter.com/reziemba/status/1121392285634179072 |
733 | 2019.05.01 | 2 | Decertified police officers. | USA Today has collaborated with more than 100 of its affiliated newsrooms and the Invisible Institute to gather police disciplinary records “from thousands of state agencies, prosecutors and local police departments” around the country, creating “the biggest collection of police misconduct records” ever assembled. They’re starting to make the records public, beginning with a database of 30,000+ officers who’ve had their certifications revoked. The database lists each officer’s name, state, agency, and year decertified. It includes records from 44 states, but you won’t find Massachusetts in it, for instance, because the state doesn’t license police officers. And although there are a handful of records from New York state, none regard NYPD officers; that’s in part because the country’s largest police force keeps its misconduct cases secret. (Last year, colleagues at BuzzFeed News published a database of 1,800 NYPD officers accused of misconduct, based on some of those secret records, obtained from a source who requested anonymity.) | https://www.usatoday.com/in-depth/news/investigations/2019/04/24/usa-today-revealing-misconduct-records-police-cops/3223984002/ https://www.usatoday.com/in-depth/news/investigations/2019/04/24/biggest-collection-police-accountability-records-ever-assembled/2299127002/ https://twitter.com/TWallack/status/1121375082474029056 https://www.buzzfeednews.com/article/kendalltaggart/secret-nypd-files-hundreds-of-officers-committed-serious https://www.buzzfeednews.com/article/kendalltaggart/nypd-police-misconduct-database | |
734 | 2019.05.01 | 3 | Nobel laureates’ papers. | A team of researchers has compiled the publication histories of 545 Nobel laureates — 92% of the prize-winners in physics, chemistry, and physiology-or-medicine between 1900 and 2016. The researchers say they spent more than 1,000 hours collecting and validating the data, drawing on the Nobel website, laureates’ personal pages, Wikipedia entries, and the Microsoft Academic Graph (featured in DIP earlier this month). | https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6NJ5RN https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fprojects%2Fmag%2F https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-04-17-edition | |
735 | 2019.05.01 | 4 | Piano performances. | The MAESTRO dataset gathers recordings from nine years of the International Piano-e-Competition, where “virtuoso pianists perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system.” The MIDI data “includes key strike velocities and sustain pedal positions”; additional metadata contains each performance’s year, composer, and title. Related: OpenAI’s music-composing MuseNet neural network, trained in part on the MAESTRO data. | https://magenta.tensorflow.org/datasets/maestro http://piano-e-competition.com/ https://openai.com/blog/musenet/ | |
736 | 2019.05.01 | 5 | Fortnite. | Through an unofficial API, you can access to data on the latest items, weapons, challenges, and other aspects of the global video game phenomenon. | https://fortniteapi.com https://www.polygon.com/fortnite-battle-royale/2018/3/30/17177068/why-is-fortnite-popular | |
737 | 2019.05.08 | 1 | Food, globally. | The United Nations’ FAOSTAT provides dozens of country-by-country datasets on agriculture. The datasets include crop and livestock production, imports and exports, fertilizer usage, emissions, and more. Many go back to 1961. (In that year, Afghanistan harvested about 32,000 metric tons of apricots.) Related: Researchers have previously used this data to trace the “increasing homogeneity in global food supplies” over time. Also related: National Geographic’s visualization of that research. [h/t David Svab] | http://www.fao.org/faostat/en/#home http://www.fao.org/faostat/en/#data https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HYOWIC https://www.pnas.org/content/111/11/4001 https://www.nationalgeographic.com/foodfeatures/diet-similarity/ | https://twitter.com/DavidSvab/status/1122977607677427712 |
738 | 2019.05.08 | 2 | Canadian candidates. | University of Montreal PhD candidate Semra Sevi has compiled data on all Canadian federal candidates from 1867 to 2017. The dataset lists each candidate’s gender, occupation, incumbency status, party affiliations, birth year, and electoral results. The tens of thousands of candidates have represented roughly 140 parties. Among them: Canada’s Work Less Party, which has fielded one lone federal candidate, who in 2008 received 1% of Vancouver East’s votes. [h/t Éric Grenier + Peter Loewen] | https://semrasevi.com/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ABFNSQ https://en.wikipedia.org/wiki/Work_Less_Party | https://twitter.com/EricGrenierCBC/status/1122898072261005312 https://twitter.com/PeejLoewen/status/1122894599238774784 |
739 | 2019.05.08 | 3 | Visual questions from blind people. | A decade ago, researchers built VizWiz, a smartphone app that allowed blind users take photos and ask questions about them. For instance: “What color is this?” or “When is the expiration date?” Now 20,000 VizWiz images and questions, plus 200,000 answers, are available to download — part of a contest to develop algorithms for visual question-answering. Related: Be My Eyes, an app that lets you volunteer your visual assistance through a video call. | http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.222.9697 http://vizwiz.org/data/ https://www.bemyeyes.com/ | |
740 | 2019.05.08 | 4 | Windy City murals. | Last month, Chicago officials launched a public mural registry. So far, the database includes more than 140 pieces, credited to more than 100 artists. About half of the entries specify the mural’s medium (e.g., paint, spray, mosaic) and nearly all indicate the mural’s location and installation year. | https://www.chicago.gov/city/en/depts/dca/supp_info/mural_registry.html https://data.cityofchicago.org/Historic-Preservation/Mural-Registry/we8h-apcf | |
741 | 2019.05.08 | 5 | Southpaws. | Using data scraped from BoxRec.com and UFCStats.com, Thomas Richardson analyzed “over 13,800 professional boxers and mixed martial artists of varying abilities” and has found “robust evidence that left-handed fighters have greater fighting success.” | https://osf.io/x3unr/ http://boxrec.com/ http://ufcstats.com/statistics/events/completed https://twitter.com/Richie_Research/status/1119714989235945472 https://www.biorxiv.org/content/10.1101/555912v3 | |
742 | 2019.05.15 | 1 | U.S. executions. | The Death Penalty Information Center maintains a database of all executions in the United States since 1976. (There have been 1,495 so far.) The database tracks the date, method, county, and state of each execution; the name, age, sex, and race of the person executed; and the race and sex of the victims they were convicted of killing. Related: The Marshall Project’s The Next to Die. Previously: Death sentences (DIP 2018.08.01) and executed prisoners' last words (DIP 2019.03.06). | https://deathpenaltyinfo.org/ https://deathpenaltyinfo.org/views-executions https://www.themarshallproject.org/next-to-die https://endofitsrope.com/using-the-database/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-08-01-edition https://selectstarsql.com/frontmatter.html#dataset https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-03-06-edition | |
743 | 2019.05.15 | 2 | Populism. | Team Populism is an initiative that “brings together renowned scholars from Europe and the Americas to study the causes and consequences” of the titular political style. The collaboration has published several datasets, including one that scores the populist rhetoric of 40 countries’ leaders between 2000 and 2018 — a project commissioned by The Guardian, which has visualized the findings and described the methodology. [h/t Erik Gahner Larsen] | http://populism.byu.edu/ http://populism.byu.edu/Pages/Data https://www.theguardian.com/world/ng-interactive/2019/mar/06/revealed-the-rise-and-rise-of-populist-rhetoric https://www.theguardian.com/world/2019/mar/06/how-we-combed-leaders-speeches-to-gauge-populist-rise | https://github.com/erikgahner/PolData#additional-overviews-of-datasets |
744 | 2019.05.15 | 3 | Books in translation. | Publishers Weekly’s Translation Database tracks books of fiction and poetry that has been translated into English and published in the United States. The database, which contains more than 7,200 entries since 2008, includes the books’ original languages and countries of publication, the authors’ and translators’ names and genders, the publishers´ names, publication years, prices, and ISBNs. Related: “Will Translated Fiction Ever Really Break Through?” a recent Vulture article by Chad Post, who created the database. | https://www.publishersweekly.com/pw/translation/home/index.html https://www.vulture.com/2019/05/translated-fiction-has-been-growing-or-has-it.html https://twitter.com/chadwpost | |
745 | 2019.05.15 | 4 | Social animals. | A team of biologists has compiled and standardized data on 790+ animal social networks, covering more than 45 species on six continents. The Animal Social Network Repository features networks of wild and captive mammals, reptiles, fish, birds, and insects; the connective data-tissue includes dominance relationships, group memberships, grooming behaviors, and several other types of interactions. | https://www.nature.com/articles/s41597-019-0056-z https://github.com/bansallab/asnr https://bansallab.github.io/asnr/ | |
746 | 2019.05.15 | 5 | Speedcubing. | The World Cube Association “governs competitions for mechanical puzzles that are operated by twisting groups of pieces,” the most famous of which is the Rubik’s Cube. The association also publishes a database of all competitions, competitors, results, rankings, and more. Related: “Children of the Cube,” by the New York Times’ John Branch. [h/t Michael Höhle + u/cavedave] | https://www.worldcubeassociation.org/about https://en.wikipedia.org/wiki/Rubik%27s_Cube https://www.worldcubeassociation.org/results/misc/export.html https://www.nytimes.com/2018/08/15/sports/cubing-usa-nationals-max-park.html | http://staff.math.su.se/hoehle/blog/2019/05/06/wcamining.html https://www.reddit.com/r/datasets/comments/blamnh/mining_the_world_rubiks_cubing_association/ |
747 | 2019.05.22 | 1 | Internet speeds. | The Measurement Lab describes itself as “the largest open source Internet measurement effort in the world.” Volunteers run the lab’s tests on their own devices, measuring their internet connection’s speed, latency, and other characteristics. The lab then publishes the data it collects, both as raw output and as BigQuery tables. It also offers a tool for charting internet speeds by location and ISP, based on 240+ million tests generated from 87,000+ cities; you can access the data underlying any chart, and also download the same aggregations directly. [h/t Georgia Bullen] | https://www.measurementlab.net/ https://www.measurementlab.net/faq/ https://www.measurementlab.net/tests/ https://www.measurementlab.net/data/ https://www.measurementlab.net/data/docs/gcs/ https://www.measurementlab.net/data/docs/bq/quickstart/ https://viz.measurementlab.net/ https://viz.measurementlab.net/data | https://georgiabullen.com/ |
748 | 2019.05.22 | 2 | City finances. | The Fiscally Standardized Cities database “makes it possible to compare local government finances for 150 of the largest U.S. cities across more than 120 categories of revenues, expenditures, debt, and assets.” The database, developed by Adam Langley at the Lincoln Institute of Land Policy, covers the years 1977 to 2016 and takes into account the ways in which finances and responsibilities overlap between cities, counties, school districts, and other local governments. [h/t Cezary Podkul] | https://www.lincolninst.edu/research-data/data-toolkits/fiscally-standardized-cities https://www.lincolninst.edu/research-data/data-toolkits/fiscally-standardized-cities/list-150-fiscs https://www.lincolninst.edu/about-lincoln-institute/people/adam-h-langley https://www.lincolninst.edu/ | https://twitter.com/Cezary/status/1104106397690556421 |
749 | 2019.05.22 | 3 | The Supreme Court of Canada’s interveners. | At Canada’s highest court, “interveners” are the rough equivalent of amicus brief filers in U.S. Supreme Court cases. Sancho McCann, a student at the University of British Columbia’s law school, has created a dataset of the past ten years of interveners and has analyzed it. For each of the 665 cases from 2009 to 2018, the dataset includes the case name, the previous court, a couple of case classifications, and the names of the interveners (if any). | https://sanchom.github.io/ https://github.com/sanchom/scc_stats https://docs.google.com/spreadsheets/d/1zNUIDaw4Fd8H_zr-dZsIs8Si_8QlqPY6nIQGR8UzoVY/edit#gid=716506492 https://sanchom.github.io/interveners-2009-2018.html | |
750 | 2019.05.22 | 4 | Primates. | A team of researchers at the Universidad Nacional Autónoma de México have aggregated the observations of 1,216 studies into a database describing 504 primate species. The traits in the database include body mass, habitat, type of diet, conservation status, and more. | https://en.wikipedia.org/wiki/National_Autonomous_University_of_Mexico https://www.nature.com/articles/s41597-019-0059-9 https://zenodo.org/record/2600338 | |
751 | 2019.05.22 | 5 | From Abdul-Aziz to Young-Malcolm. | The Pudding’s Jan Diehm has identified and analyzed decades of hyphenated last names in seven North American sports leagues: the MLB, NBA, NFL, NHL, MLS, WNBA, and NWSL. The code and data are available to download. Now you know: Two ambi-hyphenates — Pierre-Luc Letourneau-Leblond and Jean-Luc Grand-Pierre — have played in the NHL (and none in any of other leagues). | https://twitter.com/jadiehm https://pudding.cool/2019/05/hyphens/ https://github.com/the-pudding/hyphenated-names | |
752 | 2019.05.29 | 1 | Education data, unified. | “Every year, the federal government releases large amounts of data on US schools, districts, and colleges. But this information is scattered across multiple datasets, and changes in data structure make it hard to measure change.” The Urban Institute’s Education Data Explorer aims to fix that by pulling together the Department of Education’s Common Core of Data, Civil Rights Data Collection, Integrated Postsecondary Education Data System, and College Scorecard, plus the Census Bureau’s Small Area Income and Poverty Estimates. You download custom queries, access the data via an API, or download bulk files for all elementary and secondary schools, school districts, and colleges. [h/t Daniel Wood] | https://educationdata.urban.org/data-explorer/about/ https://www.urban.org/ https://educationdata.urban.org/data-explorer/ https://nces.ed.gov/ccd/ https://ocrdata.ed.gov/ https://nces.ed.gov/ipeds/ https://collegescorecard.ed.gov/ https://www.census.gov/programs-surveys/saipe.html https://educationdata.urban.org/documentation/ https://educationdata.urban.org/documentation/schools.html https://educationdata.urban.org/documentation/school-districts.html https://educationdata.urban.org/documentation/colleges.html | https://twitter.com/DanielPWWood |
753 | 2019.05.29 | 2 | Las Calles de las Mujeres. | GeoChicas, an initiative to close the gender gap in the OpenStreetMap community, has built an interactive map and dataset that shows which streets in Latin America and Spain that are named after women (and the much larger number named after men). So far, they’ve mapped 11 cities in 8 countries, including Barcelona, Havana, Mexico City, and Buenos Aires. | https://geochicas.org/ https://wiki.openstreetmap.org/wiki/GeoChicas https://geochicasosm.github.io/lascallesdelasmujeres/ https://github.com/geochicasosm/lascallesdelasmujeres | |
754 | 2019.05.29 | 3 | Serbian anti-corruption proceedings. | Postupci Protiv Funkcionera “is a unique database made by the Center for Investigative Reporting of Serbia, which gives citizens the opportunity to get information in one place about the processes conducted by the Serbian Anti-Corruption Agency against public officials in the period from 2010 to November 2018.” The database contains information on nearly 2,800 proceedings against more than 1,700 officials, and can be downloaded as an RDS file (and opened in R). Kudos: The project has been shortlisted for the 2019 Data Journalism Awards. (Full shortlist here.) | https://funkcioneri.cins.rs/ https://github.com/CINSerbia/cins_funkcioneri https://github.com/CINSerbia/cins_funkcioneri/tree/master/app/data https://mgimond.github.io/ES218/Week02b.html#reading_from_a_r_data_file https://datajournalismawards.org/projects/database-on-proceedings-against-public-officials/ https://datajournalismawards.org/2019-shortlist/ | |
755 | 2019.05.29 | 4 | Lone Star land use. | The Texas General Land Office’s geospatial data offerings include beach access points, shoreline environmental sensitivity ratings, offshore oil structures, oil and gas leases, and more. Related: “Relinquishing Riches: Auctions vs Informal Negotiations in Texas Oil and Gas Leasing,” and NBER working paper by economists Thomas R. Covert and Richard L. Sweeney; code and data available on GitHub. | http://www.glo.texas.gov/index.html http://www.glo.texas.gov/land/land-management/gis/ https://www.nber.org/papers/w25712 https://home.uchicago.edu/~tcovert/ http://www.richard-sweeney.com/ https://github.com/rlsweeney/public_cs_texas | |
756 | 2019.05.29 | 5 | From !!! to The Zutons. | Duncan Geere’s 00s Indie Band Database quantifies 130+ acts from the early-millennium’s indie music scenes. In addition to basic facts, the database also includes several subjective scales: “Guitars to Synths,” “Artsy to Populist,” “Loudness,” and “Coolness.” | https://www.duncangeere.com/ https://www.duncangeere.com/00sindiebanddatabase/ | |
757 | 2019.06.05 | 1 | Freedom lawsuits in early America. | O Say Can You See, a project partially funded by the National Endowment of the Humanities, “documents the challenge to slavery and the quest for freedom in early Washington, D.C., by collecting, digitizing, making accessible, and analyzing freedom suits filed between 1800 and 1862, as well as tracing the multigenerational family networks they reveal.” The project provides several ways to access the data and documents; it covers more than 500 lawsuits, nearly 5,000 people, and tens of thousands of relationships. You can also explore the cases, people, and families online. [h/t Jan Willem Tulp] | http://earlywashingtondc.org/ http://earlywashingtondc.org/about/data http://earlywashingtondc.org/about http://earlywashingtondc.org/cases http://earlywashingtondc.org/people http://earlywashingtondc.org/families | https://twitter.com/janwillemtulp |
758 | 2019.06.05 | 2 | Global voter turnout. | The International Institute for Democracy and Electoral Assistance’s Voter Turnout Database tracks the number of registered voters, total voter turnout, voting-age population, and associated metrics for elections in more than 200 countries, some going as far back as 1945. Related: The European Parliament’s election results website provides charts and bulk downloads. Also related: “What’s going on with abstention in Europe?,” a recent article by Lorenzo Ferrari and Jacopo Ottaviani. [h/t Gianna Grün + Giuseppe Sollazzo] | https://www.idea.int/ https://www.idea.int/data-tools/data/voter-turnout https://election-results.eu/ https://www.europeandatajournalism.eu/eng/News/Data-news/What-s-going-on-with-abstention-in-Europe https://twitter.com/lorferr https://twitter.com/JacopoOttaviani | https://twitter.com/giannagruen/status/1132965118264913920 https://mailchi.mp/7959f80f0f06/preview-222-in-other-news-3716641 |
759 | 2019.06.05 | 3 | Chicago eviction trends. | The Chicago-focused Lawyers’ Committee for Better Housing has built a database of evictions in the city from 2010 to 2017. It aggregates nearly 300,000 evictions to the ward, community area, and Census tract level, and contains metrics on case types, outcomes, legal representation, and more. There’s a user guide, bulk download, and methodology. Previously: The Eviction Lab, an effort to collect eviction data for the entire country (DIP 2018.04.18). [h/t Maya Dukmasova] | https://www.lcbh.org/ https://eviction.lcbh.org/ https://eviction.lcbh.org/data/user-guide https://eviction.lcbh.org/data/download https://eviction.lcbh.org/data/methodology https://evictionlab.org/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-04-18-edition | https://www.chicagoreader.com/chicago/tenant-attorneys-eviction-court/Content?oid=70321474 |
760 | 2019.06.05 | 4 | Language learning. | In an study published last year (preprint PDF here), three Boston-area professors analyzed data from more than 600,000 people who took an online English grammar quiz. In addition to the participants’ answers, the dataset includes their native languages, the age they began learning English, the countries they’ve lived in, gender, age, and more. Related: Scott Chacon's analysis of the data, and what it might mean for older learners. [h/t George McIntire] | https://www.sciencedirect.com/science/article/pii/S0010027718300994 http://l3atbc-public.s3.amazonaws.com/pub_pdfs/JK_Hartshorne_JB_Tenenbaum_S_Pinker_2018.pdf https://osf.io/pyb8s/ http://web.archive.org/web/20180217125721/http://archive.gameswithwords.org/WhichEnglish/ https://osf.io/pyb8s/wiki/home/ https://medium.com/@chacon/mit-scientists-prove-adults-learn-language-to-fluency-nearly-as-well-as-children-1de888d1d45f | https://twitter.com/GeorgeMcInt |
761 | 2019.06.05 | 5 | Thirsty appliances. | “The BLOND dataset was collected at a typical office building in Germany, with the main occupants being academic institutes and their researchers.” BLOND’s several dozen terabytes of data provide “long-term continuous measurements of voltage and current waveforms” for 74 appliances in office over several months, including a bunch of computers, a printer, paper shredder, space heater, and an electric toothbrush. | https://www.nature.com/articles/sdata201848 https://mediatum.ub.tum.de/1375836 https://www.nature.com/articles/sdata201848/tables/2 | |
762 | 2019.06.12 | 1 | Ebola in the DRC. | The Humanitarian Data Exchange has been tracking cases and deaths in the North Kivu Ebola outbreak. The numbers come from the Democratic Republic of the Congo’s health ministry and distinguish between suspected, probable, and confirmed cases; they are available at both the national level and disaggregated into the ministry’s 25 currently-affected health zones. Related: “Ebola cases pass 2,000 as crisis escalates” (Nature). Also related: The World Health Organization’s weekly situation reports. Previously: Data from the 2014 Ebola outbreak (DIP 2018.05.23). [h/t Sam Phinizy] | https://data.humdata.org/dataset/ebola-cases-and-deaths-drc-north-kivu https://www.nature.com/articles/d41586-019-01735-0 https://www.who.int/ebola/situation-reports/drc-2018/en/ https://github.com/cmrivers/ebola https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-05-23-edition | https://news.ycombinator.com/item?id=20105201 |
763 | 2019.06.12 | 2 | ICE solitary confinement. | The International Consortium of Investigative Journalists and partners have obtained records that detail 8,000+ instances, between 2012 and 2017, in which U.S. Immigration and Customs Enforcement detention centers placed detainees in solitary confinement. For each confinement, the records indicate the detainee’s citizenship, detention facility, dates of confinement, and the stated reasons for it. Note: “ICE said it does not keep records of every solitary confinement placement. Instead it tracks only those cases where detainees were held in isolation for more than 14 days, and where immigrants with a ‘special vulnerability’ were placed in isolation.” [h/t Jason Norwood-Young] | https://www.icij.org/investigations/solitary-voices/about-the-solitary-voices-investigation/ https://www.icij.org/investigations/solitary-voices/thousands-of-immigrants-suffer-in-us-solitary-confinement/ | https://mailchi.mp/37c8b38125c3/naked-data-216-eu-elections-election-rigging-extracting-diamonds-erudite-insights-and-excellent-envisions |
764 | 2019.06.12 | 3 | Economic mobility. | Opportunity Insights, a research and policy institute that uses data analysis to examine economic mobility in the United States, publishes dozens of datasets stemming from their studies, often accompanied by code to replicate their findings. Related: “The radical plan to change how Harvard teaches economics,” a recent profile of Raj Chetty, who co-leads the institute. Bonus: The lecture materials for Chetty’s popular new class, “Using Big Data Solve Economic and Social Problems.” [h/t Michael A. Rice] | https://opportunityinsights.org/ https://opportunityinsights.org/data/ https://www.vox.com/the-highlight/2019/5/14/18520783/harvard-economics-chetty https://opportunityinsights.org/course/ | |
765 | 2019.06.12 | 4 | Three centuries of taxation. | For 220 countries between the 1750s and 2018, the Tax Introduction Dataset tracks “the year of the first permanent introduction at the national level of government of six major taxes, as well as on the top statutory tax rate for that year.” The six taxes are those on personal income, corporate income, inheritance, and general sales, plus VATs and compulsory social security contributions. [h/t Philipp Heimberger + Laura Seelkopf] | http://tid.seelkopf.eu/ | https://twitter.com/heimbergecon/status/1133663993028128769 https://twitter.com/LauraSeelkopf/status/1132921720237625346 |
766 | 2019.06.12 | 5 | National parks. | The U.S. Department of the Interior publishes data describing the boundaries of all 420 units of the National Park System. In addition to the 61 officially-designated national parks, the boundaries include the country’s national preserves, national seashores, and 30 other types of special places. | https://irma.nps.gov/DataStore/Reference/Profile/2225713 | |
767 | 2019.06.19 | 1 | Drug prices. | The Centers for Medicare and Medicaid Services’ National Average Drug Acquisition Cost dataset indicates how much U.S. pharmacies have to pay, on average, to obtain thousands of prescription and over-the-counter drugs. The dataset contains millions of rows — one for each National Drug Code in the survey, for each week since 2013 — but you can also download smaller, weekly slices. The agency also publishes a dataset of changes in these average costs. Previously: Total and average costs for Medicare Part B and Part D prescriptions (DIP 2016.12.14). [h/t data.world] | https://data.medicaid.gov/Drug-Pricing-and-Payment/NADAC-National-Average-Drug-Acquisition-Cost-/a4y5-998d https://www.medicaid.gov/medicaid/prescription-drugs/pharmacy-pricing/index.html https://data.medicaid.gov/Drug-Pricing-and-Payment/NADAC-Comparison/6gk3-9bxc https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/MedicarePartB.html https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/MedicarePartD.html https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-12-14-edition | https://page.data.world/data-digest-veteran-congresswomen-congressional-social-media-medicaid-drug-costs |
768 | 2019.06.19 | 2 | Discographies. | Discogs, a user-contributed music database and marketplace, publishes “monthly data dumps” listing the millions of artists, labels, and releases in its system. Additional types of data (e.g., user reviews) are available through Discogs’ API. [h/t Jan Willem Tulp] | https://www.discogs.com/ https://support.discogs.com/hc/en-us/articles/360008545114-Overview-Of-How-Discogs-Is-Built https://data.discogs.com/ https://www.discogs.com/about https://www.discogs.com/developers/ | https://twitter.com/JanWillemTulp/status/1138473798313881600 |
769 | 2019.06.19 | 3 | Plant extinctions. | “Most people can name a mammal or bird that has become extinct in recent centuries, but few can name a recently extinct plant.” That’s from a new academic paper that presents “a comprehensive, global analysis of modern extinction in plants.” The paper itself is paywalled, but the dataset — of 571 extinct seed plants, plus other species that have been rediscovered or reclassified — is available to download. Related: World’s largest plant survey reveals alarming extinction rate, a summary of the findings. [h/t Joseph Stirt] | https://www.nature.com/articles/s41559-019-0906-2 https://www.nature.com/articles/s41559-019-0906-2#Sec4 https://www.nature.com/articles/d41586-019-01810-6 | https://news.ycombinator.com/item?id=20150192 |
770 | 2019.06.19 | 4 | European monarchs. | Developer Michael Zemel has built an interactive timeline of 282 European kings, queens, emperors, and other monarchs. For each, the data includes his or her name, religion, period of reign, reason for losing power, wars involved in, relationships, and notable events. Zemel has also published a detailed writeup about his inspiration and process, plus the underlying data and code. [h/t Giuseppe Sollazzo + Sophie Warnes] | https://thebackend.dev/ https://thebackend.dev/monarchs/ https://thebackend.dev/building-monarchs https://github.com/mzemel/monarchs | http://www.puntofisso.net/ https://www.getrevue.co/profile/FairWarning/issues/fair-warning-census-google-results-and-eurovision-182227 |
771 | 2019.06.19 | 5 | An obviously perfect dataset. | MUStARD is a corpus of 690 text and video clips “for research in automated sarcasm discovery.” The dataset’s 690 examples — half involving sarcasm, half not — come from Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. Related: Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper), the researchers’ introduction to the dataset. | https://github.com/soujanyaporia/MUStARD https://www.youtube.com/watch?v=JcOfFeKXcd4 https://arxiv.org/abs/1906.01815 | |
772 | 2019.06.26 | 1 | Space imagery. | You can browse NASA’s Image and Video Library online; you can also access it via NASA’s API. Through that interface, you can search by caption, keyword, location, photographer, year created, and other fields; in return, you get structured data on each media file. The library was launched two years ago, bringing together more than 140,000 images, videos, and audio files that had previously been spread across dozens of separate collections. [h/t Seth Donoughe] | https://images.nasa.gov/ https://api.nasa.gov/api.html#Images https://www.nasa.gov/press-release/nasa-unveils-new-searchable-video-audio-and-imagery-library-for-the-public | https://www.sethdonoughe.com/ |
773 | 2019.06.26 | 2 | Supreme Court v. Congress. | The Judicial Review of Congress dataset, compiled by Princeton politics professor Keith E. Whittington, “catalogs all the cases in which the U.S. Supreme Court has substantively reviewed the constitutionality of a provision or application of a federal law.” The dataset currently covers 1,308 cases, stretching from the high court’s founding through its 2017 term. For each case, it specifies the statute being reviewed, how long the statute had been in effect, the main constitutional issues at hand, the outcome, and more. [h/t Sheldon Gilbert] | https://scholar.princeton.edu/kewhitt/judicial-review-congress-database https://scholar.princeton.edu/kewhitt | https://twitter.com/sheldongilbert/status/1133778083021041666 |
774 | 2019.06.26 | 3 | Art-world salaries. | This is the spreadsheet that “broke the art world’s culture of silence.” In just a few weeks, Michelle Millar Fisher and anonymous colleagues have collected more than 2,600 self-reported salaries from their fellow curators, managers, interns, and other art-world employees. Related: “It took us three minutes to build this spreadsheet,” the organizers have written in The Art Newspaper. “It is not a perfect survey tool, nor was it ever intended to be. While we’ll work with statistics professionals to review and glean meaningful facts [...] Its primary goal is to catalyse us all into action.” [h/t u/cavedave] | https://docs.google.com/spreadsheets/d/14_cn3afoas7NhKvHWaFKqQGkaZS5rvL6DFxzGqXQa6o/edit#gid=0 https://frieze.com/article/how-google-spreadsheet-broke-art-worlds-culture-silence https://twitter.com/michellemfisher https://twitter.com/AMTransparency https://www.theartnewspaper.com/comment/missions-statements-and-paychecks-let-s-put-our-money-where-our-mouths-are | https://www.reddit.com/r/datasets/comments/by95n7/how_a_google_spreadsheet_broke_the_art_worlds/ |
775 | 2019.06.26 | 4 | UK post-graduation earnings. | The United Kingdom’s Department of Education publishes data on its university graduates’ annual earnings 1, 3, 5, and 10 years after graduation, broken down by school attended, subject studied, and demographic characteristics. [h/t Tera Allas] | https://www.gov.uk/government/collections/statistics-higher-education-graduate-employment-and-earnings | https://twitter.com/TeraPauliina |
776 | 2019.06.26 | 5 | The State Of The State Of The States. | FiveThirtyEight has collected the text of all 50 state governors’ 2019 annual addresses, and has analyzed the most common words and phrases used by Republican and Democratic governors. | https://github.com/fivethirtyeight/data/tree/master/state-of-the-state https://fivethirtyeight.com/features/what-americas-governors-are-talking-about/ | |
777 | 2019.07.03 | 1 | Wiretaps. | The Administrative Office of the United States Courts posts its annual “wiretap reports”, which provide details on the wiretaps that state and federal judges have authorized. Last week, the agency published its 2018 report; the supplementary data includes each wiretap’s jurisdiction, authorizing judge, date of authorization, type of intercept, number of communications intercepted, total cost, and more. [h/t Chris Zubak-Skees + Steven Rich] | https://www.uscourts.gov/statistics-reports/analysis-reports/wiretap-reports https://www.uscourts.gov/news/2019/06/28/2018-wiretap-report-orders-and-convictions-fall https://www.uscourts.gov/statistics-reports/wiretap-report-2018 | https://twitter.com/zubakskees https://twitter.com/dataeditor/status/1145720404574638081 |
778 | 2019.07.03 | 2 | Venmo transactions. | Dan Salmon, a grad student who specializes in information security, has published data on more than 7 million Venmo transactions, which he downloaded from the mobile payment platform’s public API. “I am releasing this dataset,” he writes, “in order to bring attention to Venmo users that all of this data is publicly available for anyone to grab without even an API key.” Practical: How to make your Venmo transactions private. Related: Salmon explains more, in Wired. Also: In 2018, Hang Do Thi Duc analyzed 200 million public Venmo transactions to show how revealing they could be. [h/t Álex Barredo] | https://danthesalmon.com/about/ https://github.com/sa7mon/venmo-data https://publicbydefault.fyi/#venmo https://www.wired.com/story/i-scraped-millions-of-venmo-payments-your-data-is-at-risk/ https://22-8miles.com/about/ https://publicbydefault.fyi/ | https://twitter.com/somospostpc/status/1140895108944076800 |
779 | 2019.07.03 | 3 | Territorial disputes. | The Issue Correlates of War project, which started in 1997 with a focus on territorial disputes, gathers “systematic data on contentious issues in world politics.” In addition to its two centuries of territorial claims, the project has also catalogued disputes over rivers, maritime zones, and ethnic groups, and compiled supplementary datasets on colonial history, historical country names, and more. | http://www.paulhensel.org/icow.html http://www.paulhensel.org/icowterr.html http://www.paulhensel.org/icowriver.html http://www.paulhensel.org/icowmar.html http://www.paulhensel.org/icowiden.html http://www.paulhensel.org/icowcol.html http://www.paulhensel.org/icownames.html | |
780 | 2019.07.03 | 4 | Mangroves. | Global Mangrove Watch uses satellite data to track the global extent of those coastal intertidal forests; the project’s seven snapshots span 1996 to 2016. Note: To download the data, you’ll need to provide a few details and agree to certain terms and conditions. [h/t Dan Friess] | https://www.eorc.jaxa.jp/ALOS/en/kyoto/mangrovewatch.htm https://oceanservice.noaa.gov/facts/mangroves.html http://data.unep-wcmc.org/datasets/45 https://www.unep-wcmc.org/policies/general-data-license-excluding-wdpa#data_policy | https://twitter.com/danfriess/status/1139812925319733248 |
781 | 2019.07.03 | 5 | Annotated pizzas. | “In this paper, we aim to teach a machine how to make a pizza,” writes a team of computer scientists from MIT and the Qatar Computing Research Institute. One of the key ingredients: 9,213 photos of pizza, with their lists of toppings annotated by Amazon Mechanical Turk workers. [h/t Kristin Houser + Center for Data Innovation] | http://pizzagan.csail.mit.edu/ http://pizzagan.csail.mit.edu/#Dataset | https://futurism.com/the-byte/mit-pizza-ai https://mailchi.mp/datainnovation/new-in-datawhat-the-evidence-shows-about-the-impact-of-the-gdpr-after-one-year |
782 | 2019.07.10 | 1 | Flood insurance. | Last month, the US Federal Emergency Management Agency released two major datasets from its National Flood Insurance Program: more than 47 million insurance policies and more than 2 million insurance claims. The latter includes details on each claim’s property, flood zone, amount paid, and more. Both datasets have been partially redacted to remove personally-identifiable information. [h/t Anna Weber] | https://www.fema.gov/news-release/2019/06/11/fema-publishes-nfip-claims-and-policy-data https://www.fema.gov/national-flood-insurance-program https://www.fema.gov/media-library/assets/documents/180376 https://www.fema.gov/media-library/assets/documents/180374 | https://twitter.com/aweberNRDC/status/1139240770194612225 |
783 | 2019.07.10 | 2 | Internet censorship tests. | The Open Observatory of Network Interference, run by the Tor Project, “collects and processes network measurements with the aim of detecting network anomalies, such as censorship, surveillance and traffic manipulation.” You can volunteer to run OONI’s tests from your computer or phone; so far, “millions of network measurements have been collected from more than 200 countries since 2012.” You can explore that data online, download it in bulk, and access it via an API. Related: OONI’s blog, which includes reports on some of its findings. [h/t John Emerson] | https://explorer.ooni.io/about/ https://www.torproject.org/ https://ooni.torproject.org/nettest/ https://explorer.ooni.io/world/ https://ooni.torproject.org/post/mining-ooni-data/ https://api.ooni.io/ https://ooni.torproject.org/post/ | https://backspace.com/ |
784 | 2019.07.10 | 3 | North American ecoregions. | In order to develop its maps of North American ecoregions, the US Environmental Protection Agency consulted with other federal agencies and state agencies, plus the governments of Canada and Mexico. Each “ecoregion” is an area with “similarity in the mosaic of biotic, abiotic, terrestrial, and aquatic ecosystem components with humans being considered as part of the biota.” The maps are available both as PDFs and as geospatial data files, at four levels of increasing specificity. [h/t Brandyn Friedly] | https://www.epa.gov/eco-research/ecoregions | https://twitter.com/brandynfriedly/status/1142917979736350721 |
785 | 2019.07.10 | 4 | California parks and wilderness. | With more than 15,000 “super units,” and an even larger number of subdivisions within them, the California Protected Areas Database is “the authoritative GIS database of parks and open space in California.” It’s one of the two main databases that the California Natural Resources Agency publishes regarding protected lands; the other, the California Conservation Easement Database, tracks restricted-use private land. [h/t @cartonaut] | https://www.calands.org/cpad/ https://data.cnra.ca.gov/dataset/california-protected-areas-database-2019a https://data.cnra.ca.gov/organization/protected-areas-gis-data http://resources.ca.gov/ https://data.cnra.ca.gov/dataset/california-conservation-easement-database-2018 https://www.calands.org/cced/ | https://twitter.com/cartonaut/status/1145838466770464768 |
786 | 2019.07.10 | 5 | Ballparks. | James Fee has compiled a dataset of more than 400 baseball stadiums from more than 40 leagues around the world; each stadium’s information includes its name, team(s), league(s), and geographic coordinates. | http://spatiallyadjusted.com/about/ https://github.com/cageyjames/GeoJSON-Ballparks/ | |
787 | 2019.07.17 | 1 | Four decades of wildlife trade. | The CITES Trade Database, named after the Convention on International Trade in Endangered Species of Wild Fauna and Flora, contains information about more than 20 million shipments of wildlife (e.g., live tapirs, sturgeon eggs, wolf skulls) and wildlife products (e.g., venus flytrap extract) since 1975. The database is maintained by a UN agency and includes the year of the shipment; the scientific name of the plant or animal; the type and quantity of the particular thing being traded; their purpose and source; and the country of origin, export, and export. Related: Citesdb, an R package for analyzing the database. | https://trade.cites.org/ https://www.unep-wcmc.org/ https://ropensci.github.io/citesdb/ | |
788 | 2019.07.17 | 2 | Two decades of UN Security Council debates. | A group of researchers have collected, parsed, and added metadata to all UN Security Council debates from 1995 through 2017. The dataset includes more than 65,000 speeches (with information about each speaker), extracted from nearly 5,000 meeting transcripts. Related: The authors describe their methodology. [h/t Ronny Patz] | https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KGVSYH https://arxiv.org/abs/1906.10969 | https://twitter.com/ronpatz/status/1144223706702630913 |
789 | 2019.07.17 | 3 | International arbitration. | The PluriCourts Investment Treaty Arbitration Database (PITAD) provides “a comprehensive, regularly-updated and networked overview of all-known investment arbitration cases.” You can download the 1,400+ cases or explore them online, searching by case, arbitrator, investor, or country. Note: PITAD says its data are “strictly for academic use.” Related: My former colleague Chris Hamby’s “The Court That Rules the World” series — “an exposé of a dispute-settlement process used by multinational corporations to undermine domestic regulations and gut environmental laws at the expense of poorer nations,” as the Pulitzer committee put it. [h/t Joel Dahlquist Cullborg] | https://pitad.org/ https://pitad.org/index#crud/flat_files/list https://twitter.com/ChrisDHamby https://www.buzzfeednews.com/article/chrishamby/super-court https://www.pulitzer.org/finalists/chris-hamby-buzzfeed-news | https://twitter.com/joeldahlquist/status/1090940676362125312 |
790 | 2019.07.17 | 4 | Foreign lobbyists. | The United States’ Foreign Agents Registration Act requires lobbyists who represent foreign governments to file paperwork with the Department of Justice. The database has long been available to browse online; last month, the agency added a last month, however, added three new features: full-text search, an API, and bulk downloads. [h/t Lachlan Markay + Jack Corrigan + u/surlyq] | https://www.justice.gov/nsd-fara https://efile.fara.gov/ords/f?p=1381:1:12056132996267::::: https://twitter.com/lachlan/status/1143160656792887298 https://efile.fara.gov/ords/f?p=1235:10 https://efile.fara.gov/ords/f?p=107:1:::::: https://efile.fara.gov/ords/f?p=API:BULKDATA | https://twitter.com/lachlan/status/1143160656792887298 https://www.nextgov.com/analytics-data/2019/07/justice-department-launches-api-foreign-lobbyist-data/158138/ https://www.reddit.com/r/datasets/comments/c8bowc/justice_department_launches_api_for_foreign/ |
791 | 2019.07.17 | 5 | Inter- and intra-national boundaries. | The Database of Global Administrative Areas aims “to map the administrative areas of all countries, at all levels of sub-division.” With 386,735 divisions and counting, “this is a never ending project, but we are happy to share what we have.” Note: “commercial use is not allowed without prior permission.” | https://gadm.org/ https://gadm.org/about.html https://gadm.org/data.html | |
792 | 2019.07.24 | 1 | Bodies of water. | The Water Observatory “provides reliable and timely information about surface water levels of water bodies across the globe.” The locations are based on NASA’s Global Reservoir and Dam Database and the World Wildlife Fund’s Global Lakes and Wetlands Database. Concerned about the accuracy of the boundaries in those databases, the researchers instead treated them as a “collection of potentially interesting water bodies” and then “extracted their polygons from the OpenStreetMap.” Of the 40,000 bodies of water they extracted, they’ve published water level data for roughly 7,000 through the project’s interactive dashboard and API. [h/t Emma Vitz] | https://www.blue-dot-observatory.com/aboutwaterobservatory https://sedac.ciesin.columbia.edu/data/set/grand-v1-dams-rev01 https://www.worldwildlife.org/pages/global-lakes-and-wetlands-database https://www.openstreetmap.org/ https://water.blue-dot-observatory.com https://forum.sentinel-hub.com/t/water-observatory-backend-example/859/2 | https://twitter.com/EmmaVitz/status/1132466425157709825 |
793 | 2019.07.24 | 2 | Hydro, streams, and rivers. | As part of Oak Ridge National Laboratory’s efforts to evaluate America’s hydropower resources, researchers there have developed a system (and corresponding dataset) for classifying all 2.6 million streams in the Lower 48 by size, hydrology, gradient, temperature, and “valley confinement.” Elsewhere, other researchers have assessed the “connectivity status of 12 million kilometres of rivers globally” and have identified “those that remain free-flowing in their entire length”; you can download that data and also explore it online. | https://www.ornl.gov/ https://hydrosource.ornl.gov/ https://www.nature.com/articles/sdata201917 https://springernature.figshare.com/collections/A_Stream_Classification_System_for_the_Conterminous_United_States/4233740 https://hydrosource.ornl.gov/environmental-information/us-stream-classification-system https://www.nature.com/articles/s41586-019-1111-9 https://figshare.com/articles/Mapping_the_world_s_free-flowing_rivers_data_set_and_technical_documentation/7688801 http://hydrolab.io/ffr/# | |
794 | 2019.07.24 | 3 | The height of the frozen world. | ICESat-2, launched by NASA in September 2018, “is measuring the height of a changing Earth one laser pulse at a time, 10,000 laser pulses per second”; the satellite “allow[s] scientists to monitor the elevation of ice sheets, glaciers, sea ice, and more—all in unprecedented detail.” Its datasets are available to download. [h/t Michael McLaughlin] | https://nsidc.org/data/icesat-2 https://nsidc.org/data/icesat-2/data-sets | https://www.datainnovation.org/2019/06/tracking-the-height-of-glaciers/ |
795 | 2019.07.24 | 4 | Drought conditions. | The Standardised Precipitation-Evapotranspiration Index is a metric, calculated from climatic data, that “can be used for determining the onset, duration and magnitude of drought conditions with respect to normal conditions.” The project, based at the Spanish National Research Council, provides both a “near real-time” global drought monitor and a historical database. | http://spei.csic.es/index.html http://spei.csic.es/map/maps.html http://spei.csic.es/database.html | |
796 | 2019.07.24 | 5 | Welsh shipping crews. | “The Merchant Shipping Act 1835 required all British registered ships of 80 tons or more employed in the coastal trade or fisheries to carry crew agreements and accounts, often referred to as crew lists.” The lists include crew members’ ages, places of birth, previous vessels, and more. Thanks to the National Library of Wales Volunteering Programme, thousands of crew lists from the Welsh port of Aberystwyth, from 1856 to 1914, have been transcribed. [h/t u/cavedave] | https://www.library.wales/about-nlw/work-with-us/volunteer/ https://www.library.wales/collections/activities/research/nlw-data/aberystwyth-shipping-records-dataset/ | https://www.reddit.com/r/datasets/comments/c07970/aberystwyth_shipping_records/ |
797 | 2019.07.31 | 1 | Foreign military trainings. | For nearly two decades, the US Department of Defense has released detailed tables on the foreign military units it has trained. For each training, the information describes the units trained, number of trainees, course name, start and end dates, location, cost, and more. Unfortunately, the government publishes these records only as PDFs. To make the data more accessible, Security Force Monitor, a project of the Columbia Law School Human Rights Institute, has converted the PDFs into an open, queryable database. An associated GitHub repository contains an extensive methodology, the extraction code, and the raw data. [h/t Jamon Van Den Hoek] | https://2009-2017.state.gov/t/pm/rls/rpt/fmtrpt/index.htm https://www.state.gov/foreign-military-training-and-dod-engagement-activities-of-interest/ https://securityforcemonitor.org/about/ http://www.law.columbia.edu/human-rights-institute https://securityforcemonitor.org/2019/07/18/unlocking-the-department-of-states-foreign-military-training-data-for-good-this-time/ https://trainingdata.securityforcemonitor.org/ https://github.com/security-force-monitor/fmtrpt_data | https://www.conflict-ecology.org/ |
798 | 2019.07.31 | 2 | Talk radio transcripts. | A team of researchers at the MIT Media Lab has built a corpus of machine-generated transcriptions from 284,000 hours of talk radio. The transcripts capture approximately 2.8 billion words from 50 semi-randomly selected stations, and include metadata, such as the program name, the speaker’s (guessed) gender, and whether the speaker seemed to be in the studio or on the phone. [h/t Lynn Cherny] | https://arxiv.org/abs/1907.07073 https://github.com/social-machines/RadioTalk | https://pinboard.in/u:arnicas/t:datasets/ |
799 | 2019.07.31 | 3 | Patent geography. | Researchers at two Swiss universities have created a dataset of inventors’ and applicants’ locations listed in 18.8 million patents filed between 1980 and 2014. The locations, which span 46 countries, are specified both by their geographic coordinates as well as their administrative areas (e.g. city, state, country). [h/t Gaétan de Rassenfosse] | https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3425764 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OTTBDX | https://people.epfl.ch/gaetan.derassenfosse?lang=en |
800 | 2019.07.31 | 4 | UK ministerial resignations. | The UK Institute for Government has been updating a spreadsheet of ministers who’ve resigned since 1979, the post each one held, the reasons for resignation, and the prime minister in charge at the time. The spreadsheet, which so far contains 151 resignations through last week, includes a few methodological notes embedded as comments in the header row. [h/t Gavin Freeguard] | https://docs.google.com/spreadsheets/d/1gVHNx4kzXd947AFfQGiJg5zJrdNXrM81t2OC8UJFnw8/edit | https://twitter.com/GavinFreeguard/status/1151835392922062848 |
801 | 2019.07.31 | 5 | Soviet space dogs. | Duncan Geere has compiled a database of the 48 dogs who participated in the USSR’s space program in the 1950s and 1960s. The information, which also includes details about the canines’ 42 flights, is based on Olesa Turkina's book, Soviet Space Dogs. | https://www.duncangeere.com/ https://airtable.com/universe/expG3z2CFykG1dZsp/sovet-space-dogs http://fuel-design.com/publishing/soviet-space-dogs/ | |
802 | 2019.08.07 | 1 | Opioid distribution. | The Washington Post and the Charleston Gazette-Mail recently won a year-long legal battle to obtain a large slice of the Drug Enforcement Administration’s data on opioid shipments. (The data had previously been provided to plaintiffs in a federal lawsuit, but a judge had sealed the records from public access.) The Post has begun publishing its findings, as well as a cleaned-up version of the dataset that focuses on “shipments of oxycodone and hydrocodone pills to chain pharmacies, retail pharmacies and practitioners” between 2006 and 2012. The raw, unsealed dataset is also available. Related: A 500-row subset, so you can see what the data looks like before downloading the large files. | https://www.washingtonpost.com/health/how-an-epic-legal-battle-brought-a-secret-drug-database-to-light/2019/08/02/3bc594ce-b3d4-11e9-951e-de024209545d_story.html?noredirect=on&utm_term=.e32623ec4459 https://www.washingtonpost.com/investigations/76-billion-opioid-pills-newly-released-federal-data-unmasks-the-epidemic/2019/07/16/5f29fd62-a73e-11e9-86dd-d7f0e60391e9_story.html https://www.washingtonpost.com/graphics/2019/investigations/dea-pain-pill-database/ https://www.washingtonpost.com/national/2019/07/18/how-download-use-dea-pain-pills-database/ https://twitter.com/dataeditor/status/1151905391678230528 https://github.com/r4dat/ARCOS_OPIOIDS_WashPo | |
803 | 2019.08.07 | 2 | Federal judges. | The government-run Federal Judicial Center publishes a daily-updated “biographical directory” of all judges who’ve served on federal courts — the Supreme Court, appellate courts, district courts, the bygone circuit courts, plus a few others. The directory is presented as structured data, and includes information on the judges’ demographics, educations, professional careers, nominations and more. Related: The University of South Carolina’s Judicial Research Initiative also maintains historical datasets of district and appellate court judges; they contain many of the same variables plus some extras, such as religion and estimated net worth. [h/t Dan Nguyen + Sergio Galletta, Elliott Ash, and Daniel L. Chen] | https://www.fjc.gov/history/judges/biographical-directory-article-iii-federal-judges-export http://artsandsciences.sc.edu/poli/juri/index.htm http://artsandsciences.sc.edu/poli/juri/attributes.htm | https://www.reddit.com/r/datasets/comments/cl4ogt/the_us_judicial_branch_maintains_a_spreadsheet_of/ https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3415393 |
804 | 2019.08.07 | 3 | Hospitals, from Angola to Zimbabwe. | An international team of researchers has compiled a “comprehensive spatial inventory” of nearly 100,000 public health facilities in sub-Saharan Africa. The dataset includes facilities in 50 countries and lists each facility’s name, country, administrative region, type, ownership, and coordinates. [h/t Karen Grepin] | https://www.nature.com/articles/s41597-019-0142-2 https://springernature.figshare.com/articles/Public_health_facilities_in_sub_Saharan_Africa/7725374/1 | https://twitter.com/KarenGrepin/status/1154675390411149315 |
805 | 2019.08.07 | 4 | Antarctic icebergs. | Brigham Young University’s Antarctic Iceberg Tracking Database provides surveillance on hundreds of floating hunks of ice, past and present. The records cover 1978 plus 1992 through mid-2019; a subset of the database lists 117 icebergs’ daily position, estimated size, and rotation angle. [h/t Robin Hawkes] | https://www.scp.byu.edu/data/iceberg/ | https://www.getrevue.co/profile/maps/issues/spatial-awareness-7-maps-spatial-newsletter-by-robin-hawkes-190525 |
806 | 2019.08.07 | 5 | State liquor prices. | About a third of US states hold a monopoly on the local sale of hard liquor. Some of them — including Virginia, Alabama, Michigan, Utah, and North Carolina — let you download their price lists as spreadsheets. [h/t Christopher Ingraham] | https://en.wikipedia.org/wiki/Alcoholic_beverage_control_state https://www.abc.virginia.gov/products/products-faqs/product-downloads https://alabcboard.gov/QPL https://www.michigan.gov/lara/0,4601,7-154-89334_10570_14173---,00.html https://webapps2.abc.utah.gov/Production/OnlinePriceList/DisplayDivCategory.aspx https://abc.nc.gov/Pricing/PriceList | https://twitter.com/_cingraham/status/449353017733943296 |
807 | 2019.08.14 | 1 | 140 years of London theatre. | The London Stage Database “is the latest in a long line of projects that aim to capture and present the rich array of information available on the theatrical culture of London, from the reopening of the public playhouses following the English civil wars in 1660 to the end of the eighteenth century.” The database contains information on more than 50,000 events, which you can search online and download in bulk, and are often supplemented with detailed notes and cast lists. The site also offers a user guide and a detailed explanation of the data’s provenance. (“We hope that visitors to the site will find this frank acknowledgment and foregrounding of the dataset’s history and limitations refreshing rather than frustrating.”) [h/t Ula Klein] | https://londonstagedatabase.usu.edu/ https://londonstagedatabase.usu.edu/search.php https://londonstagedatabase.usu.edu/data.php https://londonstagedatabase.usu.edu/guide.php https://londonstagedatabase.usu.edu/about.php | https://twitter.com/KleinUla/status/1149815316358340616 |
808 | 2019.08.14 | 2 | Airports and runways. | OurAirports, a community-assisted project that began in 2007, provides bulk data detailing 55,000+ airports and 41,000+ runways, plus listings of airport radio frequencies and global navigation aids. In addition to standard airports, the records include 23 balloonports, 1,000+ seaplane bases, and 11,000+ heliports. Related: “How we created a map of the global architecture of airport runways, which turned out to be a wind map.” [h/t Robin Hawkes] | http://ourairports.com/ http://ourairports.com/data/ https://towardsdatascience.com/trails-of-wind-39967f07a67f | https://www.getrevue.co/profile/maps/issues/spatial-awareness-5-maps-spatial-newsletter-by-robin-hawkes-188158 |
809 | 2019.08.14 | 3 | Black tech conferences. | ThePLUG, a news site that reports on the black innovation economy, has been collecting data on conferences for black tech professionals. The dataset currently contains 33 events in more than a dozen cities, and lists their costs, year started, contact information, sponsors, and more. [h/t Sherrell Dorsey] | https://tpinsights.com/ https://tpinsights.com/2018/09/28/black-tech-conferences-offer-a-lifeline-in-a-predominantly-white-industry/ https://airtable.com/shr5YDyyXCC2HQF6E | https://www.sherrelldorsey.com/ |
810 | 2019.08.14 | 4 | European electricity. | The Open Power System Data platform has aggregated energy data from across Europe into a series of standardized datasets, including electricity consumption, power plants, and generation capacity. The project has also published an “IT philosophy,” a guide for new users, and a detailed listing of primary sources. | https://open-power-system-data.org/ https://data.open-power-system-data.org/ https://data.open-power-system-data.org/time_series/ https://data.open-power-system-data.org/national_generation_capacity/2019-02-22 https://data.open-power-system-data.org/national_generation_capacity/2019-02-22 https://open-power-system-data.org/it https://open-power-system-data.org/step-by-step https://open-power-system-data.org/data-sources | |
811 | 2019.08.14 | 5 | TED talks. | Katherine M. Kinnaird and John Laudun — professors whose research includes cultural analytics and computational folklore studies — have created a dataset of 2,656 TED talks, with metadata and transcripts, and have published a detailed description of the project. [h/t Lynn Cherny] | http://katherinemkinnaird.net/ http://johnlaudun.org/ https://github.com/kinnaird-laudun/data/tree/master/Release_v0 https://culturalanalytics.org/2019/07/ted-talks-as-data/ | https://pinboard.in/u:arnicas/t:datasets/ |
812 | 2019.08.21 | 1 | Oil and gas. | The Joint Organisations Data Initiative (JODI) coordinates the collection, standardization, and publication of oil and gas data from around the world; the 100+ countries that participate represent the vast majority of global production. The oil data goes back to 2002; the gas data goes back to 2009. Both datasets are updated monthly and track a range of subproducts (e.g., crude oil, diesel, jet fuel) and flows (e.g., imports, exports, production) for each country. Previously: Global and gas infrastructure (DIP 2018.06.06) and state-owned oil companies (DIP 2019.05.01). | https://www.jodidata.org/about-jodi/history.aspx https://www.jodidata.org/about-jodi/jodi-world-databases.aspx https://www.jodidata.org/gas/database/data-downloads.aspx https://edx.netl.doe.gov/dataset/global-oil-gas-features-database https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-06-06-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-05-01-edition https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-05-01-edition | |
813 | 2019.08.21 | 2 | Historical terrorist groups. | Joshua Tschantret, a political science Ph.D. candidate at the University of Iowa, has compiled a dataset of 260+ terrorist groups formed between 1860 and 1969. For the purposes of the dataset, “terrorist groups are operationally defined as politically-motivated non-state actors using bombings or assassinations,” Tschantret writes in an introductory article (PDF). About one-third of the groups in the dataset operated in the US, Russia, or China; the rest are spread across dozens of other countries. Related: Additional documentation (PDF). Good to know: On Twitter, Tschantret explains why the Black Panthers are included. [h/t Carla Martinez Machain] | https://jtschantret.com/ https://jtschantret.com/data/ https://jtschantret.files.wordpress.com/2019/08/the-old-terrorism-a-dataset-1860-1969.pdf https://jtschantret.files.wordpress.com/2019/08/online-appendix.pdf https://twitter.com/jtschantret/status/1161171198841151489 | https://twitter.com/carlammm/status/1161262430422470657 |
814 | 2019.08.21 | 3 | A decade of TV news words. | The TV-NGRAM project pulls 14 TV stations’ data from the Television News Archive and calculates how often each word (and two-word combination) was said during each 30-minute window. Most of the stations’ counts go back 9 or 10 years, and all are updated daily. | https://blog.gdeltproject.org/announcing-the-television-news-ngram-datasets-tv-ngram/ https://archive.org/details/tv | |
815 | 2019.08.21 | 4 | A century of UK general elections. | On Monday, the British government published a dataset of voting results, by party and parliamentary constituency, for every UK general election since 1918 — merging modern data with a handful of historical sources. | https://researchbriefings.parliament.uk/ResearchBriefing/Summary/CBP-8647 | |
816 | 2019.08.21 | 5 | Confidence. | The Confidence Database is aggregating data from behavioral studies that have asked participants’ how confident they were in their own assessments. As of its launch earlier this month, the database contains 145 datasets, 8,700 participants, and 4 million individual observations. [h/t Audrey Mazancieux + Doby Rahnev] | https://osf.io/s46pr/ https://psyarxiv.com/h8tju | https://twitter.com/MazancieuxA/status/1159477377950461952 https://twitter.com/DobyRahnev/status/1159461364815056897 |
817 | 2019.08.28 | 1 | Multinational corporations. | The OECD’s ADIMA database tracks multinational corporations — Walmart, Toyota, Nestle, etc. — and their subsidiaries. It currently includes economic statistics about each of the world’s 100 largest multinationals, the names and locations of 26,000 subsidiaries, and information about nearly 20,000 of their websites. The OECD says plans to expand the number of companies in the future. Now you know: In 2016, the companies in the dataset “generated nearly $10 trillion in revenues (almost 20% of global GDP), earned $730 billion in profits and paid $185 billion in taxes,” according to the OECD. | http://www.oecd.org/sdd/its/measuring-multinational-enterprises.htm http://www.oecd.org/sdd/its/statistical-insights-the-adima-database-on-multinational-enterprises.htm | |
818 | 2019.08.28 | 2 | Citations and self-citations. | A team led by meta-research pioneer John Ioannidis has developed a dataset of citation metrics for science’s 100,000 most-cited authors. The dataset includes each author’s name, institutional affiliation, number of publications, total citations, “h-index,” and more. For each citation metric, there’s a second version that excludes self-citations. Related: “Hundreds of extreme self-citing scientists revealed in new database” (Nature). | https://profiles.stanford.edu/john-ioannidis https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000384 https://data.mendeley.com/datasets/btchxktzyw/1 https://www.nature.com/articles/d41586-019-02479-7 | |
819 | 2019.08.28 | 3 | Congressional whip counts. | Government professor C. Lawrence Evans’ dataset of US House "whip counts" describes more than 650 of the informal polls conducted by party leadership — covering 1955–86 for Democrats and 1975–80 for Republicans, on topics as varied as dairy prices, Alaskan statehood, voting rights, and Vietnam. It also indicates how each party member responded. [h/t Neil Malhotra + Janet Box-Steffensmeier] | https://wmpeople.wm.edu/site/page/clevan/home https://wmpeople.wm.edu/site/page/clevan/congressionalwhipcountdatabase | https://twitter.com/namalhotra/status/1165700263824375809 https://twitter.com/jboxstef/status/1138153812248616965 |
820 | 2019.08.28 | 4 | German federal judges. | Legal scholar and open-data enthusiast Hanjo Hamann has digitized seventy years of rosters from Germany’s seven federal courts, extracted structured data about the judges, and linked them to their Wikidata IDs. Related: Hamann’s detailed description of the dataest’s historical context and its construction. [h/t Erik Gahner Larsen] | https://www.coll.mpg.de/hanjo-hamann http://www.richter-im-internet.de https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12230 | https://github.com/erikgahner/PolData |
821 | 2019.08.28 | 5 | Movie shots. | James E. Cutting, a Cornell University psychology professor, has compiled several datasets on the structure of popular films, including one that indicates the length of each shot in 220 movies from 1915 to 2015. [h/t Igor Schwarzmann + Noah Brier] | http://people.psych.cornell.edu/~jec7/data.htm | https://twitter.com/zeigor https://whyisthisinteresting.substack.com/p/why-is-this-interesting-the-hostile |
822 | 2019.09.04 | 1 | Malaria geography. | The University of Oxford’s Malaria Atlas Project collects, models, and publishes a range of datasets related to the mosquito-borne disease, including localized incidence rates. You can explore and download the data, layer by layer, through the project’s interactive map. [h/t Clara Burgert-Brucker] | https://map.ox.ac.uk/ https://map.ox.ac.uk/data-directory/ https://map.ox.ac.uk/malaria-burden-data-download/ https://map.ox.ac.uk/faq/can-access-gis-data-map/ https://map.ox.ac.uk/explorer/ | https://twitter.com/crburgert/status/1165782515728179202 |
823 | 2019.09.04 | 2 | CAR refugees. | The Central African Republic’s ongoing civil war has pressed more than 600,000 people to flee the country. The violence has also internally displaced another 600,000 people, a phenomenon that the UN's Humanitarian Data Exchange has been tracking. In addition to counts of internally displaced people by locality, the UN’s datasets include a listing of refugee sites and the country's road network. Related: A multimedia presentation of one family's 600-kilometer journey in search of safety. [h/t Becky Band Jain] | https://www.cfr.org/interactive/global-conflict-tracker/conflict/violence-central-african-republic https://data2.unhcr.org/en/situations/car https://data.humdata.org/dataset/car-baseline-assessment-data-iom-dtm https://data.humdata.org/dataset/car-shapefile-idp-sites https://data.humdata.org/dataset/car-roads-and-paths-shapefile https://data.humdata.org/visualization/a-journey-of-600km-car/ | https://twitter.com/bexband |
824 | 2019.09.04 | 3 | Publicly funded patents. | The 3PFL dataset — Patents and Publications with a Public-Funding Linkage — lists more than 13,000 US patents that have acknowledged federal funding. The dataset, accompanied by a detailed methodology, also links the patents to details about the funding, as well as to scientific publications that stemmed from it. Previously: Patent geography (DIP 2019.07.31). [h/t Gaétan de Rassenfosse] | https://zenodo.org/record/3369582 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0218927 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OTTBDX https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-07-31-edition | https://people.epfl.ch/gaetan.derassenfosse?lang=en |
825 | 2019.09.04 | 4 | Drama. | The Drama Corpora Project has collected and processed more than 800 plays in German, Greek, Spanish, Russian, Latin, and English. For each play, the project provides a structured-data version of the text, a network diagram, speech distribution metrics, plus several other files and features. [h/t Lynn Cherny] | https://dracor.org/ https://en.wikipedia.org/wiki/Text_Encoding_Initiative | https://pinboard.in/u:arnicas/t:datasets/ |
826 | 2019.09.04 | 5 | Rah, rah, rah! Fight, fight, fight! | FiveThirtyEight has built a dataset of 65 college football fight songs, which contains each song’s name, authors, year written, tempo, duration, and whether it includes various tropes, such as spelling out words or mentioning the school’s colors. Related: FiveThirtyEight’s “Guide To The Exuberant Nonsense Of College Fight Songs,” where you can listen to the songs, read the lyrics, and explore an interactive chart of tempo versus duration. | https://github.com/fivethirtyeight/data/tree/master/fight-songs https://projects.fivethirtyeight.com/college-fight-song-lyrics/ | |
827 | 2019.09.11 | 1 | Protected lands. | The UN’s World Database on Protected Areas is, it says, “the most up to date and complete source of information on protected areas, updated monthly with submissions from governments, non-governmental organizations, landowners and communities.” It contains structured, geospatial information on more than 245,000 nature reserves, national parks, wildlife sanctuaries, and other kinds of conservation sites. The project provides bulk downloads, an interactive map, country-level statistics, and an API. Previously: The California Protected Areas Database (DIP 2019.07.10). [h/t Giuseppe Sollazzo] | https://www.protectedplanet.net/ https://www.protectedplanet.net/c/about https://www.protectedplanet.net/c/unep-regions https://api.protectedplanet.net/ https://www.calands.org/cpad/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-07-10-edition-1 | https://mailchi.mp/3eeacdf7fd0a/preview-222-in-other-news-3740637 |
828 | 2019.09.11 | 2 | City street speeds and travel times. | Uber Movement, from the titular ride-hailing company, “shares anonymized data aggregated from over ten billion trips to help urban planning around the world.” Online, you can explore street speeds and estimated travel times for dozens of cities. To download data from the website, Uber requires you to provide your name, email address, and purpose. But they also provide a command-line tool that lets you download street-speed data without any registration. [h/t Michael A. Rice] | https://movement.uber.com/ https://www.npmjs.com/package/movement-data-toolkit | |
829 | 2019.09.11 | 3 | London bike infrastructure. | Transport for London has launched its Cycling Infrastructure Database, which “contains the location of more than 240,000 pieces of cycling infrastructure in London, including places to park and the location of cycle lanes.” The new information can be found among the agency’s broader collection of cycling data; look for the “CyclingInfrastructure” folder. [h/t Jolyon Whaymand] | https://tfl-newsroom.prgloo.com/news/tfl-press-release-worlds-largest-cycling-database-set-to-make-cycling-in-the-capital-easier https://cycling.data.tfl.gov.uk/ | https://twitter.com/joejolyon/status/1156884350094499841 |
830 | 2019.09.11 | 4 | Deaths on the job. | Since 1992, the US Bureau of Labor Statistics’ has collected data on work-related deaths through its Census of Fatal Occupational Injuries. The results are presented as various cross-tabulations — by industry, demographic, circumstances, and more. Related: The agency also publishes data on non-fatal injuries and illnesses. [h/t Elissa Philip Gentry and W. Kip Viscusi] | https://www.bls.gov/iif/oshcfoi1.htm https://www.bls.gov/iif/soii-data.htm | https://academic.oup.com/aler/advance-article-abstract/doi/10.1093/aler/ahz007/5531642 |
831 | 2019.09.11 | 5 | Bug fixes. | Researchers at Brazil’s Federal University of Ceará have published a new dataset “composed of more than 70,000 bug-fix reports from 10 years of bug-fixing activity of 55 projects from the Apache Software Foundation.” | https://dl.acm.org/citation.cfm?id=3345639 https://figshare.com/articles/Replication_Package_-_PROMISE_19/8852084 | |
832 | 2019.09.18 | 1 | Amazonian deforestation. | Since 1988, Brazil’s PRODES project has been using satellite imagery to track clear-cutting in the country’s Amazon basin. The government’s TerraBrasilis web portal provides an interactive map and downloads of the data. Global Forest Watch also provides a dataset of PRODES-detected deforestation, from 2001 to 2015. [h/t Giuseppe Sollazzo] | http://www.obt.inpe.br/OBT/assuntos/programas/amazonia/prodes https://en.wikipedia.org/wiki/Amaz%C3%B4nia_Legal http://terrabrasilis.dpi.inpe.br/en/home-page/ http://terrabrasilis.dpi.inpe.br/app/map/deforestation?hl=en-us http://terrabrasilis.dpi.inpe.br/en/download-2/ http://data.globalforestwatch.org/datasets/4160f715e12d46a98c989bdbe7e5f4d6_1 | https://mailchi.mp/3eeacdf7fd0a/preview-222-in-other-news-3740637 |
833 | 2019.09.18 | 2 | State immigration laws. | Political science professor Jamie Monogan has compiled a dataset of more than 2,700 immigration laws passed by US state legislatures from 2005 to 2016. The dataset summarizes the laws and also categorizes them by subject, scope, and whether they appear to be welcoming or hostile to immigrants.[h/t Jason Anastasopoulos] | https://spia.uga.edu/faculty-member/jamie-monogan/ https://onlinelibrary.wiley.com/doi/abs/10.1111/psj.12359 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/F8YTX2 | https://twitter.com/jlanastas/status/1172327364056846336 |
834 | 2019.09.18 | 3 | How states relate. | The State Networks dataset gathers comparative and relationship metrics for every combination of the 50 US states, plus the District of Columbia. Among the metrics: the number of flights between each state-pair, migration in either direction, and total value of goods imported. The comparisons also include state-to-state differences in demographics, ideology, and GDP. [h/t Matt Grossmann] | https://ippsr.msu.edu/public-policy/state-networks | https://twitter.com/MattGrossmann/status/1171418973063303168 |
835 | 2019.09.18 | 4 | Interconnecting roads. | Urban planning professor Geoff Boeing’s US street network data represents America’s roads as a network graph, where each intersection (and dead-end) is a node, and each street segment is an edge between two of those nodes. The project’s data repository contains these networks for each city, county, Census tract, and more. You might remember: Boeing’s urban street orientation charts. [h/t Robin Hawkes] | https://geoffboeing.com/ https://geoffboeing.com/2019/03/us-street-network-models-measures/ https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CUWWYJ https://geoffboeing.com/2019/09/urban-street-network-orientation/ | https://www.getrevue.co/profile/maps/issues/spatial-awareness-6-maps-spatial-newsletter-by-robin-hawkes-189338 |
836 | 2019.09.18 | 5 | Dark-web screenshots. | CIRCL, Luxembourg’s computer security incident response team, has published a dataset of 37,500 .onion website screenshots, a subset of which have been categorized by topic (e.g., “drugs-narcotics”, “extremism”, “finance”) and/or purpose (e.g., “forum”, “file-sharing”, “scam”). [h/t Alexandre Dulaunoy] | https://www.circl.lu/ https://www.circl.lu/opendata/circl-ail-dataset-01/ | https://news.ycombinator.com/item?id=20406155 |
837 | 2019.09.25 | 1 | District court decisions. | The Carp-Manning U.S. District Court Database provides “data on 110,000+ decisions by federal district court judges handed down from 1927 to 2012.” It includes details of each case (such as the issue area and jurisdiction), each judge (year appointed, gender, race, and political party), and whether the decision was “liberal” or “conservative.” Previously: Federal judges’ data-biographies (DIP 2019.08.07). [h/t Scott Hofer and Jason Casellas] | https://www.umassd.edu/cas/polisci/resources/us-district-court-database/ https://www.fjc.gov/history/judges/biographical-directory-article-iii-federal-judges-export https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-08-07-edition | https://journals.sagepub.com/doi/abs/10.1177/1532673X19867052?journalCode=aprb |
838 | 2019.09.25 | 2 | India, geo-linked. | The Socioeconomic High-resolution Rural-Urban Geographic Platform for India is “an open access repository currently comprising dozens of datasets covering India’s 500,000 villages and 8,000 towns.” To make the datasets work well together, the project uses a common set of IDs for each town, village, and constituency. The downloadable files (free registration required) include data from censuses, elections, road construction, and more. Related: Introductory tweets from co-creator Paul Novosad. | http://www.devdatalab.org/shrug http://www.devdatalab.org/shrug_download/ https://twitter.com/paulnovosad/status/1169364171781287936 | |
839 | 2019.09.25 | 3 | Colonies. | Political science professor Jack Paine has compiled a dataset of 144 territories colonized by Britain, France, Portugal, Spain, the Netherlands, Belgium, Italy, the United States, Australia, South Africa, and New Zealand during the 16th through 20th centuries. The dataset includes the year of colonization, year of independence, and various metrics related to the colonies’ legislature and suffrage. | http://www.jackpaine.com/ https://www.cambridge.org/core/journals/world-politics/article/democratic-contradictions-in-european-settler-colonies/5B39382138A2D57175F6BE2A2ABF1CB4 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LU8IDT | |
840 | 2019.09.25 | 4 | Kiwi and Canadian building outlines. | Microsoft has released a dataset describing the geometric footprints of 12 million buildings in Canada, as detected by a neural network analyzing satellite imagery. (Last year, the company published similar data for the United States.) And the government of New Zealand has published building-outline data for most of the country. It’s based on aerial imagery, “using a combination of automated and manual processes,” and comes with detailed documentation. [h/t Michael McLaughlin + Robin Hawkes] | https://github.com/Microsoft/CanadianBuildingFootprints https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-07-18-edition https://github.com/Microsoft/USBuildingFootprints https://medium.com/on-location/an-open-building-outlines-dataset-for-new-zealand-eef8b558ef7a https://www.linz.govt.nz/land/maps/aerial-imagery-and-orthophotography https://nz-buildings.readthedocs.io/en/latest/introduction.html | https://www.datainnovation.org/2019/03/mapping-the-footprints-of-buildings-in-canada/ https://www.getrevue.co/profile/maps/issues/spatial-awareness-10-maps-spatial-newsletter-by-robin-hawkes-196361 |
841 | 2019.09.25 | 5 | UK noise pollution. | In 2012, the British government collected noise-level data from across the country and throughout London, including average daytime and nighttime loudness. [h/t Giuseppe Sollazzo] [Update, 2019-09-25: Data collected in 2017 can be found here: https://www.gov.uk/government/publications/strategic-noise-mapping-2019] | https://www.gov.uk/government/publications/open-data-strategic-noise-mapping https://data.london.gov.uk/dataset/noise-pollution-in-london | https://mailchi.mp/3eeacdf7fd0a/preview-222-in-other-news-3740637 |
842 | 2019.10.02 | 1 | Wealth, asset, and debt distributions. | The Federal Reserve’s Enhanced Financial Accounts datasets are supplements to the central bank’s more-aggregated Financial Accounts of the United States statistics. Among them: wealth, asset, and debt distributions by percentile and demographic. Also: College savings plans by state and the banking industry’s balance sheet. [h/t u/loopback2019] | https://www.federalreserve.gov/releases/efa/enhanced-financial-accounts.htm https://www.federalreserve.gov/releases/Z1/ https://www.federalreserve.gov/releases/efa/efa-distributional-financial-accounts.htm https://www.federalreserve.gov/releases/efa/efa-project-section-529-college-plans.htm https://www.federalreserve.gov/releases/efa/efa-project-consolidated-balance-sheet.htm | https://www.reddit.com/r/datasets/comments/b761im/federal_reserves_new_time_series_on_wealth/ |
843 | 2019.10.02 | 2 | Snapchat political ads. | Snapchat has released detailed data about every political ad purchased on its platform in 2018 and 2019. For each ad, the information includes its targeting parameters (age, gender, location, interests, internet service provider, device operating system, and more), the dates the it ran, the amount spent, number of impressions, and a link to the ad itself. Snap says this year’s data will be updated daily and that new ads will appear within 24 hours of first delivery. [h/t Erik Gahner Larsen] | https://businesshelp.snapchat.com/en-US/article/political-ads-library https://www.snap.com/en-US/political-ads/ | https://github.com/erikgahner/PolData |
844 | 2019.10.02 | 3 | Datos cubanos. | Official, reliable data on Cuba is hard to come by. So Cuban journalist Barbara Maseda’s Proyecto Inventario has been collecting and publishing datasets relevant to the island nation — including by documenting the country’s legislators, blackouts, and non-agricultural cooperatives. Related: Júlio Lubianco’s profile of Maseda and the project. | https://twitter.com/barbaramaseda https://proyectoinventario.org/ https://proyectoinventario.org/parlamento-cuba-ix-legislatura-anpp/ https://proyectoinventario.org/apagones-programados-cuba-reportes-ciudadanos/ https://proyectoinventario.org/registro-cooperativas-no-agropecuarias-cuba/ https://knightcenter.utexas.edu/blog/00-21226-cuban-journalist-uses-creativity-dig-information-and-maintain-database-other-media | |
845 | 2019.10.02 | 4 | Bird sounds. | Machine learning scientist Agnieszka Mikołajczyk has been gathering useful resources for identifying birds by sound. The resources include about a dozen datasets of audio recordings, many of which are immediately downloadable. | https://github.com/AgaMiko https://github.com/AgaMiko/Bird-recognition-review | |
846 | 2019.10.02 | 5 | Designer data. | Nearly 10,000 people took the American Institute of Graphic Arts’ “Design Census” earlier this year. The (non-scientific but detailed) results are out, and the raw data are (prominently) available to download. [h/t Robin Sloan] | https://designcensus.org/ | https://www.robinsloan.com/ |
847 | 2019.10.09 | 1 | Financial access. | Last week, the International Monetary Fund released the results of its 10th annual Financial Access Survey. It’s a “supply-side” dataset; its country-level metrics include, for instance, the number of automated teller machines (mainland China has the most, with more than 1 million) and active mobile banking accounts (Pakistan and Bangladesh are tops). Many of the metrics are also disaggregated by gender. | https://www.imf.org/en/News/Articles/2019/09/27/pr19359-imf-releases-the-2019-financial-access-survey-results https://data.imf.org/FAS | |
848 | 2019.10.09 | 2 | The rules for making rules. | The Parliamentary Rules Database traces “the formal rules of procedure for various parliaments over time.” Currently, the database covers two parliaments — the UK House of Commons and the Irish Dáil. The House of Commons info includes more than 137,000 “standing orders,” going all the way back to 1811. [h/t Erik Gahner Larsen] | https://parlrulesdata.org/ https://www.parliament.uk/site-information/glossary/standing-orders/ | https://github.com/erikgahner/PolData |
849 | 2019.10.09 | 3 | Electricity in rural India. | Last month, the Smart Power India and the Initiative for Sustainable Energy Policy published Rural Electricity Demand in India, a new survey dataset that “covers 10,000 households and 2,000 rural enterprises across 200 villages in Bihar, Uttar Pradesh, Odisha, and Rajasthan.” Respondents were asked, among other things, how many hours per day they get electricity, whether they have solar panels, and the price they pay for kerosene. [h/t Hisham Zerriffi + Johannes Urpelainen] | http://www.smartpowerindia.org/ https://sais-isep.org/ https://dataverse.harvard.edu/dataverse/REDI | https://twitter.com/hishamzerriffi/status/1175544673571524610 https://twitter.com/jurpelai/status/1175050741427257345 |
850 | 2019.10.09 | 4 | Web encryption. | Programmer Lee Butterman has built a dataset of the SSL encyryption connections associated with 350 million web domains. For each connection, the dataset indicates the SSL certificate’s issuer, cryptographic algorithms used, and other details. [h/t Jason Norwood-Young] | https://www.leebutterman.com/ https://www.leebutterman.com/2019/08/01/handshaking-the-web-dataset-of-350-million-ssl-connections.html https://www.leebutterman.com/2019/08/05/analyzing-hundreds-of-millions-of-ssl-connections.html | https://mailchi.mp/c1861f935ce7/naked-data-232-stripy-teslas-twitter-twerps-and-thomas-cooked |
851 | 2019.10.09 | 5 | Walter White and company. | Web developer Tim Biles has created an unofficial Breaking Bad API. It provides structured data on every character, episode, and death in the TV series, plus selected quotations. | https://timbilestim.netlify.com https://breakingbadapi.com/documentation | |
852 | 2019.10.16 | 1 | Presidential popularity. | The Executive Approval Project uses international polling data measure public support for presidents, prime ministers, and other political executives in 50 countries. For most of the countries, the database goes back to the 1990s; for some, it goes even further. Access requires providing a name, affiliation, and email address — plus agreeing to receive updates. [h/t Erik Gahner Larsen] | http://www.executiveapproval.org/ | https://github.com/erikgahner/PolData/ |
853 | 2019.10.16 | 2 | California power outages. | Last week, Pacific Gas and Electric began cutting power to hundreds of thousands of Californians — a precaution to keep the company’s aging infrastructure from sparking wildfires. Simon Willison has been scraping PG&E’s outage website every 10 minutes, and pushing the results into a database you can query and download. [h/t Lam Thuy Vo] | https://www.latimes.com/california/story/2019-10-10/pg-e-california-power-outages-grid-climate-change https://simonwillison.net/ https://simonwillison.net/2019/Oct/10/pge-outages/ http://critweb-outage.pgealerts.com/?WT.mc_id=Vanity_pge-outages https://pge-outages.simonwillison.net/pge-outages | https://lamthuyvo.com |
854 | 2019.10.16 | 3 | Mobile broadband prices. | Since 2017, the Alliance for Affordable Internet has been collecting country-level prices for mobile data. The most recent data covers 99 low- and middle-income countries for the second quarter of 2019. The rates are based on “the cheapest plan(s) providing at least 1GB of broadband data over a 30-day period from the largest mobile network operator in each country.” [h/t Teddy Woodhouse] | https://a4ai.org/who-we-are/ https://a4ai.org/mobile-broadband-pricing-data-historical/ https://a4ai.org/extra/mobile_broadband_pricing_usd-2019Q2 | https://twitter.com/TeddyWoodhouse |
855 | 2019.10.16 | 4 | Dating. | Stanford’s How Couples Meet and Stay Together study, which receives funding from both the university and the National Science Foundation, has been asking American adults about dating since 2009. A new-and-updated version of the survey includes questions related to dating apps. [h/t u/morningshower] | https://data.stanford.edu/hcmst https://data.stanford.edu/hcmst2017 | https://www.reddit.com/r/datasets/comments/dhxiva/how_couples_meet_and_stay_together_dataset_in_csv/ |
856 | 2019.10.16 | 5 | 🌴🌴🌴🌴🌴. | A team of researchers has “derived measurements of essential functional traits” for more than 2,500 species of palm plants — including but not limited to palm trees. Their PalmTraits database, based on published studies and preserved specimens, includes variables such as maximum height, fruit shape, and whether the fruit color is “conspicuous.” | https://www.nature.com/articles/s41597-019-0189-0 https://datadryad.org/stash/dataset/doi:10.5061/dryad.ts45225 | |
857 | 2019.10.23 | 1 | Autocracies. | The Autocratic Ruling Parties Dataset bills itself as “the first comprehensive data set on the founding origins, modes of gaining and losing power, ruling tenures, and other characteristics of autocratic ruling parties.” The dataset, created by political science professor Michael K. Miller, covers nearly 500 parties in more than 150 countries between 1940 and 2015. [h/t u/smurfyjenkins] | https://journals.sagepub.com/doi/full/10.1177/0022002719876000 https://sites.google.com/site/mkmtwo/data https://sites.google.com/site/mkmtwo/home | https://www.reddit.com/r/datasets/comments/dgve6r/new_dataset_on_all_autocratic_ruling_parties/ |
858 | 2019.10.23 | 2 | Euro-bank speeches. | European Central Bank has begun publishing a spreadsheet of all executive board members’ speeches since the late 1990s. The dataset contains each speech’s date, speaker(s), title, subtitle, and text; the ECB says it will be updated every two months. [h/t Volker Nitsch + Peter Tillmann] | https://www.ecb.europa.eu/press/key/html/downloads.en.html | https://twitter.com/nitschv/status/1184048455741857793 https://twitter.com/peterhtillmann/status/1183833071050727430 |
859 | 2019.10.23 | 3 | Biomedical citations. | The National Institutes of Health’s new Open Citation Collection brings together 420 million academic citations in biomedical literature. The data — the most comprehensive available for biomedicine — now underpins the NIH’s iCite platform, where you can explore citation statistics online. The citations are also available as a bulk download and via an API. [h/t Travis Hoppe] | https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000385 https://icite.od.nih.gov/ https://doi.org/10.35092/yhjc.c.4586573 https://icite.od.nih.gov/api | https://twitter.com/metasemantic/status/1182358118564450304 |
860 | 2019.10.23 | 4 | Adoptable dogs. | The Pudding’s Amber Thomas used PetFinder’s API to collect detailed data on all adoptable dogs at shelters and rescue organizations on a single day in September. Related: Thomas's story for The Pudding, which uses the data to examine state-to-state relocations. | https://pudding.cool/author/amber-thomas/ https://www.petfinder.com/developers/api-docs https://github.com/the-pudding/data/tree/master/dog-shelters https://pudding.cool/2019/10/shelters/ | |
861 | 2019.10.23 | 5 | Charts and minds. | The Data Visualization Society has published the results of its annual community survey, which received 1,359 responses from data visualization practitioners. The public data contains answers to 50 questions on topics such as compensation, tools, community, and more. [h/t Amy Cesal] | https://www.datavisualizationsociety.com/ https://github.com/data-visualization-society/data_visualization_survey https://docs.google.com/forms/d/e/1FAIpQLSe-CMHrt1gYOxO9Tdk6cjyLIpJnAJMDjS94OBDHi6YU2g3sBA/viewform | https://www.amycesal.com/ |
862 | 2019.11.13 | 1 | Millions of violent crimes. | The Trace has posted raw data on 4.3 million murders, nonfatal shootings, assaults, robberies, and rapes, obtained from 56 police and sheriff’s departments in the United States. Related: Sarah Ryley's introductory Twitter thread. Also related: The Trace and BuzzFeed News’ investigative reporting on cities’ failure to arrest shooters, for which Sarah, Sean Campbell, and I used many of these datasets. | https://www.thetrace.org/ https://www.thetrace.org/violent-crime-data/ https://twitter.com/missryley/status/1190361721526992896 https://www.buzzfeednews.com/article/sarahryley/police-unsolved-shootings https://www.buzzfeednews.com/article/sarahryley/5-things-to-know-about-cities-failure-to-arrest-shooters https://github.com/the-trace-and-buzzfeed-news/introduction | |
863 | 2019.11.13 | 2 | Online news language, updated live. | The GDELT Project’s Web News Ngram dataset keeps track the frequency individual words and two-word in online news around the world. The dataset incorporates news sources in 142 languages and provides overall word counts for every 15-minute window since January 1, 2019. An additional dataset tracks phrasings used in 10 character-based languages. Previously: GDELT’s similar dataset for television news (DIP 2019.08.21). [h/t Kalev Leetaru] | https://www.gdeltproject.org/ https://blog.gdeltproject.org/announcing-the-web-news-ngram-datasets-web-ngram/ https://blog.gdeltproject.org/the-languages-of-the-new-web-news-ngram-datasets-web-ngram/ https://blog.gdeltproject.org/announcing-the-web-ngram-character-ngram-datasets/ https://blog.gdeltproject.org/announcing-the-television-news-ngram-datasets-tv-ngram/ https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-08-21-edition | https://www.kalevleetaru.com/ |
864 | 2019.11.13 |