Data Is Plural — Structured Archive
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
Still loading...
2015.10.211Every place name in the United States.Sometimes, bureaucracy creates poetry. Since 1890, the U.S. Board on Geographic Names has been cataloguing, standardizing, and promulgating official names for the places we hike, swim, work, and call home. Along the way, it began publishing Geographic Names Information System (GNIS), a searchable and downloadable database containing all of its domestic nomenclature. In Alaska alone, the database lists names for 167 dams, 303 post offices, 666 glaciers, 2,704 capes, and 9,575 streams. My favorite: Confusion Creek. [h/t @emilymbadger],+Alaska/@68.4510925,-152.0233116,15.94z/data=!4m2!3m1!1s0x50d80cfac6a29911:0xc46bfa2a83d54866
2015.10.212“There’s finally federal data on low-income college graduation rates—but it’s wrong.”The Hechinger Report casts doubt on the Pell grant graduation numbers contained in the Department of Education’s recently-released College Scorecard. Why the discrepancy? “[W]hile schools are required by law to provide the graduation rates of Pell recipients to any applicants who ask, a loophole protects them from having to report the same figures to the government.” Oof.
2015.10.213What police-related data does your city publish?The Police Open Data Census, created by Code for America fellows in Indianapolis, is tracking “currently available open datasets about police interactions with citizens in the US," including officer-involved shootings, use of force, and citizen complaints. The census currently covers 36 police departments. Related: The NYPD says it will start tracking all officer use-of-force incidents — not just gunfire — next year, the New York Times reports.
2015.10.214How often do Wikipedia editors edit?The Wikimedia Foundation has published a dataset enumerating monthly revision counts for every editor, across all of its wikis. The foundation is asking for help investigating a few perplexing trends. For example: Why have the number “very active editors” — those with 100+ edits per month — increased while the number of merely “active” editors have plateaued?
2015.10.215Four years of rejected license plates.WNYC, through a freedom-of-information request to the New York DMV, obtained a list of vanity plate approvals and denials from late 2010 to late 2014. Among the denials: “RUBMYDUB,” “S5SS5S5S,” “RFLMAO,” and “CBSNEWS.” (Strangely, “NBC4” was approved. Go figure.) The files and related story were published in August, but the data are timeless. [h/t @veltman]
2015.10.281Data-shaming the robocallers.If you can’t beat ‘em, post spreadsheets about ‘em. Earlier this month, the Federal Communications Commission started publishing a dataset of complaints against telemarketers and robocalls. The FCC says the file will be updated weekly. It’s already being put to use: A clever programmer has crammed all the offending numbers into a single phone “contact” so that you can block them all at once. [h/t Shale Craig]
The demographics of traffic stops.
This weekend, the New York Times published a front-page article on “the disproportionate risk of driving while black.” Among other findings: “officers were more likely to conduct [searches] when the driver was black, even though they consistently found drugs, guns or other contraband more often if the driver was white.” The investigation drew on several statewide traffic-stop datasets that track the race and gender of stopped drivers. The “seven states with the most sweeping reporting requirements,” in order of how easy it seems (to me) to get detailed data: Connecticut, North Carolina, Missouri, Nebraska, Maryland, Illinois, and Rhode Island.
2015.10.283Where do Americans spend their days?Most population numbers tell you where people live. But legions of Americans commute for work across city, county, and state lines. The Census Bureau’s Commuter-Adjusted Daytime Population Data accounts for these daily migrations. Manhattan’s population (non-tourist) population doubles from 1.5 million to 3 million, by far the largest influx by raw numbers. But Lake Buena Vista, Fla., takes the percentage-growth prize. The city’s entire resident population could fit in two sedans, but its “daytime population” includes 33,000 workers — including a not-insubstantial number dressed as Mickey Mouse. [h/t Steven Romalewski],_Florida
2015.10.284Finally, free access to detailed U.S. import/export data.Prior to October 15th, the Census Bureau’s USA Trade Online tool cost $300/year. No longer. The newly-free dataset covers more than 17,000 commodities, including a category for “magic tricks, practical joke articles; parts and accessories.” [h/t Noah Veltman]
11 is on a mission: “to contribute to human sexuality understanding through a Big Data approach.” Last year, the site posted detailed metadata on 800,000 adult videos, including titles, descriptions, view counts, and tags. It powers Porngram, an only-kinda-safe-for-work charting tool.
2015.11.041Maternity leave policies at hundreds of American companies.The 600+ entries in this searchable, sortable database range from 3M to Amazon to Zynga, and list both paid and unpaid leave. The database, run by the women-in-the-workplace website, culls from published policies and employee tips. An introductory blog post provides more information.
2015.11.042MoMA, mo’ data.This July, the Museum of Modern Art published a dataset containing 120,000 artworks from its catalog, joining the UK’s Tate, the Smithsonian’s Cooper Hewitt, and other forward-thinking museums. The MoMA data contains the names of the artwork and artist, the dates created and acquired, and the medium — but no images. Related: Artist Jer Thorp encourages you to “perform” the data. Also related: Every museum in the United States. [h/t Nadja Popovich]
2015.11.043All licensed firearm dealers since 2010.The Bureau of Alcohol, Tobacco, Firearms, and Explosives publishes a searchable and downloadable licensing database. License-holders fall into eleven categories. Among them: run-of-the-mill dealers, ammunition manufacturers, collectors of “curios and relics,” pawnbrokers, and importers of “destructive devices.” The ATF’s website contains monthly and state-by-state archives. [h/t Marc DaCosta] [Correction, 2015-11-04: There are only nine categories of license-holders. The published ATF data includes only eight of them; it does not include "Collector of Curios and Relics." Thanks to @MikeStucka for flagging this mistake.]
2015.11.044One thousand ways to say “dog.” Trans-New Guinea is the world’s third-largest language family. But it’s also among the poorest-studied., an online database launched in 2013, is trying to change that. It now contains more than 1,000 New Guinea languages and lists 145,000 word translations — including 1,065 entries for “dog.” It even has an API. A recent PLOS ONE journal article provides additional background and statistics. [h/t Simon J. Greenhill]
2015.11.045When planes attack.Last May, a Gulfstream G150 taking off from Houston’s Ellington Airport struck an armadillo. The animal’s remains were collected, but were not sent to the Smithsonian Institution for identification. This anecdote comes from a single row in the Federal Aviation Administration’s Wildlife Strike Database, and draws on just seven of the 94 available fields. The database contains more than 168,000 strikes reported since 1990, almost all involving birds. Roughly 10% of the time, the animal's remains are sent to the Smithsonian's Feather Identification Lab. [h/t Dan Vergano]
2015.11.111Naughty companies.Good Jobs First’s Violation Tracker calls itself “the first national search engine on corporate misconduct.” The new database currently contains nearly 100,000 penalties for environmental, health, and safety violations — sourced from 13 U.S. regulatory agencies — since 2010. Search results can be downloaded as CSV files, which contain a few additional fields. (Tip: Search for “*” to get all cases.) The largest single fine? The Department of Justice’s $20.8 billion penalty this year against BP. [h/t Samuel Rubenfeld]
2015.11.112The 139,756 side effects of 1,430 medical drugs.The Side Effect Resource, a.k.a. SIDER, takes all the fine print from drug labels, and aggregates the information about side effects into a searchable, downloadable database. SIDER got a major upgrade last month, and now contains 40% more drug-effect pairs than before. The website incorporates both generic and brand names, so that searches for “Prozac” and “fluoxetine” bring you to the same page.
2015.11.113Albuquerque’s impressive open-data program.The New Mexico city publishes dozens of regularly-updated, well-documented datasets. Among them: government employee earnings, the number of daily visitors to the city’s swimming pools, real-time bus locations, the geography of police beats, and the city’s complete vendor checkbook. [h/t Tom Johnson, who emailed Data Is Plural to praise how Albuquerque is sharing its data: “I have not found any other city in the world doing so in such detail.”]
2015.11.1141.8 billion pages of books (and booklike things).Earlier this year, the HathiTrust Research Center released a massive dataset extracted from 4.8 million digitized volumes. For each of its 1.8 billion pages, the dataset includes word frequencies, languages used, and sentence counts, among other features.
2015.11.115Deadly Prussian horses.For his 1898 book, The Law of Small Numbers, statistician Ladislaus Bortkiewicz tabulated the number of Prussian cavalrymen killed by horse kicks each year between 1875 and 1894. (In total, 196 suffered that tragic fate.) The dataset is tiny, but boasts an outsized legacy: Bortkiewicz’s lethal horse kicks allegedly helped to popularize the then-obscure Poisson distribution. [h/t Noah Veltman]
2015.11.181Follow the F-17s.The Arms Transfer Database tracks the international flow of major weapons — artillery, missiles, military aircraft, tanks, and the like. Maintained by the Stockholm International Peace Research Institute (SIPRI), the database contains documented sales since 1950 and is updated annually. SIPRI provides a download tool, which outputs rich-text files, but it’s also possible to download the data as CSV. [h/t Martín González]
2015.11.182#campaign.The 2016 presidential hopefuls have been tweeting, ‘gramming, and ‘booking like a pack of millennials. Fusion collected nearly 70,000 images from the candidates’ social media accounts, then pumped the pictures through an automated tagging system. Now you can search for guns, money, beer and more — or download the raw data for your own analysis.
2015.11.183America’s exonerees.The National Registry of Exonerations contains “every known exoneration in the United States since 1989—cases in which a person was wrongly convicted of a crime and later cleared of all the charges based on new evidence of innocence.” For each of the 1,702 cases, the registry includes details about the exoneree, the crime, and the factors — such as new DNA evidence — that contributed to the exoneration. [h/t agate]
2015.11.184Health data, unprotected.Under the HITECH Act of 2009, companies must notify the government of any data breach involving the HIPAA-protected health data of 500 or more people. Summaries of those reports are available at the Department of Health and Human Services’s Breach Portal, which currently contains more than 1,300 incidents. Related: In April, JAMA published an analysis of the breaches. Also related: Forty years of legislative acronyms. [h/t Virginia Hughes]
2015.11.185Britain’s booze.What contains 34,052 bottles and is worth an estimated £3 million? The United Kingdom’s official wine cellar, which provides libations for the government’s guests and hosts — and a dram of data for the public. Between April 2014 and March 2015, the cellar’s clients consumed more than 5,500 bottles of wine and liquor. Among them: 205 bottles of Champagne, 51-and-a-half bottles of gin, and one bottle Château Pichon-Longueville Comtesse de Lalande 1986. [h/t Nadja Popovich]
2015.11.251Complaints against Chicago police.The newly-launched Citizens Police Data Project has collected more than 56,000 allegations of police misconduct. The data, covering 2002-2008 and 2011-2015, includes demographic information about the complainant and the officer, as well as the type and location of the incident. Click here to download the raw data. Related: The City of Chicago’s wide-ranging data portal includes a spreadsheet of every reported crime in the city since 2001; you can explore neighborhood trends via the Chicago Tribune. [h/t Melissa Segura and Abraham Epton]!/data-tools/bVyoBL/citizens-police-data-project
2015.11.252Refugees in America.The Department of State publishes demographic reports on refugee arrivals since 2002. The data includes country of origin, resettlement city and state, religion, age, gender, and more. Related: At BuzzFeed, I used the data to chart the past decade of refugee arrivals. Also related: The UN’s refugee data portal.
2015.11.2531.7 billion Reddit comments.You can download every comment posted to Reddit since October 2007 … but you’ll need some patience and a terabyte of storage. If you’re more of the instant-gratification, don’t-have-an-external-hard-drive-lying-around type, you might enjoy FiveThirtyEight’s “How The Internet* Talks,” a sort of Google Ngrams for the Reddit data. [h/t Randall Olson and Ritchie King]
2015.11.254The most popular government web pages.The U.S. government has one very large Google Analytics account, and has begun sharing traffic data with the public. Not every federal website is accounted for, but more than 4,000 are. Over the past 90 days, they’ve racked up approximately 1.5 billion visits. The most popular page at the time of this writing? Bonus: How they built it. [h/t Rebecca Williams]
2015.11.255A century of pumpkin pie.In 2011, the New York Public Library launched a crowdsourcing project to transcribe its massive collection of restaurant menus, dating back to the 1850s. So far, volunteers have transcribed more than 1.3 million dishes, their prices, and where on the menu each dish appeared. The library publishes a spreadsheet of all the data, and updates it twice a month. Happy Thanksgiving!
2015.12.021Historical climate data.The National Centers for Environmental Information maintains more than 20 petabytes of data, it says. Among the most useful slices is the Global Historical Climatology Network’s data, which aggregates reports on temperature, precipitation, wind, and more from tens of thousands of climate-monitoring stations around the world. One tidbit: January 1995 was Death Valley’s wettest month since at least the 1960s, with a whopping 2.59 inches of precipitation.
2015.12.022Mass shootings in provides datasets listing all U.S. mass shootings — defined as “when four or more people are shot in an event, or related series of events” — since 2013. So far in 2015, mass shootings have killed 447 people and wounded an additional 1,292.
2015.12.023A faster way to download open data.Socrata’s software powers open-data portals around the world. But downloading large datasets — e.g., this 2.8-gigabyte dataset of NYC parking tickets — from Socrata-powered portals can feel, well, sluggish. One solution:, a free website that provides faster-to-download versions of virtually every dataset from 50+ Socrata portals. Related: Thomas Levine’s detailed analyses of Socrata-powered portals, published in 2013 and 2014. [h/t John Krauss and Steven Romalewski]
2015.12.024College sports financing.The Huffington Post and Chronicle of Higher Education teamed up to investigate how colleges bankroll their athletics. (Georgia State, for example, spent more than $100 million subsidizing sports between 2010 and 2014, mostly via student fees.) The report, published last week, draws on five years of revenue/expense reports from 234 Division I public universities. You can download the raw data or explore it online. Related: The Washington Post also tackled this topic — from a slightly different angle — last week, examining the profitability (or lack thereof) of athletic programs at 48 schools. [h/t Shane Shifflett]
2015.12.025Celebrity faces, annotated.The CelebA dataset, published in September, contains 200,000+ images of 10,000+ celebrities, each annotated with 40 yes/no variables. Some favorites: “5_o_Clock_Shadow,” “Bags_Under_Eyes,” and “Goatee.”
2015.12.091The 2015 Global Open Data Index, released last night.Open Knowledge International has just published its latest survey of openly available government data. This year’s audit includes 112 countries and territories, up from 97 last year. The survey scores each based on the availability of datasets in 13 key categories (e.g., “election results,” “government spending,” and “pollutant emissions”) and links out to the available datasets. In this year’s survey, Taiwan ranks first, the U.K. second, and Denmark third. The U.S. ranks eighth.
More data (and discussion) on mass shootings.
Last week, Data Is Plural highlighted, a source for data on shootings that wounded at least four people. Other resources include the Gun Violence Archive and Mother Jones’ detailed database of mass shootings since 1982. The Mother Jones database takes narrower approach, focusing on shootings that killed at least four people in a public setting. In a New York Times op-ed, published shortly after last week’s San Bernardino shooting, the editor behind that database argues that broader methodologies don’t distinguish between a “a 1 a.m. gang fight” and “the madness that just played out in Southern California.” A Washington Post article weighs the pros and cons of broader and narrower approaches. [h/t Robin Shields + Mark Follman + Christopher Ingraham]
2015.12.093Firearm background checks.Gun dealers use the FBI’s National Instant Criminal Background Check System to determine whether someone is allowed to buy a firearm. There isn’t a one-to-one correlation between these background checks and gun sales, but they’re said to be the best available proxy. The FBI publishes a PDF tallying the monthly number of firearm checks for each state and type. At BuzzFeed News, we’ve parsed that PDF into a CSV/spreadsheet for easier use.
2015.12.094Good FOOD, bad food.The CDC’s Foodborne Outbreak Online Database (FOOD) contains 18,000+ outbreaks, which resulted in 358,000+ illnesses and 13,000+ hospitalizations, from 1998 through last year. In 2008, a multi-state Salmonella Saintpaul outbreak hospitalized 308 people — the highest count in the database.
2015.12.095Know thy barber.The Texas Department of Licensing and Regulation maintains a webpage of well-formatted data on state-licensed workers, including tow truck operators, boxing judges, journeyman electricians, elevator inspectors, manicurists, and, yes, barbers. [h/t Ryan Murphy]
2015.12.161Policing the police.The Department of Justice is authorized to investigate police departments that display a “pattern or practice” of civil rights violations. In April, the Marshall Project began publishing a spreadsheet of the DOJ investigations into local law enforcement. The dataset, which is updated regularly, indicates when each case began, when it ended, and what type of agreement (if any) was reached. The latest entry: An investigation into the Chicago Police Department, announced last week. Related: PBS Frontline's interactive map of DOJ investigations. [h/t Tom Meagher]
2015.12.162All the world’s glaciers.The recently-updated Randolph Glacier Inventory contains spreadsheets and outlines of every known glacier in the world. Of the 212,000+ glaciers inventoried, more than 27,000 are in Alaska. Someone please adopt Deserted Glacier. [h/t Robin Wilson’s stunningly extensive directory of free GIS data],+Alaska+99686/@60.9786026,-145.6392684,7075m/data=!3m1!1e3!4m2!3m1!1s0x56b6f38f0ce35db9:0x1f9d53f4331c53fc
2015.12.163College coaching salaries.Last week, USA Today released its annual accounting of assistant — yes, assistant — college football coaches’ salaries. At $1.6 million per annum, Auburn’s Will Muschamp leads the pack. More than 371 assistants have salaries of $250,000+. The release complements the publication’s database of head-coaching salaries. Related: Each state’s highest paid public employee, as of 2013-ish. [h/t Steve Berkowitz]
2015.12.164Many pants on fire.You’ve probably heard of PolitiFact, the Tampa Bay Times project that fact-checks what politician say. What you might not know: PolitiFact has an API. You can use it to fetch detailed data the project’s national and state-level editions. Related: “All Politicians Lie. Some Lie More Than Others,” PolitiFact’s top editor writes in the New York Times.
2015.12.165Every obscenity and death in Quentin Tarantino's movies.This dataset is fucking amazing.
2015.12.231How America injures itself.Every year, the U.S. Consumer Product Safety Commission tracks emergency rooms visits to approximately 100 hospitals. The commission uses the resulting National Electronic Injury Surveillance System data to estimate national injury statistics, but it also publishes anonymized information for each consumer product–related visit, including the associated product code (e.g., 1701: “Artificial Christmas trees”) and a short narrative (“71 YO WM FRACTURED HIP WHEN GOT DIZZY AND FELL TAKING DOWN CHRISTMAS TREE AT HOME”).
2015.12.232Farm to data-table.The USDA’s 2012 Census of Agriculture — the most recent vintage available — tallies agricultural activity at the national, state, and county levels. You can download detailed data from the agency’s Quick Stats tool. In 2012, Oregon harvested more Christmas trees than any other state: 6.8 million of them, or 39% of the census total. [Correction, 2015-12-23: The Oregon numbers incorrectly referenced 2007 data. In 2012, Oregon harvested 6.4 million trees, or 37% of the census total. Thanks to @JoeMurph for flagging this mistake.],_Chapter_2_US_State_Level/st99_2_035_035.pdf
2015.12.233Wikipedia traffic trends.The Wikimedia Foundation publishes hourly pageview counts for each of its articles. It’s a tremendous amount of data — about 90 megabytes, compressed, per hour. Luckily, there’s also a tool for browsing individual pages’ daily traffic stats. Last Wednesday, the English-language page for "Christmas tree" received 7,822 visits, its highest mark so far this year.
2015.12.234Little’s big tree maps.The Forest Service has digitized many of the tree species distribution maps from Elbert Little's “Atlas of United States Trees,” first published in the 1970s. Shapefiles and PDFs are available for for more than 600 species — including Ilex opaca (American holly) and Pseudotsuga menziesii (Douglas fir).
2015.12.235The emjoiverse.The Unicode Consortium publishes a big ol’ HTML table of every emoji, how they look in various contexts, and when they entered the canon. The “Christmas tree” emoji occupies code point U+1F384, and was introduced in 2010. (“Menorah with nine branches” arrived in 2015.) [h/t Ben Collins]
2015.12.301New Orleans slave sales, 1856–1861.A new study in the American Economic Review suggests that slaveholders in the South underestimated the odds of “emancipation without compensation.” To reach its conclusions, researchers compiled a dataset of 15,377 slave sales, culled from remarkably detailed official records. Data for each sale includes demographic information about the slaves, seller, and buyer; the price paid; payment method; and researcher notes.
2015.12.302Medicare’s priciest drugs.Last week, the Centers for Medicare & Medicaid Services published a new drug-spending dataset. It focuses on medications that (a) cost the most, overall; (b) cost the most per patient; or (c) saw the largest price-hike between 2013 and 2014. Vimovo, an arthritis pain reliever, tops the price-hike rankings: Between 2013 and 2014, the average cost per unit increased more than sixfold, from $1.94 to $12.46. [h/t Virginia Hughes]
2015.12.303Millions of home loans.Over the weekend, the Seattle Times and BuzzFeed News published an investigation into Clayton Homes, a company that is owned by Warren Buffett's Berkshire Hathaway and that “has grown to dominate virtually every aspect of America’s mobile-home industry.” The investigation draws on data released through the Home Mortgage Disclosure Act. The law requires large lenders to publish details about each of their loans. You can download the raw data from the FFIEC, or slightly user-friendlier versions from the CFPB. [h/t Mike Baker + Dan Wagner]
2015.12.304Every known satellite orbiting Earth.The Union of Concerned Scientists’s Satellite Database currently contains 1,305 entries and is updated “roughly quarterly.” The longest-orbiting: AMSAT-OSCAR 7, an amateur radio satellite launched in November 1974. Related: The satellites, visualized. [h/t David Yanofsky]
2015.12.305Things lost (and not yet found) on the New York subway.Among them: 37,622 cellphones; 3,604 hats; 1,903 scarves; 1,017 birth certificates; 483 diaries; 115 VHS tapes; 82 violins; 41 GPS navigation systems; and 9 answering machines. At least one of the 2,756 umbrellas is mine. [h/t Mona Chalabi + Allison McCann + Noah Veltman]
2016.01.061One year of fatal police encounters.After it became clear that the federal government was doing an awful job of keeping track of how often police kill civilians, two newspapers started counting last year. According to The Guardian’s tally, U.S. police killed 1,136 people in 2015. The Washington Post’s count — which focused on shootings only and didn’t include off-duty officers — counted 984 deaths. Both organizations provide methodologies and downloadable datasets (including demographic and geographic details): Guardian / WaPo.
The World Atlas of Language Structures.
This database compares the phonological, grammatical, and lexical properties of hundreds of languages. One dataset looks at languages’ counting systems. (Many use the decimal system, but Yoruba uses the vigesimal system and Danish uses a hybrid.) Others examine the use of tone, how you say “tea”, and whether there are different words for “finger” and “hand”. [h/t Jacqui Maher]
2016.01.063NYC felonies.The historically opaque New York Police Department has finally started publishing incident-level felony data — something that cities such as Chicago and Boston have done for years. The dataset includes the date, time, and approximate location of each offense. It currently covers the first nine months of 2015 and will (apparently) be updated quarterly. Don’t miss the footnotes in this PDF. Related: Some initial insights. Also related: “Which Cities Share The Most Crime Data?” [h/t Dan Nguyen + Mark Silverberg]
2016.01.064Refugee arrivals along the Western Balkans route.The UN’s refugee agency is keeping track of daily refugee movements through Greece, Macedonia, Serbia, and farther along into Europe. The downloadable data and interactive map cover migrations since October 2015.
2016.01.065The position of Michael Jackson’s white glove in all 10,060 frames of “Billie Jean.”Crowdsourced from his 1983 “Motown 25” performance. [h/t Nadja Popovich]
2016.01.131Religion in America.The 2010 Religious Congregations and Membership Study counts, for more than 200 religious groups, the number of congregations and adherents in each U.S. state and county. In total, the study reported more than 344,000 congregations and more than 150 million adherents — nearly half of the 2010 U.S. population. New counts are published every 10 years. [h/t Julia Silge]
2016.01.132Shifting global borders.What did the world’s political boundaries look like in 1945? The lines between Swedish counties in 1968? The U.S. states in 1865? Thenmap, an open-source API and mapping tool, answers these questions and more. [h/t Carlos Matallín]
2016.01.133U.S. foreign assistance.USAID, the Peace Corps, the U.S. African Development Foundation, and other agencies report data on foreign assistance spending to The full dataset includes detailed information for each grant and contract — and comes with data dictionary. The website also provides a chart of participating agencies, and an interactive map of the data.
2016.01.134Retirees’ language preferences.Last year, more than 2 million people applied for new Social Security retirement and survivor benefits. When they did, they indicated their preferred language. More than 93% said English, and about 5% of applicants said Spanish — the second most popular choice. Among the 88 other options: 1,616 applicants chose American Sign Language, 32 chose Japanese, nine chose Yiddish, and one chose Swedish.
2016.01.135State Department per diems.When State Department employees travel on official business abroad, they can get reimbursed — to a point — for lodging, meals, and things such as laundry. The department publishes monthly spreadsheets of the maximum per diems, which vary by location. The highest right now? The Cayman Islands ($735 per day). The lowest? Antarctica ($0/day) and Iraq ($11/day).
2016.01.201Flint water samples.Researchers from Virginia Tech have joined forces with Flint, Mich., residents to sample the city’s lead-tainted water supply. In December, the researchers posted the results of 271 samples, which indicated high levels of lead contamination. The most extreme sample found a lead concentration of 158 parts per billion — 10 times higher than the EPA’s “action level.” Related: The New York Times + The Washington Post have used the data.
2016.01.202The transatlantic slave trade.Slate Magazine’s “The Atlantic Slave Trade in Two Minutes” — recently named a multimedia finalist for the American Society of Magazine Editors’ annual awards — tracks 20,528 transatlantic voyages over 315 years. The information comes via, which provides searchable, downloadable records of ships’ and captains’ names, regions where slaves were purchased and sent, and more.
2016.01.203Campaign ad purchases.The FCC requires broadcasters to keep records of “all requests for broadcast time made by or on behalf of a candidate for public office.” With the help of volunteers, Political Ad Sleuth gathers those records and enters them into a searchable, downloadable database. Note: Due, in part, to the difficulty of transcribing the (non-standardized) records, the information in the database is incomplete.
2016.01.204568,454 reviews of “fine foods” on Amazon.In 2013, Stanford University researchers published a paper examining how people’s tastes “change and evolve over time.” They drew, in part, on a dataset containing 13 years of Amazon reviews of gourmet foods. (Note: Not all foods were intended for humans.) The dataset comes in a slightly unconventional format; here’s a Python script to convert it to a TSV file. [h/t Kaggle]
2016.01.205One hyper-quantified human.Last month, Nature Communications published a study of the “long-term neural and physiological phenotyping of a single human.” That human? Study co-author Russell A. Poldrack, “a right-handed Caucasian male, aged 45 years at the onset of the study.” The 18 months of results — tracking brain connections, food consumption, stress levels, and much more — are available to download and explore. [h/t Sune Lehmann]
2016.01.271Airplane confidential.NASA collects aviation safety reports from pilots, technicians, flight attendants, and other personnel. The (anonymized) published data contains text narratives, as well as details about flight conditions and other safety factors. (“Ok, I did it; the dumbest thing I have ever done in my entire life,” one confessional begins.) You can search the database but can only download so many records at a time. And you can request the full database from NASA, but you’ll have to wait. An alternative option: There’s a copy from November on the Internet Archive. [h/t Dave Riordan + Julian Simioni]
2016.01.272Cancer statistics.Earlier this month, the American Cancer Society launched a new data dashboard. Metrics include estimated new cases, historical survival rates, and more. To download the corresponding spreadsheets, use the “tools” button on each page. [h/t Virginia Hughes]
2016.01.273Tens of millions of movie is a free, noncommercial movie recommender — sort of like Netflix, minus the ability to watch movies. The service is run by a research lab at the University of Minnesota. The lab publishes several datasets of user ratings and movie info. The largest contains 22 million ratings. Among movies with at least 1,000 ratings, The Shawshank Redemption has received the highest average score (4.44 of 5), while 2007’s Epic Movie has netted the lowest (1.48 of 5).
2016.01.274Federal employees’ feelings.Last year, more than 400,000 federal employees took the Office of Personnel Management’s annual survey, which includes questions about satisfaction, leadership, and work schedules. You can download aggregate and raw results. Important note: The survey is voluntary and non-random.
2016.01.275The Survey of Scottish Witchcraft.The University of Edinburgh hosts an incredibly detailed, and deeply documented database of more than 3,000 accused witches in Scotland. The mania reached its quantitative peak in 1662, when, according to the database, 402 people were accused of witchcraft. [h/t Felix Haass]
2016.02.031Angry travelers.The Transportation Security Administration publishes spreadsheets of legal claims against the agency, including the location, circumstances, and outcome of each claim. The most expensive settlement on record appears to involve a vehicle-related personal injury in July 2004, for which the TSA paid $125,000. On the other end of the spectrum: In 2014, a traveler recouped $1.25 for lost food or drink at Hilton Head Island Airport. [h/t Seth Kadish + Lindsey Cook]
2016.02.032Famous people on Wikipedia.Last month, a group of researchers introduced Pantheon 1.0, “a manually verified dataset of globally famous biographies.” It starts with 11,341 Wikipedia biography pages in 25 languages, and adds birthplace, birthdate, gender, occupations, and page views. You can download the data or explore it online. Baffling factoid: As of May 2013, High School Musical star Corbin Bleu had biographies in more language editions than anyone other than Jesus Christ and Barack Obama. Related: A broader-but-shallower dataset of more than 400,000 influential people on the English-language Wikipedia. [h/t Ben Dilday]
2016.02.033Zika data.Fears about the Zika virus — and a possible, but not proven, connection to microcephaly — are growing. Little data on the latest outbreak has been published, but here’s an open guide to what’s available so far, including reported cases of microcephaly in Brazil and the number of suspected Zika samples sent to Colombia’s national institute of health.
2016.02.034Post-Fukushima radiation.Next month marks the five-year anniversary of the Fukushima Daiichi disaster, the worst nuclear accident since Chernobyl. Since shortly after the meltdown, volunteers for Safecast have been collecting radiation measurements in Japan and beyond. The results are available to download or to access via API.
2016.02.035Movie chatter.The Cornell Movie-Dialogs Corpus contains 220,579 “conversational exchanges” between 9,035 characters in 617 movies. Included: “Hello. My name is Inigo Montoya. You killed my father. Prepare to die.”
2016.02.101Powering America.Every year, the U.S. Energy Information Administration requires thousands of power plants to report detailed data on fuel consumption and electricity generation. The datasets stretch back more than three decades, to 1989. In 2014, the most recent year available, Arizona’s Palo Verde Nuclear Generating Station generated more electricity — 32 million megawatt hours — than any other power plant in the country. [h/t Marc DaCosta]
2016.02.102Nature-spotting.iNaturalist is a sort of social network for nature enthusiasts. Users can post photos and descriptions of birds, fish, bugs, and even mold, which experts can then help to identify. In November, the site recorded its two-millionth observation. You can explore the data via API or, with a free account, use the site’s export tool. [h/t Dan Brady]
2016.02.103Organ transplants.The Organ Procurement and Transplantation Network, a public-private partnership, keeps records of organ donations, transplants, and waiting lists in the United States. The website’s “advanced” data tool lets you generate fairly detailed custom reports. One hitch: The site doesn’t provide an option to download the data. Data Is Plural wrote a small bit of software to fix that.
2016.02.104More political ads.The Internet Archive’s Political TV Ad Archive uses audio fingerprinting to identify the campaign ads playing in key primary states. You can search the database, watch the ads, and download the data. The data file contains information about each ad’s sponsor, pro/con-ness, TV network, and time of airing. Previously: Political Ad Sleuth, featured Jan. 20.
2016.02.105One million songs.The Million Song Database contains metadata and “feature analysis” (e.g., loudness, tempo, and “danceability”) for, you guessed it, one thousand-thousand songs. The full dataset occupies hundreds of gigabytes, but you can also download a 1% sample. [h/t Neal Lathia]
2016.02.171The kids are alright.Every two years since 1991, the CDC has conducted the Youth Risk Behavior Survey, which asks high school students questions about drug use, sex, eating habits, and more. The results are available at the national, state, and district level. Results from the 2015 survey will be published in June, the CDC says. Related: Today’s teens _______ less than you did.
2016.02.172Word-emotion associations.Computational linguists at Canada’s National Research Council used Mechanical Turk to crowdsource the emotional associations of 14,182 words. For each word, participants were asked whether it was “positive” and/or “negative”, and whether it was associated with any of eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The resulting Word-Emotion Association Lexicon was first published in 2010. Of the full lexicon, only two words — “treat” and “feeling” — were associated with all eight emotions. [h/t Bipul Mohanto]
2016.02.173The United States of Land.In 2011, agriculture occupied about 22% of all land in the contiguous U.S., according to the National Land Cover Database. The NLCD classifies every 30-meter-by-30-meter chunk of land into one of 16 categories, including “woody wetlands,” “cultivated crops,” and “developed” land, at different intensities. (Alaska’s unique landscape has earned it a few additional categories, such as “dwarf scrub.”) The database is presented as raster files, so you’ll need some geospatial software to dig in. [h/t Ryan McNeill]
2016.02.174Hundreds of thousands of chess games.Portable Game Notation, a file format used to describe chess matches, was invented in 1993. Since then, enthusiasts have created PGN files for virtually all top players’ games and every high-level tournament at sites such as PGN Mentor and Chess DB. [h/t Seth Kadish]
2016.02.175A quarter-million bugs.For 18 years, a trap on the roof of the University of Copenhagen’s Zoological Museum lured moths, butterflies, and beetles to their early deaths. Researchers at the university counted and identified more than 250,000 specimens from 1,500+ species. The most common: Yponomeuta evonymella, a moth species also known as the bird-cherry ermine, which got trapped nearly 40,000 times.
2016.02.241Supremely useful data.The Supreme Court Database is exactly what it sounds like — and definitively so. The most recent release covers all SCOTUS cases from 1946 through 2014. For each case, the database contains 247 “pieces of information,” including the source of the case, why the court agreed to hear the case, the legal provisions at play, and how each justice voted.
2016.02.242Armed conflict.The Uppsala Conflict Data Program maintains several large, interconnected datasets describing decades of war, genocide, and other armed hostilities. Looking for a slightly less depressing experience? Try the UCDP’s dataset of 216 peace agreements signed between 1975 and 2011. [h/t Tony Gray]
2016.02.243Nuclear capabilities.The Nuclear Latency Dataset contains “all known uranium enrichment and plutonium reprocessing facilities” built between 1939 and 2012. That amounts to 253 plants around the world, each with information on its construction timeframe, civilian-vs-military purpose, international oversight, and more. [h/t Abraham Epton]
2016.02.244Cruise ship inspections.The CDC publishes a searchable database of its cruise ship sanitation inspections — but doesn’t provide an option to download the data. Last week, an open-data enthusiast scraped the database and posted CSVs of specific deficiencies and overall inspection scores since 1990. The lowest score: The Nippon Maru’s 38 points (out of 100) in 1998. Related: ProPublica’s “Cruise Control,” a searchable database of health and safety reports. [h/t Mike Stucka + Lena Groeger]
2016.02.245Funny ha ha.Since 1999, Jester has been telling jokes. The website, built by UC Berkeley’s Laboratory for Automation Science and Engineering, asks you to rate its sometimes-humorous offerings, and then uses those answers to guess which of the remaining 100+ jokes you’ll like best. The UC Berkeley team behind the project has released millions of joke ratings from more than 100,000 anonymous users. [h/t Alex Gude]
2016.03.021American infrastructure.Last week, the Department of Homeland Security published more than 250 infrastructure-related datasets, which had previously been marked as "For Official Use Only." The release covers a wide range of topics, including datasets on educational facilities, hurricane evacuation routes, poultry slaughterhouses, and sports venues. (According to that dataset, the Indianapolis Motor Speedway holds more people than any other major sports venue, with a listed capacity of 257,325.) [h/t Michael Keller]
2016.03.022British diets.The UK government has published data on 27 years of food consumption. The National Food Survey datasets are based on “food diaries” recorded by a sample of British families from 1974 to 2000. In addition to tracking food consumption, the data contains details about each household, including whether they kept vegetarian, had a pregnancy, and/or owned a microwave. [h/t Hannah Brooks + Sebastian Gutierrez]
2016.03.023Bills, bills, bills.Congress has finally begun publishing official bulk data on the status of its bills — something open-government advocates had been requesting for more than a decade. The bulk downloads include an XML file for each piece of legislation, with indicators tracking (among other things) committee referrals and actions. Nostalgia: I’m Just A Bill. [h/t Derek Willis]
2016.03.024Provincial populations.National population data is easy to find. But it’s much harder to find reliable, standardized population figures for finer-grained geographies. To that end, the World Bank has launched a pilot of its Subnational Population Database, which calculates estimates for 75 countries’ major provinces/states/regions.