Data Rescue Project
Updated 2025-02-20
These suggestions come from various sources, including IASSIST, RDAP, Data Curation Network, BlueSky, LinkedIn, and others.
WE HAVE A WEBSITE!
About Data Rescue Project
The Data Rescue Tracker: https://www.datarescueproject.org/data-rescue-tracker/
We will be phasing out the use of this google doc (it has been fun). Updates will be on our website moving forward. Feel free to subscribe to get those posts.
For more current information on Current Efforts go to: https://www.datarescueproject.org/current-efforts/
WE ARE NO LONGER USING THIS GOOGLE DOC.
The information below is no longer updated.
PLEASE REFER TO OUR WEBSITE FOR CURRENT INFORMATION.
Ways to get involved:
_________________________________________
Larger and Established Data / Website Efforts
- The main coordinated effort to archive websites
- Datasets have been more of a challenge, especially data embedded in databases.
- #SafeguardingResearch is in contact with them to mirror data on servers not in US-jurisdiction
- Overview of ICPSR's data rescue activities to date:
- Downloaded ~2800 files from various sources requested by researchers; all the files ICPSR collected will soon be available via a dropbox link.
- Examining CDC data dump from archive.org to assess what might be missing.
- Ideally will also be a resource for those looking for data to see what is/isn’t available.
- ICPSR staff and allies are generating metadata for each of the datasets we have so that we can make them available through an existing archive at ICPSR (DataLumos, openICPSR, or the Resource Center for Minority Data, depending on our timeline and some technical issues we’re working out)
- ICPSR Data Lumos - They have the older version of a lot of major data, including a recent addition from the CDC.
- They have data and have been working on cataloging efforts
- Notification went out yesterday that they will share more soon.
- Generalist repository available to help with data publication, storage, and preservation.
- Joint initiative of the Sabin Center for Climate Change Law and the Climate Science Legal Defense Fund.
- Tracks government attempts to restrict or prohibit scientific research, education or discussion, or the publication or use of scientific information.
- Generalist repository for archiving, sharing, and storing all types of research outputs, not limited to preprints or only data.
- OSF is available as an option for pre-prints of articles if, for some reason, they cannot be posted on official sources.
- Many universities also have institutional repositories where research (articles, data, dissertations, etc) from that institution can be posted. They also have preservation mandates. An example is Penn’s ScholarlyCommons.
- Has NOAA data pulled during the 2017 data rescue.
- A volunteer has pointed out that “key equity data” is missing from the Dept of Energy. Says they were able to find it on this site. Includes additional data from DOE.
- Note that OEDI is run by NREL which is part of DOE, it is not an independent organization. Several energy efficiency consultants have received tips that NREL has been ordered to start taking datasets offline. It’s not clear if OEDI data may be removed or not.
- Roper Center has collected over 50,000 files (datasets and documentation) from 22 federal survey projects. Efforts to this point have been focused on acquiring the files and ensuring backup copies are preserved on multiple servers.
- The information system PANGAEA is operated as an Open Access library aimed at archiving, publishing and distributing georeferenced data from earth system research. PANGAEA is ready to rescue data - just contact us.
Data Rescue Events
- Healthy Regions Policy Lab at UIUC
Smaller/Ad Hoc Rescue Efforts/ Data Archiving Activists
- Mirrored and archived public data on locally hosted git server
- Includes retrieved data sets from CDC, NIH, and NOAA
- A special archive created on IA of all CDC datasets publicly available as of January 28, 2025
- uploaded by DataHoarders (we think)
- Includes CDC's Social Vulnerability Index data.
- Most of what's being placed here is data focusing on health and the environment.
- DataRefuge from 2017 DataRefuge is open for deposits of individual datasets or collection (it’s a subcollection within CAFE)
- Please note if you need TBs of support, reach out to sbarbosa@g.harvard.edu
- NOAA (fisheries) is working with Harvard Dataverse to upload over 4TB of data and likely support the efforts via CAFE
- Based in EU, USA and global - got access to Update 1-2 PB (and more on the way) of storage & people willing to seed
- Currently, we’ve got around 1TB of data backed up
- Including >100.000 PDFs from academia.edu (“transgender”, “Queer Studies”, “intersex”, “nonbinary” etc. - see the forum for the full list)
- 350GB web archive of CDC, including all 30.000 files from archive.cdc.gov And much more
- “We're working on providing a central index of archives, with metadata about who archived what, when, to be disseminated widely alongside torrent files and act as both a central point of coordination for archivers to assess what new work is needed, and a mass distribution channel.”
- Possible contact to CERN, will update asap
- A reddit community that is coordinating efforts to rescue data.
- index of resources and archives related to data hoarding, web archival and self hosting.
- They run a distributed crawler. Anyone can install it to help contribute.
- US Federal Data page
- Data is uploaded to Archive.org by volunteers
- Note: It looks like the project may have stalled in September 2024. Send info if you know more about them.
- Run by BigLocalNews and MuckRock, which are good groups to follow.
Tools for Data Rescues
- Provides key insights for curating data and the types of questions that need to be asked.
- Checklist to assist with curating data rescue efforts.
- According to an email: has archived 8TB+ of government sites, some from the End-of-Term-Archive seed list, some from EDGI Slack requests, and many sites independently
- According to an email: has also archived government datasets from data.gov, CIBP, USCIS, NOAA, NASA, NSIDC, and more
- Provides a list of tools for web harvesting, etc.
- Another curated list of web archiving tools
- This is the workflow from the original data rescue/DataRefuge project in 2017.
- Many of the tools are no longer working, but the workflow is still useful. UW used this to create their workflow above.
- The challenge with the original project was where to store and how to make discoverable the large amounts of data captured.
- Part of this effort is also housed in the Harvard Dataverse Repository and can be opened for more data deposits
- There is a CKAN instance with some of the 2017 data.
- Tool created by Jerome Paulos to show side-by-side changes in government websites.
- This is a reddit post, but it lists instructions for how to archive and the tools needed to be able to contribute. Figured it would best be categorized here.
Existing Alternative Data Sources
Thanks to Brianne Dosch for suggesting the section and some of the bullets.
- PolicyMap – offers a free tier that can be used to view basic information down to the tract-level, but more detailed data and functionality requires a subscription; available at some universities
- GEM - contains 46 demographic, energy, health, and housing indicators down to the tract-level with free access for community-based organizations; others require a subscription.
- FRED - They have some demographic data as well; free and open source
- Census Reporter – is a free, open-source platform focused on making American Community Survey (ACS) data more accessible, including the recent upload of the 2022 1-Year ACS data
- Esri – for mapping users, the GIS vendor publishes several U.S. Census Bureau data sets, including the ACS, through its ArcGIS Online Platform
- IPUMS – Even when the government operates normally, many analysts turn to Minnesota Population Center products to access ACS, Current Population Survey microdata and Decennial Census data
- Social Explorer - historical Census data and more; available at some universities
- SimplyAnalytics - has internally processed American Community Surveys; available at some universities
- American College of Obstetricians and Gynecologists - Hosting copies of immunization schedules and contraceptive use guidance from the CDC
Economic Indicators
- This tool aggregated data from many sources – it seems to be still able to categorize disadvantaged communities (by environmental and economic standards), as well as other critical data denotations that are increasingly hard to access
- This resource specifically provides data on work, housing, and community resources for households below the ALICE threshold (Asset Limited, Income Constrained, Employed). The data is provided by the U.S. Census Bureau’s Public Use Microdata Sample (PUMS, 202!)
- A data and policy tool that provides a detailed report card on racial and economic equity – this tool can provide a holistic Racial Equity Index snapchat of communities. The Atlas draws its data from a unique regional equity indicators database developed and maintained by two private institutions: PolicyLink and USC Equity Research Institute ERI.
- provides researchers, media, and the public with easily accessible, up-to-date, and comprehensive historical data on the American labor force. It is compiled from Economic Policy Institute analysis of government data sources. Use it to research wages, inequality, and other economic indicators over time and among demographic groups.
Public Health
- County Health Rankings & Roadmaps (CHR&R)
- A program of University of Wisconsin’s Population Health Institute, this data tool aims to highlight the symbiotic nature of health and equity by factoring in physical environment, social and economic indicators, clinical care, and health behaviors to health outcomes.
- From NYU Langone Health, this platform provides 40+ measures of health and factors affecting health across five areas (Health Behaviors, Social and Economic Factors, Physical Environment, Health Outcomes, and Clinical Care) for 970+ cities across the U.S.
Library Guides to Data Rescues
Articles on current efforts
Articles for context