Bountiful Harvest: Curation, Collection, and Use of Web Archives
Instructors:
Lori Donovan
Maria Praetzellis
Internet Archive
ARA 2017
August 29, 2017 Manchester, England
AGENDA
J. Howard Miller, We Can Do It! (1942)
National Museum of American History
INTRODUCTIONS
Name, organization, experience and or interest in web archiving
INTRODUCTIONS
THE WAYBACK MACHINE
Online: https://archive.org/web/
The largest publicly available web archive in existence.
> 300 Billion Web pages
> 100 million websites
> 150 languages
~ 1 billion URLs added per week
WEB ARCHIVE COMMUNITY
@KristaOldham 6 May 2015
WEB ARCHIVING
What is a web archive?�A collection of archived URLs grouped by theme, event, subject area, or web address.
A web archive contains as much as possible from the original resources and documents their change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on the day it was archived.
THE LIFESPAN OF A WEBSITE
How long does a website last?
In general, a typical web page can be expected to last ~90-100 days before changing, moving, or disappearing completely.
WHY WEB ARCHIVE?
U.S. WEB ARCHIVING STATISTICS
National Digital Stewardship Alliance (NDSA) 2016 Survey (PDF)
USE CASES
Create a thematic/topical web archive on a specific subject or event
> Often related to traditional collecting activity around the same topical focus
> Capture spontaneous events
> Document different perspectives and social commentaries
Fulfill a mandate to capture/preserve evolving web history
> Construct a historical record of an institution or individual’s web/social media presence
> Support an electronic records system to meet records retention requirements
> Collect publications/documents that are no longer in print form
Closure crawls
> Document a public institution’s presence on the web before it changes or closes
UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS COLLECTION
Content includes:
> Full-text archives of official documents
> Original video and audio recordings of key regional leaders
> Thousands of annual and "state of the nation" reports
> Collections of Latin American elections and political parties
Use Case:
Archive government documents from 18 different countries in Latin America
UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE
Honduras presidential website, 2008 (before coup)
UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE
Honduras presidential website, 2009 (during coup)
UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE
Honduras presidential website, 2010 (after coup)
BODLEIAN LIBRARY
> Collect in focused thematic/topical subjects designated by librarians, university staff/students, etc.
Use Cases:
> Archive Oxford University’s web presence
BODLEIAN LIBRARY
OXFORD UNIVERSITY
BODLEIAN LIBRARY
INTERNATIONAL
BODLEIAN LIBRARY
SCIENCE, MEDICINE AND TECHNOLOGY
LIBRARY AND ARCHIVES CANADA
Content includes:
> Canadian government websites, hosted locally
> Collaborative collections, curated with other Canadian organizations
Use Case:
Archive Canadian government web content, and other content important to Canadian history
LIBRARY AND ARCHIVES CANADA
TRUTH AND RECONCILIATION COMMISSION (TRC)
LIBRARY AND ARCHIVES CANADA
GOVERNMENT OF CANADA WEB ARCHIVE
QUESTIONS?
STARTING A COLLECTION
Collection: �A group of URLs curated around a common theme, topic, or domain.
Ask Yourself…
> What is the topic of this collection?
> Which websites should I archive � as part of this collection?
New York Public Library
COLLECTIONS START WITH SEEDS
Seed: �The starting point URL for the crawler. The crawler will follow linked pages from the seed URL and archive them if they are “in scope.”
Document: �Any file with a distinct URL.�html, image, PDF, video, etc...
University of Kentucky
CRAWLERS AND SPIDERS AND ROBOTS, OH MY!
Crawlers are pieces of software that visit websites and index the information included therein.
Archive-It crawls the web and archives copies of the information and files displayed on target websites.
Département évangélique français d’action apostolique (Défap)
HOW THE CRAWLER WORKS
1. Starts with seed URL(s)
2. Checks if those URLs are reachable, and archives them
3. Looks for embedded content – what does it need to render the page? CSS, Javascript, Images, etc...
4. Looks for links to other pages
5. Checks if those pages are “in scope” and archives them
The crawler will continue until it cannot locate any more links that are in scope or it hits a limit set for the crawl (time, data, or document limits).
Département évangélique français d’action apostolique (Défap)
CRAWL SCOPE
How does the crawler know which links to archive
and which to ignore?
> The seeds you add to your collection will determine the “scope” of your crawls.
> How you format your seed URLs can have
an impact on the “scope” of your crawl.
ARCHIVE-IT CRAWLING SCOPE
Seed URLs can limit the crawl to a single directory of a site.
> example: www.archive.org/about/
> a / at the end of your url can have a big effect on scope
> Parts of the site not included in your seed directory will NOT be archived
Example seed: www.archive.org/about/
> Link: www.archive.org/webarchive.html is NOT in scope
Example seed: www.archive.org/about
> Link: www.archive.org/webarchive.html is in scope
ROBOTS.TXT BLOCKS
By default, our crawler respects all robots.txt files. Partners can check post-crawl reports for blocked seeds, hosts, or documents.
If your site is blocked...
> Contact the site owner and ask if they will unblock your crawler specifically – archive.org_bot.
> Some institutions choose to utilize a tool to ignore robots.txt blocks for specific cases
WEB ARCHIVING LIFE CYCLE
https://archive-it.org/static/files/archiveit_life_cycle_model.pdf
QUESTIONS TO ASK
Vision and Objectives:
Resources and workflow:
Access/Use/Reuse:
MORE QUESTIONS
Appraisal and Selection:
Scoping and Data Capture:
PART II AGENDA
TOOLS & STANDARDS
CHALLENGES
NEW TECHNOLOGIES�
RESEARCH & ACCESS
WEB ARCHIVING TOOLS AT INTERNET ARCHIVE
Heritrix�Web crawler – crawls and captures web pages
Umbra�Assists the crawler to access social media and other sites in the same way a browser would
Wayback�Access tool for rendering and viewing pages - surf the web as it was
ElasticSearch & SOLR�Full-text search indexing engine & metadata search software
Brozzler
Browser + crawler= Brozzler!
Browser + Crawler = Brozzler!
WARC�ISO standard for storing web archives
WARC (Web ARChive) Format
CHALLENGES: CONTENT
> Javascript – some implementations � can be difficult to capture and display.
> Videos – can capture most videos, but some � proprietary formats can be difficult.
> Social Media – always improving tools for � archiving Facebook, Twitter, Flickr, Instagram, � and more.
CHALLENGES: CONTENT
Alternatives may include a sitemap or � direct links to content
Password protected sites – new feature � currently in Beta
What happens when content is behind a form, search box, or password?
Minnesota Historical Society
ACCESS: WHAT MAKES A SITE ARCHIVABLE
Make links transparent
Be careful with robots directives
Return reliable response codes
Many more suggestions: https://library.stanford.edu/projects/web-archiving/archivability
Type of content provoking concern over capacity to archive.
From Web Archiving in the United States: A 2016 Survey, report from the National Digital Stewardship Alliance
CHALLENGES: CONTENT
Crawling Technology
Heritrix
Heritrix + Umbra
Brozzler
2003 - 2014
2014 - Present
Future?
Traditional web crawler
Scoping, capture, deduplication, WARC creation in one process
Less adept at triggering and capturing client side script and Javascript
ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE
Crawling Technology
Heritrix
Heritrix + Umbra
Brozzler
2006 - 2014
2014 - Present
Future?
Runs alongside Heritrix
Mimics the way a browser would access a page
Executes client side scripts so previously undetectable URLs could be accessed by Heritrix
Clicking or hovering to execute Javascript
Allows for dynamic scrolling
ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE
Crawling Technology
Heritrix
Heritrix + Umbra
Brozzler
2006 - 2014
2014 - Present
Future?
Captures http traffic as it’s loaded
Uses a real browser to fetch pages and embedded URLs, and to extract links
Works with youtube to improve media capture
ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE
BROZZLER
“browser”|”crawler” = BROZZLER
Runs on an instance of chromium browser
Opens page in the browser, takes a screenshot, sends to warcprox, written as a WARC file
Runs a javascript behavior and finds a@href outlinks
WEB ARCHIVING TOOLS & SERVICES
For a full list: http://netpreserve.org/web-archiving/tools-and-software/
WEB ARCHIVING TOOLS: HTTRACK & WGET
WGET
“Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.”
Uses Heritrix 3.2.0 and OpenWayback 2.3.0
Developed by Mat Kelly at the Web Science and Digital Libraries research group at Old Dominion University
WEB ARCHIVING TOOLS: WAIL
WEB ARCHIVING TOOLS: WEBRECORDER
SOCIAL MEDIA WEB ARCHIVING TOOLS & SERVICES
ACCESS: OLDWEB.TODAY
ACCESS: MEMENTO/TIME TRAVEL SERVICE
Chrome plug in allows you to navigate between the present web and the web of the past
archive.today, Archive-It, Arquivo.pt: the Portuguese Web Archive, Bibliotheca Alexandrina Web Archive, DBpedia archive, DBpedia Triple Pattern Fragments archive, Canadian Government Web Archive, Croatian Web Archive, Estonian Web Archive, Icelandic web archive, Internet Archive, Library of Congress Web Archive, NARA Web Archive, National Library of Ireland Web Archive, perma.cc, PRONI Web Archive, Slovenian Web Archive, Stanford Web Archive, UK Government Web Archive, UK Parliament's Web Archive, UK Web Archive, Web Archive Singapore, WebCite, Bayerische Staatsbibliothek
RESEARCH SERVICES & WEB ARCHIVE DATASETS
TYPOLOGY OF RESEARCHER INTERESTS
RESEARCH SERVICES & WEB ARCHIVE DATASETS
Web Archive Datasets
RESEARCH SERVICES & WEB ARCHIVE DATASETS
LESSONS LEARNED SUPPORTING RESEARCHERS
RESEARCH SERVICES & WEB ARCHIVE DATASETS
STRATEGIC APPROACHES TO SUPPORTING RESEARCHERS
RECAP AND REVIEW
READING LIST
STO
Davis, Corey. "Archiving the Web: A Case Study from the University of Victoria." Code4Lib Issue 26, 2014-10-21 (2014). http://journal.code4lib.org/articles/10015
Keeping Collections: More Podcast Less Process. Episode 007. The Web Archivist Are Present. http://keepingcollections.org/more-podcast-less-process-episode-007/
D-Lib Magazine. Special Issue on Web Archives. March/April 2012. http://www.dlib.org/dlib/march12/03contents.html
Web Archiving In The United States: A 2016 Survey. NDSA, 2017. Web. 10 May 2017. Results Of A Survey Of Organizations Preserving Web Content, http://ndsa.org/documents/WebArchivingintheUnitedStates_A2016Survey.pdf
Summers, Ed. 'The Web As A Preservation Medium'. inkdroid. N.p., 2013. http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/
Pennock, M. (2013, March). Web-archiving (DPC Technology Watch Report 13-01). Digital Preservation Coalition. Retrieve from
http://dx.doi.org/10.7207/twr13-01
Taylor, Nicholas. 'Anatomy Of A Web Archive | The Signal: Digital Preservation'. N.p., 2013. Web. 30 Jan. 2015. http://blogs.loc.gov/digitalpreservation/2013/11/anatomy-of-a-web-archive/
ADDITIONAL RESOURCES: support.archive-it.org
Thank you!
Lori Donovan, Senior Program Manager, Web Archiving | lori@archive.org
Maria Praetzellis, Program Manager, Web Archiving | maria@archive.org
Internet Archive & Archive-It | @internetarchive & @archiveitorg
ARCHIVE-IT DEMO
Login details: