1 of 63

Bountiful Harvest: Curation, Collection, and Use of Web Archives

Instructors:

Lori Donovan

Maria Praetzellis

Internet Archive

ARA 2017

August 29, 2017 Manchester, England

2 of 63

AGENDA

J. Howard Miller, We Can Do It! (1942)

National Museum of American History

Definitions and Community

The what and why of web archives
Current challenges and initiatives

Collection

Selection and acquisition
Description and scoping

Manage

Challenges & Opportunities
Tools & Services

Use

Access & Research

Demonstration
Create your own collection

3 of 63

INTRODUCTIONS

Name, organization, experience and or interest in web archiving

4 of 63

INTRODUCTIONS

5 of 63

We are a non-profit Digital Library & Archive founded in 1996
30+PB unique data: 10PB web, ~14m text, 3.6m vid, 3.6m aud, 190K soft, etc
Developed: Open source web archiving tools, formats and standards
Engineers, librarians/archivists, program staff

6 of 63

THE WAYBACK MACHINE

Online: https://archive.org/web/

The largest publicly available web archive in existence.

> 300 Billion Web pages

> 100 million websites

> 150 languages

~ 1 billion URLs added per week

7 of 63

WEB ARCHIVE COMMUNITY

International Internet Preservation C onsortium

SAA’s Web Archiving Section

Archive-It

Mid Atlantic AIT User Group

@KristaOldham 6 May 2015

8 of 63

WEB ARCHIVING

What is a web archive?�A collection of archived URLs grouped by theme, event, subject area, or web address.

A web archive contains as much as possible from the original resources and documents their change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on the day it was archived.

9 of 63

THE LIFESPAN OF A WEBSITE

How long does a website last?

In general, a typical web page can be expected to last ~90-100 days before changing, moving, or disappearing completely.

Of 582 Occupy Movement websites archived in 2012, only 41% were still live on the web as of April, 2014
In 2013, our colleagues at Old Dominion University determined that over 10% of content posted to social media platforms is lost after one year.
In 2014, a study by UCLA determined that 7-in-10 scholarly articles that include citations with hyperlinks suffer from reference rot.

10 of 63

WHY WEB ARCHIVE?

Institutional History: Maintain a record of your institution’s web presence over time.

Responsibility: preserve things like course information, special exhibit information, policies, organizational reports— many documents now showing up only as digital content.

Research: Many libraries are seen as authorities on a particular subject, topic or person, and collect web-based information to augment other holdings.

11 of 63

U.S. WEB ARCHIVING STATISTICS

National Digital Stewardship Alliance (NDSA) 2016 Survey (PDF)

94% of respondents use an external web archiving service like Archive-It
71% of organizations devote one half FTE or less to web archiving
60% started programs between 2011 and 2015
60% rely on other organizations’ or community-generated policies in the creation of their own
Principle concerns include ability to archive social media (70%), video (69%), and databases (62%)

12 of 63

USE CASES

Create a thematic/topical web archive on a specific subject or event

> Often related to traditional collecting activity around the same topical focus

> Capture spontaneous events

> Document different perspectives and social commentaries

Fulfill a mandate to capture/preserve evolving web history

> Construct a historical record of an institution or individual’s web/social media presence

> Support an electronic records system to meet records retention requirements

> Collect publications/documents that are no longer in print form

Closure crawls

> Document a public institution’s presence on the web before it changes or closes

13 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS COLLECTION

Content includes:

> Full-text archives of official documents

> Original video and audio recordings of key regional leaders

> Thousands of annual and "state of the nation" reports

> Collections of Latin American elections and political parties

Use Case:

Archive government documents from 18 different countries in Latin America

14 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE

Honduras presidential website, 2008 (before coup)

15 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE

Honduras presidential website, 2009 (during coup)

16 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE

Honduras presidential website, 2010 (after coup)

17 of 63

BODLEIAN LIBRARY

> Collect in focused thematic/topical subjects designated by librarians, university staff/students, etc.

Use Cases:

> Archive Oxford University’s web presence

18 of 63

BODLEIAN LIBRARY

OXFORD UNIVERSITY

19 of 63

BODLEIAN LIBRARY

INTERNATIONAL

20 of 63

BODLEIAN LIBRARY

SCIENCE, MEDICINE AND TECHNOLOGY

21 of 63

LIBRARY AND ARCHIVES CANADA

Content includes:

> Canadian government websites, hosted locally

> Collaborative collections, curated with other Canadian organizations

Use Case:

Archive Canadian government web content, and other content important to Canadian history

22 of 63

LIBRARY AND ARCHIVES CANADA

TRUTH AND RECONCILIATION COMMISSION (TRC)

23 of 63

LIBRARY AND ARCHIVES CANADA

GOVERNMENT OF CANADA WEB ARCHIVE

24 of 63

QUESTIONS?

25 of 63

STARTING A COLLECTION

Collection: �A group of URLs curated around a common theme, topic, or domain.

Ask Yourself…

> What is the topic of this collection?

> Which websites should I archive � as part of this collection?

New York Public Library

26 of 63

COLLECTIONS START WITH SEEDS

Seed: �The starting point URL for the crawler. The crawler will follow linked pages from the seed URL and archive them if they are “in scope.”

Document: �Any file with a distinct URL.�html, image, PDF, video, etc...

University of Kentucky

27 of 63

CRAWLERS AND SPIDERS AND ROBOTS, OH MY!

Crawlers are pieces of software that visit websites and index the information included therein.

Archive-It crawls the web and archives copies of the information and files displayed on target websites.

Département évangélique français d’action apostolique (Défap)

28 of 63

HOW THE CRAWLER WORKS

1. Starts with seed URL(s)

2. Checks if those URLs are reachable, and archives them

3. Looks for embedded content – what does it need to render the page? CSS, Javascript, Images, etc...

4. Looks for links to other pages

5. Checks if those pages are “in scope” and archives them

The crawler will continue until it cannot locate any more links that are in scope or it hits a limit set for the crawl (time, data, or document limits).

Département évangélique français d’action apostolique (Défap)

29 of 63

CRAWL SCOPE

How does the crawler know which links to archive

and which to ignore?

> The seeds you add to your collection will determine the “scope” of your crawls.

> How you format your seed URLs can have

an impact on the “scope” of your crawl.

30 of 63

ARCHIVE-IT CRAWLING SCOPE

Seed URLs can limit the crawl to a single directory of a site.

> example: www.archive.org/about/

> a / at the end of your url can have a big effect on scope

> Parts of the site not included in your seed directory will NOT be archived

Example seed: www.archive.org/about/

> Link: www.archive.org/webarchive.html is NOT in scope

Example seed: www.archive.org/about

> Link: www.archive.org/webarchive.html is in scope

31 of 63

ROBOTS.TXT BLOCKS

By default, our crawler respects all robots.txt files. Partners can check post-crawl reports for blocked seeds, hosts, or documents.

If your site is blocked...

> Contact the site owner and ask if they will unblock your crawler specifically – archive.org_bot.

> Some institutions choose to utilize a tool to ignore robots.txt blocks for specific cases

32 of 63

WEB ARCHIVING LIFE CYCLE

Highlights the policy and workflows of six partner institutions: Columbia, University of Alberta, Montana State Library, State Library of North Carolina, North Carolina State Archives and Creighton University

Covers issues including:

Policy
Vision and Objectives
Workflows
Access
Preservation

https://archive-it.org/static/files/archiveit_life_cycle_model.pdf

33 of 63

QUESTIONS TO ASK

Vision and Objectives:

What are your overarching web archiving goals?

Resources and workflow:

What resources will help you achieve these goals?

Staff
Technical/Infrastructure
External organizations?

Access/Use/Reuse:

Who are your target users?
What levels of access/research use would you like to facilitate?

34 of 63

35 of 63

36 of 63

PART II AGENDA

TOOLS & STANDARDS

CHALLENGES

NEW TECHNOLOGIES�

RESEARCH & ACCESS

37 of 63

WEB ARCHIVING TOOLS AT INTERNET ARCHIVE

Heritrix�Web crawler – crawls and captures web pages

Umbra�Assists the crawler to access social media and other sites in the same way a browser would

Wayback�Access tool for rendering and viewing pages - surf the web as it was

ElasticSearch & SOLR�Full-text search indexing engine & metadata search software

Brozzler

Browser + crawler= Brozzler!

Browser + Crawler = Brozzler!

WARC�ISO standard for storing web archives

38 of 63

ISO 28500:2009
Combines multiple digital resources into an aggregate archival file together with related information
Container file
Written by crawlers
Concatenated raw content
For long-term storage and preservation

WARC (Web ARChive) Format

39 of 63

CHALLENGES: CONTENT

> Javascript – some implementations � can be difficult to capture and display.

> Videos – can capture most videos, but some � proprietary formats can be difficult.

> Social Media – always improving tools for � archiving Facebook, Twitter, Flickr, Instagram, � and more.

40 of 63

CHALLENGES: CONTENT

Alternatives may include a sitemap or � direct links to content

Password protected sites – new feature � currently in Beta

What happens when content is behind a form, search box, or password?

Minnesota Historical Society

41 of 63

ACCESS: WHAT MAKES A SITE ARCHIVABLE

Make links transparent

Be careful with robots directives

Return reliable response codes

Many more suggestions: https://library.stanford.edu/projects/web-archiving/archivability

42 of 63

Type of content provoking concern over capacity to archive.

From Web Archiving in the United States: A 2016 Survey, report from the National Digital Stewardship Alliance

CHALLENGES: CONTENT

43 of 63

Crawling Technology

Heritrix

Heritrix + Umbra

Brozzler

2003 - 2014

2014 - Present

Future?

Traditional web crawler

Scoping, capture, deduplication, WARC creation in one process

Less adept at triggering and capturing client side script and Javascript

ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE

44 of 63

Crawling Technology

Heritrix

Heritrix + Umbra

Brozzler

2006 - 2014

2014 - Present

Future?

Runs alongside Heritrix

Mimics the way a browser would access a page

Executes client side scripts so previously undetectable URLs could be accessed by Heritrix

Clicking or hovering to execute Javascript

Allows for dynamic scrolling

ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE

45 of 63

Crawling Technology

Heritrix

Heritrix + Umbra

Brozzler

2006 - 2014

2014 - Present

Future?

Captures http traffic as it’s loaded

Uses a real browser to fetch pages and embedded URLs, and to extract links

Works with youtube to improve media capture

ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE

We are currently developing a technology to better capture dynamic and multimedia web content called Brozzler, which gets its name from browser + crawler. It differs from Heritrix and other crawling technologies in its reliance on an actual web browser to render and interact with web content before all of that content indexed and archived into WARC files. Instead of following hyperlinks and downloading files, Brozzler records interactions between servers and web browsers as they occur, more closely resembling how a human user would experience the web resources that they want to archive. For more information on how this process works, and the related open-source tools on which it relies, Brozzler’s code and technical documentation can be found in its GitHub repository.

Currently in the testing phase

46 of 63

BROZZLER

“browser”|”crawler” = BROZZLER

Runs on an instance of chromium browser

Opens page in the browser, takes a screenshot, sends to warcprox, written as a WARC file

Runs a javascript behavior and finds a@href outlinks

47 of 63

Services & Tools

Archive-It
Internet Memory Foundation
Commercial: Hanzo, Pagefreezer, Mirrorweb
Tools: Webrecorder, WAIL
API based: twarc, Social Feed Manager

Access:

Oldweb.today, Momento, Webrecorder

WEB ARCHIVING TOOLS & SERVICES

For a full list: http://netpreserve.org/web-archiving/tools-and-software/

48 of 63

HTTRACK

http://www.httrack.com/

Download a site from the Internet to a local directory

WEB ARCHIVING TOOLS: HTTRACK & WGET

WGET

Terminal tool
Allows you to download directories via the command line
http://ftp.gnu.org/gnu/wget/

49 of 63

“Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.”

Uses Heritrix 3.2.0 and OpenWayback 2.3.0

Developed by Mat Kelly at the Web Science and Digital Libraries research group at Old Dominion University

WEB ARCHIVING TOOLS: WAIL

50 of 63

WEB ARCHIVING TOOLS: WEBRECORDER

Available at webrecorder.io/

Developed by Ilya Kreymer, and a project of Rhizome

Focus on dynamic web content such as embedded video and complex javascript

Provides for both capture and access

51 of 63

SOCIAL MEDIA WEB ARCHIVING TOOLS & SERVICES

Social Feed Manager

Developed at George Washington University Libraries
Collects Twitter data in bulk using the Twitter API
Open source; available on github

twarc

Developed by Ed Summers at Maryland Institute for Technology in the Humanities
A command line tool for archiving Twitter JSON data
Also uses the Twitter API
Useful for running searches on terms to collect all tweets mentioning a keyword

52 of 63

Browser emulator to access publicly available archived sites using virtual versions of old browsers

Focus is on playing back the site as it would have been originally experienced

Developed by Ilya Kramer at Rhizome

ACCESS: OLDWEB.TODAY

53 of 63

ACCESS: MEMENTO/TIME TRAVEL SERVICE

http://timetravel.mementoweb.org/

54 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

Documentary: Evidentiary, Attestation, Legal discovery/claim
Social/Political Scientists: Communications, Politics/Government, Social Anthropology
Web Science: Technology Systems and Protocols
(Digital) Humanities: Historians and humanities disciplines, networks, collection building
Computer Science: Information Retrieval, Data Processing and Indexing, Infrastructure and tools
Data Analysts: Mining/Training, language processing, trend analysis

TYPOLOGY OF RESEARCHER INTERESTS

55 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

Web Archive Datasets

56 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

Researchers don’t always know what they want
Researchers default to wanting access to raw/all data
Researchers will have varying levels of technical resources or support
Address upfront issues of technical proficiency, non-archive technical support and/or methodological stuff
Will require reference/resources to explain and contextualize web archive tools and processes
More data doesn’t equal better analysis

LESSONS LEARNED SUPPORTING RESEARCHERS

57 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

Focus on derivation, portability, and access
Focus on scalable partnerships & decentralization
Research support expectations often != with available resources or services
Research methodologies (conceptual, practical, technical) often != with data, collecting, tools
Service models or death (though yet to emerge for most data-driven LAM-ish research)

STRATEGIC APPROACHES TO SUPPORTING RESEARCHERS

58 of 63

RECAP AND REVIEW

Definitions and Community

We defined terms, practices, and the current landscape

Collection

We discussed the whys and hows of creating web archives

Manage

We outlined management, tools, and services

Use

We looked at formats, archival replay, and data mining�

Now We’ll Demo It All!�
Then It’s Your Turn To Archive!

59 of 63

READING LIST

STO

Davis, Corey. "Archiving the Web: A Case Study from the University of Victoria." Code4Lib Issue 26, 2014-10-21 (2014). http://journal.code4lib.org/articles/10015

Keeping Collections: More Podcast Less Process. Episode 007. The Web Archivist Are Present. http://keepingcollections.org/more-podcast-less-process-episode-007/

D-Lib Magazine. Special Issue on Web Archives. March/April 2012. http://www.dlib.org/dlib/march12/03contents.html

Web Archiving In The United States: A 2016 Survey. NDSA, 2017. Web. 10 May 2017. Results Of A Survey Of Organizations Preserving Web Content, http://ndsa.org/documents/WebArchivingintheUnitedStates_A2016Survey.pdf

Summers, Ed. 'The Web As A Preservation Medium'. inkdroid. N.p., 2013. http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/

Pennock, M. (2013, March). Web-archiving (DPC Technology Watch Report 13-01). Digital Preservation Coalition. Retrieve from

http://dx.doi.org/10.7207/twr13-01

Taylor, Nicholas. 'Anatomy Of A Web Archive | The Signal: Digital Preservation'. N.p., 2013. Web. 30 Jan. 2015. http://blogs.loc.gov/digitalpreservation/2013/11/anatomy-of-a-web-archive/

60 of 63

ADDITIONAL RESOURCES: support.archive-it.org

Collections :

Collection Management Overview

Managing Metadata

Managing Seeds

Scoping:

How the crawler determines “Scope”

Seed Types

Seed vs. collection level scoping

Identify and avoid crawler traps

Scoping for specific types of sites

Crawling:

Scheduling recurring crawls

Starting One-Time or Test crawls

Saving Test crawls

Reviewing:

Reviewing captures

Reading your crawl report

Quality Assurance:

Wayback QA

Quality assurance from the Host Report

Using Proxy mode

Access:

Through your Archive-It account

Through Archive-it.org

Through other domains

Downloading WARC files

61 of 63

62 of 63

Thank you!

Lori Donovan, Senior Program Manager, Web Archiving | lori@archive.org

Maria Praetzellis, Program Manager, Web Archiving | maria@archive.org

Internet Archive & Archive-It | @internetarchive & @archiveitorg

63 of 63

Archive-It

https://archive-it.org/

Click “Login” in upper right

ARCHIVE-IT DEMO

Login details:

Username: araworkshop
Password: ara2017