1 of 63

Bountiful Harvest: Curation, Collection, and Use of Web Archives

Instructors:

Lori Donovan

Maria Praetzellis

Internet Archive

ARA 2017

August 29, 2017 Manchester, England

2 of 63

AGENDA

J. Howard Miller, We Can Do It! (1942)

National Museum of American History

  • Definitions and Community
    • The what and why of web archives
    • Current challenges and initiatives
  • Collection
    • Selection and acquisition
    • Description and scoping
  • Manage
    • Challenges & Opportunities
    • Tools & Services
  • Use
    • Access & Research
  • Demonstration
  • Create your own collection

3 of 63

INTRODUCTIONS

Name, organization, experience and or interest in web archiving

4 of 63

INTRODUCTIONS

5 of 63

  • We are a non-profit Digital Library & Archive founded in 1996
  • 30+PB unique data: 10PB web, ~14m text, 3.6m vid, 3.6m aud, 190K soft, etc
  • Developed: Open source web archiving tools, formats and standards
  • Engineers, librarians/archivists, program staff

6 of 63

THE WAYBACK MACHINE

Online: https://archive.org/web/

The largest publicly available web archive in existence.

> 300 Billion Web pages

> 100 million websites

> 150 languages

~ 1 billion URLs added per week

7 of 63

WEB ARCHIVE COMMUNITY

@KristaOldham 6 May 2015

8 of 63

WEB ARCHIVING

What is a web archive?�A collection of archived URLs grouped by theme, event, subject area, or web address.

A web archive contains as much as possible from the original resources and documents their change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on the day it was archived.

9 of 63

THE LIFESPAN OF A WEBSITE

How long does a website last?

In general, a typical web page can be expected to last ~90-100 days before changing, moving, or disappearing completely.

  • Of 582 Occupy Movement websites archived in 2012, only 41% were still live on the web as of April, 2014
  • In 2013, our colleagues at Old Dominion University determined that over 10% of content posted to social media platforms is lost after one year.
  • In 2014, a study by UCLA determined that 7-in-10 scholarly articles that include citations with hyperlinks suffer from reference rot.

10 of 63

WHY WEB ARCHIVE?

  • Institutional History: Maintain a record of your institution’s web presence over time.

  • Responsibility: preserve things like course information, special exhibit information, policies, organizational reports— many documents now showing up only as digital content.

  • Research: Many libraries are seen as authorities on a particular subject, topic or person, and collect web-based information to augment other holdings.

11 of 63

U.S. WEB ARCHIVING STATISTICS

National Digital Stewardship Alliance (NDSA) 2016 Survey (PDF)

  • 94% of respondents use an external web archiving service like Archive-It
  • 71% of organizations devote one half FTE or less to web archiving
  • 60% started programs between 2011 and 2015
  • 60% rely on other organizations’ or community-generated policies in the creation of their own
  • Principle concerns include ability to archive social media (70%), video (69%), and databases (62%)

12 of 63

USE CASES

Create a thematic/topical web archive on a specific subject or event

> Often related to traditional collecting activity around the same topical focus

> Capture spontaneous events

> Document different perspectives and social commentaries

Fulfill a mandate to capture/preserve evolving web history

> Construct a historical record of an institution or individual’s web/social media presence

> Support an electronic records system to meet records retention requirements

> Collect publications/documents that are no longer in print form

Closure crawls

> Document a public institution’s presence on the web before it changes or closes

13 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS COLLECTION

Content includes:

> Full-text archives of official documents

> Original video and audio recordings of key regional leaders

> Thousands of annual and "state of the nation" reports

> Collections of Latin American elections and political parties

Use Case:

Archive government documents from 18 different countries in Latin America

14 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE

Honduras presidential website, 2008 (before coup)

15 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE

Honduras presidential website, 2009 (during coup)

16 of 63

UNIVERSITY OF TEXAS AT AUSTIN: �LATIN AMERICAN GOVERNMENT DOCUMENTS ARCHIVE

Honduras presidential website, 2010 (after coup)

17 of 63

BODLEIAN LIBRARY

> Collect in focused thematic/topical subjects designated by librarians, university staff/students, etc.

Use Cases:

> Archive Oxford University’s web presence

18 of 63

BODLEIAN LIBRARY

OXFORD UNIVERSITY

19 of 63

BODLEIAN LIBRARY

INTERNATIONAL

20 of 63

BODLEIAN LIBRARY

SCIENCE, MEDICINE AND TECHNOLOGY

21 of 63

LIBRARY AND ARCHIVES CANADA

Content includes:

> Canadian government websites, hosted locally

> Collaborative collections, curated with other Canadian organizations

Use Case:

Archive Canadian government web content, and other content important to Canadian history

22 of 63

LIBRARY AND ARCHIVES CANADA

TRUTH AND RECONCILIATION COMMISSION (TRC)

23 of 63

LIBRARY AND ARCHIVES CANADA

GOVERNMENT OF CANADA WEB ARCHIVE

24 of 63

QUESTIONS?

25 of 63

STARTING A COLLECTION

Collection: A group of URLs curated around a common theme, topic, or domain.

Ask Yourself…

> What is the topic of this collection?

> Which websites should I archive � as part of this collection?

New York Public Library

26 of 63

COLLECTIONS START WITH SEEDS

Seed: �The starting point URL for the crawler. The crawler will follow linked pages from the seed URL and archive them if they are “in scope.”

Document: Any file with a distinct URL.�html, image, PDF, video, etc...

University of Kentucky

27 of 63

CRAWLERS AND SPIDERS AND ROBOTS, OH MY!

Crawlers are pieces of software that visit websites and index the information included therein.

Archive-It crawls the web and archives copies of the information and files displayed on target websites.

Département évangélique français d’action apostolique (Défap)

28 of 63

HOW THE CRAWLER WORKS

1. Starts with seed URL(s)

2. Checks if those URLs are reachable, and archives them

3. Looks for embedded content – what does it need to render the page? CSS, Javascript, Images, etc...

4. Looks for links to other pages

5. Checks if those pages are “in scope” and archives them

The crawler will continue until it cannot locate any more links that are in scope or it hits a limit set for the crawl (time, data, or document limits).

Département évangélique français d’action apostolique (Défap)

29 of 63

CRAWL SCOPE

How does the crawler know which links to archive

and which to ignore?

> The seeds you add to your collection will determine the “scope” of your crawls.

> How you format your seed URLs can have

an impact on the “scope” of your crawl.

30 of 63

ARCHIVE-IT CRAWLING SCOPE

Seed URLs can limit the crawl to a single directory of a site.

> example: www.archive.org/about/

> a / at the end of your url can have a big effect on scope

> Parts of the site not included in your seed directory will NOT be archived

Example seed: www.archive.org/about/

> Link: www.archive.org/webarchive.html is NOT in scope

Example seed: www.archive.org/about

> Link: www.archive.org/webarchive.html is in scope

31 of 63

ROBOTS.TXT BLOCKS

By default, our crawler respects all robots.txt files. Partners can check post-crawl reports for blocked seeds, hosts, or documents.

If your site is blocked...

> Contact the site owner and ask if they will unblock your crawler specifically – archive.org_bot.

> Some institutions choose to utilize a tool to ignore robots.txt blocks for specific cases

32 of 63

WEB ARCHIVING LIFE CYCLE

  • Highlights the policy and workflows of six partner institutions: Columbia, University of Alberta, Montana State Library, State Library of North Carolina, North Carolina State Archives and Creighton University

  • Covers issues including:
    • Policy
    • Vision and Objectives
    • Workflows
    • Access
    • Preservation

https://archive-it.org/static/files/archiveit_life_cycle_model.pdf

33 of 63

QUESTIONS TO ASK

Vision and Objectives:

  • What are your overarching web archiving goals?

Resources and workflow:

  • What resources will help you achieve these goals?
    • Staff
    • Technical/Infrastructure
    • External organizations?

Access/Use/Reuse:

  • Who are your target users?
  • What levels of access/research use would you like to facilitate?

34 of 63

MORE QUESTIONS

Appraisal and Selection:

  • What topics, events or domains would your organization be interested in archiving?
  • What websites/portions of websites would help you document your collections?
  • Who else is collecting this type of content?
  • How would you make selections? Internally? Nomination?

Scoping and Data Capture:

  • How much of your target sites would you capture?
  • How often would you capture this content?

35 of 63

36 of 63

PART II AGENDA

TOOLS & STANDARDS

CHALLENGES

NEW TECHNOLOGIES�

RESEARCH & ACCESS

37 of 63

WEB ARCHIVING TOOLS AT INTERNET ARCHIVE

HeritrixWeb crawler – crawls and captures web pages

UmbraAssists the crawler to access social media and other sites in the same way a browser would

Wayback�Access tool for rendering and viewing pages - surf the web as it was

ElasticSearch & SOLR�Full-text search indexing engine & metadata search software

Brozzler

Browser + crawler= Brozzler!

Browser + Crawler = Brozzler!

WARCISO standard for storing web archives

38 of 63

    • ISO 28500:2009
    • Combines multiple digital resources into an aggregate archival file together with related information
    • Container file
    • Written by crawlers
    • Concatenated raw content
    • For long-term storage and preservation

WARC (Web ARChive) Format

39 of 63

CHALLENGES: CONTENT

> Javascript – some implementations � can be difficult to capture and display.

> Videos – can capture most videos, but some � proprietary formats can be difficult.

> Social Media – always improving tools for � archiving Facebook, Twitter, Flickr, Instagram, � and more.

40 of 63

CHALLENGES: CONTENT

Alternatives may include a sitemap or � direct links to content

Password protected sites – new feature � currently in Beta

What happens when content is behind a form, search box, or password?

Minnesota Historical Society

41 of 63

ACCESS: WHAT MAKES A SITE ARCHIVABLE

Make links transparent

Be careful with robots directives

Return reliable response codes

42 of 63

Type of content provoking concern over capacity to archive.

From Web Archiving in the United States: A 2016 Survey, report from the National Digital Stewardship Alliance

CHALLENGES: CONTENT

43 of 63

Crawling Technology

Heritrix

Heritrix + Umbra

Brozzler

2003 - 2014

2014 - Present

Future?

Traditional web crawler

Scoping, capture, deduplication, WARC creation in one process

Less adept at triggering and capturing client side script and Javascript

ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE

44 of 63

Crawling Technology

Heritrix

Heritrix + Umbra

Brozzler

2006 - 2014

2014 - Present

Future?

Runs alongside Heritrix

Mimics the way a browser would access a page

Executes client side scripts so previously undetectable URLs could be accessed by Heritrix

Clicking or hovering to execute Javascript

Allows for dynamic scrolling

ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE

45 of 63

Crawling Technology

Heritrix

Heritrix + Umbra

Brozzler

2006 - 2014

2014 - Present

Future?

Captures http traffic as it’s loaded

Uses a real browser to fetch pages and embedded URLs, and to extract links

Works with youtube to improve media capture

ARCHIVE-IT CRAWLING TECHNOLOGY TIMELINE

46 of 63

BROZZLER

“browser”|”crawler” = BROZZLER

Runs on an instance of chromium browser

Opens page in the browser, takes a screenshot, sends to warcprox, written as a WARC file

Runs a javascript behavior and finds a@href outlinks

47 of 63

  • Services & Tools
    • Archive-It
    • Internet Memory Foundation
    • Commercial: Hanzo, Pagefreezer, Mirrorweb
    • Tools: Webrecorder, WAIL
    • API based: twarc, Social Feed Manager

  • Access:
      • Oldweb.today, Momento, Webrecorder

WEB ARCHIVING TOOLS & SERVICES

For a full list: http://netpreserve.org/web-archiving/tools-and-software/

48 of 63

HTTRACK

http://www.httrack.com/

Download a site from the Internet to a local directory

WEB ARCHIVING TOOLS: HTTRACK & WGET

WGET

  • Terminal tool
  • Allows you to download directories via the command line
  • http://ftp.gnu.org/gnu/wget/

49 of 63

“Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.”

Uses Heritrix 3.2.0 and OpenWayback 2.3.0

Developed by Mat Kelly at the Web Science and Digital Libraries research group at Old Dominion University

WEB ARCHIVING TOOLS: WAIL

50 of 63

WEB ARCHIVING TOOLS: WEBRECORDER

Available at webrecorder.io/

Developed by Ilya Kreymer, and a project of Rhizome

Focus on dynamic web content such as embedded video and complex javascript

Provides for both capture and access

51 of 63

SOCIAL MEDIA WEB ARCHIVING TOOLS & SERVICES

Social Feed Manager

  • Developed at George Washington University Libraries
  • Collects Twitter data in bulk using the Twitter API
  • Open source; available on github

twarc

  • Developed by Ed Summers at Maryland Institute for Technology in the Humanities
  • A command line tool for archiving Twitter JSON data
  • Also uses the Twitter API
  • Useful for running searches on terms to collect all tweets mentioning a keyword

52 of 63

  • Browser emulator to access publicly available archived sites using virtual versions of old browsers

  • Focus is on playing back the site as it would have been originally experienced

  • Developed by Ilya Kramer at Rhizome

ACCESS: OLDWEB.TODAY

53 of 63

ACCESS: MEMENTO/TIME TRAVEL SERVICE

54 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

  • Documentary: Evidentiary, Attestation, Legal discovery/claim
  • Social/Political Scientists: Communications, Politics/Government, Social Anthropology
  • Web Science: Technology Systems and Protocols
  • (Digital) Humanities: Historians and humanities disciplines, networks, collection building
  • Computer Science: Information Retrieval, Data Processing and Indexing, Infrastructure and tools
  • Data Analysts: Mining/Training, language processing, trend analysis

TYPOLOGY OF RESEARCHER INTERESTS

55 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

Web Archive Datasets

56 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

  • Researchers don’t always know what they want
  • Researchers default to wanting access to raw/all data
  • Researchers will have varying levels of technical resources or support
  • Address upfront issues of technical proficiency, non-archive technical support and/or methodological stuff
  • Will require reference/resources to explain and contextualize web archive tools and processes
  • More data doesn’t equal better analysis

LESSONS LEARNED SUPPORTING RESEARCHERS

57 of 63

RESEARCH SERVICES & WEB ARCHIVE DATASETS

  • Focus on derivation, portability, and access
  • Focus on scalable partnerships & decentralization
  • Research support expectations often != with available resources or services
  • Research methodologies (conceptual, practical, technical) often != with data, collecting, tools
  • Service models or death (though yet to emerge for most data-driven LAM-ish research)

STRATEGIC APPROACHES TO SUPPORTING RESEARCHERS

58 of 63

RECAP AND REVIEW

  • Definitions and Community
    • We defined terms, practices, and the current landscape
  • Collection
    • We discussed the whys and hows of creating web archives
  • Manage
    • We outlined management, tools, and services
  • Use
    • We looked at formats, archival replay, and data mining�
  • Now We’ll Demo It All!�
  • Then It’s Your Turn To Archive!

59 of 63

READING LIST

STO

Davis, Corey. "Archiving the Web: A Case Study from the University of Victoria." Code4Lib Issue 26, 2014-10-21 (2014). http://journal.code4lib.org/articles/10015

Keeping Collections: More Podcast Less Process. Episode 007. The Web Archivist Are Present. http://keepingcollections.org/more-podcast-less-process-episode-007/

D-Lib Magazine. Special Issue on Web Archives. March/April 2012. http://www.dlib.org/dlib/march12/03contents.html

Web Archiving In The United States: A 2016 Survey. NDSA, 2017. Web. 10 May 2017. Results Of A Survey Of Organizations Preserving Web Content, http://ndsa.org/documents/WebArchivingintheUnitedStates_A2016Survey.pdf

Summers, Ed. 'The Web As A Preservation Medium'. inkdroid. N.p., 2013. http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/

Pennock, M. (2013, March). Web-archiving (DPC Technology Watch Report 13-01). Digital Preservation Coalition. Retrieve from

http://dx.doi.org/10.7207/twr13-01

Taylor, Nicholas. 'Anatomy Of A Web Archive | The Signal: Digital Preservation'. N.p., 2013. Web. 30 Jan. 2015. http://blogs.loc.gov/digitalpreservation/2013/11/anatomy-of-a-web-archive/

60 of 63

ADDITIONAL RESOURCES: support.archive-it.org

61 of 63

62 of 63

Thank you!

Lori Donovan, Senior Program Manager, Web Archiving | lori@archive.org

Maria Praetzellis, Program Manager, Web Archiving | maria@archive.org

Internet Archive & Archive-It | @internetarchive & @archiveitorg

63 of 63

Archive-It

https://archive-it.org/

Click “Login” in upper right

ARCHIVE-IT DEMO

Login details:

  • Username: araworkshop
  • Password: ara2017