1 of 13

Web Archiving

Life Cycle Model

Emily Collier

University of Kentucky Special Collections

Research Center

2 of 13

Web Archiving Life Cycle Model

Archive-It’s Life Cycle Model

Bragg, Molly; Hanna, Kristine. (2013). The Web Archiving Life-Cycle Model. The Internet Archive. 30.

3 of 13

Policy: Always on the Board

The Web Archiving Life Cycle model begins with institutional policy. This stage of the cycle is continuously developing and interacting with every other aspect of web archiving, so unlike other areas of the model, policy stays on the board at every play.

  • New policy specific to web archiving OR
  • Adaptation of web archiving policy into existing archival policy

4 of 13

The Outer Circle:

Vision and Objectives

  • Why are we archiving the web?
  • Institutional mandate: to archive publications that are only available in online format for legal purposes
  • Archival mission: to archive a full representation of institutional business and operations
  • As an extension of current collection development policies
  • To record and preserve cultural heritage

5 of 13

The Outer Circle:

Resources and Workflow

  • The day-to-day resources needed to create and maintain web archives
  • Primary Considerations:
    • Number of Staff
    • Amount of Time
  • Types of individuals:
    • Librarian
    • Archivist
    • Metadata Librarian
    • Webmaster
    • Student Worker

6 of 13

The Outer Circle:

Access/Use/Reuse

  • Open access
    • Search box on local website
    • Institutional portal/landing page
    • Wayback software
    • Self-directed (reference Archive-it)
  • Restricted access
    • Unavailable until certain time
    • Restricted by IP address
  • Discovery tools
  • Finding aids/catalogs

7 of 13

The Outer Circle:

Preservation

  • An evolving issue
    • Changing formats/compatibility
    • Expired formats
    • Development of digital repositories
  • Local vs. External
  • Best practices:
    • Redundancy
    • Transparency
    • Integrity checks

8 of 13

The Outer Circle:

Risk Management

  • Managing the level of copyright risk
  • Whether and how to seek permissions:
    • Robots.txt protocol
    • “Fair Use”
  • Policy on content removal upon complaint or issue:
    • Immediately remove content
    • Argue right to content

9 of 13

The Middle Circle:

Metadata and Description

  • Like policy, metadata and description is ongoing as standard for wb
  • Level of metadata description:
    • 90% generate collection level
    • 60% generate seed level
    • 15% generate document level
    • 60% generate collection and seed
      • 2013 Archive-It Study
  • Prepare manually vs. scraping from the website

10 of 13

The Inner Circle:

Appraisal and Selection

  • Which specific websites should be captured and which seeds should be crawled? Who decides?
    • Archives team
    • Subject liaisons
  • State: State agencies and websites
  • University: web presence and creation of collections based on themes
  • Social media can be difficult to capture. Is it worth it?

11 of 13

The Inner Circle:

Scoping

  • How much of each website should be archived?
    • One page vs. Nested pages
      • Ie. a front page or included pages from entire site
    • Embedded links
    • Third-party content
    • Specific format only (ie. PDF)
  • Captures can be limited by:
    • Crawl Time
    • URL expressions
    • Data limits
  • Scoping limits can create complexities and cause issues, particularly with pages using difficult-to-capture scripts like javascript or flash

12 of 13

The Inner Circle:

Data Capture

  • Involves the specific of the crawls including:
    • Restrictions
    • Frequency/Scheduling
  • Complications:
    • Sites can be bigger than expected
    • Crawl “traps”

13 of 13

The Inner Circle:

Quality Assurance/Analysis

  • Review crawls through reports at seed and host levels
  • Issues can be resolved by:
    • Conferring with webmasters
    • Running test crawls
    • Running patch crawls
    • Archive-It proxy mode for viewing sites offline
    • Enabling the Wayback QA (Quality Assurance) tool
    • Modifying scopes
  • Open source software is available by developers to handle complex issues or enhance use.
    • These should be verified for support, maintenance and longevity before adopted