1 of 11

Web Archiving

Goals and Fundamentals

Emily Collier

University of Kentucky Special Collections

Research Center

2 of 11

Why Web Archiving: The Loss of the Hard Copy

  • Many types of records that used to be delivered as a hard copy are now produced and maintained exclusively online.
  • Online records present a new problem as these items are regularly altered, deleted, or lost on a relatively consistent basis. (ie deleted Twitter post, revamped website)

3 of 11

Why Web Archiving: To Fully Represent Institutions

  • Many university functions that can be difficult to acquire in traditional accessions (such as socialization or the promotion of a distinct culture) are readily apparent in websites, particularly social media.
  • Websites help confer credentials (from the recruitment of students through their graduation), convey knowledge, foster socialization, conduct research, sustain the institution, provide public services, and promote a distinctive culture. (Barry et al.)
  • Archives fail in their mission to fully represent an institution if these types of records are lost.

4 of 11

The Goals of Web Archiving

To create, preserve, and make accessible the transient records stored as various web documents that become subject to loss through the nature of online materials: link rot, change in URL, removal of or updates to uploaded files, etc.

To do this efficiently with a high level of automation to reduce the amount of required resources for processing (Low resources/labor, high output), as a means of alleviating or preventing archival backlog. “Archivists spend too much time doing things they don’t need to be doing, or at least don’t need to be doing all the time.” (Greene and Meissner, p. 209)

To integrate these records into a system that makes them easily accessible.

5 of 11

University of Kentucky Web Archive Goal

To preserve ‘external’ digital content related to the University of Kentucky published exclusively on web-based platforms: degree requirements, syllabi, course catalogs, newsletters, etc.

This content represents essential functions of UK operations and document the rich social and academic life of the University. Without the preservation of these web-based resources, valuable records are lost.

To effectively use Archive-It as a web crawler (record collector) and then integrate archived sites into a system that makes them accessible for researchers: ArchivesSpace

6 of 11

Unique Problems

7 of 11

Communication Infrastructure

“Since archival versions of websites typically result in limited functionality, especially where hyperlinks are concerned, communication must exist between web developers and administration.” (Barry et al. p. 9)

Communication lines must exist to ensure that web capture crawls perform correctly and rights/restrictions are preserved. Robot.txt extensions can halt crawls or crawls can fall into repetitive “traps.”

8 of 11

Breaking Traditional Practices

“The archival profession has been unwilling or unable to change its processing practices in response to increased acquisitions” (G. & M p. 211)

Traditional approaches may present different hurdles with web-based records: metadata for web documents are typically not included in standards or institutional guidelines. Institutions need to be inclusive of these new types of resources and modify their practices.

9 of 11

Mixed Collections: Analog + Digital

“There are also discrepancies between physical holdings and representations on the web.” (Barry et al. p. 2) The loss of either physical or web documents creates gaps in representation.

Online records are often considered part of larger, pre-existing collections, but formats, preservation concerns and methods of access are unique. Integration with EAD finding aids can increase access.

10 of 11

Unique Problems: Seed URLs and Relationships

Many institutional sites may have more than one relevant “seed” URL, or multiple URLs that contain relevant documents for a specific website. (http://law.uky.edu/about-us/law-library vs. http://library.law.uky.edu/home)

Many archive collections can relate to different groups or crawls of one specific URL. This creates a many-to-one relationship where one “seed” could be considered belonging to multiple collections. Creating multiple seeds from one main seed (ie department seeds from a main university site) isn’t practical because the structure of linked pages create meaning and context.

11 of 11

Selected Readings

    • Shallcross. (2001). “On the Development of the University of Michigan Web Archives.” 29 pgs.
    • Barry et al. (2008). “Survey of the University of Pittsburgh and Association of American Universities’ Websites and Physical Holdings.” 12 pgs.
    • Greene & Meissner. (2005). “More Product, Less Process: Revamping Traditional Archival Process.” 56 pgs.
    • Roe. (2005) “Arranging and Describing Archives and Manuscripts.” Chapters 1, 2, and 4.
    • Weidman. (2016). “A Sustainable, Large-Scale, Minimal Approach to Accessing Web Archives.”
    • Work, L.; Sullivan, L. (2018). Archive-It Partner Presentations
    • LePore, Jill. (Jan. 26, 2015) “The Cobweb: Can the Internet be archived?” The New Yorker