Web Archiving
Goals and Fundamentals
Emily Collier
University of Kentucky Special Collections
Research Center
Why Web Archiving: The Loss of the Hard Copy
Why Web Archiving: To Fully Represent Institutions
The Goals of Web Archiving
To create, preserve, and make accessible the transient records stored as various web documents that become subject to loss through the nature of online materials: link rot, change in URL, removal of or updates to uploaded files, etc.
To do this efficiently with a high level of automation to reduce the amount of required resources for processing (Low resources/labor, high output), as a means of alleviating or preventing archival backlog. “Archivists spend too much time doing things they don’t need to be doing, or at least don’t need to be doing all the time.” (Greene and Meissner, p. 209)
To integrate these records into a system that makes them easily accessible.
University of Kentucky Web Archive Goal
To preserve ‘external’ digital content related to the University of Kentucky published exclusively on web-based platforms: degree requirements, syllabi, course catalogs, newsletters, etc.
This content represents essential functions of UK operations and document the rich social and academic life of the University. Without the preservation of these web-based resources, valuable records are lost.
To effectively use Archive-It as a web crawler (record collector) and then integrate archived sites into a system that makes them accessible for researchers: ArchivesSpace
Unique Problems
Communication Infrastructure
“Since archival versions of websites typically result in limited functionality, especially where hyperlinks are concerned, communication must exist between web developers and administration.” (Barry et al. p. 9)
Communication lines must exist to ensure that web capture crawls perform correctly and rights/restrictions are preserved. Robot.txt extensions can halt crawls or crawls can fall into repetitive “traps.”
Breaking Traditional Practices
“The archival profession has been unwilling or unable to change its processing practices in response to increased acquisitions” (G. & M p. 211)
Traditional approaches may present different hurdles with web-based records: metadata for web documents are typically not included in standards or institutional guidelines. Institutions need to be inclusive of these new types of resources and modify their practices.
Mixed Collections: Analog + Digital
“There are also discrepancies between physical holdings and representations on the web.” (Barry et al. p. 2) The loss of either physical or web documents creates gaps in representation.
Online records are often considered part of larger, pre-existing collections, but formats, preservation concerns and methods of access are unique. Integration with EAD finding aids can increase access.
Unique Problems: Seed URLs and Relationships
Many institutional sites may have more than one relevant “seed” URL, or multiple URLs that contain relevant documents for a specific website. (http://law.uky.edu/about-us/law-library vs. http://library.law.uky.edu/home)
Many archive collections can relate to different groups or crawls of one specific URL. This creates a many-to-one relationship where one “seed” could be considered belonging to multiple collections. Creating multiple seeds from one main seed (ie department seeds from a main university site) isn’t practical because the structure of linked pages create meaning and context.
Selected Readings