1 of 16

Web Archives as Big Data: Building Tools and Community to Support Access and Usability with Archives Unleashed

Samantha Fritz, MLIS

Project Manager, Archives Unleashed

Samantha.fritz@uwaterloo.ca | @SamVFritz

IFLA WLIC 2021 17-19 August 2021

2 of 16

  • Term popularized in 2005

  • Reflective of technological developments of 20th / 21st century
  • WWW impacted information production, interaction, and preservation
  • Volume and speed data is generated are important characteristics of big data (A Short History of Big Data, 2019)

Big Data

3 of 16

Big Data

data whose scale, diversity and complexity require new architectures, techniques, algorithms, and analytics to manage it and extract value and hidden meaning from it.

(noun).

(Noam Slonim et al, 2012)

4 of 16

Web Archiving

  • Web archives (W/ARCs) are Big Data
  • Web Archiving preserves vulnerable cultural information via born-digital artifacts.
  • WARCs are ISO standard file format for saving web data & metadata
  • Preservation efforts began in 1996; grown into an international community
  • WARCs are an important data source for studying the recent past (Milligan, 2019)

5 of 16

“Libraries have long been in the business of preserving documentary heritage”

- Ben White, 2012

Libraries & (W/ARC) Data

6 of 16

Libraries & (W/ARC) Data

  • Fundamental processes apply equally to print and analogue materials, and data collections (CARL-ABRC, 2014)
  • Web archiving has increasingly become part of research agendas for national libraries and archives around the world.

Small sample of national, university/college, and public libraries conducting web archiving

7 of 16

Challenges with Web Archives

Despite the volume of data captured web archives have not become a dominant resource for researchers

8 of 16

Challenges with Web Archives

  1. Visibility
    • Silo effect of collection – curation at the institutional/collection level
  2. Lag of Analytical Tools
    • Collection/preservation practices excel
    • Development of analytics tools and infrastructure has lagged
  3. Tools are too Technical
    • large-scale analysis requires technical knowledge, tends to be out of reach for most scholars

9 of 16

Solutions

with the Archives Unleashed Project

Tool Building Community Engagement Collaborative Partnerships

Est. 2017

10 of 16

  • Lag and lack of analytical tools hinders access and use
  • The Archives Unleashed Project (2017-2020) developed open-source, scalable, and user-friendl(ier) tools
  • Options for scalable analysis of W/ARC files

1. Tools: Scalable and User-Friendly

Archives Unleashed Toolkit

Archives Unleashed Cloud

11 of 16

  • Resources developed to support, encourage, and instil confidence in scholars
  • Toolkit Documentation - a cookbook approach, with pre-built scripts (or recipes) that users can plug in to address common analytic tasks.
  • Learning Resources - step-by-step instructions for guiding researchers to explore derivatives with external tools

2. Learning Resources: Inspire Confidence & Use

Archives Unleashed Toolkit User Documentation: https://aut.docs.archivesunleashed.org

12 of 16

  • Projects can’t live in silos, they need community
  • Main community building & engagement activity: series of datahon events
  • Participants: interdisciplinary group of librarians, researchers, technologists
  • Collaboration over two days to gain hands-on experience with web archive data.
  • Build user community, foster sense of belonging
  • Datahon alumni became ambassadors within broader communities

3. Build and Engage Community

Archives Unleashed Washington, DC. Datathon

Gelman Library, George Washington University, 2019. Photo by Samantha Fritz

13 of 16

Formalized collaboration with the Internet Archive’s Archive-It (2020-2023) to integrate services

Worked with scholars in several disciplines: digital humanities, social sciences, medicine, journalism, and political science

Proactively developed collaborative relationships with several stakeholder groups

4. Collaboration Expands visibility, Access, and Use

RESEARCH

COMMUNITIES

ACADEMIC

LIBRARIES +

ARCHIVE-IT

COLLABORATION

Collaborated with library institutions in North America to make scholarly derivatives openly available.

14 of 16

Conclusions

  • Reflecting on Archives Unleashed as a use-case, we can draw from soft-skills found within the LIS profession to face challenges of data visibility, use, and access

  • Community Stewards
    • Foster, support, engage communities
    • Create conversational space to address user needs
    • Sense of belonging creates ambassadors, impacts/ increases data visibility
  • Collaborators and Teachers
    • Build collaborative relationships to instill confidence in data use
    • Support data literacy & skills development
  • Resourceful Problem Solvers
    • Assess tool landscape, identify resources for sharing to fill gaps and meet data access challenges

15 of 16

Sources

IFLA WLIC 2021 17-19 August 2021

16 of 16

Thank You!

Samantha Fritz, MLIS

Project Manager, Archives Unleashed

https://archivesunleashed.org/

Samantha.fritz@uwaterloo.ca @SamVFritz