1 of 19

Build: The Archives Research Compute Hub from Idea to Platform

Ian Milligan, Jefferson Bailey, Nick Ruest, Helge Holzmann, Samantha Fritz, Kody Willis

Web Archiving Conference, 2022

2 of 19

The Web

  • is a reflection of society
  • remains an untapped resource for research1
  • has been archived since mid-1990s and led to an abundance of material
  • provides a new context for research data in the form of web archives

2

Citation:

1. Schroeder, R., & Brügger, N. (2017). Introduction: The web as history. In R. Schroeder & N. Brügger (Eds.), The Web as History: Using Web Archives to Understand the Past and the Present (pp. 1–20). UCL Press. https://doi.org/10.2307/j.ctt1mtz55k.6

3 of 19

The Challenge

available analytics tools, community infrastructure, and inaccessible web archival interfaces present high barriers for conducting research with web archives at scale.

3

4 of 19

Archives Unleashed I (2017-2020)

  • Recognizes the critical role of web archives for scholars studying the 1990s onward
  • Developed the Archives Unleashed Cloud
    • An interface to sync Archive-It collections, analyze them, generate scholarly derivatives, and work with them
  • Standalone system that demonstrated how a web browser interface could power the underlying Apache Spark-based Archives Unleashed Toolkit

4

5 of 19

Archives Unleashed I (2017-2020)

  • Refined the Archives Unleashed Toolkit
  • An Apache Spark library for working with W/ARC files and analyzing them
  • User documentation offers dozens of pre-built scripts to explore web archives and extract information

5

6 of 19

Archives Unleashed II (2020-2023)

6

Merge Archives Unleashed with the Internet Archive Archive-It Platform to create an end-to-end solution to collect and study web archives.

Foster and support a research community of practice by offering opportunities to engage with web archive research.

ARCH (Archives Research Compute Hub)

Cohort Program

Project Priorities

7 of 19

Introducing the ARCH Platform

  • Develop a scalable analysis platform within the Archive- It environment

  • ARCH allows users to delve into the rich data within web archival collections for further research

7

8 of 19

Introducing the ARCH Platform

Features:

  • Interactive, familiar environment for current Archive-It subscribers
  • Addresses first steps in analysis
  • Generate and download over a dozen datasets
  • In-browser visualizations and data previews that presents a glimpse into collection content
  • Located in the Internet Archive data center, ARCH has quick access to the petabytes of content collected

8

9 of 19

Switching to Live Demo here

9

10 of 19

Building for Scalable Analysis

10

11 of 19

First Steps

11

Ideation: Identifying existing Archives Unleashed and Archive-It services – overlaps and differences?

Creation: A half-dozen paper drawings to an interactive prototype (using MockPlus) - sketching wireframe

Iteration: Showing teams storyboards, thinking about how to make for an intuitive and friendly workflow

12 of 19

User Experience Testing

12

seeks to understand the impressions, experience, and feelings a user expresses while interacting with a product prototype.

  • Brings the creators and developers into closer alignment with their end-users
  • ARCH UX Goal: understand research behaviours and the user journey while assessing what works well, what challenges arise, and identifying needs that aren’t being met
  • Conducted multi-staged user testing process to continually assess user sentiment and the impact with functionality and interface improvements
    • Concept Design Interviews (2021)
    • Multiple rounds of UX testing (2021-2022)
    • Focused interviews with cohort researchers (2022)

13 of 19

User Experience Testing

13

  • Spectrum of confidence: Data scientists confident in their ability to analyze data, collectors less so.
  • Better integration into analysis pipelines (i.e. command line download)
  • Enhanced Analysis to include additional datasets that respond to research needs
  • Increased alignment with accessibility standards
  • Improved workflow and navigation by providing way-finding support and prompts
  • Clearer language for our educational users and additional documentation and learning resources

14 of 19

Connection and Integration

14

  • Front-end development
  • Connecting back-end process
  • Integration of product into the production environment
  • Technical Choices
    • Scalatra
    • Apache Spark
    • HDFS
    • Sparkling
    • Archives Unleashed Toolkit

15 of 19

Continual Improvement

15

  • ARCH in beta, with pilot users from the Cohort Program
  • Monitoring
  • Final year developments
    • Pre-Filtering (user defined queries)
    • Thoughtful access and use for non AI subscribers

16 of 19

Lessons Learned

16

17 of 19

Reflecting on Lessons Learned

Lesson 1

If you build it, they won’t come. You need to actively work to create an environments where users feel comfortable.

Lesson 2

Work to meet your users. This doesn’t necessarily mean that you will make all of them happy, but it does mean you need to listen and be responsive through UX testing and outreach.

Lesson 3

Be ready for the unexpected! If there’s something that is 1 in a 1,000,000, you’ll run into it dozens of times in your WARCs. So be ready for error handling and continual improvement.

17

18 of 19

18

Acknowledgements of Institutional Support

19 of 19

Thanks!

Any questions ?

Connect with out project team:

19