1 of 27

Hydra in a Box:

Building A Next-Generation Platform for Digital Collections

Hannah Frost, Stanford University

Gretchen Gueguen, DPLA

Mark A. Matienzo, DPLA

DPLAFest 2016 — April 14, 2016

2 of 27

Project Overview

A Time for Change
The Vision
Project Partners
Project Goals
Timeline

3 of 27

A Time for Change

Conversations between Stanford University, DPLA, and DuraSpace informed project design
Current digital collections platforms originate in an earlier phase of the web, which explain current limitations
Infrastructure needs in the DPLA Hub network

Legacy systems unable to leverage modern affordances of the web
Lack of scalable and sustainable aggregation workflows
Lack of support for linked data and metadata enrichment
Perceived lack of “obvious choices” for replacement systems for digital collections

The origins of the project are in conversations between DPLA, DuraSpace, and Stanford about state of the world of repositories and digital collections in the summer of 2014. Through these discussions, we realized that the management and publication of digital collections for libraries, cultural heritage, and memory institutions was at a point where large scale transition was possible. Many current digital repository systems in use were conceived with the main objective of putting collections online, during an earlier phase of the web in which unifying collections at a national scale was neither feasible nor a priority, integration with other web-based services was novel, and devices such as tablets and mobile devices considerably were even rarer. Digital content, the curation workflows applied to it, and mechanisms for publishing content to the web have all become far more sophisticated since the advent of early-stage systems. The web has demonstrated its capacity for decentralized discovery, sharing, and reuse of resources across the network. What better way to leverage this possibility than to collaborate in developing a new, best-of-breed system?

DPLA hubs, as well as much of the library community in general, lament having to shoehorn the needed functionality into older systems to make this happen. Legacy systems are unable to take advantage of contemporary web affordances to describe, transform, preserve, and serve digital objects to audiences. For instance, they are often not based on an architecture that can natively support linked data, which is rapidly becoming critical for the cultural heritage sector to manage relationships between digital collections and their real world context. Conventions and standards, such as the International Image Interoperability Framework have emerged, and can support many core interoperability needs for digital collections, but their newness has led to limited implementation in these existing systems without a significant amount of integration work. DPLA has also realized that these systems lack rigorous and scalable ways to export their metadata into aggregation systems. This holds equally true for other external applications with which digital collections systems must interoperate, including discovery environments, metadata enrichment systems, content management systems, and crowdsourcing platforms. Legacy systems are also missing reliable and scalable mechanisms to import or synchronize improvements, such as geocoded place names, back from entities such as DPLA.

While some institutions have chosen to undertake the development work necessary to address these gaps, many institutions have discovered aging software is also increasingly hard to maintain given older code bases and architectural assumptions, and painfully difficult to extend, integrate, and build new services upon. DPLA’s hubs and similar institutions with major repository needs are using aging tools for the evolving needs of stewarding their digital assets, and for the entirely new job of making digital cultural heritage materials first-class citizens of today’s web. Overall, these gaps make management of digital collections and their metadata both more complicated and more expensive for organizations like DPLA, as well as for all organizations seeking to share and add value to their content.

In a survey of DPLA hubs conducted late last year, nearly half of the respondents noted they or their partners are considering implementing a new digital collections management system in the near future. Several hubs also identified the importance of providing tools that facilitate easy integration of metadata enriched by DPLA back into their systems, both at the hub and partner level. Numerous hubs highlighted that they or their partners were actively seeking to move away from CONTENTdm, but as yet there were no “obvious choices” in terms of replacement systems. One hub noted that their possible move away from CONTENTdm was complicated by their reluctance to invest development time and resources to install, configure, and migrate to an immature open source solution. In short, there is a dissatisfaction with the current state of digital collection storage and display, and a great willingness, albeit tempered by the hurdles of any locally created solution, to move to a new environment.

4 of 27

The Vision

A product and service that is easy to use, easy to integrate, and that
Reduce barriers (including cost) to DPLA contribution
Allow digital collections to be not just on the web, but of the web
Expand and diversify both the DPLA and Hydra communities

Clearly, cultural heritage institutions are actively searching for new solutions that are easy to use, easy to integrate, and offer best-of-breed technologies and methods matching current web environments and evolving curation workflows. Widespread adoption of new and emerging standards across the DPLA network by current and potential hubs, and other content partners, will greatly reduce the technical and economic barriers to making their cultural heritage resources easily repurposable, aiding aggregation and discovery both within and outside the context of DPLA. Availability of a robust, turnkey open source solution for digital collections management will make it significantly easier for organizations to not only share their content with aggregators like DPLA, but will also allow digital collections to be not just on the web, but of the web. In other words, improving and modernizing core infrastructure for digital collections will allow the cultural heritage sector to build on advances of technology not only in libraries, archives, and museums, but with the web at large. By forming a tripartite partnership to address these increasingly pressing needs, DPLA, DuraSpace and Stanford intend to produce a turnkey Hydra-based solution for the management of digital collections that can be widely and easily adopted by institutions nationwide.

We envision that Hydra in a Box will allow institutions with aging installations of traditional software to have access to a best-of-breed replacement, one that is not only free and open source but also supported by a diverse, active community. In addition, the project will provide an easier technological pathway for institutions to join DPLA; the turnkey solution will be recommended to new service hubs, as well as established ones—the state and regional digital libraries that power DPLA’s network. Because this project brings together a number of promising next-generation software initiatives, it can serve as a magnet for additional development around advanced services for digital collections and metadata management. By providing hosting options, this project will enable a simple way for institutions without significant technical staffing or infrastructure to benefit from Hydra and to participate in DPLA. A hosted model also benefits the Hydra project not only through its expansion of the framework’s user base, but through the potential to broaden the types of interests and capacity represented within the Hydra community.

5 of 27

Project Partners

DPLA: Experience with metadata aggregation; Network of hubs & partners - many are looking for a modern solution to manage and provide access to digital content; commitment to open access, and encouraging access and reuse of cultural heritage content; goodwill within community

Stanford: Institution with a repository and firsthand repository needs; motivation to accelerate mainline Hydra development to meet Stanford needs; longterm experience with, and deep commitment to, Hydra (both the tech and community)

DuraSpace: The success of Fedora is linked to the success of Hydra; help open source software communities be sustainable; six years experience building and supporting scalable cloud-based, hosted applications; bring groups together to achieve common goals

IMLS: Funder of the project, through the National Leadership Grants program.

Hydra community: Over 60 partners and known users. 48 licensed contributing institutions, 180 licensed individual contributors. Not a directed project. Investment in a framework, not an application; contributions back to core code base; Investment in a community, not a vendor: Contributions back to community: training, documentation, modeling, sharing best practices, outreach. Travel & Face time commitments.

6 of 27

Project Goals

Development of turnkey, Hydra-based application that leverages and improves on core components
Development/integration of metadata aggregation & enrichment tools
Connect components with DPLA hubs, current Hydra partners, and prospective Hydra adopters
Work toward a hosted service

The grant project has four goals: to produce a polished, feature-complete, easy-to-install, turnkey, Hydra-based application for next-generation digital asset management; to improve and generalize DPLA’s metadata harvesting, and enrichment tools into more reusable components, lowering the bar for DPLA contribution for users of the turnkey system and other Hydra applications;

to connect these key infrastructural pieces with DPLA hubs, current Hydra partners, and prospective Hydra adopters—creating a vibrant, participatory community of adopters and contributors; and to work toward a hosted service, offering a cloud-based version of the application for use across multiple domains.

These goals strongly support the fundamental objectives of developing a national, digital platform. It will equip libraries, archives, museums and cultural heritage institutions with advanced digital repository capabilities; enable much more effective publication, aggregation, discovery, and reuse of digital content via DPLA; help form a global network of interoperable, digital content; and forge a robust community of common practice among institutions charged with stewarding and serving their digital assets.

7 of 27

Timeline

May 2015-November 2017 (30 months)
Design process: May 2015-March 2016
Development: March 2016-November 2017
Service development and community engagement: throughout project

8 of 27

Design Phase

Discovery Phase (Fall 2015)

Literature review and product/service analysis
Surveys, interviews, and focus groups
Community outreach

Information Architecture (Winter 2016)

User requirements and personas
Requirements - functional & technical
Models and wireframes

Visual Design (Spring 2016)�

Since its conception, the project has taken an intentional, informed approach to designing the Hydra in a Box product. We set out to understand as much as possible about the current digital repository landscape, and the needs of its anticipated users. To accomplish these goals, we completed several distinct threads of work. We have undertaken an environmental scan, including a literature review and analyses of several leading existing repository products and services, to better understand the current state of digital repository systems.

To gain more knowledge about the institutions that might be interested in adopting Hydra-in-a-Box, including the types of content they manage, the pain points of their current systems, and the features they’d most like to see in a new repository system, we also conducted both a quantitatively-oriented web-based community survey and a set of qualitative individual interviews and focus groups.

Additionally we have had continual engagement with the cultural heritage community through the DPLA network, Hydra community, Digital Library Federation membership, DuraSpace membership, and other professional communities

Discovery phase is followed by Information Architecture and Visual Design phases, where the team analyzes and processes this information to develop requirements and user personas, models and wireframes … all leading to the visual design

9 of 27

Design Phase

�

10 of 27

Key Areas of Progress

Design, Requirements and Specifications team:

Community survey insights�
Analysis of user interviews, focus groups�
Content types requirements for data modeling��

Much of the work that we’ve undertaken thus far has been done under the auspices of the one of the project subteam’s, the Design Requirements and Specifications team.

Myself, Gary Geisler, Audrey Altman, and Gretchen Guegen.

The majority of this work relates to the detailed identification and the refinement of the requirements for the products we’ll be developing. There are few different areas where we’ve made significant progress so far.

Survey analysis - leading to key insights

We’ve also been going a bit deeper into the analysis of feedback and potential requirements through the interviews, and focus groups. These have allowed us to have sustained conversations with potential adopters and stakeholders, and allowed us to get a more complete understanding of how digital collections are managed in different institutions and why.

Through this process, we’ve also begun to develop a set of defined content types that will serve as potential targets for development.

I’m going to briefly discuss these areas of progress, followed by an overview of our development plans and work.

11 of 27

Community Survey

256 complete responses��311 repositories��Mostly small, US academic�libraries��

12 of 27

Survey Insights

Expectations of our project�
Satisfaction levels

Users of hosted services tend to be more satisfied than users of local deployments�

Strengths and weaknesses of existing repository options�
53% plan to migrate to another system

Most to a Fedora-based solution
Rest are “not sure” what’s next��

13 of 27

User Interviews

Completed 21 individual or small-group interviews and 4 focus groups

55 individuals in total
46 institutions in the US and Canada
29 hours of recorded content�

Interviews held either in-person or through videoconference; focus groups held in-person�
Coded and analyzed process to further identify potential requirements��

14 of 27

Content Analysis Visualizations

15 of 27

Interviewee’s Notable Quote

“... How many of these different systems do you need? You can have your digital collections with images and documents, you can have your IR, you can have your digital preservation system, and you can add Omeka on top of that to do exhibits. It's just too much to have four or five different systems.”��

16 of 27

Content Type Analysis

��

17 of 27

Early Technical Exploration

Deploying to the Cloud

Leverage services for institutions without local infrastructure�

Simplifying installation and configuration

Users should not need to be technical to set up and maintain an instance�

Determining a starting point for application development

Build on existing community-based work if possible
Sufia 7.0 - actively under development

What is different -- new areas to explore -- about Hydra-in-a-Box?

Sufia 6.0 is the most widely implemented Hydra app. Developed originally at Penn State to serve as a repository front-end, focused especially to support management and sharing of institutional repository collections.

Since 2012, over 30 institutions have contributed to the code

Healthy list of features and functionality, including: multi-file or folder upload, flexible user- and group-based access controls, user dashboard for managing collections, forms for batch-editing of metadata, Google Scholar support, faceted search and browse, full text indexing and search, and much more.

One year ago we saw emergence the PCDM data model -- motivated by the introduction of Fedora 4 and its support for RDF. Hydra community saw the need for a data model that supported RDF best practices, encouraged interoperability and adoption through broad use case support,

Sufia 7.0 will be the first Hydra app that supports Fedora 4 and PCDM. Rose to the top as a clear starting point for Hydra-in-a-Box whose goal is to

18 of 27

Repository Development

Assembled an all-star technical team

10 Engineers: software development, data modeling, development operations
Contributions from other institutions (Penn State, maybe others)
Led by Michael Giarlo�

First work cycle: March - June 2016

Series of one-week sprints
Recorded demos of iterative progress, available to the public�

First milestone: Deploy our application based on Sufia 7 to the cloud

Priority content types
Configuration UI
Administrative dashboard
Batch import

Assembled an all-star team, many seasoned veterans

of the Hydra dev community (Stanford)
Of data modeling and metadata aggregation (DPLA)
Of working with the cloud (DuraSpace)�

We are organizing our work in one-week sprints, and will have open demos on Friday mornings (Pacific time) for those who are interested in keeping up with our progress. We also will be recording these demos. Links to the live demos and recordings will be shared on community mailing lists.

Our first milestone is June 1st, 2016 -- in time for a demonstration of Hydra-in-a-Box at Open Repositories 2016 -- at which time we endeavor to deploy an application based on Sufia 7 to the cloud (Amazon Web Services), which:

* supports the management of top-priority content types: image-based works and generic multi-file works (think IR-like use cases)

* offers a user interface for dynamic configuration of the application (without having to touch Ruby or YAML files)

* provides an administrative dashboard with reporting widgets

* allows for batch import of content and metadata

19 of 27

Follow our progress

20 of 27

Aggregator Needs

More flexible mapping standard than XSLT
Ability to harvest from multiple sources
Reconciliation services that utilize linked data
Enhanced quality control tools
Ability to normalize and create consistencies in data values
Easily get data in and out
Robust enough to handle multiple feeds and multiple sources
Processes to move data from one repository to another resemble �aggregation workflows

As was mentioned several of the interviews focused specifically on metadata aggregation, and several others addressed it among more general repository needs.

The key findings in this area were that:

Aggregators wanted tools that used more flexible mapping standards than XSLT which is complex but limiting

They wanted the ability to easily harvest from multiple types of sources: OAI feeds, spreadsheets, apis, etc., using the same harvesting and mapping tools

They wanted to be able to reconcile that data with vocabularies from linked data endpoints and to be able to begin integrating URIs with their data

They needed more and better quality control tools

These include an increased ability to normalize and create consistency in their data.

Aggregators need to be able to easily get data in and out of their systems and aggregation tools

And they need the tools to be robust enough to handle multiple feeds and multiple sources at once.

Lastly, we found that a lot of processes people described to us when talking about managing metadata in more than one repository, or doing some type of bulk ingest of metadata resembled aggregation workflows. There was a need to gather metadata, map it to a new standard and then send it somewhere else.

Because of this overlap we are investigating ways the two products can work with and complement each other without needing to be inter-dependent.

21 of 27

DPLA’s Aggregation System, Heiðrún

Three Main Functions

Harvest

Source agnostic

Map

Mapping DSL expressed in Ruby
Maps to RDF triples

Enrich

Modular enrichments written to normalize and enhance data

22 of 27

harvest

map

enrich

Marmotta

original record data store

Partner data store

Dashboard

QA

staging

production

mapping

Enrichment profile

Institution profile

User Interface

Aggregator

This is a high-level diagram of the system. It starts by performing a harvest activity, which uses a profile for the data source that includes harvest parameters like source feed, or sets for OAI for instance.

The original record is stored at this point, and will stay stored in it’s original state until a new harvest activity deletes or overwrites it.

Next the mapping is run on the records. Each mapping is distinct for each institution allowing for flexibility. At this point a new RDF record based on the DPLA MAP is created and stored in Marmotta, our RDF triplestore.

Finally, the mapped records are enriched according to a profile that invokes the various enrichment modules we wish to run. These enrichments update the mapped records, so in the end we only have stored the mapped and enriched record and the original record.

There are a series of user interface elements that allow for the running of processes, QA, and finally indexing to DPLA staging and production environments.

23 of 27

harvest

map

enrich

Marmotta

original record data store

Original data store

Dashboard

QA

mapping

Enrichment profile

Institution profile

User Interface

Aggregator

Exported

data store

24 of 27

Roadmap

Completing requirements now
April - July

Design remaining infrastructure
Develop user interface requirements further

August - November

Develop dashboard tools
Analyze convergence points with repository
Plan for improvements to QA interface
Begin User Testing

November - March 2017

Develop QA improvements
Refine interfaces and infrastructure
Implement job scheduling

The development plan for the aggregator is separate from the development plan for the repository, but somewhat concurrent. We are currently also completing the requirements and developing things like personas and wireframes for the aggregator now.

From the spring through the summer we will be working on the remaining components of the infrastructure that have not been developed and finalizing requirements and design.

Late summer and fall will include developing the dashboard that was shown in the previous models, but not yet implemented, as well as increasing the QA tools. We will seriously analyze where the aggregator tools and the repository tools can build off of one another during this time. We will also begin user testing of the prototype.

And in late fall through the following spring we hope to continue to work on those QA tools and implement refinements based on the user testing as well as implementing job scheduling tools that will allow the dashboard user to schedule processes for future or recurring dates.

25 of 27

Developing a Hosted Service

Project partners collaborating to develop requirements for a cloud-hosted service based on the repository product under development
Market research underway, starting with analysis of information discovered during the design phase
Evaluating tiered service models depending on needs of potential adopters
Significant technical work to focus on develop a shared, maintainable, and scalable service

26 of 27

More Information

Visit our website and blog: http://hydrainabox.org/ ��Follow us on Twitter: @HydraInABox

Public information list

hybox-info@googlegroups.com

�Contact us

hybox-contact@googlegroups.com

27 of 27

Thank You!

Hannah Frost�hfrost@stanford.edu�@feefifofannah��Gretchen Gueguen�gretchen@dp.la�@G_AmSpinnrade��Mark A. Matienzo�mark@dp.la�@anarchivist��http://bit.ly/dplafest-hybox