1 of 22

Case Study: Oregon Digital Migration from CONTENTdm to Hydra �for digital collections

Hydra Connect 2016, Boston Public Library �Julia Simic & Linda Sato, University of Oregon Libraries

Ryan Wick & Margaret Mellinger, Oregon State University Libraries

2 of 22

3 of 22

From CONTENTdm to Hydra

4 of 22

Hydra Prototype

  • All descriptive metadata fields were in RDF, everything was a string
  • Collection (set) landing pages and additional information pages
  • Items could belong to multiple Collections (sets)
  • Zoomable image viewer using IIP server and OpenSeadragon UI
  • PDF/Document viewer using Internet Archive BookReader
  • At the time, most in the community were not building repositories with many different item types or linked data.

5 of 22

After the Prototype

  • Decided to use deep RDF (URIs) where possible
  • Needed to write RDF code that didn’t exist elsewhere, such as fetch and label handling, some became ActiveTriples gem
  • Set up a separate server for derivatives, ingest, fetch and index jobs
  • Needed to have CONTENTdm item short URLs and viewer URLs redirected
    • CONTENTdm URL/ID stored with each new item, Rake task can export all of them and generate mapping files
    • Had some issues with map size until we used nginx
  • Audio player - HTML5

6 of 22

Migration Workflows

7 of 22

Examine CDM desc.all files

Regularize predicates/ field names

Find or publish LD predicates

Create YAML mapping files

Examine object data

Object normalization

Find or publish LD object data

Create equivalents lists

cdm2bag ingest

230,000 assets in 68 collections

8 of 22

Examine CDM desc.all files

Regularize predicates/ field names

Find or publish LD predicates

  • Exported CONTENTdm metadata by collection
  • Examined field names and content to determine needed predicates
  • Regularized and mapped field names to existing LD predicates
  • Published new predicates in opaquenamespace.org
    • Originally JSON-LD files on GitHub
    • Now published using the Controlled Vocabulary Manager
  • Deleted fields that were used only internally or were specific to CDM
  • Determined if predicates should require string or URI objects within Oregon Digital
  • Created mapping files (YAML)
  • Documented in Metadata Dictionary

Create YAML mapping files

9 of 22

Examine object data

Object normalization

Find or publish LD object data

Create equivalents lists

  • Examined all Object data per collection
  • Fixed typos, delimiter errors etc. using scripts and by hand
    • Non-UTF-8 diacritic problems
  • Looked for usable LD vocabularies
    • Used a combination of scraping and hand gathering
  • Defined predicates to validate only chosen vocabs
  • Published new vocabularies in opaquenamespace.org (JSON-LD→CVM)
  • Created lists of old entries and new URI values

10 of 22

cdm2bag ingest

  • Cdm2bag Ruby script processed desc.all files, mapped fields and output bags
  • Replace object data using a combination of mapping methods and lists
    • BagIt utilized to output one asset + metadata per bag
  • Bad mapping, predicates or invalid objects stopped ingest
    • Problems with deep RDF predicates
    • Problems with unresolvable object URIs
  • Some high resolution files were missing and/or never existed
  • CDM full resolution links corrupted when assets were moved on servers and links were not updated
  • Processing derivatives was very slow, large PDFs and TIFs
    • Considered pre-processing derivatives

11 of 22

Post-Migration Cleanup

12 of 22

Post Migration Cleanup

13 of 22

Post Migration Cleanup

14 of 22

Bad Assets

15 of 22

Final Review: Image filename problems

SQKH_00002_KV

16 of 22

Final Review: Image problems

17 of 22

Mountain West Digital Library

Correctly formatted dates

2015

2015-02

2015-02-26

any of the above as a range: 2015/2016

or as a series: [1801, 1926]

What we found:

ca. 1915

1877, c. 1885

c. 1900-1909

1913?

Decmeber 29, 1955

Summer 1957

1948, dismantled 1983

27-Aug-80

February 1975

1911-11 - 1912-08

1937 and later

1883, hospital opened

Kenneth Gunn

1925, 1928, completed 1948

18 of 22

Lessons Learned

  • Metadata cleanup and review require significant resources, have metadata people closely involved with developers from the beginning
  • Leadership needs to have all impacted staff onboard
  • Ensure that original media files are named correctly and accessible
  • Linked data encourages better quality but can be time consuming with poor metadata, tools have improved over years

19 of 22

Lessons Learned continued

  • Being involved in the Hydra community is very important, share work and solve problems together
  • Devote development resources to migration tooling and QA
  • Compound objects are complicated

20 of 22

Resources

21 of 22

Oregon Digital Hydra Team, Past & Present

Oregon State University

Evviva Weinraub, Trey Pendragon,�Tom Johnson, Ryan Wick, Mike Eaton, �Brandon Straley, Greg Luis Ramírez,�Josh Gum, Ryan Ordway, Maura Valentino, �Brian Davis, Erin Clark, Michael Boock, Margaret Mellinger, Chris Petersen, �Trevor Sandgathe, Hui Zhang, Susan McEvoy, Helena Bales

University of Oregon

Karen Estlund, Sheila Rabun, David McCallum, Julia Simic, Jeremy Echols, Duncan Barth, Sarah Seymore, Linda Sato, Kate Jones

22 of 22

Questions?��Title�Mika-in-a-bag�LC Subject�American shorthair cat�Photographer�Rabun, Sheila�Condition Of Source�adorable�Identifier�mika6�Type�Image�Format�image/jpeg�Has Version�Sheila's cat �Rights�Rights Reserved - Restricted Access��

oregondigital:df715547z