1 of 40

Mother spreadsheet, father template:

Unconventional ways for massive XML/RDF/structured data generation from flat sources

OR 2018, Bozeman, Montana, USA�Diego Pino Navarro

Metropolitan New York Library Council

@dpinonavarro

2 of 40

Refreshing some concepts

  • Hierarchical data/metadata
    • Deep nested metadata XML Schemas like MODS (Poormans RDF)
    • Deep nested meta-meta-data schemas like METS and FOXML (Matryoshka Syndrome)
    • Tree shaped data, like non cyclic Graphs, RDF, GraphQL
    • JSON/JSON-LD documents like IIIF Manifests
    • HTML
    • Query languages (any)
  • Templating Engines (or processor)
    • Software that mixes data input and a template to generate an output document
  • Tabular data
    • Bi dimensional data structure (matrix) with columns and rows.
    • Single row = many key/value pairs. Many rows.
  • Spreadsheet processor
    • Software that allows you to manipulate Tabular data
  • Fedora Commons
    • Popular Digital Objects Repository
  • Islandora
    • Drupal based Repository ecosystem that uses Fedora commons as backend

3 of 40

What is this about?

This presentation is about:

  • Metadata and data, fields and values. Tables and Graphs. Tools and Skills
  • Understanding human limitations
  • Reusing what people know to help them reach a certain unknown
  • Finding a common denominator / place that feels like home in your Metadata ingest workflow

4 of 40

What is this also about?

It is about being simple, being pragmatic and getting the job done

It is about being happy, or at least, less miserable while generating good, quality Metadata

It is about helping repositories to be sustainable by allowing a constant flow of ingest happening

5 of 40

It is about ingesting Digital Objects, images, metadata, data into repositories without losing it

6 of 40

The shape of {meta}data

7 of 40

Axiom 0 wonder-metadatian

To be able to generate quality {meta}data for today’s standards you need to manage many of the following technologies/skills/tools

JSON-LD/JSON

Regular expressions

OpenRefine

Bash

At least one Repo flavour

SPARQL

SQL

Character Encodings

XML

XSLT 1.0 and 2.0

MODS 3.5 MODS 3.6

DC

QDC

OAI-PMH

Python

RDF/RDFa/n-turtle

WIKIDATA

MARC21

IMAGEMAGICK

UNIX

ZIP formats

Spreadsheet Processors

Solr

Feel free to add more to this list: DP.LA requirements, Citation formats, Color spaces? TECHMD? Provenance?

8 of 40

Axiom 1 repo-splaining

Repositories have a strong opinion about how your {meta}data should look like (and they did NOT ask you if you were Ok with it)

9 of 40

Axiom 3 “i say tomato, you say tomatillo”

Repositories expose {meta}data using predefined, well-known, well accepted formats and schemas. Those match normally external interaction requirements and demand from you specialized knowledge to be able to interact. E.g XML/turtle - MODS, METS, DC, EDM

I say expose because luckily most humans don’t need to interact with internal/storage representations

10 of 40

Axiom 3 “The meat grinder

Ingesting {meta}data in a repository involves pain. To avoid chronic pain and premature aging, we try to do this either

  1. Using the native Repository provided tools
  2. We prepare all in a format that is natively understood by the system

“Ingesting” is used loosely here, could be also smashing, squishing, mixing, blending, cutting or baking too

11 of 40

Axiom 4 “The rented delorean

Not all {meta}data is created contemporary to your shiny new repository. We deal with legacy or out of out control data all the time

Also, not all archiving, describing, cataloging happens with a repo in mind even when already in place

12 of 40

Theorem 1 “Honey i Shrunk the Digital Objects”

Based on all the previous Axioms we can safely say that {meta}data workflows are driven by end formats, APIs and after-end requirements. Data destruction, reduction and compression happens all the time just to get things done. All this is a multi step, multi tool messy operation that demands a lot from you with no Ctrl+Z.

Compression is used here in a semantic sense, no zip/unzip related

13 of 40

Theorem 2 “ Golden Hammer”

Based on no Axiom really, but on daily experience:

Once you learn to use a Hammer, to will try to “hammer” your way through life as much as possible. This happens with technology like XSLT or standards like MARC21

Normally you choose a few trustworthy Skills from Axiom 0 and apply them everywhere

14 of 40

Case Study

Ingesting Digital Objects to DCMNY.org

Islandora 7.x / Fedora 3.8x

The Need:

Create a Simple and efficient Ingest Workflow that integrates:

  1. Metadata Creators
  2. Our particular Mixed Sourced Collections
  3. DP.LA metadata guidelines

15 of 40

To understand the problem we need to understand the final shape of our Objects

16 of 40

The Fedora/Islandora Digital Object

(da shape)

A shallow aggregation of Metadata and Resources

17 of 40

The Fedora/Islandora Digital Object Hierarchy (tree)

An RDF Graph of related DO.

Relationships are given by CMODELS

18 of 40

Islandora’s basic ingest Options

  1. Single Object (hand crafted): using XML and multi step forms via UI
  2. Multiple objects (sausage factory): Using Islandora Batch and & Content Model (Cmodel) specific Modules. Specific structure and pre built XML documents
  3. Each CModel can eventually have a different requirement

It is not lack of DEV effort. It is just complex.

19 of 40

Islandora’s Ingest Problem

Ingesting content is complex and slow

Preparing {meta}data and Hierarchy

  • Time consuming
  • Prone to human error
  • QA difficult once in ingest format (Single MODS XML + Master Binary)
  • Workflow difficult to document and recreate.
  • Either too static (all looks the same) or a lot of tooling for each use case
  • Steep learning curve
  • Difficult to delegate parts of this to third parties (all or nothing)

20 of 40

Our first attempt

(Golden Hammer)

21 of 40

Things we got right

Spreadsheet!

  1. Could be given to creators
  2. QA was relatively simple

22 of 40

Things we did as good as we could

  1. Openrefine is complex
  2. We were stuck with fixed naming conventions
  3. We had to process in memory GB sized XML file via XSLT
  4. We had to split and name XML == Binary
  5. We could only ingest single CMODEL at a time
  6. No Control over rest of the datastreams
  7. We either had upload limits or gave users ssh access

23 of 40

We needed something better

We wanted:

Full control over how Objects are built

Full control over how Objects are related to each other

Lighting fast XML generation

Flexible Binary Sources

Replicable Workflow

UI all over the place

Integrated Workflow

Smart Islandora Integration

Inline QA and Data shaping

Flexibility

Weekends

24 of 40

Islandora Multi Importer

Was brewed

(IMI)

Deny Theorem 1

Embrace Theorem 2

Integrate safely our Data creators

Ingest tons of material

https://github.com/mnylc/islandora_multi_importer

Demo: https://vimeo.com/273217707

25 of 40

IMI uses Spreadsheets as input

But we went further:

Any Tabulated format works. Excel, CSV, TSV, Google Spreadsheets via API access. You name it.. (Really, there not many more)

There are no header row naming restrictions, they just need to match our template’s use.

26 of 40

27 of 40

IMI generates XML via Twig templates

  • Websites serve 1000x users in real time by rendering web pages on demand
  • HTML looks pretty similar to XML…
  • Symfony/Silex/Drupal 8 were using Twig already
  • Low footprint, compiled and cacheable!
  • Pseudo language built to deal with data and similar to PHP
  • Can be tested outside Islandora. Can be stored and validated
  • Can be shared and augmented
  • Templates are easy to read, share and to understand

28 of 40

<xsl:stylesheet

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:fn="fn"

xmlns:xs="http://www.w3.org/2001/XMLSchema"

version="2.0" exclude-result-prefixes="xs fn">

<xsl:output indent="yes" encoding="US-ASCII" />

<xsl:param name="pathToCSV" select="'file:///c:/csv.csv'" />

<xsl:function name="fn:getTokens" as="xs:string+">

<xsl:param name="str" as="xs:string" />

<xsl:analyze-string select="concat($str, ',')" regex='(("[^"]*")+|[^,]*),'>

<xsl:matching-substring>

<xsl:sequence select='replace(regex-group(1), "^""|""$|("")""", "$1")' />

</xsl:matching-substring>

</xsl:analyze-string>

</xsl:function>

<xsl:template match="/" name="main">

<xsl:choose>

<xsl:when test="unparsed-text-available($pathToCSV)">

<xsl:variable name="csv" select="unparsed-text($pathToCSV)" />

<xsl:variable name="lines" select="tokenize($csv, ' ')" as="xs:string+" />

<xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+" />

<root>

<xsl:for-each select="$lines[position() &gt; 1]">

<row>

<xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+" />

<xsl:for-each select="$elemNames">

<xsl:variable name="pos" select="position()" />

<elem name="{.}">

<xsl:value-of select="$lineItems[$pos]" />

</elem>

</xsl:for-each>

</row>

</xsl:for-each>

</root>

...

CSV to XML via XSLT

Dissecting

  • Requires XSLT 2.0
  • Requires a full path to a CSV
  • Requires that headers in CSV match XML elements we want
  • Assumes space separation
  • Uses a regex
  • Weirdly, no nested XML output
  • Pretty complicated to read
  • Output schema not clear
  • gosh

29 of 40

<?xml version="1.0" encoding="UTF-8"?>

{% block content %}

{% autoescape false %}

<mods xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd" >

<titleInfo>

<title>{{ data.title|trim|escape }}</title>

</titleInfo>

{% if attribute(data, 'personal_name_author') %}

{% for name in attribute(data, 'personal_name_author')|split(';') %}

<name type="personal">

<namePart>{{ name|trim|escape }}</namePart>

<role>

<roleTerm authority="marcrelator" type="text">author</roleTerm>

</role>

</name>

{% endfor %}

{% endif %}

....

</mods>

{% endautoescape %}

{% endblock %}

Twig template that generates MODS based on a single spreadsheet row of data

(you can see the MODS there right?)

30 of 40

31 of 40

{% if attribute(data, 'personal_name_byrole') %}

{% for name_byrole in attribute(data, 'personal_name_byrole')|split(';') %}

{% set name = name_byrole|split ('|') %}

<name type="personal">

<namePart>{{ name[0]|trim }}</namePart>

{% if name[1] is not empty %}

<role>

<roleTerm authority="marcrelator" type="text">{{ name[1]|trim }}</roleTerm>

</role>

{% endif %}

</name>

{% endfor %}

{% endif %}

Twig snipper that generates nested XML from single cell

personal_name_byrole

Islandora dude|creator; Samvera friend |collaborator

32 of 40

<name type="personal">

<namePart>Islandora dude</namePart>

<role>

<roleTerm authority="marcrelator" type="text">creator</roleTerm>

</role>

</name>

<name type="personal">

<namePart>Islandora dude</namePart>

<role>

<roleTerm authority="marcrelator" type="text">collaborator</roleTerm>

</role>

</name>

<name type="personal">

<namePart>Dspace expert</namePart>

</name>

XML output for previous + extra data same field

personal_name_byrole

Islandora dude |creator; Samvera friend |collaborator; Dspace expert

33 of 40

IMI is smart

  • Integrates with Islandora Batch system and creates “to be born” objects
  • Understands Islandora existing CMODEL blueprint objects.
  • Allows any existing CMODELs (even future ones)
  • Allows Complex Hierarchies to be created
  • Can read Binary Sources from ZIP files, remove HTTP, other streamwrappers (tmp://), local server paths, Amazon S3, Dropbox.
  • Is able to fallback in case of failure and try alternative solutions
  • Cleans up after its mess
  • Has full UI
  • Has a Twig template editor and internal storage
  • Can be extended
  • Has a knot breaker
  • Discards invalid Objects
  • Is open source and maintained

34 of 40

IMI helps with housekeeping

  • IMI can also update existing Objects
  • Full control of what gets updated and how RDF is built
  • IMI can transmute Content Models, Today an Image, tomorrow a Postcard

35 of 40

36 of 40

Why Use Multi-importer (What others say about it)

  • UI driven integrated workflow for ingest and update
  • Metadata Cleanup: Export your MODS metadata as CSV via Solr, clean up, then update the MODS datastream of the objects by recreating the MODS datastream using Twig
  • To ingest different content types at the same time including hierarchies, like collections inside collections with compounds and books, etc
  • To avoid having to follow strict naming conventions and folder structure dictated by many Islandora batch ingest processes
  • Selectively choose which derivatives you want to create and upload
  • To avoid the OpenRefine/XSLT approach to creating MODS from CSVs
  • To take advantage of the Twig Templating system for creating MODS from CSVs
  • To preview the MODS output easily
  • Supports integration with Google Spreadsheets, Zip/Local/Amazon storage and Complex storage needs via hooks.

37 of 40

Some happy IMI users

  • http://dcmny.org (we) 35000+ objects so far
  • https://digitallibrary.sdsu.edu (San Diego State Library) 5000+
  • University of Toronto SC
  • Grinnell College
  • New York Historical Society 70.000+
  • ICG
  • Born Digital
  • ...

38 of 40

Next Steps

  • Integrate a full blown Twig parser (with pretty colors)
  • Add extensibility to accept other API inputs like json. Those can be provided by other modules and be extended by third parties
  • Build a full FLAT 2 Hierarchical back-2 Flat Workflow (Solr on one end)
  • Save Data provenance and Workflow in the repository. You could want to reingest again in 5 years
  • Use the same Concept and Approach to build a Not Islandora specific App. e.g Migrate via UI from ContentDM to Fedora 5 (the API). Why not?
  • Make more documentation and free workshops and Twig tuts
  • Modify our IIIF Development to generate manifests totally based on Twig templates
  • Make IMI Drupal 8 compliant
  • Get more uses cases

39 of 40

DEMO time!

40 of 40

Discussion / Q & A

Special thanks to Mark McFate, Patrick Dunlavey, Kim Pham, Nat Keeran and to our Metro team Anne Karle-Zenith, Karen Hwang and Chris Stanton