1 of 40

Mother spreadsheet, father template:

��

Unconventional ways for massive XML/RDF/structured data generation from flat sources

OR 2018, Bozeman, Montana, USA�Diego Pino Navarro

Metropolitan New York Library Council

@dpinonavarro

Slides at: h ttps://tinyurl.com/OR2018momanddad

2 of 40

Refreshing some concepts

Hierarchical data/metadata

Deep nested metadata XML Schemas like MODS (Poormans RDF)
Deep nested meta-meta-data schemas like METS and FOXML (Matryoshka Syndrome)
Tree shaped data, like non cyclic Graphs, RDF, GraphQL
JSON/JSON-LD documents like IIIF Manifests
HTML
Query languages (any)

Templating Engines (or processor)

Software that mixes data input and a template to generate an output document

Tabular data

Bi dimensional data structure (matrix) with columns and rows.
Single row = many key/value pairs. Many rows.

Spreadsheet processor

Software that allows you to manipulate Tabular data

Fedora Commons

Popular Digital Objects Repository

Islandora

Drupal based Repository ecosystem that uses Fedora commons as backend

@dpinonavarro

3 of 40

What is this about?

This presentation is about:

Metadata and data, fields and values. Tables and Graphs. Tools and Skills
Understanding human limitations
Reusing what people know to help them reach a certain unknown
Finding a common denominator / place that feels like home in your Metadata ingest workflow

@dpinonavarro

4 of 40

What is this also about?

It is about being simple, being pragmatic and getting the job done

It is about being happy, or at least, less miserable while generating good, quality Metadata

It is about helping repositories to be sustainable by allowing a constant flow of ingest happening

@dpinonavarro

5 of 40

It is about ingesting Digital Objects, images, metadata, data into repositories without losing it

@dpinonavarro

6 of 40

The shape of {meta}data

@dpinonavarro

7 of 40

Axiom 0 “wonder-metadatian“

To be able to generate quality {meta}data for today’s standards you need to manage many of the following technologies/skills/tools

JSON-LD/JSON

Regular expressions

OpenRefine

Bash

At least one Repo flavour

SPARQL

SQL

Character Encodings

XML

XSLT 1.0 and 2.0

MODS 3.5 MODS 3.6

DC

QDC

OAI-PMH

Python

RDF/RDFa/n-turtle

WIKIDATA

MARC21

IMAGEMAGICK

UNIX

ZIP formats

Spreadsheet Processors

Solr

Feel free to add more to this list: DP.LA requirements, Citation formats, Color spaces? TECHMD? Provenance?

@dpinonavarro

8 of 40

Axiom 1 “repo-splaining “

Repositories have a strong opinion about how your {meta}data should look like (and they did NOT ask you if you were Ok with it)

@dpinonavarro

9 of 40

Axiom 3 “i say tomato, you say tomatillo”

Repositories expose {meta}data using predefined, well-known, well accepted formats and schemas. Those match normally external interaction requirements and demand from you specialized knowledge to be able to interact. E.g XML/turtle - MODS, METS, DC, EDM

I say expose because luckily most humans don’t need to interact with internal/storage representations

@dpinonavarro

10 of 40

Axiom 3 “The meat grinder”

Ingesting {meta}data in a repository involves pain. To avoid chronic pain and premature aging, we try to do this either

Using the native Repository provided tools
We prepare all in a format that is natively understood by the system

“Ingesting” is used loosely here, could be also smashing, squishing, mixing, blending, cutting or baking too

@dpinonavarro

11 of 40

Axiom 4 “The rented delorean”

Not all {meta}data is created contemporary to your shiny new repository. We deal with legacy or out of out control data all the time

Also, not all archiving, describing, cataloging happens with a repo in mind even when already in place

@dpinonavarro

12 of 40

Theorem 1 “Honey i Shrunk the Digital Objects”

Based on all the previous Axioms we can safely say that {meta}data workflows are driven by end formats, APIs and after-end requirements. Data destruction, reduction and compression happens all the time just to get things done. All this is a multi step, multi tool messy operation that demands a lot from you with no Ctrl+Z.

Compression is used here in a semantic sense, no zip/unzip related

@dpinonavarro

13 of 40

Theorem 2 “ Golden Hammer”

Based on no Axiom really, but on daily experience:

Once you learn to use a Hammer, to will try to “hammer” your way through life as much as possible. This happens with technology like XSLT or standards like MARC21

Normally you choose a few trustworthy Skills from Axiom 0 and apply them everywhere

@dpinonavarro

14 of 40

Case Study

Ingesting Digital Objects to DCMNY.org

Islandora 7.x / Fedora 3.8x

The Need:

Create a Simple and efficient Ingest Workflow that integrates:

Metadata Creators
Our particular Mixed Sourced Collections
DP.LA metadata guidelines

@dpinonavarro

15 of 40

To understand the problem we need to understand the final shape of our Objects

@dpinonavarro

16 of 40

The Fedora/Islandora Digital Object

(da shape)

A shallow aggregation of Metadata and Resources

@dpinonavarro

17 of 40

The Fedora/Islandora Digital Object Hierarchy (tree)

An RDF Graph of related DO.

Relationships are given by CMODELS

@dpinonavarro

18 of 40

Islandora’s basic ingest Options

Single Object (hand crafted): using XML and multi step forms via UI
Multiple objects (sausage factory): Using Islandora Batch and & Content Model (Cmodel) specific Modules. Specific structure and pre built XML documents
Each CModel can eventually have a different requirement

It is not lack of DEV effort. It is just complex.

@dpinonavarro

19 of 40

Islandora’s Ingest Problem

Ingesting content is complex and slow

Preparing {meta}data and Hierarchy

Time consuming
Prone to human error
QA difficult once in ingest format (Single MODS XML + Master Binary)
Workflow difficult to document and recreate.
Either too static (all looks the same) or a lot of tooling for each use case
Steep learning curve
Difficult to delegate parts of this to third parties (all or nothing)

@dpinonavarro

20 of 40

Our first attempt

(Golden Hammer)

@dpinonavarro

21 of 40

Things we got right

Spreadsheet!

Could be given to creators
QA was relatively simple

@dpinonavarro

22 of 40

Things we did as good as we could

Openrefine is complex
We were stuck with fixed naming conventions
We had to process in memory GB sized XML file via XSLT
We had to split and name XML == Binary
We could only ingest single CMODEL at a time
No Control over rest of the datastreams
We either had upload limits or gave users ssh access

Openrefine tutorial: https://www.youtube.com/watch?v=yceTHecXd9g

@dpinonavarro

23 of 40

We needed something better

We wanted:

Full control over how Objects are built

Full control over how Objects are related to each other

Lighting fast XML generation

Flexible Binary Sources

Replicable Workflow

UI all over the place

Integrated Workflow

Smart Islandora Integration

Inline QA and Data shaping

Flexibility

Weekends

@dpinonavarro

24 of 40

Islandora Multi Importer

Was brewed

(IMI)

Deny Theorem 1

Embrace Theorem 2

Integrate safely our Data creators

Ingest tons of material

https://github.com/mnylc/islandora_multi_importer

Demo: https://vimeo.com/273217707

@dpinonavarro

25 of 40

IMI uses Spreadsheets as input

But we went further:

Any Tabulated format works. Excel, CSV, TSV, Google Spreadsheets via API access. You name it.. (Really, there not many more)

There are no header row naming restrictions, they just need to match our template’s use.

@dpinonavarro

26 of 40

https://docs.google.com/spreadsheets/d/1YNuMcByw4iXUtaY_oecKxTvu3aMWBmqZSq9I46zlwMM/edit#gid=0

27 of 40

IMI generates XML via Twig templates

Websites serve 1000x users in real time by rendering web pages on demand
HTML looks pretty similar to XML…
Symfony/Silex/Drupal 8 were using Twig already
Low footprint, compiled and cacheable!
Pseudo language built to deal with data and similar to PHP
Can be tested outside Islandora. Can be stored and validated
Can be shared and augmented
Templates are easy to read, share and to understand

https://twig.symfony.com/pdf/2.x/Twig.pdf

@dpinonavarro

28 of 40

<xsl:stylesheet

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:fn="fn"

xmlns:xs="http://www.w3.org/2001/XMLSchema"

version="2.0" exclude-result-prefixes="xs fn">

<xsl:output indent="yes" encoding="US-ASCII" />

<xsl:param name="pathToCSV" select="'file:///c:/csv.csv'" />

<xsl:function name="fn:getTokens" as="xs:string+">

<xsl:param name="str" as="xs:string" />

<xsl:analyze-string select="concat($str, ',')" regex='(("[^"]*")+|[^,]*),'>

<xsl:matching-substring>

<xsl:sequence select='replace(regex-group(1), "^""|""$|("")""", "$1")' />

</xsl:matching-substring>

</xsl:analyze-string>

</xsl:function>

<xsl:template match="/" name="main">

<xsl:choose>

<xsl:when test="unparsed-text-available($pathToCSV)">

<xsl:variable name="csv" select="unparsed-text($pathToCSV)" />

<xsl:variable name="lines" select="tokenize($csv, ' ')" as="xs:string+" />

<xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+" />

<root>

<xsl:for-each select="$lines[position() > 1]">

<row>

<xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+" />

<xsl:for-each select="$elemNames">

<xsl:variable name="pos" select="position()" />

<xsl:value-of select="$lineItems[$pos]" />

</elem>

</xsl:for-each>

</row>

</xsl:for-each>

</root>

...

http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html

CSV to XML via XSLT

Dissecting

Requires XSLT 2.0
Requires a full path to a CSV
Requires that headers in CSV match XML elements we want
Assumes space separation
Uses a regex
Weirdly, no nested XML output
Pretty complicated to read
Output schema not clear
gosh

@dpinonavarro

29 of 40

<?xml version="1.0" encoding="UTF-8"?>

{% block content %}

{% autoescape false %}

<title>{{ data.title|trim|escape }}</title>

</titleInfo>

{% if attribute(data, 'personal_name_author') %}

{% for name in attribute(data, 'personal_name_author')|split(';') %}

<namePart>{{ name|trim|escape }}</namePart>

<role>

<roleTerm authority="marcrelator" type="text">author</roleTerm>

</role>

</name>

{% endfor %}

{% endif %}

....

</mods>

{% endautoescape %}

{% endblock %}

Twig template that generates MODS based on a single spreadsheet row of data

(you can see the MODS there right?)

@dpinonavarro

30 of 40

https://twig.symfony.com/pdf/2.x/Twig.pdf

@dpinonavarro

31 of 40

{% if attribute(data, 'personal_name_byrole') %}

{% for name_byrole in attribute(data, 'personal_name_byrole')|split(';') %}

{% set name = name_byrole|split ('|') %}

{% if name[1] is not empty %}

<role>

</role>

{% endif %}

</name>

{% endfor %}

{% endif %}

Twig snipper that generates nested XML from single cell

personal_name_byrole
Islandora dude\|creator; Samvera friend \|collaborator

@dpinonavarro

32 of 40

<namePart>Islandora dude</namePart>

<role>

<roleTerm authority="marcrelator" type="text">creator</roleTerm>

</role>

</name>

<namePart>Islandora dude</namePart>

<role>

<roleTerm authority="marcrelator" type="text">collaborator</roleTerm>

</role>

</name>

<namePart>Dspace expert</namePart>

</name>

XML output for previous + extra data same field

personal_name_byrole
Islandora dude \|creator; Samvera friend \|collaborator; Dspace expert

@dpinonavarro

33 of 40

IMI is smart

Integrates with Islandora Batch system and creates “to be born” objects
Understands Islandora existing CMODEL blueprint objects.
Allows any existing CMODELs (even future ones)
Allows Complex Hierarchies to be created
Can read Binary Sources from ZIP files, remove HTTP, other streamwrappers (tmp://), local server paths, Amazon S3, Dropbox.
Is able to fallback in case of failure and try alternative solutions
Cleans up after its mess
Has full UI
Has a Twig template editor and internal storage
Can be extended
Has a knot breaker
Discards invalid Objects
Is open source and maintained

@dpinonavarro

34 of 40

IMI helps with housekeeping

IMI can also update existing Objects
Full control of what gets updated and how RDF is built
IMI can transmute Content Models, Today an Image, tomorrow a Postcard

@dpinonavarro

35 of 40

@dpinonavarro

36 of 40

Why Use Multi-importer (What others say about it)

UI driven integrated workflow for ingest and update
Metadata Cleanup: Export your MODS metadata as CSV via Solr, clean up, then update the MODS datastream of the objects by recreating the MODS datastream using Twig
To ingest different content types at the same time including hierarchies, like collections inside collections with compounds and books, etc
To avoid having to follow strict naming conventions and folder structure dictated by many Islandora batch ingest processes
Selectively choose which derivatives you want to create and upload
To avoid the OpenRefine/XSLT approach to creating MODS from CSVs
To take advantage of the Twig Templating system for creating MODS from CSVs
To preview the MODS output easily
Supports integration with Google Spreadsheets, Zip/Local/Amazon storage and Complex storage needs via hooks.

@dpinonavarro

37 of 40

Some happy IMI users

http://dcmny.org (we) 35000+ objects so far
https://digitallibrary.sdsu.edu (San Diego State Library) 5000+
University of Toronto SC
Grinnell College
New York Historical Society 70.000+
ICG
Born Digital
...

@dpinonavarro

38 of 40

Next Steps

Integrate a full blown Twig parser (with pretty colors)
Add extensibility to accept other API inputs like json. Those can be provided by other modules and be extended by third parties
Build a full FLAT 2 Hierarchical back-2 Flat Workflow (Solr on one end)
Save Data provenance and Workflow in the repository. You could want to reingest again in 5 years
Use the same Concept and Approach to build a Not Islandora specific App. e.g Migrate via UI from ContentDM to Fedora 5 (the API). Why not?
Make more documentation and free workshops and Twig tuts
Modify our IIIF Development to generate manifests totally based on Twig templates
Make IMI Drupal 8 compliant
Get more uses cases

@dpinonavarro

39 of 40

DEMO time!

40 of 40

Discussion / Q & A

Special thanks to Mark McFate, Patrick Dunlavey, Kim Pham, Nat Keeran and to our Metro team Anne Karle-Zenith, Karen Hwang and Chris Stanton