Mother spreadsheet, father template:
��
Unconventional ways for massive XML/RDF/structured data generation from flat sources
Slides at: https://tinyurl.com/OR2018momanddad
Refreshing some concepts
What is this about?
This presentation is about:
What is this also about?
It is about being simple, being pragmatic and getting the job done
It is about being happy, or at least, less miserable while generating good, quality Metadata
It is about helping repositories to be sustainable by allowing a constant flow of ingest happening
It is about ingesting Digital Objects, images, metadata, data into repositories without losing it
The shape of {meta}data
Axiom 0 “wonder-metadatian“
To be able to generate quality {meta}data for today’s standards you need to manage many of the following technologies/skills/tools
JSON-LD/JSON
Regular expressions
OpenRefine
Bash
At least one Repo flavour
SPARQL
SQL
Character Encodings
XML
XSLT 1.0 and 2.0
MODS 3.5 MODS 3.6
DC
QDC
OAI-PMH
Python
RDF/RDFa/n-turtle
WIKIDATA
MARC21
IMAGEMAGICK
UNIX
ZIP formats
Spreadsheet Processors
Solr
Feel free to add more to this list: DP.LA requirements, Citation formats, Color spaces? TECHMD? Provenance?
Axiom 1 “repo-splaining “
Repositories have a strong opinion about how your {meta}data should look like (and they did NOT ask you if you were Ok with it)
Axiom 3 “i say tomato, you say tomatillo”
Repositories expose {meta}data using predefined, well-known, well accepted formats and schemas. Those match normally external interaction requirements and demand from you specialized knowledge to be able to interact. E.g XML/turtle - MODS, METS, DC, EDM
I say expose because luckily most humans don’t need to interact with internal/storage representations
Axiom 3 “The meat grinder”
Ingesting {meta}data in a repository involves pain. To avoid chronic pain and premature aging, we try to do this either
“Ingesting” is used loosely here, could be also smashing, squishing, mixing, blending, cutting or baking too
Axiom 4 “The rented delorean”
Not all {meta}data is created contemporary to your shiny new repository. We deal with legacy or out of out control data all the time
Also, not all archiving, describing, cataloging happens with a repo in mind even when already in place
Theorem 1 “Honey i Shrunk the Digital Objects”
Based on all the previous Axioms we can safely say that {meta}data workflows are driven by end formats, APIs and after-end requirements. Data destruction, reduction and compression happens all the time just to get things done. All this is a multi step, multi tool messy operation that demands a lot from you with no Ctrl+Z.
Compression is used here in a semantic sense, no zip/unzip related
Theorem 2 “ Golden Hammer”
Based on no Axiom really, but on daily experience:
Once you learn to use a Hammer, to will try to “hammer” your way through life as much as possible. This happens with technology like XSLT or standards like MARC21
Normally you choose a few trustworthy Skills from Axiom 0 and apply them everywhere
Case Study
Ingesting Digital Objects to DCMNY.org
Islandora 7.x / Fedora 3.8x
The Need:
Create a Simple and efficient Ingest Workflow that integrates:
To understand the problem we need to understand the final shape of our Objects
The Fedora/Islandora Digital Object
(da shape)
A shallow aggregation of Metadata and Resources
The Fedora/Islandora Digital Object Hierarchy (tree)
An RDF Graph of related DO.
Relationships are given by CMODELS
Islandora’s basic ingest Options
It is not lack of DEV effort. It is just complex.
Islandora’s Ingest Problem
Ingesting content is complex and slow
Preparing {meta}data and Hierarchy
Our first attempt
(Golden Hammer)
Things we got right
Spreadsheet!
Things we did as good as we could
Openrefine tutorial: https://www.youtube.com/watch?v=yceTHecXd9g
We needed something better
We wanted:
Full control over how Objects are built
Full control over how Objects are related to each other
Lighting fast XML generation
Flexible Binary Sources
Replicable Workflow
UI all over the place
Integrated Workflow
Smart Islandora Integration
Inline QA and Data shaping
Flexibility
Weekends
Islandora Multi Importer
Was brewed
(IMI)
Deny Theorem 1
Embrace Theorem 2
Integrate safely our Data creators
Ingest tons of material
IMI uses Spreadsheets as input
But we went further:
Any Tabulated format works. Excel, CSV, TSV, Google Spreadsheets via API access. You name it.. (Really, there not many more)
There are no header row naming restrictions, they just need to match our template’s use.
IMI generates XML via Twig templates
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fn="fn"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
version="2.0" exclude-result-prefixes="xs fn">
<xsl:output indent="yes" encoding="US-ASCII" />
<xsl:param name="pathToCSV" select="'file:///c:/csv.csv'" />
<xsl:function name="fn:getTokens" as="xs:string+">
<xsl:param name="str" as="xs:string" />
<xsl:analyze-string select="concat($str, ',')" regex='(("[^"]*")+|[^,]*),'>
<xsl:matching-substring>
<xsl:sequence select='replace(regex-group(1), "^""|""$|("")""", "$1")' />
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:function>
<xsl:template match="/" name="main">
<xsl:choose>
<xsl:when test="unparsed-text-available($pathToCSV)">
<xsl:variable name="csv" select="unparsed-text($pathToCSV)" />
<xsl:variable name="lines" select="tokenize($csv, ' ')" as="xs:string+" />
<xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+" />
<root>
<xsl:for-each select="$lines[position() > 1]">
<row>
<xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+" />
<xsl:for-each select="$elemNames">
<xsl:variable name="pos" select="position()" />
<elem name="{.}">
<xsl:value-of select="$lineItems[$pos]" />
</elem>
</xsl:for-each>
</row>
</xsl:for-each>
</root>
...
CSV to XML via XSLT
Dissecting
<?xml version="1.0" encoding="UTF-8"?>
{% block content %}
{% autoescape false %}
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd" >
<titleInfo>
<title>{{ data.title|trim|escape }}</title>
</titleInfo>
{% if attribute(data, 'personal_name_author') %}
{% for name in attribute(data, 'personal_name_author')|split(';') %}
<name type="personal">
<namePart>{{ name|trim|escape }}</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
{% endfor %}
{% endif %}
....
</mods>
{% endautoescape %}
{% endblock %}
Twig template that generates MODS based on a single spreadsheet row of data
(you can see the MODS there right?)
{% if attribute(data, 'personal_name_byrole') %}
{% for name_byrole in attribute(data, 'personal_name_byrole')|split(';') %}
{% set name = name_byrole|split ('|') %}
<name type="personal">
<namePart>{{ name[0]|trim }}</namePart>
{% if name[1] is not empty %}
<role>
<roleTerm authority="marcrelator" type="text">{{ name[1]|trim }}</roleTerm>
</role>
{% endif %}
</name>
{% endfor %}
{% endif %}
Twig snipper that generates nested XML from single cell
personal_name_byrole |
Islandora dude|creator; Samvera friend |collaborator |
<name type="personal">
<namePart>Islandora dude</namePart>
<role>
<roleTerm authority="marcrelator" type="text">creator</roleTerm>
</role>
</name>
<name type="personal">
<namePart>Islandora dude</namePart>
<role>
<roleTerm authority="marcrelator" type="text">collaborator</roleTerm>
</role>
</name>
<name type="personal">
<namePart>Dspace expert</namePart>
</name>
XML output for previous + extra data same field
personal_name_byrole |
Islandora dude |creator; Samvera friend |collaborator; Dspace expert |
IMI is smart
IMI helps with housekeeping
Why Use Multi-importer (What others say about it)
Some happy IMI users
Next Steps
DEMO time!
Discussion / Q & A
Special thanks to Mark McFate, Patrick Dunlavey, Kim Pham, Nat Keeran and to our Metro team Anne Karle-Zenith, Karen Hwang and Chris Stanton