1 of 32

Introduction to CLAW

And LInked Data

2 of 32

@dannyLamb

3 of 32

What is CLAW?

So, we’re going to start at the ground floor today. The responses we received from the poll Kim issued has shown that the majority of folks today have little to no experience with CLAW or Islandora. So for those of you who are more in the know, please bear with us, I promise there’s interesting stuff ahead for you to.

So, what is CLAW? At its heart, it’s the same as Islandora 7.x. It’s the combination of Drupal

and Fedora.

And we do this because the whole is more than the sum of its parts. Together they make a better system for managing a repository.

Now that’s an easy reduction to make, because in the end, it’s not just Drupal + Fedora = Islandora. There’s obviously a lot more going on. There’s solr. There’s the triplestore. There’s microservices for derivative generation and little pieces of middleware to glue it all together. And we’ll cover all of that soon enough. But for now just think “Islandora lets you manipulate things in Fedora via Drupal”.

4 of 32

Strategic Goals

5 of 32

High Level

Easy to install
Easy to scale
Easy to sustain / maintain
Provide a better experience for repository admins
Provide a better experience for developers and sysadmins
Increase and promote inter-community collaboration

From an ultra high level, independent of any individual feature, here’s what we’re trying to achieve with CLAW. This will explain the rationale behind why we’re doing what we’re doing.

Easy to install. This is first on the list for a reason. Installation has always been a hurdle. It’s more than just Drupal + Fedora = Islandora, right? All those pieces and independent systems that we’re gluing together all have to get installed and configured, and quite frankly, it’s a nightmare to do it manually. Barrier to entry numero uno is installation. If we want people to adopt Islandora, we can’t give them something their sysadmin spends two weeks on and says “I give up. Too much. This is crazy.”

Easy to scale. This is another big hurdle. The amount of data that people are collecting is exponentially growing, day in and day out. If we want to be in this for the long haul, we have to be able to handle lots and lots of data. This means we’ve got to be able to scale. An all in one box will work great for a smaller institution, and you don’t have to scale if you don’t want to. But if we want to be used by larger institutions, we have to solve the scaling problem. We have to be able to handle more data, more page hits, and more active users. Period.

Easy to sustain and maintain. We have a good sized community. The Islandora community is growing. Interest is great. But at the end of the day you have to be aware that people can only volunteer so much time. Bosses can’t loan me developers for months on end. People can’t just toss away their personal lives and hack on Islandora indefinitely. So we have to respect and understand the limits of how open source software works. So at all costs, we have to look at what we’re doing and say “Do we really need to do this ourselves?” Has someone else already solved this poblem? Can we use it? Shrinking the codebase and having less code to maintain means we can manage the code with less time.

Provide a better experience for repository admins. We want a better and more responsive site, first and foremost. And we’d love for it to look great, too. But just as importantly, we need to give more control to the repository admin through the UI. This is Islanora’s thing and we want to make it better.

Provide a better experience for developers and sysadmin. This is the toughest one. It falls in line our goal of sustainability, but it does go beyond that. Islandora is a turn-key system if you want, but many people want more and you shouldn’t have to suffer in order to make it happen. Knowledge should not be limited or restricted to a chosen few. If you want to do it yourself it needs to be possible. You should be able to train an employee wihtout years of onboarding. And if you don’t want to do it in-house, you should have a _choice_ of vendors. And those vendors need to have an easier time finding and training talent as well. Making it easier to develop with and for Islandora benefits everyone, even if you’re not a developer. And it also fosters innovation, some of which we’re hoping will find its way back to the Foundation.

Increase and promote inter-community collaboration. We’re doing this for two reasons. One is that we we use other peole’s software all the time, and generally don’t contribute back to it. How can we ask people to do the same if we’re not doing it ourselves? Besides, we use this software all the time, so scratching someone else’s back is beneficial because you never know when you might need a hand with something. I know thta sounds political, and it kinda is, but it’s a very important part of open source software. The other big reason for this goal is that it willl expose Islandora to other, potentially larger, communities. So we hope it will help expand adoption.

Now those are some lofty goals, and I’m happy to report that we’ve pretty much nailed it. I’ll go over each one in turn.

6 of 32

Installation

https://github.com/Islandora-Devops/claw-playbook

Installation. I’m very proud of this. The Islandora _community_ (not me, not a vendor, the community) has developed a completely free installer, that will give you either a development VM using

Vagrant

or a production ready machine. It’s not just for dev environments, you can put it on a cloud VM or something in house. UTSC has been instrumental in making that happen, btw. We’re not you with a VM, showing you what it can do, and then expecting you to go through and install it all by hand, ok? It is flexible, it is customizable, and built on top of

Ansible which is a fantastic devops tool. Your sysadmins will thank you. Installation is at its simplest, a one line command. For real production you’ve got some extra configuration to set, but it’s still about as straightforward as we can possibly make installing a complicated repository system like Islandora. We’ve even gone as far as breaking up the install so it is modular. Meaning that you can install bits and pieces on different machines, and you’re in complete control of the granularity.

This has been the focus of the last few community sprints and I feel it’s been a huge success, and is something that sets us ahead of the pack.

We currently support Ubuntu 16.04, and are rapidly approaching full support for Centos 7. So we’re hitting the two major linux OS’s out there, and targeting LTSs.

7 of 32

Scaling

By Stephen Edmonds from Melbourne, Australia (First quick test of DIY light box) [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

So historically, this has been a big problem for Islandora. It’s up there with installation. The problem with scaling is that it demands a lot more to make existing software scalable. Installable? Sure, it’s a pain but you can provide it without having to redesign the software. Scalable? That’s a ‘back to the drawing board’ type of situation. Which is what we did. Has it taken a while? Yes. But it’s been worth it.

To scale the software, letting you break it apart into different pieces and ramp them up independently of one another based on your needs... that demands _highly decoupled_ software. And every step of the way, when presented wtih a problem, we have generally accepted that it’s better to sort things out and decouple them now, because it probably won’t happen later. It’s just the reality we’re living in. And because we’ve had the courage, as a community, to take that approach, all the rest of the strategic goals have been easier to meet, too. All the goodies that we’re getting now, we’re kind of getting for free because we’ve been patient and decoupled the code as best we can.

8 of 32

The Fundamental Difference

The main distinction between CLAW and all other Fedora based systems is the role that Fedora plays. In CLAW, Fedora is the repository, not the database driving the web application that is administering the repository.

Because we’ve decoupled so well, CLAW now stands apart from the rest of the crowd.

The main distinction between CLAW and all other Fedora based systems is the role that Fedora plays. In CLAW, Fedora is the repository, not the database driving the web application that is administering the repository.

What this means is that instead of letting Fedora get in the middle of every single page request, we still talk to Fedora as we’re doing business, but we don’t let it get in the way. We’re not trying to wrap Fedora with Drupal, mimicking all of its features. We’re letting them sit side by side and work together to each do what they do best. And this is very important because now that Fedora’s an LDP server, it can be used very differently. Instead of only being accessed through Drupal, you can expose your Fedora within your organization or even outside your firewall and be “firmly of the web”.

Either way, when a visitor hits your Drupal site, they’re not requesting things directly from your institution’s repository. They’re getting copies that are going to be delivered more quickly by a different system. And when you’re editing information in your repository, you’re editing a working copy which eventually makes its way to your repository.

9 of 32

Microservices

https://github.com/Islandora-CLAW/Crayfish

Ok, so there’s that. But then there’s other scalability issues beyond Fedora. Like what happens when everything is on one box and you’re processing thousands and thousands of images with OCR? You’re hammering that box. Crushing it. You need to be running Tesseract somewhere else, not on the public facing server that’s handling web traffic. Hence microservices

Anything that could potentially take a long time, or eat up a bunch or resources, gets wrapped up as a tiny web app. Why? Web servers are easy to scale and everyone understands HTTP. There’s already some out there, like the FITS web service. POST a file to it, get some fits. Need some more FITS action? Spin up more FITS servers. Don’t want any? Don’t run any! Repositories often need a glut of resources during large batch processes, but then once that’s done resources can be de-allocated and the overall system can run with less.

And we’re making many microservices ourselves. They’re available as on Github. The project name is Crayfish. It has this super cute logo.

And there’s some overlap with another one of our strategic goals, here. Inter-community collaboration. Most (not all, but the ones where it makes sense) of our microservices in Crayfish are API-X compatible. Which means that our microservices will work with for anyone that has a Fedora, not just those running Islandora. Which is pretty cool.

10 of 32

Sustainability and Maintenance

Lewis Hine [Public domain],

via Wikimedia Commons

So this is a pretty big deal. The foundation has two employees. One works on CLAW a lot, but one person is not enough to maintain and sustain this codebase. And 7.x is still there, which has a large amount of technical debt. So if this is gonna work, it has to be lean and it has to have tests. And we’ve very purposefully shed a huge amount of the codebase to achieve this.

Now we definitely DO NOT have all the same features/functionality as 7.x, but for those that we do, we do it in with a fraction of the code. Every step of the way we evaluate existing solutions to see if there’s already something out there we can use. And if there is, we use it. And only as a last resort do we write something ourselves, but if we do, we provide tests. Even the install code has tests. That’s how serious we are about it. You never know what you’re gonna break when you do something, so having those tests are the first line of defense for an understaffed crew that’s maintaining code.

11 of 32

User Experience

Drupal-y

Control display through UI
Control forms through UI
Model content through UI

Contrib modules!
Themes
Views
Inline Forms
Context

User experience is, generally speaking, that of Drupal’s. And although Drupal is not without its UX critics, it’s still a solid improvement.

You can control how an object is displayed through the UI.

You can control the form for how you upload and edit an object through the UI.

You can model content and control metadata through the UI.

And it’s infinitely customizable. Now that everything is Drupal-y,

we’ve unlocked Drupal’s contributed modules. So now things you find on Drupal.org will actually work with your content. Viewers and slideshows and

_themes_ and javascript, all this stuff is gonna work for you now. You’re not limited to only the modules that we provide. And this is how we manage to keep the codebase so lean. We actively leverage contributed modules and encourage you to do the same.

We’ve unlocked views, which let you query for resources and format the results all through a user interface. This is a Drupal staple that is very powerful and versatile. Views are a fantastic tool to have at your disposal.

We utilize the inline forms contrib module, which let you embed forms within forms so you can edit multiple resources at the same time.

We also take advantage fo the Context module, which lets you tailor the functionality of your site based on criteria that you define. Want different collections to have different themes? No problem. Need different users to see different forms? No problem. We even use context as a way of exposing Drupal hooks and alters through the UI. So things that normally would have required a programmer now have their own forms.

Context gives you such fine grained control, you can even use it to limit what gets published to Fedora or indexed in a triplestore. It is incredibly powerful.

12 of 32

Developer/Sysadmin Experience

Very Drupal-y when in Drupal

Object oriented
Dependency Injection
Plugin based
Contrib modules!!!

Great for front-end developers
Silex microservices
Apache Camel Middleware

Java
Insanely active community

Ansible

For developers, when in Drupal, things are also very Drupal-y.

Everything is object oriented.

We make extensive use of Dependency Injection.

We make everything as plugins whenever possible, which is a standardized way of making things swappable.

And of course, contrib modules. Many many many problems have already been solved for you by the Drupal community.

It’s also definitely better for themers and front-end developers. More themes are available. You can use views to do a ton of things. And Twig templating is definitely a step up from the Drupal 7 theming experience.

Microservices are small, simple, and written in Silex, which is a microframework build out of the Symfony components. So it’s not _exactly_ the same, but it’s very similar to Drupal development in a lot of respects. Microservices are small, simple, and quite frankly, fun bits of code to program.

And for the stuff that has to run forever, the infinite loops of listening for messages on a queue and responding, those type things… we use Apache Camel. Which at first seems arcane, but

It’s Java, which is fast, stable, and has fantastic libraries out there.

And it has an insanely active community. It is impressive, to say the least. Subscribing to the apache camel mailing list is like turning on the fire house. It will flood your inbox!

And for the sysadmins out there, we use Ansible! We’ve settled on a sweet devops tool, and we share the code! And although we’re using it for installations, it’s possible to use Ansible to maintain and update environments as well.

13 of 32

Collaboration

Acropolis Museum [CC BY-SA 2.5 (https://creativecommons.org/licenses/by-sa/2.5)], via Wikimedia Commons

We want to work more with other communities, because it generally is good for both parties to be involved and engaged. I’ve already mentioned that our microservices are made available for Fedora users via API-X, but another important thing to consider is our engagement with the Drupal community. We’re on the cutting edge for a lot of stuff that I’m positive would be useful to the greater Drupal community. We’re doing linked data (if you want to), and we’re also managing “media” (to put it in Drupal terminology) and batch proccessing large amounts of Media and doing things with it like making lower quality access copies en masse and in such a way that doesn’t tank your Drupal server. And these can be good doors for us to open with Drupal because we can sort of take this “mainstream”.

I know there’s some connotations to that, but I promise it won’t make what we’re doing any “lamer”, k? There’s still plenty of library-specific stuff that we’re doing and will do, I’m not suggesting that we leave the space, so to speak. But ultimately we’re better off being part of the Drupal community than continuing to hang out on the sidelines.

14 of 32

Features

15 of 32

Something that’s important to understand, is that since our Drupal and our Fedora are sitting side by side, and we’re not trying to bring all of the features of one into the other, we can identify where the overlap in functionality is and use that to our advantage. We can isolate where we have to write custom code by picking the pieces that are closest to each other before gluing them together. That’s the intersection in this Venn diagram. BUT, and this is a big but, users are not limited to only those features. The _UNION_ of Drupal and Fedora is available to users. You can use the whole thing. Previously we’ve been trapped in the purple part of the diagram, and now we’ve got the freedom to use so many more tools that are available.

I’ll also say that we haven’t fully filled in the purple part yet. There’s still some things hanging that we’ll get around to once the API specification is finalized, like versions. We’re specifically waiting for that to happen because the modeshape implementation of Fedora has a funny way of doing versions that doesn’t line up with how Drupal works nor what the specification is requiring. So there’s still some plates in the air there.

16 of 32

MVP

https://islandora-claw.github.io/CLAW/mvp/mvp_doc/

17 of 32

MVP Features

Content modeled in Drupal as Entities using PCDM 1.0
REST API exposed for Drupal Entities
Support for collections, images, books, and pages
The ability to control metadata mappings between Drupal and RDF
Provide RDF based default descriptive metadata profile in Drupal
The ability to export/import JSON-LD
Automatic backup of Drupal content in Fedora 4
Ability to restore/bootstrap a Drupal site from a properly structured Fedora 4 repository
The ability to index and search resources with Apache Solr
The ability to restrict access to collections and/or individual resources across all representations (Drupal, Fedora, Solr, etc…)
Asynchronous derivative generation
Vagrant environment for development purposes, which will serve as a starting point for more complicated, distributed installs

18 of 32

MVP Features

Content modeled in Drupal as Entities using PCDM 1.0
REST API exposed for Drupal Entities
Support for collections, images, books, and pages (SOON)
The ability to control metadata mappings between Drupal and RDF
Provide RDF based default descriptive metadata profile in Drupal (ONGOING)
The ability to export/import JSON-LD
Automatic backup of Drupal content in Fedora 4
Ability to restore/bootstrap a Drupal site from a properly structured Fedora 4 repository
The ability to index and search resources with Apache Solr
The ability to restrict access to collections and/or individual resources across all representations (Drupal, Fedora, Solr, etc…)
Asynchronous derivative generation (CLOSE)
Vagrant environment for development purposes, which will serve as a starting point for more complicated, distributed installs

And so here’s what we’ve managed to do.

Books and pages are still left, although we’ve found we’re already making assumptions here. Like that a book is a set of scanned page images, when really a book could just as easily be a pdf or .epub or something, right? So some form of that may come a lot sooner than the paged content part of it.

The default descriptive metadata profile is very slippery eel. We’re sorting out a lot of kinks when it comes to managing and administering your metadata profile through Drupal. And there’s definitely progress on the horizon there. Again, another UTSC innovation. But RDF based metadata profiles are heavy. And complex. Like, very very complex. So we’re most likely going to start by implementing something super basic, like dcterms, and see what issues we run into before attempting something more complicated like EDM or BIBFRAME. And I’ll just stop now to say, if could stay right here, and we could talk about this for an hour no problem. So if I run short we can revisit this to burn whatever time is left :D

Also, turns out exporting JSONLD from a Drupal entity is easy. Importing… not so much. There’s just a ton of Drupal specific stuff you’ll need to sprinkle into you rmetadata to make it happen, and we’ll have to step up our level of invasiveness into your metadata to make it happen. If we have time, we can circle back to this when talking about RDF and linked data.

So, indexing a Drupal from a Fedora? That’s still a huge one. And we’ve taken a lot of precautions to keep it within the realm of possibility. But it’s gonna take a big effort, and right now the RDF and Derivative stuff are just higher priority. Flat out. It’s a super cool feature, but we’ll have to hunker down and keep our noses to the grindstone for a while to make it happen. And it just can’t be prioritized over the other features that are left.

Derivatives… we’re close. I know I’ve been saying that for a while, but we’re making sure this is rock solid and really configurable for users. We’re almost at the point where all of this can be managed through a UI in a super flexible way. I’ve had to go back to the drawing board several times on this, and switch from Rules to Context, and have learned a lot of hard Drupal lessons along the way.

Also, I’d like to say that we’e gonig above and beyond on our instlalation goals, as we’re targeting multiple OS’s and can spin up production environments, too.

19 of 32

Small But STrong

By Steve Jurvetson (https://www.flickr.com/photos/jurvetson/70704300) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

20 of 32

But wait, there’s more!!!

There’s a lot of things on that aren’t on that list we got for free by doing things the Drupal way. For example, by writing our code the Drupal way we wound up up with a user interface for bulk operations. With views, you can query your content however you like. And all our operations are formatted as Actions, so when you combine the two (like say in Drupal’s main content admin page), you get something like this:

It’s maybe a little hard to see but you can search your content by type, and then apply actions to some or all of the results. Even the most basic operations like indexing metadata in Fedora or the Triplestore are available. Need to re-index a handful of items that you can identify by a query? Well, here ya go. For free.

And FYI this approach was proposed by Diego Pino to Kim Pham when investigating Rules and VBO integration. We’re using neither of those right now, but we still got this! So thanks to both of you for being with this.

21 of 32

But wait, there’s more!!!

22 of 32

But wait, there’s IIIF?

23 of 32

Linked Data

24 of 32

What Does WikIpedia Say?

Linked Data is “a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers.”

25 of 32

TBL FTW

Paul Clarke [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

26 of 32

Linked Data Note of 2006

Use URIs to name (identify) things.
Use HTTP URIs so that these things can be looked up (interpreted, "dereferenced").
Provide useful information about what a name identifies when it's looked up, using open standards such as RDF, SPARQL, etc.
Refer to other things using their HTTP URI-based names when publishing data on the Web.

27 of 32

Vocabulary

URI
HTTP URI
RDF

Triples
Subject - Predicate - Object
Multiple serializations

SPARQL
Semantics

Ontologies

Dcterms
Bibframe
EDM
OWL
Schema.org

So that’s twice a bunch of words have popped up that you may not be familar with, so I’ll go through them quickly

URI: Uniform Resource Identifier. It’s a unique id for something.

HTTP URI: That’s a link! It’s a URI you can visit in a browser.

RDF: Resource Description Framework. It’s a way of describing data in triples. Triples are of the form Subject, Prediate, Object. If you ever see or hear SPO, that’s what that means. And the thing about RDF is that there’s a lot of flavours of it, so to speak. There’s an XML format, there’s a JSON format (which we use exclusively) or you can see RDF just as lists of Triples. Or in a more concise and human readable “Turtle” format.

SPARQL is a query language for RDF. It’s not SQL, and is used for different purposes. It’s not overly difficult to learn, but most of the material online is pretty dry and academic.

And then there’s semantics. And I think this is where people often get turned off. When you describe things using RDF, if you “make up a new word” so to speak, you have to define it in an

“ontology”. Ontologies are documents providing the context or implications of what you’re saying with RDF. That means there’s more to it than just raw data. These relationships we’re describing _mean something_, and because of that, there can be differences of opinion as to what things on the web in RDF _mean_. And sometimes folks just like to have philosophical arguments about the meaning of things, and if you get wrapped up in one, it’s a little scary to speak up if you’re uninitiated. So i’m treading lightly here. But if you’re curious, there’s lots of ontologies out there to choose from.

28 of 32

So WHY DO ALL THIS?

The purpose of Linked Data is not to expose data and relationships to humans, we have HTML for that. RDF is useful for machines, and linked data is an attempt to standardize how we expose RDF to machines so that they can do more intelligent things.

29 of 32

lDP - LINKED DATA PLATFORM

30 of 32

SO WHAT DOES CLAW DO WITH LINKED DATA?

It’s not what CLAW does with Linked Data. CLAW helps you publish your linked data. You can create it, edit it, delete it, etc… But other than that, what you do with it is up to you and how far you’re willing to go.

31 of 32

Rambling RDF

32 of 32

Fin