DHDC.Exercise

NYPL Menus

Data Curation Exercise

These activities are intended to give you experience with some of the potential hands-on tasks that data curators might need to do in order to add value and maintain usefulness of a dataset over time.

Significant Properties

You are taking over responsibility for the data from the New York Public Library’s “What on the Menu?” crowdsourcing project. Your partners in IT are setting up a web crawler to try to capture as much of the existing state of the project website as possible but it’s not clear how much they’ll be able to capture. The only data you can be sure of having access to are the downloadable data files. The first task is to try to understand what information exists in the downloaded files and to compare this to the online application. Explore the downloaded files and the online site and take notes—try to capture the significant properties of the data.

Navigate to http://menus.nypl.org
Identify the components of this object that we might want to maintain as a dataset
Make a table of these components, for each:

Assign one of five categories: content, context, appearance, behavior, and structure
For more info on the categories, refer to: http://www.jisc.ac.uk/whatwedo/programmes/preservation/2008sigprops
Rate on a scale of 1-10 how significant is this component to preserve

Download and unzip the latest data package from the data page

Take approximately 15 minutes to complete this part of the exercise. Compare your inventory with people near you. After 15 minutes, pretend that you no longer have access to the menus.nypl.org site. Any other domain is fair game except that one.

Open the various files you downloaded
Compare what’s available in the data files against your inventory. Can you find all the components you identified. Where something is obviously different (appearance)—note what the different versions communicate
You can use other domains to look for information you might need (Hint: Try NYPL Digital Collections, Digital Gallery and other research resources provided by the library)

Islandora

Interacting with Repository Systems

Depositing Items

Congratulations your institution/project/band of friends has gained access to a “digital repository platform” into which you can deposit your research data. This exercise is meant to give you hands-on experience with the specifics of how data gets deposited into a system designed for its curation and preservation. The point here is not to train you in the operation of a particular platform or to give you the expectation that you will need to carry out these specific tasks yourself on your own data. However, if you’re going to deposit your data, SOMEONE will need to negotiate a set of tasks similar to these.

So, how does data deposit work? And what implications do these procedures have for your data? We’ll walk through the deposit process in Islandora to explore this question.

The DHDC demonstration repository: http://mith.clients.discoverygarden.ca/

Islandora documentation wiki: https://wiki.duraspace.org/display/ISLANDORA712/Islandora

Task: Deposit different kinds of items into the repository (you can use your own data or items from publicly accessible repositories— see below)

1. Start the deposit workflow by clicking “Submit to a collection”

2. This will take you to a screen with the existing collections that come with the repository out of the box (these are organized by content type).

3. Select a collection (Don’t worry if there are no items in this view yet). Click on the manage tab

4. In the resulting pop-up, click on the link to add an item

5. When asked to upload a MARCXML file, skip this upload step and simply click “Next”

6. Fill-in some metadata. Be sure to follow the links to vocabularies suggested for supplying values for specific fields

7. When you’re done, click through, upload a file, and finish the process by clicking “Ingest”

Try this with a few different types of objects. After the ingest process is done:

Some Sources of Objects:

1. Rijksmuseum: https://www.rijksmuseum.nl/en

2. Internet Archive (good for books): http://archive.org/

3. DPLA: http://dp.la/

Proto-Personas

This is a quick brainstorming exercise designed to help teams put themselves into the perspective of potential users through the creation of “personas” that represent archetypes of “people” that data owners believe might use their data. Real user personas are often extensively researched and developed in great detail. These will be more proto-personas.

In small teams, create diagrams of some proto-personas that represent people who might use the NYPL menus data:

In the top left quadrant: sketch the user, give him/her a name, annotated your sketch with demographic characteristics
In the bottom left quadrant: list demographic information about this person
In the top right quadrant: list some behaviors and beliefs of this person
In the bottom right quadrant: make a list of this person’s needs and goals

The idea of proto-personas is drawn from: http://uxmag.com/articles/using-proto-personas-for-executive-alignment

Digital Humanities Data Curation

Using Open Refine

1. Choose ‘Create Project’ from the menu on the right

2. Make sure the data file we provided has been unzipped into a folder somewhere on your machine

3. It’s best to upload files one at a time. We recommend starting with ‘Menu.csv’

4. Open Refine will parse the file and ask you to choose some options

The most important of these are:

Select ‘utf-8’ as the character encoding
Make sure that ‘commas (CSV)’ are selected as a column separator

The display of the data should “look right” in the preview—if something’s not right, check the way that headings are being handled (the select boxes on the right).

5. Give your project a name and click the ‘Create Project’ button at the top right.

6. Select the the downward pointing arrow next to the name of a column

7. The first thing to do is some basic normalization. Go to Edit cells > Common transforms. You might want to try:

Trim leading and trailing whitespace
Collapse consecutive whitespace
Pick a capitalization scheme

8. Next, we’ll start with working on faceting and clustering.

9. Choose text facet from the drop-down menu

Screen Shot 2014-05-01 at 12.09.47 PM.png

10. Start by creating a text facet on the ‘sponsor’ column

11. Inspect the values that appear in the box on the right (try sorting by ‘count’)

12. When you’ve looked through the values a bit, click the cluster button—a menu will appear.

13. You can supply the preferred value for a cluster of similar items in the text box, then select the check box under ‘Merge?’

14. Do a few clusters then click the ‘Merge Selected & Re-Cluster’ button

For more, see the Open Refine lesson at Programming Historian.

Digital Humanities Data Curation

Digital Repository Interface Critique

(inspired by Shannon Mattern’s “Digital Archives” studio assignment: http://www.wordsinspace.net/wordpress/2014/01/22/interface-critique-revisited-thinking-about-archival-interfaces/)

The goal of this exercise is to understand how “digital repositories” shape and participate in doing the work of data curation. We’ll do this by exploring the backend or administrative interfaces of a commonly-used digital repository platform.

As you explore the site, consider these questions:

How does the site structure the user’s experience and navigation?
Take a little time (working in groups) to try and deposit a couple different types of items*
How does the site contextualize digital objects? Does it provide links or specific kinds of interactions?
What are the hierarchies for information? What’s on the surface? What information/functionality do you find if you dig deeper?
Where are the seams between different parts? What components can you identify?
What specific actions can you figure out how to accomplish in a short amount of time? This is one way of answering the question: What does this do?
What does the system assume you know? What does it assume you want to do?
What’s missing? What doesn’t it do?

As we mentioned, software is just one part of the system that makes up a “digital repository”. There are standards that specify what “good” or “trusted” repositories need to do. One of these is called the Trusted Repositories Audit & Certification (TRAC) checklist:

Take a few minutes to browse the checklist (starts on pg. 9) and compare with the list of activities or functions you identified above.

* See separate instructions.