1 of 38

Introduction to Data Curation �in the Humanities:

Managing Tabular Data

Tierney Gleason

Reference & Digital Humanities Librarian

tgleason11@fordham.edu

Updated December 16, 2022

2 of 38

What is the point of this workshop?

  • To understand the Research Data Life Cycle and learn practices to document and preserve DH scholarship.

3 of 38

Proper understanding of digital scholarship requires an acknowledgement of its entropic nature; the absence of forward planning implies a misunderstanding of the object being produced at a fundamental – perhaps ontological – level.

James Smithies, Carina Westling, Anna-Maria Sichani, Pam Mellen, and Arianna Ciula. "Managing 100 Digital Humanities Projects: Digital Scholarship & Archiving in King's Digital Lab." DHQ: Digital Humanities Quarterly 13, no. 1 (2019).

4 of 38

What are data in the humanities

Image by Leimenide via Flickr Commons under a CC BY-SA 2.0 license https://creativecommons.org/licenses/by-sa/2.0/�

5 of 38

The NEH defines humanities data as “materials generated or collected during the course of conducting research.”

  • text or images
  • citations
  • software code
  • algorithms
  • digital tools
  • project documentation
  • databases
  • geospatial coordinates
  • reports
  • articles & more

Examples could include:

6 of 38

Humanities data per the DH Curation Guide:

  • Digital scholarly editions (TEI)
  • Text corpora
  • Marked-up text (XML, TEI)
  • Digital collections (text, images, metadata, etc.)
  • Data paired with analysis or annotations (data visualizations i.e. maps, graphs, timelines, etc.)
  • Finding aids, bibliographies, and other “information maps.”1

1 Julia Flanders and Trevor Muñoz, "Introduction to Humanities Data Curation." DH Curation Guide. Accessed September 23, 2019. https://guide.dhcuration.org/contents/intro/.

7 of 38

Digital vs. Data Curation

Digital Curation → Involves maintaining, preserving and adding value to digital research data throughout its lifecycle.

Data Curation → Expands upon the idea of digital curation to include “capturing and preserving not only the data itself, but information about the methods by which it was produced.”

- Julia Flanders and Trevor Muñoz via the DH Curation Guide

8 of 38

Data curation in the humanities combines:

  • Library science
  • Records management
  • Archival description
  • Computer science
  • Humanities disciplines2

2 Julia Flanders and Trevor Muñoz, "Introduction to Humanities Data Curation." DH Curation Guide. Accessed September 23, 2019. https://guide.dhcuration.org/contents/intro/.

9 of 38

The end goal of data curation:

  • To produce data that is machine-readable & interoperable for preservation & reuse.

  • To document sources and decisions that shape data in order to “retain provenance and complex layers of meaning.”3

3 Trevor Muñoz and Allen Renear, "Issues in Humanities Data Curation." Center for Informatics Research in Science and Scholarship (CIRSS), University of Illinois, Urbana-Champaign, 2011. Accessed cirss.ischool.illinois.edu/paloalto/whitepaper/premeeting/.

10 of 38

Data curation & interpretation example

→ Data transcription of all those living at St. John’s College during the Seventh Census of the United States, 1850

1850 United States Federal Census Seventh Census of the United States, 1850; (National Archives Microfilm Publication Census Place: West Farms, Westchester, New York; Roll: M432_615; Pages: 288A – 290A; Records of the Bureau of the Census, Record Group 29; National Archives, Washington, D.C.

11 of 38

Example cont’d

  • Looking at the 1850 census scan, there are many different decisions to be made about arranging the data in a spreadsheet.
  • If another source is used to interpret handwriting or add detail to the census data, this process should be made visible and documented as part of overarching citation practices.

Images : Courtesy of Fordham University Libraries – Archives and Special Collections via the Internet Archive

12 of 38

Getting Started

with

Data Curation

Image by cuatrok77 via Flickr Commons under a CC BY-SA 2.0 license https://creativecommons.org/licenses/by-sa/2.0/�

13 of 38

Copyright & Data Curation

As you mix original research with existing sources when curating data, be mindful of copyright considerations:

  • Examine the parameters of Creative Commons

14 of 38

Keep a Project Notebook

  • A dedicated place to record data sources, adjustments to source data, and decisions about structuring data during the active phases of project development.

  • Project notebooks become project documentation or content for a Humanities Data Curation Record.

15 of 38

Tips for Structuring Tabular Data

  • The columns of your spreadsheet should be organized with a single header row to be machine-readable.

16 of 38

Assigning identifiers

  • Assign each record or row in your spreadsheet with an identifier.

  • To keep records in order, format IDs as text and/or use the following sequencing:

For records 1-10 → 01-10

For records 1-100 → 001-100

17 of 38

Structuring header row of columns

  • For names with multiple words, use underscores rather than spaces:

  • Another option is to use camelCase:

18 of 38

Organizing Cells

  • Enter 1 item per cell.

  • Consider adding additional columns to accommodate more data. For example:

19 of 38

Develop a controlled vocabulary

  • Use controlled vocabularies to organize your data into categories.

  • Use data validation tools to manage data entry integrity for controlled vocabularies.

Consistency is key when structuring data.

Image via Openclipart.org

20 of 38

Avoid punctuation & symbols

  • Refrain from using punctuation or symbols in datasets

# $ / : , * @ % [ } ! ?

  • If you need to use data with punctuation and/or symbols, make sure that text is encased in double quotes:

21 of 38

Formatting numbers & dates

  • Format numbers as text to prevent auto calculations by Excel or Google sheets.

  • Format dates using ISO Standard 8601:

YYYY-MM-DD 󠆈 2019-09-24

22 of 38

Avoid color coding

  • Computer programs & programming languages are unable to read color to query or automatically sort information in your data set.

  • Save color coding for data visualization.

23 of 38

Managing blank cells

  • Ensure that blank cells are managed consistently i.e. is data missing or are cells left blank intentionally?

  • The meaning of blank cells can be explained in data documentation.

  • Blank cells (intentional or due to incomplete data) can be marked with chosen values like “NA”, “null”, or “blank.”

24 of 38

UTF-8 Encoding

  • UTF-8 encoding protects characters that use accents or diacritical marks in your data set.

  • Turn on settings for UTF-8 encoding in Excel.

  • Google Sheets should have automatic UTF-8 encoding.

25 of 38

One table per sheet

  • Before analyzing your data with software or digital tools, make sure data is organized with 1 table per sheet.

  • Computer programs cannot process additional tabs of data.

  • Separate multiple tabs into separate spreadsheets.

26 of 38

File naming conventions

  • Use words in file names that describe your project

  • Limit file names to 32 characters or less with no spaces, symbols, or special characters.

  • Use underscores _ or camelCase (NYCtheatresHarrison_20190924)

  • Be consistent if naming files in a related series or group.

27 of 38

Versioning & write-protection

  • Mark versions of your files by name:

NYCtheatresHarrison_20190924v5.csv

  • Select an option to ensure data integrity as files are opened, closed, and edited.

28 of 38

Data storage

Ensure research data is backed up in more than one place:

    • A networked hard drive

    • Cloud or repository

    • Removable media stored offsite (drive or CD)

29 of 38

Preparing Humanities Data for Publication

Image by cuatrok77 via Flickr Commons under a CC BY-SA 2.0 license https://creativecommons.org/licenses/by-sa/2.0/�

30 of 38

Preparing files for publication

  • Working files of data should be different from files submitted for publication.

  • Save data for publication in interoperable formats like .csv rather than proprietary formats like Excel.

  • Publish raw data only; strip dataset of calculations or macros before uploading to a repository.

31 of 38

Data documentation for repositories

  • Project documentation describes data, how it was produced, licensing for reuse, etc.

Examples of data documentation found in repositories:

  • README file

  • Codebook

32 of 38

Humanities Data Curation Record (HDCR)

  • Developed by librarians Thomas Padilla and Brandon Locke while at Michigan State Libraries.

  • Designed as a format for a README file to describe and contextualize humanities data sets.

  • Check out the HDCR GitHub page

33 of 38

HDCR supports reuse & reproducibility by:

  • Documenting data sources

  • Describing data types, formats, and/or features

  • Providing an assessment of data quality

  • Explaining methods & tools used to transform or gain insights from data4

4 Thomas Padilla and Brandon Locke, "Humanities Data Curation Record." GitHub. Last modified July 5, 2017. Accessed September 23, 2019. https://github.com/datapraxis/hdcr.

34 of 38

HDCRs aim to help researchers:

  • Understand how data are organized

  • Access the methods and tools used to support analysis

  • Understand data “cleaning” and transformation processes

  • Identify data source(s)5

  • HDCR Example

5 Thomas Padilla and Brandon Locke, "Humanities Data Curation Record." GitHub. Last modified July 5, 2017. Accessed September 23, 2019. https://github.com/datapraxis/hdcr.

35 of 38

Use Project Notebook Content for an HDCR

  • Project Summary
  • Data Quality
  • Type, Format, Extent, Size
  • Filenaming conventions
  • Modifications
  • Methods & Tools
  • Sources
  • Reuse license6

What would you add to the HDCR template?

6 Thomas Padilla and Brandon Locke, "Humanities Data Curation Record." GitHub. Last modified July 5, 2017. Accessed September 23, 2019. https://github.com/datapraxis/hdcr.

36 of 38

Publishing research data

  • Upload your data set and supporting documentation to a digital repository that offers a DOI (Digital Object Identifier).

  • Apply for an ORCID (Open Researcher and Contributor ID) to track your contributions to the open research community.

37 of 38

Recommended data repositories

  • Humanities Commons

  • Harvard Dataverse

  • Zenodo

  • figshare

38 of 38

We hope this workshop was helpful!

  • Check out the Digital Humanities guide for additional tips & updates.

  • Contact tgleason11@fordham.edu for questions or to schedule a research consultation.

Image via Openclipart.org