1 of 14

(Meta)data Analysis for DEI

The University of Alabama

Brian Clark & Catherine Smith

#coreforum2022

2 of 14

Brian Clark & Catherine Smith

10/15/2022

(Meta)data Analysis for DEI

3 of 14

Impetus & Overview of Methodology

  • Presuppositions: 
    • Inadequately described resources are effectively lost in a library's collection
    • There are issues with LCSH that can impede subject access, particularly for resources related to historically marginalized communities and identities
  • Developed a method for evaluating subject headings across a library's collection using sample data:
    • 468,100 MARC records
    • Limit to records with an LCC classification (050) and at least one LCSH (650 _0)
    • Compiled a list of LCSH relating to 1) African-Americans; 2) LGBTQIA+; 3) Women; 4) Indigenous peoples

4 of 14

Selected LCSH

5 of 14

Diagram of Workflow

MARC records

Read into PyMARC

Meets criteria for analysis

discard

Read into R

Sort/group records by subject headings

Yes

No

Write to CSV

CSV file

Clean and normalize data

Python

R

Produce visualizations of data

6 of 14

Python

  • Input: File of MARC records
  • Output: CSV file (one row per record) containing record ID, the first 050, the year of publication, and all LCSH as a JSON array, or ‘NULL’ if that field is not present in the record

  • Data types:
    • String
    • List
    • Integer
    • PyMARC record/field
  • Libraries
    • CSV
    • RE
    • JSON
    • PyMARC

7 of 14

PythonPyMARC

  • PyMARC is a Python 3 library for working with bibliographic data encoded in MARC21.
  • “It provides an API for reading, writing and modifying MARC records. It was mostly designed to be an emergency eject seat, for getting your data assets out of MARC and into some kind of saner representation. However over the years it has been used to create and modify MARC records, since despite repeated calls for it to die as a format, MARC seems to be living quite happily as a zombie.”

8 of 14

Python650 field decision tree

Append ‘NULL’ to output row

Return list

Yes

650 in record

ind 2 = 0

Append to LCSH list

Count items in LCSH list

Contains items

Yes

No

No

Ignore

No

Append ‘NULL’ to output row

Append to output row as JSON

[“=650 \0$aWomen$zAsia$xHistory.$0http://id.loc.gov/authorities/subjects/sh85147274”,

“=650 \0$aWomen$zAsia$xSocial conditions.$0http://id.loc.gov/authorities/subjects/sh85147274”,

“=650 \0$aFeminism$zAsia.$0http://id.loc.gov/authorities/subjects/sh85047741”,

“=650  \4$aFeminism$zAsia.”

“=650  \4$aWomen$zAsia$xHistory.”,

“=650 \4$aWomen$zAsia$xSocial conditions.”]

[“$aWomen$zAsia$xHistory.”,

“$aWomen$zAsia$xSocial conditions.”,

“$aFeminism$zAsia.”]

Yes

9 of 14

R

  • R (programming language) 
    • Popular language for statistical computing and creating graphical representations of data
    • Effective tool for cleaning "messy" data
  • RStudio
    • Open source software for interacting with R code in a graphical user interface (GUI)

10 of 14

R

  • TidyVerse
    • A collection of packages designed for common data science tasks
  • Regular Expressions (RegEx)
    • Useful for parsing complicated text strings and identifying patterns

11 of 14

R

    • Remove NULL entries for LCC and LCSH
    • Remove symbols from LCC

Clean

    • Format LCSH from JSON body to individual strings
    • Group/split LCSH by record

Organize

    • Identify records with selected subject headings
    • Create graphs to show significant relationships

 Analyze

Utility vs. Exploration

12 of 14

Sample Findings

13 of 14

Sample Findings

Cooccurrences of LCSH w/ "Indigenous peoples"- related LCSH

Cooccurrences of LCSH w/ "Women"-related LCSH

14 of 14

Questions?

B. Clark & C. Smith (2022) "Prioritizing the People: Developing a Method for Evaluating a Collection’s Description of Diverse Populations," Cataloging & Classification Quarterly, DOI: 10.1080/01639374.2022.2090042 

GitHub repo: https://github.com/bpclark2/Core-Forum-2022