1 of 14

(Meta)data Analysis for DEI

The University of Alabama

Brian Clark & Catherine Smith

#coreforum2022

2 of 14

Brian Clark & Catherine Smith

10/15/2022

(Meta)data Analysis for DEI

3 of 14

Impetus & Overview of Methodology

Presuppositions:

Inadequately described resources are effectively lost in a library's collection
There are issues with LCSH that can impede subject access, particularly for resources related to historically marginalized communities and identities

Developed a method for evaluating subject headings across a library's collection using sample data:

468,100 MARC records
Limit to records with an LCC classification (050) and at least one LCSH (650 _0)
Compiled a list of LCSH relating to 1) African-Americans; 2) LGBTQIA+; 3) Women; 4) Indigenous peoples

4 of 14

Selected LCSH

5 of 14

Diagram of Workflow

MARC records

Read into PyMARC

Meets criteria for analysis

discard

Read into R

Sort/group records by subject headings

Yes

No

Write to CSV

CSV file

Clean and normalize data

Python

R

Produce visualizations of data

6 of 14

Python

Input: File of MARC records
Output: CSV file (one row per record) containing record ID, the first 050, the year of publication, and all LCSH as a JSON array, or ‘NULL’ if that field is not present in the record

Data types:

String
List
Integer
PyMARC record/field

Libraries

CSV
RE
JSON
PyMARC

Data types demo:

Create string

Print string

String variable

Print string type

String slicing

Concatenate strings

Create empty list

Append string to list

Print list

Print first item of list

Append integer

Print list

Print list type

Print items types

Write list to csv

sample_string = 'string'�print(sample_string)�print(type(sample_string))�print(sample_string[3:])�print(sample_string + ' other string')�my_list = []�my_list.append(sample_string)�print(my_list)�print(type(my_list))�print(type(my_list[0]))�my_list.append('other string')�import csv�file_out = open('file_out.csv', 'w', encoding='utf-8')�csv_writer = csv.writer(file_out, delimiter=',')�csv_writer.writerow(my_list)

7 of 14

Python�PyMARC

PyMARC is a Python 3 library for working with bibliographic data encoded in MARC21.
“It provides an API for reading, writing and modifying MARC records. It was mostly designed to be an emergency eject seat, for getting your data assets out of MARC and into some kind of saner representation. However over the years it has been used to create and modify MARC records, since despite repeated calls for it to die as a format, MARC seems to be living quite happily as a zombie.”

8 of 14

Python�650 field decision tree

Append ‘NULL’ to output row

Return list

Yes

650 in record

ind 2 = 0

Append to LCSH list

Count items in LCSH list

Contains items

Yes

No

Ignore

No

Append ‘NULL’ to output row

Append to output row as JSON

[“=650 \0$aWomen$zAsia$xHistory.$0http://id.loc.gov/authorities/subjects/sh85147274”,

“=650 \0$aWomen$zAsia$xSocial conditions.$0http://id.loc.gov/authorities/subjects/sh85147274”,

“=650 \0$aFeminism$zAsia.$0http://id.loc.gov/authorities/subjects/sh85047741”,

“=650 \4$aFeminism$zAsia.”

“=650 \4$aWomen$zAsia$xHistory.”,

“=650 \4$aWomen$zAsia$xSocial conditions.”]

[“$aWomen$zAsia$xHistory.”,

“$aWomen$zAsia$xSocial conditions.”,

“$aFeminism$zAsia.”]

Yes

9 of 14

R

R (programming language)

Popular language for statistical computing and creating graphical representations of data
Effective tool for cleaning "messy" data

RStudio

Open source software for interacting with R code in a graphical user interface (GUI)

10 of 14

R

TidyVerse

A collection of packages designed for common data science tasks

Regular Expressions (RegEx)

Useful for parsing complicated text strings and identifying patterns

11 of 14

R

Remove NULL entries for LCC and LCSH
Remove symbols from LCC

Clean

Format LCSH from JSON body to individual strings
Group/split LCSH by record

Organize

Identify records with selected subject headings
Create graphs to show significant relationships

Analyze

Utility vs. Exploration

12 of 14

Sample Findings

13 of 14

Sample Findings

Cooccurrences of LCSH w/ "Indigenous peoples"- related LCSH

Cooccurrences of LCSH w/ "Women"-related LCSH

14 of 14

Questions?

B. Clark & C. Smith (2022) "Prioritizing the People: Developing a Method for Evaluating a Collection’s Description of Diverse Populations," Cataloging & Classification Quarterly, DOI: 10.1080/01639374.2022.2090042

GitHub repo: https://github.com/bpclark2/Core-Forum-2022