1 of 65

Workshopping Queeries: Linked Data Vocabularies and Ethical Cataloging

Clair Kronk — UC Department of Biomedical Informatics

Brian M. Watson — UBC School of Library, Archival, & Information Studies

2 of 65

Where are we going?

  1. Introduction & Intersextions.

Kronk & Watson | @brimwats @ld4conference

2. Cataloging & Discontent

3. Linking what?

4. Homosaurus: A Dino Of A Project

5. Quee-rying Further: Ontology-Based Applications

6. The GSSO and the HomoIT

7. Natural Language Processing

1. Introduction & Intersextions.

2. Cataloging & Discontent

3. Linking what?

4. Homosaurus: A Dino Of A Project

5. Quee-rying Further: Ontology-Based Applications

6. The GSSO and the HomoIT

7. Natural Language Processing

3 of 65

2. CATALOG & DISCONTENT

Kronk & Watson | @brimwats @ld4conference

1. Introduction & Intersextions.

��3. Linking what?

4. Homosaurus: A Dino Of A Project

5. Quee-rying Further: Ontology-Based Applications

6. The GSSO and the HomoIT

7. Natural Language Processing

4 of 65

Cataloging, Interrupted

Kronk & Watson | @brimwats @ld4conference

[Catalogers] critically read subject headings for bias, arguing, often successfully, for changing subject headings to ameliorate bias and altering classification structures to “fix” the ideological stories told by the classification scheme.

Drabinski, Queering the Catalog

5 of 65

Representation is Fluid

Kronk & Watson | @brimwats @ld4conference

When classifications are created, they inherently reflect the predominant biases of society. To categorize something is to define what it is not, yet what something is or is not is subject to change.

Vaughan, The Language of Cataloguing: Deconstructing and Decolonizing Systems of Organization in Libraries.

6 of 65

Representation is Fluid

Kronk & Watson | @brimwats @ld4conference

When classifications are created, they inherently reflect the predominant biases of society. To categorize something is to define what it is not, yet what something is or is not is subject to change.

Vaughan, The Language of Cataloguing: Deconstructing and Decolonizing Systems of Organization in Libraries.

7 of 65

Rather than taking these identities as stable and fixed, queer theory sees these identities as shifting and contextual…

A queer approach to classification and cataloging suggests no easy solutions. In defining the problems [as] queer, the solutions themselves must be queer.

Kronk & Watson | @brimwats @ld4conference

Society

LCSH/DDC

Cataloger

Emergent Identities

Catalog Changes

Drabinski Queering the Catalog

8 of 65

Kronk & Watson | @brimwats @ld4conference

9 of 65

SOLUTIONS?

Kronk & Watson | @brimwats @ld4conference

10 of 65

Solution 1: Drabinski

Proposal:

  • “A queer approach to the problem of library classification and cataloging demands that these… offensive subject divisions and subject language remain uncorrected… [and librarians] teach students how to read what they discover.”

Critique:

  • Moves burden from systems to librarians, thus increasing librarian work, emotional labor, more.
  • Would require a deeply radical change from way things are—not necessarily a reason not to.
  • No changes means individual encounters with deeply offensive language; likely to compound issue

Kronk & Watson | @brimwats @ld4conference

Drabinski “Queering the Catalog”

11 of 65

Solution 2: Adler, et al.

  • Proposal: Folksonomies
  • Represent concepts left out of controlled vocabularies
  • Democratic, allow for shifts and changes in vocabularies, and allow spaces for the “long tail.”
  • Library catalogs may be the perfect environment to introduce a “hybrid metadata ecology” combining controlled vocabularies, classifications, and folksonomies

Kronk & Watson | @brimwats @ld4conference

Adler Transcending Library Catalogs

12 of 65

Kronk & Watson | @brimwats @ld4conference

13 of 65

Kronk & Watson | @brimwats @ld4conference

!?!?!

14 of 65

Critique: Keilty, et al.

Tagging and folksonomies

          • are not free of [oppressive] forces.
          • are not entirely democratic, actually.
          • Terms used in non-normative sexual subcultures, do not operate strictly in a top-down or bottom-up fashion.

Kronk & Watson | @brimwats @ld4conference

Keilty Sexual Boundaries and Subcultural Discipline

Keilty Tagging and Sexual Boundaries

15 of 65

3. LINKING WHAT?

Kronk & Watson | @brimwats @ld4conference

1. Introduction & Intersextions.

2. Cataloging & Discontent�

�4. Homosaurus: A Dino Of A Project

5. Quee-rying Further: Ontology-Based Applications

6. The GSSO and the HomoIT

7. Natural Language Processing

16 of 65

Kronk & Watson | @brimwats @ld4conference

17 of 65

Kronk & Watson | @brimwats @ld4conference

Subject Predicate Object

18 of 65

Kronk & Watson | @brimwats @ld4conference

Brian’s

presenting at

a LD4 2020 Conference

19 of 65

Kronk & Watson | @brimwats @ld4conference

Brian’s Brian M. Watson, born May 25th 1989 in Manchester, New Hampshire Unceeded Abenaki, and Pennacook land, identified by the United States Government

SSN ###-##-####

presenting at

a LD4 2020 Conference.

20 of 65

Kronk & Watson | @brimwats @ld4conference

Brian’s Brian M. Watson, born May 25th 1989 in Manchester, New Hampshire, Unceeded Abenaki, and Pennacook land, identified by the United States Government

SSN ###-##-####

presenting appearing nervously & formally before other people in order to show them slides and talk to them

at a LD4 2020 Conference.

21 of 65

Kronk & Watson | @brimwats @ld4conference

Brian’s Brian M. Watson, born May 25th 1989 in Manchester, New Hampshire Unceeded Abenaki, and Pennacook land, identified by the United States Government

SSN ###-##-####

presenting appearing nervously & formally before other people in order to show them slides and talk to them

A DLBB. At 2020 LD4 Conference on Linked Data in Libraries that was suppose dot take place in Texas A&M but due to the novel Coronavirus 19 but is now taking place on the teleconferencing platform zoom from 10AM, identified on twitter by the hashtag #ld4conference.

22 of 65

Kronk & Watson | @brimwats @ld4conference

23 of 65

Kronk & Watson | @brimwats @ld4conference

@brimwats

IsPresenting

@ld4 conference #ld4conference

24 of 65

Kronk & Watson | @brimwats @ld4conference

@brimwats

IsPresenting

@ld4 conference #ld4conference

Subject Predicate Object

25 of 65

4. HOMOSAURUS: A DINO OF A PROJECT

Kronk & Watson | @brimwats @ld4conference

1. Introduction & Intersextions.

2. Cataloging & Discontent

3. Linking what?

��5. Quee-rying Further: Ontology-Based Applications

6. The GSSO and the HomoIT

7. Natural Language Processing

26 of 65

Kronk & Watson | @brimwats @ld4conference

How do we go

from this

27 of 65

Kronk & Watson | @brimwats @ld4conference

How do we go

from this

to this?

28 of 65

Kronk & Watson | @brimwats @ld4conference

29 of 65

Kronk & Watson | @brimwats @ld4conference

30 of 65

Digital Transgender Archive

              • An online hub for digitized historical materials, born-digital materials, and information on archival holdings throughout the world.
              • “We treat transgender as a practice rather than an identity category in order to bring together a trans-historical and trans-cultural collection of materials related to trans-ing gender.”
              • Total: ~10,000 items up to the year 2000.

Kronk & Watson | @brimwats @ld4conference

31 of 65

Kronk & Watson | @brimwats @ld4conference

32 of 65

Kronk & Watson | @brimwats @ld4conference

33 of 65

Kronk & Watson | @brimwats @ld4conference

34 of 65

Kronk & Watson | @brimwats @ld4conference

35 of 65

Kronk & Watson | @brimwats @ld4conference

36 of 65

=

37 of 65

Kronk & Watson | @brimwats @ld4conference

38 of 65

Kronk & Watson | @brimwats @ld4conference

39 of 65

Ontology-Based Applications: Further Workshopping

Clair Kronk

University of Cincinnati

Department of Biomedical Informatics

Kronk & Watson | @brimwats @ld4conference

40 of 65

5. MOVING FROM LINKED DATA TO ONTOLOGIES

Kronk & Watson | @brimwats @ld4conference

1. Introduction & Intersextions.

2. Cataloging & Discontent

3. Linking what?

4. Homosaurus: A Dino Of A Project

�6. The GSSO and the HomoIT

7. Natural Language Processing

41 of 65

Organizing Linked Data

  • What happens when Linked Data becomes Big Data?
  • How do we organize triples?
    • Formats like RDF, RIF, JSON-LD, and OWL
  • Leveraging computer readability and human readability
  • Structuring linked data: how do data relate to one another? What’s the “bigger” picture?
  • Why use ontologies to organize?
    • Easier to integrate “separate” linked data? Authors vs. creators
    • Hierarchical structures make searching large databases easy
    • Formats can be used on any computer, via any database platform, and any programming language

Kronk & Watson | @brimwats @ld4conference

42 of 65

Kronk & Watson | @brimwats @ld4conference

  • Vocabulary
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology

43 of 65

Kronk & Watson | @brimwats @ld4conference

  • Vocabulary
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology

44 of 65

Kronk & Watson | @brimwats @ld4conference

  • Vocabulary
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology

45 of 65

Kronk & Watson | @brimwats @ld4conference

  • Vocabulary
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology

46 of 65

Kronk & Watson | @brimwats @ld4conference

  • Vocabulary
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology

47 of 65

Kronk & Watson | @brimwats @ld4conference

  • Vocabulary
  • Controlled Vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology

48 of 65

Why do we develop ontologies outside of organizing linked data?

  • To share common knowledge
  • To enable reuse of knowledge
  • To make domain assumptions explicit
  • To separate domain knowledge from operational knowledge
  • To analyze domain knowledge

Kronk & Watson | @brimwats @ld4conference

49 of 65

Why an ontology specifically for gender, sex, and sexual orientation data?

  • Ontologies control the biomedical world and research
  • There are no standardized ways to capture sexual orientation and gender identity which integrate multiple perspectives
  • A controlled vocabulary can’t capture context or usage
  • Ontologies can be easily updated and expanded

Kronk & Watson | @brimwats @ld4conference

50 of 65

6. THE GSSO & HOMOSAURUS

Kronk & Watson | @brimwats @ld4conference

1. Introduction & Intersextions.

2. Cataloging & Discontent

3. Linking what?

4. Homosaurus: A Dino Of A Project

5. Quee-rying Further: Ontology-Based Applications

��7. Natural Language Processing

51 of 65

The Gender, Sex, and Sexual Orientation (GSSO) Ontology

  • Version 2 released June 2020
  • Over 11,000 linked terms and over 14,000 external database mappings
  • Over 70% of terms unavailable in the over 800 ontologies in the NCBO BioPortal
  • Descriptors and definitions for over 200 slang terms, 190 pronouns, and 200 culturally-specific gender identities

Kronk & Watson | @brimwats @ld4conference

52 of 65

The GSSO and the Homosaurus

  • Version 2 of the Homosaurus has a complete mapping to version 2 of the GSSO
  • Mappings are done manually to ensure compatibility
  • Indexing with Homosaurus means indexing with the GSSO as well! And all other connected ontologies!

Kronk & Watson | @brimwats @ld4conference

intersex

Human Readable Text

http://homosaurus.org/v2/intersex

Computer Readable ID

53 of 65

The GSSO and the Homosaurus

Kronk & Watson | @brimwats @ld4conference

intersex

Label

http://homosaurus.org/v2/intersex

Homosaurus ID

http://purl.bioontology.org/ontology/GSSO/000109

GSSO ID

ATC (Anatomical Therapeutic Chemical Classification)

BFO (Basic Formal Ontology)

ChEBI (Chemical Entities of Biological Interest)

DDC (Dewey Decimal Classification)

DO (Disease Ontology)

DSM (Diagnostic and Statistical Manual of Mental Disorders)

EFO (Experimental Factor Ontology)

FMA (Foundational Model of Anatomy)

GO (Gene Ontology)

HPO (Human Phenotype Ontology)

ICD (International Classification of Diseases)

LCC (Library of Congress Classification)

LCSH (Library of Congress Subject Headings)

MeSH (Medical Subject Headings)

NCIT (National Cancer Institute Thesaurus)

Wikipedia

… and more!

54 of 65

Kronk & Watson | @brimwats @ld4conference

55 of 65

7. NATURAL LANGUAGE PROCESSING

Kronk & Watson | @brimwats @ld4conference

1. Introduction & Intersextions.

2. Cataloging & Discontent

3. Linking what?

4. Homosaurus: A Dino Of A Project

5. Quee-rying Further: Ontology-Based Applications

6. The GSSO and the HomoIT

56 of 65

Introduction to NLP: Why Use NLP?

  • Summarize long documents
  • Identify the author of an unknown document
  • Discover plagiarism
  • Examine sentiment trends
  • Correct spelling and grammar
  • Create chatbots
  • Autocomplete

Kronk & Watson | @brimwats @ld4conference

57 of 65

Introduction to NLP: Modeling

Kronk & Watson | @brimwats @ld4conference

VECTORIZER

58 of 65

Automatically Annotating Gender, Sex, and Sexual Orientation Data

  • Named entity recognition (NER) using the GSSO
  • Pick a concept like “transgender”
  • Should it include results for “transsexual”? Should it include results for “transvestite”?
  • Smart date-specific searching and options to “map” forward or backward in time

Kronk & Watson | @brimwats @ld4conference

Text

Parser

GSSO Searching w/ Selected Options

Annotations

Insights

synonyms

instances

locations

related terms

time-based info

59 of 65

Using Linked Data and Ontologies for Searching

  • Terms change over time and shift meaning linguistically
  • “sexual inversion” versus “gay” and “transgender”
  • Should terms appear or not?
    • Recency bias, etc.
  • Backward and forward approximate searching compatibilities

Kronk & Watson | @brimwats @ld4conference

transsexual

transvestite

crossdresser

transgender

sexual invert

gay person

60 of 65

A Case Study: Digitalizing Archival Collections

  • Make sure PDF is searchable or source has text data
  • Load file into program
  • Automatically create an annotation report
  • Helps make archival collections more accessible for larger research projects, systematic reviews, literature reviews, etc.
  • GSSO annotations take seconds to produce

61 of 65

A Case Study: Digitalizing Archival Collections

Kronk & Watson | @brimwats @ld4conference

62 of 65

A Case Study: Digitalizing Archival Collections

Kronk & Watson | @brimwats @ld4conference

63 of 65

Using Linked Data and Ontologies to Provide Insights

  • How is a particular group represented over time?
  • Could the current identification method be biased?
  • Can we construct more complete timelines, chronologies, or bibliographies?
  • How can we more easily access historical and contemporary resources?

Kronk & Watson | @brimwats @ld4conference

64 of 65

A Case Study: The Electronic Health Record (EHR)

  • Gender, sex, and sexual orientation data is important to collect in the medical field
  • Medical professionals rarely collect it
  • However it may be indicated in notes
  • Gold-standard identification is ICD codes for now
  • In one dataset of emergency notes:
    • GSSO found 100% of transgender patients; ICD codes fund 46%
    • Doctors indicated correct assigned sex at birth 54% of the time; correct pronouns were used 38% of the time
    • Diagnosis based statistics calculated via ICD are therefore extremely unreliable

Kronk & Watson | @brimwats @ld4conference

65 of 65

CONCLUSION & QUESTIONS TIME

Kronk & Watson | @brimwats @ld4conference