1 of 19

CCDH Pilot Demonstration

CRDC All Hands

2021-09-20

These slides: bit.ly/ccdh-pilot

CENTER for �CANCER DATA HARMONIZATION

2 of 19

OUTCOMES

PROCESS

GOALS

Outline

01

02

03

Highlight the value of harmonization & demonstrate the use of CRDCH.

Exemplar data package, workflow diagram, and�documentation.

Harmonize data models and value sets for demonstrator queries.

3 of 19

The role of CCDH in the wider CRDC environment

4 of 19

Understanding the Problem

Each node models �things differently

Each node uses different values

For example:

  • No direct link from Sample-to- Diagnosis in one model,
  • Need to “remodel” Sample-to-Case, and Diagnosis-to-Case to align with Sample-to-Diagnosis

For example, one node encodes race like this:

not reported

white

american indian or alaska native

black or african american

While another does it like this:

not allowed to collect

unknown

white

native hawaiian or other pacific islander

american indian or alaska native

asian

other

black or african american

The DST metadata WG aimed to identify the most needed elements for harmonization.

5 of 19

PILOT DEMONSTRATION

Illustrate what CRDCH-based harmonization looks like and the value it affords, by providing a set of concrete and well-documented examples of harmonized data records, framed around two demonstrator use cases.

Goals

6 of 19

Pilot Workflow

STEP 1

Define demonstrator use cases and queries

STEP 2

Select 'high value' elements to include in the data examples

STEP 3

Identified source data from different CRDC Nodes that meet query criteria

STEP 4

Manually transform/harmonize source data, and validate against CRDCH JSON Schema

STEP 5

Document common modeling and transform patterns in the data examples

7 of 19

D1: Cross-Node Specimen Data Retrieval

D2: Aggregation of Case Data Across Nodes

Step 1: Define Demonstrator Use Cases

STEP 1

Show harmonized data records returned by a specimen query across three CRDC Nodes (GDC, PDC, ICDC)

Use Case: Researcher looking for specimens to compare molecular characteristics of primary vs lung metastases of Osteosarcoma tumors.

Query: "Find frozen Osteosarcoma tumor specimens >30 mg that are lung metastases from patients with Stage 4 disease"

Show data for the same case across three CRDC Nodes (GDC, PDC, IDC) merged into a single harmonized record

Use Case: Researcher looking to assemble a cohort for a retrospective analysis, using Case data distributed across different systems .

Query: "Find patients in the TCGA-OV project who are 50 years of age or older and who have Stage IIIC ovarian cancer"

8 of 19

Prioritize data elements that:

Step 2: Select 'High Value' Elements to Include

STEP 2

  • Are directly targeted by the demonstrator queries (or of interest for the analysis use case)

  • Were included in the DST Report

  • Illustrate common CRDC-H modeling features and transformation patterns

Included Elements from GDC.Sample

Data Element

Query Target

DST Report

Illustrative or of Interest

sample_id

x

sample_submitter_id

x

project_id

x

case_id

x

case_submitter_id

x

sample_type

x

tissue_type

x

tumor_code

x

x

tumor_descriptor

x

current_weight

x

days_to_collection

x*

initial_weight

x

biospecimen_anatomic_site

x

x

time_between_excision_and_ freezing

x

preservation_method

x

freezing_method

x

* Asterisk indicates that this element is not in the current DST report, but will be included once mappings to node attributes are updated

9 of 19

Different approaches used for the two Demonstrators:

Step 3: Identify or Generate Source Data Records

STEP 3

  • Demonstrator 1: Generated aggregate Specimen records from GDC, PDC, and ICDC data (assembled real values from different records into an aggregate record)
  • Demonstrator 2: Identified existing Case records in GDC, PDC, and IDC data that represent the same real-world patient/subject

GDC.Sample

PDC.Sample

ICDC.Sample

Demonstrator 1 ‘Aggregate’ Sample Records from GDC, PDC, and ICDC. Full source data documented here.

10 of 19

Manually generated YAML (with JSON-translation available)

Step 4: Transform / Harmonize Source Data

STEP 4

  • A schema-aware editor provided real-time validation
  • Inlined comments explain key features of the model
  • Examples accessible for review and feedback here
  • Hand-rolled examples will provide a gold standard for testing automated transform tooling

Excerpt from the harmonized GDC Specimen example showing how specimen weight is represented

11 of 19

Element-by-element annotation of each transformation performed on source data

Step 5: Document Modeling and Transform Patterns

STEP 5

  • Transform of two elements from a GDC.Sample are shown above. Complete documentation of modeling and transform patterns are available here.

12 of 19

Harmonization Allows Cross-Node Querying

"Find frozen Osteosarcoma specimens >30 mg that are lung metastases from patients with Stage 4 disease"

Source GDC Data

tumor_code

tumor_ descriptor

current_weight

biospecimen_ anatomic_site

preservation_ method

ajcc_pathologic_ stage

Osteosarcoma (OS)

Metastatic

45

Lung

Frozen

Stage IVA

Source ICDC Data

specific_sample_pathology

tumor_sample_ origin

n/a (not captured)

sample_site

sample_ preservation

stage_of_disease

Osteosarcoma

Metastatic

n/a

Lung

Snap Frozen

IV

CRDCH Data

Specimen.

specific_sample_pathology

Specimen.

tumor_status_at_ collection

Specimen.

quantity_ measure

SpecimenCreationActivity. collection_site

SpecimenCreationActivity. activity_type

Diagnosis.

stage

GDC

C9145 (Osteosarcoma)

C3261 (Metastatic)

45 milligrams

C12468 (Lung)

C178955 (Freezing)

C27979 (Stage IVA)

ICDC

C9145 (Osteosarcoma)

C3261 (Metastatic)

n/a

C12468 (Lung)

CC63521 (Quick Freeze)

C27971 (Stage IV)

Value Harmonization:

  • Attributes and data structure are standardized in CRDC-H model (e.g., tumor_ descriptor, tumor_sample_ origin -> tumor_status_at_collection)
  • Coded values harmonized through mapping to ontology/terminology-based value sets (e.g., C9145 - Osteosarcoma)
  • Hierarchical relationships in source ontologies/terminologies allows harmonization/retrieval over values specified at different granularity (e.g. Freezing > Quick Freeze, Stage IV > Stage IVA)
  • Quantitative values are standardized with an explicit declaration of units (e.g., 45 milligrams)

13 of 19

Harmonization Allows Cross-Node Querying

"Find frozen Osteosarcoma specimens >30 mg that are lung metastases from patients with Stage 4 disease"

# type = Specimen

id: "f2f:05f1574e-2a28-50bc-bdc1-e4c6dee92fd1"

specific_tissue_pathology:

coding:

- code: C9145

label: Osteosarcoma

system: http://ncithesaurus.nci.nih.gov

tumor_status_at_collection: # type = CodeableConcept

coding: # type = Coding

- code: C3261

label: Metastatic Neoplasm

system: http://ncithesaurus.nci.nih.gov

quantity_measure: # type = QuantityMeasureObservation

- observation_type: # type = CodeableConcept

coding: # type = Coding

- code: C25208

label: Weight

system: http://ncithesaurus.nci.nih.gov

value_quantity: # type = Quantity

value_decimal: 45

unit: # type = CodeableConcept

coding: # type = Coding

- code: C28253

label: Milligram

system: http://ncithesaurus.nci.nih.gov

creation_activity: # type = SpecimenCreationActivity

collection_site: # type = BodySite

site: # type = CodeableConcept

coding: # type = Coding

- code: C12468

label: Lung

system: http://ncithesaurus.nci.nih.gov

processing_activity: # type = SpecimenProcessingActivity

- activity_type: # type = CodeableConcept

coding: # type = Coding

- code: C48160

label: Freezing

system: http://ncithesaurus.nci.nih.gov

Harmonized GDC Sample Record

# type = Specimen

id: "f2f:ecc2bf49-204d-11e9-b7f8-0a80fada099c"

specific_tissue_pathology:

coding:

- code: C9145

label: Osteosarcoma

system: http://ncithesaurus.nci.nih.gov

tumor_status_at_collection: # type = CodeableConcept

coding: # type = Coding

- code: C3261

label: Metastatic Neoplasm

system: http://ncithesaurus.nci.nih.gov

# Sample weight is not captured by ICDC

creation_activity: # type = SpecimenCreationActivity

collection_site: # type = BodySite

site: # type = CodeableConcept

coding: # type = Coding

- code: C12468

label: Lung

system: http://ncithesaurus.nci.nih.gov

processing_activity: # type = SpecimenProcessingActivity

- activity_type: # type = CodeableConcept

coding: # type = Coding

- code: CC63521

label: Quick Freeze

system: http://ncithesaurus.nci.nih.gov

Harmonized ICDC Sample Record

Structural Harmonization:

14 of 19

Canonical Terminology Recommendations

  • Our criteria for target terminologies used for harmonization included:
    • Standards-based
    • Accessible to the research community
    • Well-maintained

  • In some cases, we are putting forward multiple alternative recommendations:
    • Clinical standards used throughout the cancer community for interoperability
    • More computable and granular ontologies that are used more commonly in basic research and that may be more suitable for analytics

  • We prioritized:
    • Rich content
    • A computable set of identifiers as a minimal criterion
    • A robust user community

  • We anticipate community engagement to achieve consensus

15 of 19

Terminological Recommendations

Demonstrator 1

Demonstrator 2

CRDCH Field Name

Recommended Terminologies

Subject.sex

HL7 Gender Harmony Project

Subject.race

OMB / USCDI

Subject.vital_status

LOINC

Subject.ethnicity

OMB / USCDI

Diagnosis.morphology

ICD-O-3, MONDO (NCIT) �Likely needs further requirements analysis

Diagnosis.primary_site

Uberon

Diagnosis.stage[CancerStageObservationSet.observation[CancerStageObservation.valueEnum]]

Harmonize to AJCC, use NCIT codes

Diagnosis.condition

ICD10-CM, MONDO (NCIT)�Likely needs further requirements analysis

CRDCH Field Name

Recommended Terminologies

Specimen.general_tissue_morphology

NCIT

Specimen.source_material_type

Mixed bag of concepts, recommend refactor

Specimen.specific_tissue_morphology

ICD-O-3, MONDO (NCIT). �Likely needs further requirements analysis

Specimen.tumor_status_at_collection

NCIT

SpecimenCreationActivity.collection_site[BodySite.site]

Uberon, Snomed

SpecimenProcessingActivity.method_type

Ontology of Biomedical investigations (OBI), NCIT

Diagnosis.condition

ICD 10-CM, MONDO (NCIT)

Likely needs further requirements analysis

Diagnosis.stage[CancerStageObservationSet.observation[CancerStageObservation.valueEnum]]

Harmonize to AJCC, use NCIT for AJCC stage terms

Diagnosis.stage[CancerStageObservationSet.observation[CancerStageObservation.method_type]]

NCIT

Diagnosis.primary_site

Uberon, Snomed

Quantity.unit (used in many CRDC-H elements)

UOM (Units of Measurement Ontology)

SpecimenProcessingActivity.method_type

Ontology of Biomedical investigations (OBI), NCIT

16 of 19

Lessons Learned

Data Examples

Terminologies

  • Overall data sparseness presented challenges in locating queries and populating example data that would effectively exercise the CRDCH model
  • Inconsistent representation of fundamental entities like Program and Project (e.g., TCGA, CPTAC) in different data repositories presents harmonization challenges
  • Establishing CRDC requirements for data provenance is important to help guide modeling work
  • Binding of a field (attribute) to a terminology (ontology) requires domain expertise to make fit-for-purpose recommendations
  • Requirements for how the harmonized data will be used are needed to make fit-for-purpose terminology recommendations based on computability or interoperability considerations
  • When there are no terminological bindings on an attribute (e.g., they are strings), permissible values are likely to be inconsistent and require a lot of manual clean-up
  • While informative, this exercise was performed prior to the anticipated development of CRDCH and as such, not all the modeling and strategies for harmonization were in place

17 of 19

Discussion Questions for Breakouts

  • What are the most salient requirements across the CRDC for using harmonized data for search and analytics?
  • Where will harmonized data be housed and who will generate it?
  • What terminologies would be most useful in different search and analysis contexts?
  • What specific attributes of data transformation and validation tools would be most useful to the CRDC community?

18 of 19

Pilot Deliverables

  1. Source Data Examples: Simple spreadsheet based representations for easy human viewing and comparison (link)
  2. Harmonized Data Examples: YAML representations of Specimen, Subject, Diagnosis, and Research Project records created for Demonstrators 1 and 2 (link)
  3. Element Transform Documentation: Describes common data transform patterns and features of the CRDC-H model as implemented in the data examples (link)
  4. Recommendations for Canonical Terminologies: Candidates for harmonization of value set bound to elements included in Demonstrators (link)
  5. Harmonized Value Set Example: LinkML specification of a harmonized value set / enumeration (link)
  6. Current version of the CRDCH Schema (v1.1): against which demonstrator data examples were validated (link, link)
  7. Retrospective: Lessons learned during the development of this pilot (link; 9/24)

19 of 19

THANKS

CRDC Data Nodes and Collaborators: CDA, CDS, ICDC, IDC, GDC, PDC, DCF.

Center for Biomedical Informatics & Information Technology: Gilberto Fragoso, Robinette Renner, Sherri de Coronado.

Frederick National Laboratory for Cancer Research: Todd Pihl, Mark Jensen, Resham Kulkarni.

Samvit Solutions: Smita Hastak, Wendy Ver Hoef.

CENTER for �CANCER DATA HARMONIZATION

CCDH Team:

JHU: Christopher Chute (PI), Davera Gabriel, Dazhi Jiao, Harold Solbrig, Joe Flack, Tricia Francis

LBL: Chris Mungall (PI), Mark Miller, Nomi Harris, Sujay Patil, William Duncan

RENCI: James Balhoff (PI), Gaurav Vaidya

U Chicago: Samuel Volchenboum (PI), Brian Furner, Debra Venckus, Jooho Lee, Kathryn Blumhardt

U Colorado: Melissa Haendel (PI), Anne Thessen, Julie McMurry, Matthew Brush, Monica Munoz-Torres, Nicole Vasilevsky, Shahim Essaid

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics by Freepik