This is the first draft of the Hodgkin Lymphoma (HL) data dictionary. It was created using the CRFs for AHOD0831, AHOD0431, AHOD0031, AHOD1331, and AHOD1221, as well as CRFs from St. Jude's studies. All decisions made in creating this draft are preliminary, flexible, and subject to change based on the needs of the HL investigators. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
To create this draft, the different elements of the CRFs were transferred into an excel sheet, and aligned so that similar elements were placed next to each other. Then these elements were compared to look for areas of similarity between the CRFs and areas of difference. Elements that either spanned all CRFs or seemed feasible to report in a common data dictionary were then selected for inclusion. These elements were then transferred into a draft data dictionary, using a structure and format that aligns with the PCDC data model. Elements that were not selected were usually not selected because the information inherent in them was captured elsewhere.
The PCDC used the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM), and Minimal Common Oncology Data Elements (mCODE) models as frameworks to build the PCDC data model. The structure of the model is grouped into domains (i.e. data tables), comprised of variables, each of which are defined by name, data type, description, and permissible values. All variables and permissible values were standardized using National Cancer Institute (NCI) terminologies to ensure data replicability and interoperability. This work was overseen by an international group of pediatric oncologists, statisticians, and data standards experts.
Tier 1: Essential
Tier 2: Good to have
Tier 3: Nice to have, but not necessary
Tier 4: If we have it, we’ll keep it/Defer
Tier 5: Don’t care
In alignment with the SDTM, the data dictionary is structured into tables, where each table captures a “concept." For example, a concept could be as specific as "Labs" or as broad as "Disease Characteristics." This format allows the data dictionary to capture concepts that are longitudinal (i.e. that occur many times). So for example, the “Labs” table captures every time a lab test, such as RBC, is reported. And the “Vitals” table captures every time a vitals measurement is reported.
A concise data structure was chosen for the tables to ensure the model is flexible enough to represent additional data sources. For example, “findings”-type data (e.g., labs, vital signs, EKG, echocardiogram results) represent data utilizing a test/result structure, with supplemental metadata represented in additional variables in each domain. This means that instead of having a variable for every lab test, the dictionary instead has one variable whose permissible values include all lab teRMS and another variable with the lab test result. This structure facilitates the incorporation of additional lab teRMS that retrospective or future trials might include.
The data dictionary format for reporting longitudinal data utilizes an “Age at X” structure. This should be thought of as equivalent to a “Date at X” structure, with the only difference being that it allows for the data to be anonymized according to HIPAA policies. This transformation from “Date at X” to “Age at X” can be done simply in a statistical software package, after data has already been collected in the usual way on the CRFs.
The data dictionary links “concepts” to the age (i.e. the date) at which they occurred whenever possible. This is preferred to linking them to the Reporting Period (i.e. Disease Phase or Course) in which they occurred, because different contributor groups have different definitions for Reporting Periods, and these definitions are not standardized. This is especially true when considering the differences in Reporting Period definitions among different countries. Ages (i.e. dates), however, allow “concepts” to be linked to the time they occurred in a standardized way. However, when actual ages (i.e. dates) are not available for a specific concept, the PCDC decided not to impute any dates, so the age will be missing and the relevant Reporting Period can be used instead.
Different “concepts” in different tables can be linked by the ages (i.e. dates) at which they occured. For example, if you know the age (i.e. date) at which a response assessment was made, you can find the age (i.e. date) of the nearest MRD assessment, in order to associate the results of that MRD assessment with the response assessment. A similar process can be used if only Disease Phase or Course are available.
The PCDC data dictionary has developed the concept of "Disease Phase" to allow for streamlining of data representation. The Disease Phases include: Initial Diagnosis, Relapse, Progression, Refractory, etc... These Disease Phases are codified in a "Disease Phase Timing" table, which indicates the start and end "Age" (i.e. date) of each Disease Phase. So the "Age at Disease Phase" for the Initial Diagnosis Disease Phase captures the age at diagnosis. And the "Age at Disease Phase" for the Relapse 1 Disease Phase captures the age at the first relapse. There is also a variable to indicate the ordinal numbering of the Disease Phase within subgroup (i.e. Relapse 1, Relapse 2, Relapse 3, etc... and Progression 1, Progression 2, Progression 3, etc...).
The PCDC data dictionary uses a similar strategy to report timing of protocol treatment "Courses." The "Course Timing" table includes a variable with options for each course (i.e. Induction, Intensification, Consolidation, etc...), a variable to indicate the ordinal numbering of courses within a subgroup (i.e. Induction 1, Induction 2, etc...), and the start and end "Age" of each course. When applicable, start and end ages of each cycle within a course are included.
The data dictionary uses the "Disease Phase" concept to frame the timing of additional elements. For example, instead of having separate variables for “Staging at Initial Diagnosis” and “Staging at Relapse 1” and “Staging at Relapse 2,” the dictionary has one table, called “Disease Phase Timing.” This table has a variable that indicates the “Disease Phase” and “Disease Phase Number”. This structure reduces the number of redundant variables needed to report similar concepts, allowing for more elegant storage of data. The "Disease Phase" and “Disease Phase Number” variables can then be linked to other tables that contain these variables to indicate the timing of the relevant disease phase.
On a similar note, the "Course Timing" table can also be used to frame the timing of additional elements. For instance, if you want to know the weight during Induction 1, you can combine the "Vitals" table to the "Course Timing" table and find the start and end ages of Induction 1, if important to know.
The PCDC has decided that it is not currently within the scope of the data dictionary to determine how certain elements should be defined. For example, the data dictionary does not define what a “complete response” is. Instead, the data dictionary reports data as it is reported to the PCDC. It also reports the raw data that were used to make that decision. Using AML as an example, the data dictionary reports MRD results, which are used to make response assessments. To continue the example, if a subject is listed as having a “complete response” in the data commons, it only means that the contributing group reported them as having a complete response. A researcher could then go back to the protocol for the trial the subject was on to see how that group defined complete response. The researcher could also look at the raw variables used to make the response assessment, and make their own assessment if they wanted to (for example, a researcher could look at the subject’s MRD results and decide for themselves whether or not the subject had a complete response, regardless of what the data contributor reported).
The PCDC data dictionary is structured so that it can report the "total dose" of each chemo agent a subject received over a flexible period of time. That period of time is defined by a "total dose start age" and a "total dose end age." Those start and end ages could correspond to the start and end ages of a Disease Phase, of a course, or even of a cycle. This structure allows different data contributors to report the total dose over the most granular time period they have available. Of note, "total dose" can be thought of as analagous to "cumulative dose." However, the data dictionary uses the term "total dose" to reflect the fact that the "total dose" is only summed over the particular time period and is not in fact a true cumulative measurement. The Total Dose table includes both the "actual total dose" (what was actually given) and the "intended total dose" (the total dose by intention to treat), since some groups collect the actual values while others only collect intent-to-treat values.