Skip to contents

Online documentation

This vignette displays pre-computed results. Run the targets pipeline locally for interactive analysis.

Overview

See the Glossary for term definitions and units reference used throughout this project.

  • Describes all variables in the MMRF CoMMpass dataset from the GDC
  • CoMMpass: longitudinal observational study of ~1,000 newly diagnosed multiple myeloma patients
  • Variables span clinical demographics, biospecimen metadata, and RNA-seq gene expression

Data sources:

Complete Data Dictionary

Searchable table of all variables available in the CoMMpass dataset with types, categories, and descriptions.

CoMMpass Data Dictionary (38 variables). Click column headers to sort; use filters to search. GDC links open the official GDC data dictionary page. Source: GDC clinical + biospecimen + RNA-seq metadata. See glossary for term definitions and units-table for measurement conventions.
variable category data_type units description typical_range gdc_link
1 submitter_id clinical character NA Patient identifier assigned by the submitting institution (MMRF) MMRF_0001 to MMRF_2149 <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
2 project_id clinical character NA GDC project identifier MMRF-COMMPASS <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
3 age_at_diagnosis clinical integer days Age at primary diagnosis in DAYS (divide by 365.25 for years). GDC stores all ages in days. 10000-35000 (approx 27-96 years) <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
4 gender clinical character NA Patient sex/gender female, male, not reported <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
5 race clinical character NA Patient race category per NIH guidelines white, black or african american, asian, not reported, other <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
6 ethnicity clinical character NA Patient ethnicity per NIH guidelines not hispanic or latino, hispanic or latino, not reported <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
7 vital_status clinical character NA Patient vital status at last follow-up Alive, Dead, Not Reported <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
8 days_to_death clinical integer days Number of days from diagnosis to death. NA if patient is alive. 0-5000+ <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
9 days_to_last_follow_up clinical integer days Number of days from diagnosis to last follow-up. NA if patient is deceased. 0-5000+ <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
10 primary_diagnosis clinical character NA ICD-O-3 morphology code description for the primary diagnosis Plasma cell myeloma, Myeloma NOS <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
11 disease_type clinical character NA Type of disease studied Multiple Myeloma <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
12 site_of_resection_or_biopsy clinical character NA Anatomic site of tissue sample collection Bone marrow, Blood <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
13 tissue_or_organ_of_origin clinical character NA Anatomic site of the disease origin Bone marrow <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
14 year_of_diagnosis clinical integer year Calendar year of primary diagnosis 2005-2020 <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
15 classification_of_tumor clinical character NA Tumor classification primary, recurrence, metastasis, not reported <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
16 prior_malignancy clinical character NA Whether patient had a prior malignancy yes, no, not reported <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
17 prior_treatment clinical character NA Whether patient received prior treatment yes, no, not reported <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
18 ajcc_staging_system_edition clinical character NA AJCC staging edition used various editions <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
19 days_to_last_known_disease_status clinical integer days Days from diagnosis to last disease status assessment 0-5000+ <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=clinical” target=“_blank”>GDC</a>
20 sample_submitter_id biospecimen character NA Sample identifier assigned by submitting institution MMRF_0001_1_BM, etc. <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
21 sample_id biospecimen character NA GDC-assigned UUID for the sample UUID format <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
22 sample_type biospecimen character NA Type of sample collected Primary Blood Derived Cancer - Bone Marrow, etc. <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
23 sample_type_id biospecimen character NA Numeric code for sample type 01, 09, 10, etc. <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
24 tissue_type biospecimen character NA Whether tissue is tumor or normal Tumor, Normal <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
25 preservation_method biospecimen character NA Method used to preserve the sample FFPE, Frozen, etc. <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
26 composition biospecimen character NA Sample composition category Bone Marrow Components, Blood Derived <a href=“https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=sample” target=“_blank”>GDC</a>
27 gene_id rnaseq_counts character NA Ensembl gene identifier (ENSG with version) ENSG00000000003.15 <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
28 unstranded rnaseq_counts integer raw counts Unstranded read counts from STAR aligner. Use for DESeq2/edgeR analysis. 0-1000000+ <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
29 stranded_first rnaseq_counts integer raw counts First-strand read counts (dUTP protocol) 0-1000000+ <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
30 stranded_second rnaseq_counts integer raw counts Second-strand read counts 0-1000000+ <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
31 tpm_unstranded rnaseq_counts numeric TPM Transcripts Per Million (unstranded). Normalized for gene length and sequencing depth. 0-100000+ <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
32 fpkm_unstranded rnaseq_counts numeric FPKM Fragments Per Kilobase of transcript per Million mapped reads (unstranded). 0-100000+ <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
33 fpkm_uq_unstranded rnaseq_counts numeric FPKM-UQ Upper quartile normalized FPKM (unstranded). 0-100000+ <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
34 gene_name rnaseq_metadata character NA HGNC gene symbol TP53, KRAS, MYC, etc. <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
35 gene_type rnaseq_metadata character NA Biotype from GENCODE annotation protein_coding, lncRNA, miRNA, processed_pseudogene, etc. <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
36 barcode rnaseq_metadata character NA TCGA-style barcode for the sample MMRF-COMMPASS-XXXX-TBM-… <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
37 patient rnaseq_metadata character NA Patient identifier extracted from barcode MMRF-COMMPASS-XXXX <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>
38 sample_type rnaseq_metadata character NA Sample type from barcode decoding Primary Blood Derived Cancer - Bone Marrow <a href=“https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/” target=“_blank”>GDC</a>

Clinical Data

Sample Rows

First 5 rows of key clinical variables. submitter_id = MMRF patient ID; age_at_diagnosis = days; iss_stage = International Staging System (I/II/III). Full dataset: 995 patients x 88 variables.
submitter_id age_at_diagnosis gender race vital_status days_to_death days_to_last_follow_up iss_stage
MMRF_2240 24502 male white Alive NA 832 III
MMRF_1038 25255 male white Dead 1753 1753 III
MMRF_1054 21310 female white Alive NA 1828 II
MMRF_2174 31241 male white Dead 425 425 Unknown
MMRF_2168 18590 male black or african american Alive NA 877 I

Summary

  • Total patients: 995
  • Variables: 88
  • Data completeness: 95.8%

Column Structure

Structure of all 88 clinical variables for 995 patients. Time variables are in DAYS (GDC convention). Columns reordered: IDs > demographics > time > vital status > disease > other. See glossary for units reference.
Column Type Non_NA Pct_Complete N_Unique Example
1 project character 995 100% 1 MMRF-COMMPASS
2 submitter_id character 995 100% 995 MMRF_1016, MMRF_1020, MMRF_1021
3 submitter_id character 995 100% 995 MMRF_1016, MMRF_1020, MMRF_1021
4 submitter_sample_ids character 995 100% 995 MMRF_1016_1_BM_CD138pos,MMRF_1016_1_PB_WBC, MMRF_1020_3_BM_CD138pos,MMRF_1020_3_PB_Whole, MMRF_1021_1_BM_CD138pos,MMRF_1021_2_PB_CD3pos
5 submitter_id character 995 100% 995 MMRF_10161, MMRF_10201, MMRF_10211
6 age_at_diagnosis_days character 995 100% 927 NA, 17034, 18560
7 race character 995 100% 6 white, not reported, black or african american
8 gender character 995 100% 2 male, female
9 ethnicity character 995 100% 3 not hispanic or latino, not reported, hispanic or latino
10 age_at_index_days integer 995 100% 59 min=27 med=63 max=89
11 days_to_last_known_disease_status_days character 995 100% 693 1, 997, 617
12 days_to_best_overall_response character 995 100% 1 NA
13 days_to_diagnosis character 995 100% 1 NA
14 days_to_last_follow_up_days character 995 100% 693 1, 997, 617
15 year_of_diagnosis character 995 100% 1 NA
16 days_to_recurrence character 995 100% 1 NA
17 days_to_birth integer 953 96% 926 min=-32800 med=-23200 max=-10100
18 year_of_birth logical 0 0% 0 (all NA)
19 days_to_death_days integer 191 19% 176 min=8 med=475 max=1750
20 year_of_death logical 0 0% 0 (all NA)
21 last_known_disease_status character 995 100% 1 Unknown tumor status
22 cause_of_death character 188 19% 2 Cancer Related, Not Cancer Related
23 vital_status character 995 100% 2 Alive, Dead
24 primary_site character 995 100% 1 Hematopoietic and reticuloendothelial systems
25 irs_stage character 995 100% 1 NA
26 iss_stage character 995 100% 4 II, I, III
27 ajcc_pathologic_stage character 995 100% 1 NA
28 ann_arbor_clinical_stage character 995 100% 1 NA
29 enneking_msts_stage character 995 100% 1 NA
30 inrg_stage character 995 100% 1 NA
31 tissue_or_organ_of_origin character 995 100% 1 Bone marrow
32 cog_liver_stage character 995 100% 1 NA
33 inpc_grade character 995 100% 1 NA
34 wilms_tumor_histologic_subtype character 995 100% 1 NA
35 classification_of_tumor character 995 100% 1 NA
36 cog_renal_stage character 995 100% 1 NA
37 figo_stage character 995 100% 1 NA
38 inss_stage character 995 100% 1 NA
39 tumor_confined_to_organ_of_origin character 995 100% 1 NA
40 primary_diagnosis character 995 100% 1 Multiple myeloma
41 ajcc_clinical_stage character 995 100% 1 NA
42 metastasis_at_diagnosis character 995 100% 1 NA
43 enneking_msts_tumor_site character 995 100% 1 NA
44 ann_arbor_pathologic_stage character 995 100% 1 NA
45 method_of_diagnosis character 995 100% 1 NA
46 diagnosis_id character 995 100% 995 0073d350-15ca-4e89-9326-d95543cf778c, 00a8b803-c44b-41ba-9d44-79c1b87a74aa, 00fe922e-5ecb-4936-a91d-6680d3f393c6
47 site_of_resection_or_biopsy character 995 100% 1 Bone marrow
48 first_symptom_prior_to_diagnosis character 995 100% 1 NA
49 tumor_grade character 995 100% 1 Unknown
50 enneking_msts_grade character 995 100% 1 NA
51 disease_type character 995 100% 1 Plasma Cell Tumors
52 created_datetime character 995 100% 1 2018-07-10T14:08:13.021252-05:00
53 enneking_msts_metastasis character 995 100% 1 NA
54 esophageal_columnar_dysplasia_degree character 995 100% 1 NA
55 child_pugh_classification character 995 100% 1 NA
56 state character 995 100% 1 released
57 prior_treatment character 995 100% 1 NA
58 cog_rhabdomyosarcoma_risk_group character 995 100% 1 NA
59 ajcc_pathologic_t character 995 100% 1 NA
60 morphology character 995 100% 1 9732/3
61 ajcc_pathologic_n character 995 100% 1 NA
62 ajcc_pathologic_m character 995 100% 1 NA
63 irs_group character 995 100% 1 NA
64 medulloblastoma_molecular_classification character 995 100% 1 NA
65 residual_disease character 995 100% 1 NA
66 ann_arbor_b_symptoms character 995 100% 1 NA
67 icd_10_code character 995 100% 1 NA
68 synchronous_malignancy character 995 100% 1 NA
69 burkitt_lymphoma_clinical_variant character 995 100% 1 NA
70 supratentorial_localization character 995 100% 1 NA
71 ishak_fibrosis_score character 995 100% 1 NA
72 goblet_cells_columnar_mucosa_present character 995 100% 1 NA
73 laterality character 995 100% 1 NA
74 cog_neuroblastoma_risk_group character 995 100% 1 NA
75 updated_datetime character 995 100% 2 2019-06-24T08:07:19.797044-05:00, 2019-08-21T12:47:37.999949-05:00
76 prior_malignancy character 995 100% 1 NA
77 best_overall_response character 995 100% 1 NA
78 ann_arbor_extranodal_involvement character 995 100% 1 NA
79 mitosis_karyorrhexis_index character 995 100% 1 NA
80 ajcc_staging_system_edition character 995 100% 1 NA
81 esophageal_columnar_metaplasia_present character 995 100% 1 NA
82 ajcc_clinical_m character 995 100% 1 NA
83 ajcc_clinical_n character 995 100% 1 NA
84 ajcc_clinical_t character 995 100% 1 NA
85 inpc_histologic_group character 995 100% 1 NA
86 gastric_esophageal_junction_involvement character 995 100% 1 NA
87 progression_or_recurrence character 995 100% 1 unknown
88 demographic_id character 995 100% 995 0059d727-d660-43b7-bfc5-e8e374f813bf, 00aa6c25-74bf-4a10-b015-3873314906b5, 00af936c-0b23-4c49-94ac-254f89600cf0

Biospecimen Data

  • Tissue samples collected from patients

  • Includes sample type, tissue type, and preservation method

  • Total records: 2,119

  • Variables: 31

Biospecimen data columns for 2119 records (31 variables). Column = field name from GDC biospecimen endpoint; R Type = data type in R; Non-NA = non-missing count; % Complete = data completeness; Unique Values = cardinality. Source: GDC biospecimen API. See clinical table for patient-level demographics and exploratory-analysis for sample-type distribution.
Column Type Non_NA Pct_Complete N_Unique
distributor_reference distributor_reference logical 0 0% 0
tumor_descriptor tumor_descriptor character 2119 100% 3
sample_id sample_id character 2119 100% 2119
diagnosis_pathologically_confirmed diagnosis_pathologically_confirmed logical 0 0% 0
sample_type sample_type character 2119 100% 5
created_datetime created_datetime character 2119 100% 1
distance_normal_to_tumor distance_normal_to_tumor logical 0 0% 0
time_between_excision_and_freezing time_between_excision_and_freezing logical 0 0% 0
growth_rate growth_rate logical 0 0% 0
updated_datetime updated_datetime character 2119 100% 1
days_to_collection days_to_collection logical 0 0% 0
method_of_sample_procurement method_of_sample_procurement logical 0 0% 0
state state character 2119 100% 1
initial_weight initial_weight logical 0 0% 0
preservation_method preservation_method character 2119 100% 1
intermediate_dimension intermediate_dimension logical 0 0% 0
passage_count passage_count logical 0 0% 0
time_between_clamping_and_freezing time_between_clamping_and_freezing logical 0 0% 0
freezing_method freezing_method logical 0 0% 0
pathology_report_uuid pathology_report_uuid logical 0 0% 0
submitter_id submitter_id character 2119 100% 2119
tumor_code_id tumor_code_id logical 0 0% 0
shortest_dimension shortest_dimension logical 0 0% 0
biospecimen_anatomic_site biospecimen_anatomic_site logical 0 0% 0
specimen_type specimen_type character 2119 100% 2
biospecimen_laterality biospecimen_laterality logical 0 0% 0
days_to_sample_procurement days_to_sample_procurement integer 2119 100% 223
longest_dimension longest_dimension logical 0 0% 0
current_weight current_weight logical 0 0% 0
catalog_reference catalog_reference logical 0 0% 0
tissue_type tissue_type character 2119 100% 2

RNA-seq Data

  • RNA-seq gene expression generated using the STAR aligner
  • Quantified against GENCODE annotations
  • Available in multiple quantification types (raw counts, TPM, FPKM)
  • See units reference for quantification units

Dimensions

  • Samples: 100
  • Genes: 60,660
  • Assays: unstranded, stranded_first, stranded_second, tpm_unstrand, fpkm_unstrand, fpkm_uq_unstrand
Gene biotypes from GENCODE annotations. 60,660 total gene entries. protein_coding genes are the primary analysis target.
Biotype Count Percentage
1 protein_coding 19962 32.9%
2 lncRNA 16901 27.9%
3 processed_pseudogene 10167 16.8%
4 unprocessed_pseudogene 2614 4.3%
5 misc_RNA 2212 3.6%
6 snRNA 1901 3.1%
7 miRNA 1881 3.1%
8 TEC 1057 1.7%
9 snoRNA 943 1.6%
10 transcribed_unprocessed_pseudogene 939 1.5%
11 transcribed_processed_pseudogene 500 0.8%
12 rRNA_pseudogene 497 0.8%
13 IG_V_pseudogene 187 0.3%
14 IG_V_gene 145 0.2%
15 transcribed_unitary_pseudogene 138 0.2%
16 TR_V_gene 106 0.2%
17 unitary_pseudogene 98 0.2%
18 TR_J_gene 79 0.1%
19 scaRNA 49 0.1%
20 polymorphic_pseudogene 48 0.1%
21 rRNA 47 0.1%
22 IG_D_gene 37 0.1%
23 TR_V_pseudogene 33 0.1%
24 Mt_tRNA 22 0.0%
25 IG_J_gene 18 0.0%
26 pseudogene 18 0.0%
27 IG_C_gene 14 0.0%
28 IG_C_pseudogene 9 0.0%
29 ribozyme 8 0.0%
30 TR_C_gene 6 0.0%
31 sRNA 5 0.0%
32 TR_D_gene 4 0.0%
33 TR_J_pseudogene 4 0.0%
34 IG_J_pseudogene 3 0.0%
35 Mt_rRNA 2 0.0%
36 translated_processed_pseudogene 2 0.0%
37 IG_pseudogene 1 0.0%
38 scRNA 1 0.0%
39 translated_unprocessed_pseudogene 1 0.0%
40 vault_RNA 1 0.0%
RNA-seq sample metadata columns from the SummarizedExperiment colData (42 columns, 100 samples). Column = metadata field name; Type = R class; N_Unique = number of distinct values. Source: GDC mRNA-seq pipeline via TCGAbiolinks. See data-dictionary clinical table for patient-level variables.
Column Type N_Unique
barcode barcode character 100
sample sample character 100
patient patient character 98
sample_submitter_id sample_submitter_id character 100
tumor_descriptor tumor_descriptor character 2
sample_id sample_id character 100
sample_type sample_type character 3
state state character 1
preservation_method preservation_method character 1
submitter_id submitter_id character 98
specimen_type specimen_type character 2
days_to_sample_procurement days_to_sample_procurement integer 47
tissue_type tissue_type character 1
iss_stage iss_stage character 4
treatments treatments list 98
tissue_or_organ_of_origin tissue_or_organ_of_origin character 1
age_at_diagnosis age_at_diagnosis integer 92
days_to_last_known_disease_status days_to_last_known_disease_status numeric 93
morphology morphology character 1
last_known_disease_status last_known_disease_status character 1
primary_diagnosis primary_diagnosis character 1
diagnosis_id diagnosis_id character 98
site_of_resection_or_biopsy site_of_resection_or_biopsy character 1
tumor_grade tumor_grade character 1
progression_or_recurrence progression_or_recurrence character 1
cause_of_death cause_of_death character 2
race race character 5
gender gender character 2
ethnicity ethnicity character 3
vital_status vital_status character 2
age_at_index age_at_index integer 38
days_to_birth days_to_birth integer 92
demographic_id demographic_id character 98
days_to_death days_to_death integer 27
bcr_patient_barcode bcr_patient_barcode character 100
primary_site primary_site list 1
dbgap_accession_number dbgap_accession_number character 1
project_id project_id character 1
disease_type disease_type list 1
name name character 1
releasable releasable logical 1
released released logical 1

Treatment Data

  • Treatment records extracted from GDC API (7,184 records, 994 patients)
  • Each row = one drug administration for one patient
  • Includes therapeutic agent, line of therapy, and timing

Sample Rows

First 5 treatment records from GDC API. Each row = one drug administration for one patient. regimen_or_line_of_therapy = treatment line (First, Second, etc). Full dataset: 7184 records, 994 patients.
public_id therapeutic_agents regimen_or_line_of_therapy days_to_treatment_start days_to_treatment_end
MMRF_2240 Dexamethasone Second line of therapy 849 NA
MMRF_2240 Melphalan First line of therapy 733 733
MMRF_2240 Dexamethasone First line of therapy 1 91
MMRF_2240 Cyclophosphamide First line of therapy 1 68
MMRF_2240 Lenalidomide First line of therapy 91 202

Column Name Mappings

GDC uses standardized column names. Common aliases used in analysis:

GDC column name mappings to common aliases used in analysis code (7 mappings). GDC_Column = official GDC field name; Common_Alias = shorthand used in pipeline code. Key: age_at_diagnosis is in DAYS (not years); vital_status determines whether days_to_death or days_to_last_follow_up is used for OS time. See data-dictionary for full variable definitions.
GDC_Column Common_Alias Notes
1 submitter_id patient_id Unique patient identifier
2 age_at_diagnosis age_days In days, not years (divide by 365.25)
3 vital_status status Alive/Dead
4 days_to_death os_time (if dead) Overall survival time
5 days_to_last_follow_up os_time (if alive) Censoring time
6 gender sex male/female
7 primary_diagnosis diagnosis ICD-O-3 morphology

Units Reference

For the complete units reference table (age/time in days, expression quantification types), see the Glossary units reference.

Key rule: All age and time variables are stored in DAYS by GDC. This is the single most common source of errors in GDC analyses.

For exploratory analysis using these variables, see the EDA vignette.

R Code Examples

Loading Data

How to load CoMMpass data from parquet files and query with DuckDB.

Show code
safe_tar_read("code_dd_load_data")

library(coMMpass)

Load clinical data from parquet

clinical <- query_commpass_parquet(“clinical”)

Load with DuckDB lazy evaluation

con <- DBI::dbConnect(duckdb::duckdb()) clinical_tbl <- get_commpass_tbl(“clinical”, con = con) result <- clinical_tbl |> dplyr::filter(gender == “female”) |> dplyr::select(submitter_id, age_at_diagnosis, vital_status) |> dplyr::mutate(age_years = age_at_diagnosis / 365.25) |> dplyr::collect() DBI::dbDisconnect(con, shutdown = TRUE)

Exploring the Dictionary

Programmatic access to variable documentation and metadata.

Show code
safe_tar_read("code_dd_explore_dict")

dd <- get_commpass_data_dictionary()

Find all clinical variables

dplyr::filter(dd, category == “clinical”)

Get detailed docs for a variable

docs <- get_variable_docs(“age_at_diagnosis”) cat(docs$usage_notes)

Data Sources

Results in this vignette are derived from the MMRF CoMMpass study (MMRF-COMMPASS, ~1,143 patients), downloaded via TCGAbiolinks. The pipeline runs with a configurable sample_limit (default 200; CI uses 20).

For full citations, data access tiers, and the distinction between pipeline data and synthetic test data, see the Data Sources vignette.

Recent Changes

Recent project commits with lines added, files changed, and change categories.

Last 20 project commits with change statistics. Date = commit date; Type = conventional-commit prefix (feat/fix/docs/ci/refactor/test/chore). Files = number of files modified; +Lines/-Lines = lines added/removed. Source: git log –numstat. See changes-by-type table for aggregate breakdown.
date type summary n_files lines_added lines_removed file_categories
2026-03-14 Bug Fix fix(pipeline): Fix 11 NULL targets — DE condition, ID matching, consensus type 41 146 47 Other, R Source
2026-03-14 Bug Fix fix(cachix): Remove –watch-mode auto flag (already default) 1 1 1 Other
2026-03-14 Bug Fix fix(pipeline): Fix 3 NULL-target bugs, auto-generate package.nix (#93) 87 235 80 Config, Docs, Other, R Source
2026-03-14 Bug Fix fix(nix): Fix cachix signing key, rebuild Bioconductor-dependent targets 2 0 0 Other
2026-03-14 New Feature feat(captions): Add dynamic captions to 34 table/plot targets 22 579 89 Other, R Source
2026-03-14 Bug Fix fix(vignettes): Enforce zero-computation rule — 22 violations → 0 32 360 764 Other, R Source, Vignettes
2026-03-13 Bug Fix fix(vignettes): Convert kable RDS to data.frames, fix telemetry eval guards 18 8 2 Other, Vignettes
2026-03-13 Bug Fix fix(ci): Save data frames (not DT widgets) to RDS for Nix portability 3 0 0 Other
2026-03-13 Bug Fix fix(vignettes): Use Quarto #| eval syntax for pkgdown-banner chunks 11 44 11 Vignettes
2026-03-13 Refactoring refactor(targets): Move Bioconductor packages to per-target declarations 11 35 17 Other, R Source, Vignettes
2026-03-13 New Feature feat(vignettes): Add code provenance, kable→DT conversion, caption compliance 35 1004 437 CI/CD, Other, R Source, Vignettes
2026-03-13 Bug Fix fix(vignettes): Skip NULL RDS in safe_tar_read, return invisible(NULL) 11 22 22 Vignettes
2026-03-13 Bug Fix fix(glossary): Prevent double DT::datatable() wrapping in glossary-table chunk 1 3 1 Vignettes
2026-03-13 CI/CD ci: Show quarto errors with quiet=FALSE, render individual vignettes in diagnostic 1 20 6 CI/CD
2026-03-13 CI/CD ci: Add verbose quarto error diagnostics on build failure 1 14 1 CI/CD
2026-03-13 Bug Fix fix(vignettes): Strip Nix paths from DT widgets, auto-wrap data frames 25 66 28 CI/CD, Other, Vignettes
2026-03-13 CI/CD ci: Add diagnostic quarto render step to debug build failure 1 17 0 CI/CD
2026-03-13 Bug Fix fix(vignettes): Revert safe_tar_read placeholder, guard gene-report 11 12 56 Vignettes
2026-03-13 Maintenance chore: Export vig_count_distribution_plot as ggplot RDS (513KB) 1 0 0 Other
2026-03-13 Bug Fix fix(vignettes): Enable code eval in CI with RDS fallback 80 113 74 CI/CD, Other, R Source, Vignettes

Reproducibility

Session Info (click to expand)
Show code
sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.2         cli_3.6.5           knitr_1.51         
#>  [4] rlang_1.1.7         xfun_0.57           otel_0.2.0         
#>  [7] processx_3.8.6      targets_1.12.0      jsonlite_2.0.0     
#> [10] data.table_1.18.2.1 glue_1.8.0          prettyunits_1.2.0  
#> [13] backports_1.5.0     htmltools_0.5.9     ps_1.9.1           
#> [16] rmarkdown_2.30      tibble_3.3.1        evaluate_1.0.5     
#> [19] base64url_1.4       fastmap_1.2.0       yaml_2.3.12        
#> [22] lifecycle_1.0.5     compiler_4.5.3      codetools_0.2-20   
#> [25] igraph_2.2.2        pkgconfig_2.0.3     digest_0.6.39      
#> [28] R6_2.6.1            tidyselect_1.2.1    pillar_1.11.1      
#> [31] callr_3.7.6         magrittr_2.0.4      withr_3.0.2        
#> [34] tools_4.5.3         secretbase_1.2.0