I would like to work out the % of patients with a particular mutation. I'm looking at "data_mutations_extended.txt" data. The guide states: "The unique, anonymized patient identifier for the GENIE project. Conforms to the following the convention: GENIE-CENTER-1234. The first component is the string, ?GENIE?; the second component is the Center abbreviation. The third component is an anonymized unique identifier for the patient." If I split the identifier by "-" I do not get the expected format. I've replaced instances of "MSK-P-" with "MSKP-" but there's still Tumor_Sample_Barcode which have more "-"s than expected. How do I extract just the participant ID out of this variably formatted column? Some examples: "GENIE-JHU-00006-00185" = 00006 is patient ID "GENIE-MSK-P-0048298-T02-IM6" = 0048298 is presumably the patient ID "GENIE-GRCC-06d79b10-metastasis-a" = 06d79b10 is presumably the patient ID

Created by yesitsjess
Hi @yesitsjess , Thanks for your interest in the GENIE data. We recommend using the clinical sample file `data_clinical_sample.tx` per release to map the PATIENT_ID and SAMPLE_IDs. Please let us know if you have any other questions. Best, Tom

Extracting Participant ID from Tumor_Sample_Barcode page is loading…