I would like to work out the % of patients with a particular mutation. I'm looking at "data_mutations_extended.txt" data. The guide states:
"The unique, anonymized patient identifier for the GENIE project. Conforms to the following the convention: GENIE-CENTER-1234. The first component is the string, ?GENIE?; the second component is the Center abbreviation. The third component is an anonymized unique identifier for the patient."
If I split the identifier by "-" I do not get the expected format. I've replaced instances of "MSK-P-" with "MSKP-" but there's still Tumor_Sample_Barcode which have more "-"s than expected. How do I extract just the participant ID out of this variably formatted column? Some examples:
"GENIE-JHU-00006-00185" = 00006 is patient ID
"GENIE-MSK-P-0048298-T02-IM6" = 0048298 is presumably the patient ID
"GENIE-GRCC-06d79b10-metastasis-a" = 06d79b10 is presumably the patient ID
Created by yesitsjess Hi @yesitsjess ,
Thanks for your interest in the GENIE data. We recommend using the clinical sample file `data_clinical_sample.tx` per release to map the PATIENT_ID and SAMPLE_IDs. Please let us know if you have any other questions.
Best,
Tom
Drop files to upload
Extracting Participant ID from Tumor_Sample_Barcode page is loading…