Extracting Participant ID from Tumor_Sample

I would like to work out the % of patients with a particular mutation. I'm looking at "data_mutations_extended.txt" data. The guide states: "The unique, anonymized patient identifier for the GENIE project. Conforms to the following the convention: GENIE-CENTER-1234. The first component is the string, ?GENIE?; the second component is the Center abbreviation. The third component is an anonymized unique identifier for the patient." If I split the identifier by "-" I do not get the expected format. I've replaced instances of "MSK-P-" with "MSKP-" but there's still Tumor_Sample_Barcode which have more "-"s than expected. How do I extract just the participant ID out of this variably formatted column? Some examples: "GENIE-JHU-00006-00185" = 00006 is patient ID "GENIE-MSK-P-0048298-T02-IM6" = 0048298 is presumably the patient ID "GENIE-GRCC-06d79b10-metastasis-a" = 06d79b10 is presumably the patient ID

Created by yesitsjess
Hi @yesitsjess , Thanks for your interest in the GENIE data. We recommend using the clinical sample file `data_clinical_sample.tx` per release to map the PATIENT_ID and SAMPLE_IDs. Please let us know if you have any other questions. Best, Tom

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Extracting Participant ID from Tumor_Sample_Barcode page is loading…