I'm trying to find the correct 'patient' identifiers for DFCI data set. If I try using the column names in the data file, my submission is rejected because 'Must predict all patients in the goldstandard'. I think it's because the DFCI names are wrong. It looks like a leading X has been appended to the column names by make.names. Where should I pull the correct patient identifiers from? Should 'RNASeq_geneLevelExpFileSamplId' in 'sc2_Validation_ClinAnnotations.csv' match the column names in the data file and then 'Patient' be the correct Patient ID? Should I just strip the X from the submission names? Thank you, -S

Created by Sam Danziger sdanziger
Hi All, The simulated clinical annotation files *might* be of some help here. The patient ids there are not real but do conform to the general form of the true ids. ids that start with a numerical will start with a numerical, those that contain a dash will contain a dash etc. Some languages like R but just R will do frustrating things like adding an X in front of the starting numeric or convert a "-" or a "_" to a "." It may be useful to make a fake csv file with some of the simulated ids as headers and see what your codes does to them when they get read in. I hope this helps Mike
Dear Ryu, Do you have a particular submissionId that I could look at? Basically if you predict on all of the patients in the clinical files associated with each subchallenge, you won't get this issue. Just make sure you also read in both `Study` and `Patient` column, and write out these in your own `study` and `patient` column in your prediction file. Best, Tom
Some hint for us non-R people? What exactly should we do to properly handle the naming? Thanks,
Dear Ryu and Sam, As I look at your prediction file, it looks like you are reading in the expression file without using `make.names=False` in R. Please strip X from the `patient` column and it should work. Best, Tom
We are also stuck in a similar condition. Our code is missing all the DFCI samples for some reason.
Mike, My most recent submission is 9633902. If you could give me some hint as to what I should be doing to get past the error message 'Must predict all patients in the goldstandard, and must match the correct study (study + patient).' Are my study variable wrong? Are my patient names wrong? Is it something simple? Are the names totally mangled? submission name: syn10524034 submission ID: 9633902 Thank you, -Sam
Dear Sam, You should be be pulling the correct patient identifiers from `RNASeq_geneLevelExpFileSamplId` in `sc2_Validation_ClinAnnotations.csv`. When you read in the gene level expression file in R, you can set `make.names=F` so that the column headers match what is in the clinical annotation file. Or you could strip the X from the submission names. Best, Tom
Dear Sam, Please include the submission ID so we can look into the submission if necessary. Thanks

Challenge 2: Patient IDs for samples in DFCI data set? page is loading…