Hi
I understand that the algorithms should be robust to a few missing values like missing demographics but I assume that for challenge 2 in validation all patients will have expression at least. Is this correct? I also assume that the same set of genes will be available in training and validation.
The same for challenge 1?
thanks
DreamAnon
Created by exquirentibus veritatem exquirentibus I have confirmed that is only the case for MMRF. Thank you for bringing this to our attention.
Mike Hi dreamAnon,
are you looking in syn9926877? It should not but I will look into this to determine what happened and ensure it will not be an issue for the validation. I noticed that in the training data in the MMRF dataset there were some patients that had a gene expression filename in the clinical file but that gene expression file did not contain expression for that patient
specifically
"MMRF_1079_1_BM"
"MMRF_1805_1_BM"
"MMRF_1988_1_BM"
"MMRF_2507_1_BM"
are not in MMRF_CoMMpass_IA9_E74GTF_Salmon_Gene_TPM.txt
Am I understanding correctly that this will not happen in validation?
Good.
thanks for the clarification
DA
Dear dreamAnon,
Both challenges will have missing data in these columns.... for example:
* in Challenge question 1 DFCI validation data will not have WES based mutect or ot Strelka vcfs but **will** have RNA-seq based mutect calls. No validation sample should have NA's in all three data types.
* in Challenge question 2 DFCI samples will not have microarray data data column data but **will** have RNA-seq data columns filled, while the Hose dataset will have RNA-seq columns with NA and microarray columns filled.