In the Wiki the output file should have 4 columns (dataset.name, sample.id, cell.type, prediction) but the example in Githus is much simpler: https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge-Workflow/blob/master/example_files/predictions.csv. So which one should we use?

Created by Dani Livne zurkin
Hi @zurkin , Either the GEO annotations or the original publication that generated the data will indicate the log vs linear scale. This can also be checked by looking at the values. Log values will generally by "small" (e.g., less than 10). For example, ff the data are from GEO, they will be annotated with a data.processing column that will include information such as the normalization approached used. As one example, if the normalization is RMA, the data are likely in log2 format--RMA by default outputs in log 2 and, unless the data contributors took the anti-log of the data before submitting, the data will remain in log 2 scale. Brian
Thank you, but I still don't understand how you decide on the 'scale' parameter. Given a new GSE dataset, what is the process of deciding on this parameter.
Hi @zurkin , Microarray expression data are commonly represented as log2 values and some normalization approaches directly output on a log scale (e.g., RMA). Linear references to a scale that is the original scale of the intensities (or the anti-log of log values). In the context of RNA-seq data, linear could be counts, FPKMs, or TPMs, for example. When we are using publicly-available data (e.g., from GEO) we provide the data in the original scale submitted by the authors. So, no, we are not applying any log/exponentiation transformations ourselves. If you do so, please note that linear does not necessarily imply that all values are positive. Sometimes linear values may be shifted and/or batch corrected, which can lead to negative values. Best, Brian
Would you please provide more details on the scale parameter. From the Wiki: "scale The scale of the expression data (e.g., Log2, Log10, Linear)". How is this scale deduced from the input sample (the GSEXX sample)? are you applying a log transform on the GSE data? was it applied before you download the data?
The most recent prediction file you made looks like: dataset.name sample.id cell.type prediction ds2 Sample_1 B.cells 0.030671346927550403 ds2 Sample_2 B.cells 0.04365072967001203 ds2 Sample_3 B.cells 0.046015139054162324 ds2 Sample_1 CD4.T.cells 0.10605049787657665 ds2 Sample_2 CD4.T.cells 0.1403652825623914 ds2 Sample_3 CD4.T.cells 0.1290456058979449 ds2 Sample_1 CD8.T.cells 0.17961414911217238 ds2 Sample_2 CD8.T.cells 0.26130829070406925 ds2 Sample_3 CD8.T.cells 0.22795497032685091 ds2 Sample_1 NK.cells 0.6783032479586784 ds2 Sample_2 NK.cells 0.5651775230225564 ds2 Sample_3 NK.cells 0.590418775025309 The error message: Prediction file has missing predictions: [CD4.T.cells;ds1;Sample_1, CD8.T.cells;ds1;Sample_1, NK.cells;ds1;Sample_1, B.cells;ds1;Sample_1, monocytic.lineage;ds1;Sample_1, neutrophils;ds1;Sample_1, endothelial.cells;ds1;Sample_1, fibroblasts;ds1;Sample_1, CD4.T.cells;ds1;Sample_2, CD8.T.cells;ds1;Sample_2, NK.cells;ds1;Sample_2, B.cells;ds1;Sample_2, monocytic.lineage;ds1;Sample_2, neutrophils;ds1;Sample_2, endothelial.cells;ds1;Sample_2, fibroblasts;ds1;Sample_2, monocytic.lineage;ds2;Sample_1, neutrophils;ds2;Sample_1, endothelial.cells;ds2;Sample_1, fibroblasts;ds2;Sample_1, monocytic.lineage;ds2;Sample_2, neutrophils;ds2;Sample_2, endothelial.cells;ds2;Sample_2, fibroblasts;ds2;Sample_2, monocytic.lineage;ds2;Sample_3, neutrophils;ds2;Sample_3, endothelial.cells;ds2;Sample_3, fibroblasts;ds2;Sample_3]. Your prediction file is missing all of the dataset 1 predictions and around 1/2 of the dataset 2 predictions.
I am not sure if this is the reason. The output.csv as I printed in the logs is exactly as required in the Wiki: dataset.name sample.id cell.type prediction 0 ds1 Sample_1 B.cells 0.340249 1 ds1 Sample_2 B.cells 0.340337 2 ds1 Sample_1 CD4.T.cells 0.394372 3 ds1 Sample_2 CD4.T.cells 0.400493 4 ds1 Sample_1 CD8.T.cells 0.233511 5 ds1 Sample_2 CD8.T.cells 0.250012 6 ds1 Sample_1 NK.cells 0.017449 7 ds1 Sample_2 NK.cells 0.022712 But, looking at your logs it complains about missing: CD4.T.cells;ds1;Sample_1. The order is *swapped *(cell type before dataset name). Can you please check?
Please see the Output file section [here.](https://www.synapse.org/#!Synapse:syn15589870/wiki/592699) It looks like the issue is the dataset.name, and sample.id columns. The names in the dataset.name column must be exact matches to the names in [input.csv](https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge-Workflow/blob/master/example_files/fast_lane_dir/input.csv) The names in the sample.id column must be exact matches to the columns in [input files.](https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge-Workflow/blob/master/example_files/fast_lane_dir/ds1.csv) For example dataset.name sample.id cell.type prediction dataset_B Sample1 B.cells 0 Should look like: dataset.name sample.id cell.type prediction ds2 Sample_1 B.cells 0
I keep getting a "submission is invalid" message: Your submission is invalid, below are the reason(s): Prediction file has missing predictions: [CD4.T.cells;ds1;Sample_1, CD8.... I checked locally and I have all these cell types in /output/output.csv: dataset.name sample.id cell.type prediction dataset_B Sample1 B.cells 0 dataset_B Sample2 B.cells 0 dataset_B Sample1 CD4.T.cells 0.689872095 dataset_B Sample2 CD4.T.cells 0.709728409 dataset_B Sample1 CD8.T.cells 0.296780039 dataset_B Sample2 CD8.T.cells 0.303110634 dataset_B Sample1 NK.cells 0 dataset_B Sample2 NK.cells 0 Can you please assist?
@zurkin The cancer type will be column in the input.csv file, please see the updated wiki for more information: https://www.synapse.org/#!Synapse:syn15589870/wiki/592699
Hi zurkin, I think what you are asking is how will your docker image know what sub-challenge it is currently making a prediction for? The answer is it won't. You'll need to have two different versions, that you will submit to the appropriate submission queue. We don't have a list of the exact values for platform type just yet. At some point before the rounds start we will make the input.csv files available.
- How can I tell from input.csv what sub-challenge is it? (are there 8 or 16 cell types to predict?). - What are the valid values for 'platform' (and related mapping to either Microarray or RNAseq)?
I'll have to get back to you on that regarding the validation phase. (Or more likely Brian when he gets back next week). It's likely it won't be a parameter, but we will tell you beforehand. However, for the leaderboard phase the data is from various non-cancer datasets, so there will be no tumor type.
A somewhat related question: in the webinar it was mentioned that the kind of tumor (BRCA or COA) will be given as input parameter. But I can't see it in https://www.synapse.org/#!Synapse:syn15589870/wiki/592699.
Use the version in the wiki. The github example is outdated and will be replaced very soon. -Andrew

Output file format page is loading…