Hi!
After taking a look at the example datasets I realized that there were around 12000 features/transcripts in each which is not nearly close to the Kallisto output with the Gh38 ensemble with around 57000 features/transcripts.
If in a given prediction model an exceptionally important feature is missing in the datasets, it might strongly affect the predictions of the model.
Taking this issue into consideration, are the genes in the example datasets the only features given or will there be more?? If the later, will they vary across datasets or remain constant?
Thank you!
Martin
Created by Martin Guerrero martinguerrero89 Hi @jdroz
Hopefully you saw that Andrew posted these here:
https://www.synapse.org/#!Synapse:syn15589870/discussion/threadId=5915
Brian Hi Andrew,
Could we have the a list of the exact available feature for the different micro array and for your version of a Gh38 reference? I'd like to have it as Ensembl genes, but other participants might want other formats. Many machine learning techniques do not deal well with missing data (at least without considerable tweaking).
I understand that it is asking for a lot, but I think it would be good to have it for the next round. If it is not possible I'd still think that you should at least give the list of features available for the validation phase.
Thank you!
Jean-Marie
Hi Andrew,
yeah, I was talking about those examples, but you made your point clear! thank you for the information!
Hi Martin,
When you refer to the example datasets, do you mean the ones here:
https://github.com/Sage-Bionetworks/Tumor-Deconvolution-Challenge-Workflow/tree/master/example_files/fast_lane_dir
These and the leaderboard datasets are mostly publicly available microarray experiments, and you are right they won't have the full range of Gh38 ensemble features.
So to answer your question, for the leaderboard datatsets it depends entirely on the microarray experiment, but ~12k features won't be an outlier. The features will vary as well across datasets.
The final scoring phase will be different however as we are generating our own RNASeq data, and each sample will have the full range of ensemble features.