Dear Organizers, I have just came across a shortage of predictors in the Docker repo, all of course necessary to establish the predictions. Upon closer inspection, I noticed there were test set SNP variants not included in the train set, and likewise train set SNP variants not included in the test set. I specifically looked at `var_name`. Is this assertion correct? If so, I wonder how can these markers be adequately modelled? Best, Francisco

Created by Francisco de Abreu e Lima monogenea
Hi Jacob, Many thanks for your quick response. I will try to find some workarounds, thanks for the suggestions. Best, Francisco
Hi Francisco, It is not expected that every variant is in every dataset. It is possible that every gene is represented in the training dataset, but that is not guaranteed either. The natural approach, one-hot encodings of "var_name" (which is just shorthand for _) is tricky, since it's a very sparse space. This is an implicit problem of mutation datasets. Some ideas you might consider for inspiration: 1) discard any mutations you didn't see in the training dataset (this is a quick way to make the leaderboard dataset match your model) 2) key on mutated genes, rather than individual mutations (ie: Hugo_Symbol instead of var_name). Best, Jacob

dnaseq.csv in test set lists different variants? page is loading…