Dear organizers, After going through the description and the often contradictory statements in the discussion forum I think I am just another confused user. I understand that the goal is to see if the same bin locations of the 1297 cells that are by default derived from 84 genes (by comparing dge_binarized against binarized_bdtnp), can be achieved with similar accuracy with fewer genes. That is clear, but what data one can not use to choose say 20 out of 84 genes, and how you ensure that people are not simply overfitting the data is not clear to me. Do you think there is a way that the official description can somehow be improved to clarify these issues. I am giving below two distinct scenarios: A) Submission 1 and 2, and 3 are derived from the same method but predictions are obtained by re-substitution, LOOCV and say 3 fold cross-validation. Are you going to favor one over the other just because how they were estimated? B) Method 1 fits the bin locations derived from comparing dge_binarized against binarized_bdtnp with all 84 genes based on 20 genes identified only using information in dge_binarized and the names of the 84 insitu genes. Assuming this works, would this be is this a valid submission? Thank you,

Created by Adi Tarca bcbuprb
Hi Adi no problem, not having a blind GS has brought some issues, but we think the interest of the problem is above this. What you are saying sounds right, of course if you do overfitting in 1) then it is not that interesting. thanks Pablo
Pablo, thanks, So you trust that people would do the best to not over-fit. Sounds fine. Regarding B, let's be even more specific. 1)To identify the 20 genes one can use: the full dge matrix and geometry bin IDs (or xyz coordinates) as derived from comparing dge_binarized against binarized_bdtnp with all 84 genes? 2) Once the 20 genes are selected, to make predictions, one can use the full dge matrix and eventually the columns from binarized_bdtnp for only the the 20 genes in question? Thanks for you patience. Adi
Hi Adi, of course it will be difficult for us to avoid overfitting given that the GS is revealed, but if everybody has perfect score it will be obvious. There are people who honestly want to solve a problem, people that only want to win (those can be easily separated) and then the problem is with people in the middle. We hope overfitting is avoided and models are chosen for how generalizable they are. Again the point of the challenge is selecting the best subset of genes, that at least should be clear. Regarding A) we wont consider the cross-validation approach. B) Seems to me the when you say "comparing dge_binarized against binarized_bdtnp with all 84 genes" you are actually using in situ info from the 84 genes which you cant. If what you mean is using the cell locations and "20 genes identified only using information in dge_binarized and the names of the 84 insitu genes." The it seems ok to me. P

still confused page is loading…