My previous question keeps unanswered, so I have to ask again. Is it allowed to build different models for different validation sets in one sub-challenge?

Created by Yi Cui cuiyi
Dear Yuanfang, The M2Gen dataset is small. The leaderboard rounds tend to have around 23 samples from m2gen but as you have noted there are duplicates due to sampling with replacement.... additionally there are samples that do not progress prior to 18 months (or other time cuts for the iAUC) and are therefore censored. For example, in a previous leaderboard round there were just 8 unique samples that are not censored from M2gen. In this scenario, 1 or 2 samples that are poorly predicted could really throw any given metric. Especially if the samples are duplicated. We have seen odd behavior for study wise metrics in m2gen before (like [here](https://www.synapse.org/#!Synapse:syn6187098/discussion/threadId=2665)). We do not expect to see this behavior in the final round where the will more samples and there will be no sampling with replacement.
Looking back at this thread. This round I flipped M2Gen predictions for sub2, since it has been robustly 0.1X in previous rounds. Somehow my flipped version is still below 0.5. @Michael.Mason what could be the possible reasons in terms of scoring that causes both original and flipped prediction below 0.5?
Thank you all for this discussion. My concern is cleared. Cheers
yes it is sampled with replacement. for anyone who has submitted to round2, you would see the performance for each dataset can be 0.1 to 0.15 off even with the same model. and final round would completely be different and will be like a lottery. Based on my experience in the past, under such circumstances, it is much more useful to pray than improve the model. i already give up improving the model.   but m2gen in sub2 still worths reversing the predictions, because that's the only dataset that consistently get a score below 0.2.
Thanks for addressing this while I am traveling Fadi, Dear Yi and Fadi, The only thing I'll add is that the final validation round data set is also much larger than the leaderboard rounds' so we do not expect this to happen. Additionally, there is only one submission allowed in the final round so it would be very risky to take the overfitting approach without a stronger sense of how one will perform in the validation dataset. Also, The M2Gen data set is the smallest in our validation so this would not be weighted heavily. It's small size is also most likely why it is exhibiting this strange iAUC behavior for one or two submissions. Apologies for the delay in response, Mike
Dear Yi, Thanks for clarifying. If I understood the approach you describe accurately, what you are describing is essentially tailored overfitting to each validation dataset. So, instead of focusing on deriving a single classifier that'll generalize on any future dataset (and, hence, be applicable to other the real world cohorts), the strategy you describe aims to tailor-fit a classifier to each separate cohort. While the above strategy may theoretically work for the leaderboard rounds due to the fact that one can see the performance result from a classifier and reverse/adjust the scores, please keep in mind that you won't have access to the performance estimates from the final validation round. Further, the validation sets for leaderboards are ** sampled with replacement ** from the validation datasets and final submissions in validation round are scored on the remaining data [See slide 37 in first challenge webinar](https://www.synapse.org/#!Synapse:syn10306311). So, that strategy is ill-advised for this challenge. Hope this is helpful - I'll give a chance for @Michael.Mason to add his 2-cents if I missed anything. Best, Fadi
Hi Fadi,   Thank you for the reply.   What I mean is the second scenario you mentioned, i.e., a separate model for each validation sets within one sub-challenge (I've modified the original thread to reflect this). Based on your reply, it seems acceptable.   However, this raises some concerns. Especially when you look at the Challenge 2 leader board, the iAUC for M2Gen by the clinical variables alone is 0.08. So if we reverse the that score, iAUC will be as high as 0.92. Now if our purpose is just to get as high weighted iAUC as possible, the simple way is just to use one model for the other 3 datasets, and use the reversed clinical score as a separate model for M2Gen only.   Or if this approach seems too ridiculous, how about building a model for M2Gen using expression data only without using clinical data, as the leader board clearly indicates the clinical variables have an adverse effect for this dataset. But for the other 3 datasets, we opt to build another model which includes clinical variables.   To me, either of the approaches above smacks of cherry picking..., but they do not seem to break the rule. So clarification is really appreciated.
Hi Yi, Not sure I understand your question exactly - can you elaborate? If you are thinking about building different models for different challenges (e.g., challenge 1 vs. challenge 2 vs. challenge 2), then this is perfectly fine. If your question is regarding a separate model for each of the validation datasets within each challenge (e.g., 3 or 4 models for challenge 2). You still have to have an automated way to decide which model to apply to each dataset. As long as your approach doesn't require access to the outcome data in the validation dataset, it should be OK. @Michael.Mason can you confirm? Yi - can you please clarify which case you are thinking of (different models for different challenges or different models within a challenge?), Best, Fadi

Clarification of rule about building multiple models page is loading…