See below a couple of thoughts to start a discussion how overfitting could be mitigated for the final submissions. We realize that it could be difficult to avoid overfitting as results are somewhat noisy (see the results on [Random predictions](syn6156761/wiki/405291)) and there were a limited number of submissions that could be made.
* The most simple approach to make the final submission would be to choose for each network the submission that achieved the best score. However, this approach is prone to overfitting.
* One strategy to avoid overfitting would be to see if variations of the best submission still perform well. For example, suppose module size 10 achieved the best score but module size 8 and 12 performed poorly, while module sizes 20, 22, 24 all performed well. In this case, module size 22 could be a better choice even if the score is lower than for module size 10, especially if the difference in score is comparable to the variation observed in the [Random predictions](syn6156761/wiki/405291).
* We could give some additional results, e.g., the NS scores at 1% FDR. This could be helpful to choose between several submissions with similar scores in the leaderboards.
Let us know what you think and if you have other ideas to mitigate overfitting.
Created by Daniel Marbach daniel.marbach Yuanfang, I'm afraid we can't do more submissions than planned, especially not as we went back to just using the VitalIT cluster because of the small differences in results discussed in the other thread.
--daniel Sorry I forgot to answer the question about the module identities. Yes, we do not provide the identity of the significant modules. The motivation of the challenge was to evaluate "classic" unsupervised module / community detection methods. We were also considering a challenge with training GWAS data given as input, I think that would be very interesting for a future edition but it also poses many difficulties for anonymization or we would need access to unpublished GWAS data. I didn't anticipate that teams would attempt supervised approaches with only 20 submissions, looking forward to compare the different strategies.
i don't think we can in anyway figure out which module works, unless we try one by one...
in most problems both predictions and gold standard change, it is common to see a model overfits crazily.
i had once that on the leaderboard 0.5X and on the final test 0.3X-- that's a real-world challenge years ago. and another time on the learderboard #1, and on the final test #10 something. and then this year a project 0.75 on my cross-validation and 0 on the independent test set.
and there is no way to prevent it other than hunch accumulated from many failed projects. i found small dataset inevitably have this problem, maybe even the top-performing method was just overfitting the small final test set.
but in this case, only gold standard changes, rarely do we have chances to see such situations. actually i can't wait to see if in this case there will still be overfitting, and how much. Yuanfang and Dmitry,
Any hints about supervised learning here? I didnot get it. Last time I was wondering if we could get any information about which module is significant but Daniel didn't reply. i agree, dmitry.
i am very sure the optimal solution of this problem, as every other challenge in dream, is still obtained by supervised learning.
but i am too lazy to implement it, plus only 5 submissions left to experiment. would it be possible to give a bonus 5 submissions for every team? >I agree, compared to other challenges overfitting is not a big issue as there is no supervised learning here. Still, it could be difficult to choose between several predictions with similar scores. We could provide NS scores at 2.5% and 10% FDR.
It's an interesting thing that you mentioned supervised learning here, Daniel. As a matter of fact, we have been exploring the setup for this problem as a supervised learning problem, but we were running out of time. I believe, it can be done very nicely.
Best,
Dmitry > does this mean we can have NS scores at 2.5% and 10% FDR for all our submission during Round 2-4?
Yes, as we didn't get additional feedback I think we'll go with 10%, 5% (already reported), 2.5% and 1% FDR.
We'll report these results for all submissions (Rounds 2-4) by the end of the week. It won't be done in the leaderboards, we'll just share the table with the results for all teams as a text file, which will be updated periodically as new submissions are coming in.
Best, Daniel
leaderboards of round 4 and final test subchallenge 2 are set up incorrectly. does this mean we can have NS scores at 2.5% and 10% FDR for all our submission during Round 2-4? I agree, compared to other challenges overfitting is not a big issue as there is no supervised learning here. Still, it could be difficult to choose between several predictions with similar scores. We could provide NS scores at 2.5% and 10% FDR.
--daniel
FDR at 2.5% and 10%
but i was thinking this dataset at most overfits by 20-30%. won't be like 70-80% in AZ or RA challenge. because it is actually the same prediction, just evaluate on different large sets.