Here some tips and ideas I wanted to share, feel free to add yours in this thread!
1. **Submit early to avoid delays**
Leaderboard submissions get evaluated in the order that they are received. Scores will be updated continuously in the leaderboard. If you submit early on in the submission period, you get results back sooner and have more time to prepare for the next round. If many teams submit on the last day, it may take several days to complete the scoring, leaving little time to prepare for the next round.
2. **Sub-challenge 2: use results from sub-challenge 1 to weight networks**
Not all networks are equally informative to discover disease modules. A potential idea is to use the scores for the different networks available in the leaderboard of sub-challenge 1 as prior information in sub-challenge 2.
3. **Find the best settings for each network**
You are not required to use the same parameters for each network, likely performance can be increased by optimizing the settings of your method for each network individually (in particular the number/size of modules, see [Preliminary Results](syn6156761/wiki/400653). However, there may also be a danger of over-fitting on the GWAS set used for the leaderboard. I'll add some comments on over-fitting next week.
4. **Do not get discouraged by a low rank on the leaderboard**
- Other teams may be **over-fitting** the leaderboard GWAS set, the ranking on the final GWAS set could be different.
- We could decide to **exclude some networks** for the final evaluation if they don't show consistent signal / if they just add noise (e.g., we'll sub-sample the GWASs to see if team rankings are robust on a given network). This could potentially be the case for the cancer and homology networks (networks 5 and 6).
- The difference in scores **may not be statistically significant** (e.g., if we sub-sample the GWAS set).
- We will do **additional analyses**, such as overlap with known pathways from databases.
Created by Daniel Marbach daniel.marbach I agree with Yuanfang that discussing tied conditions now is helpless. It rarely happens. When it happens finally, we can discuss at that time which one is better:) The reported numbers of significant modules are at 5% FDR using the Benjamini-Hochberg method. We also used Bonferroni correction in our analysis of baseline methods - numbers were lower but the ranking of methods remained relatively stable. Thanks for the input, we'll consider all these ideas and arguments when we analyze the final results.
--daniel re: agitter: "Assessing submissions at multiple FDR cutoffs should help distinguish between these two hypothetical submissions, right? The specific method used to control the FDR could end up being important during the evaluation. Can you share that information?"
-- i can not agree, because 30/30 will still have 30 at FDR 10%. 50% or 100%, but 30/3000 will probably have hundreds and 3000 at FDR at 100%. one can only fix with one FDR and trial and error with this FDR.
i don't think we should spend too much time on discussing tied conditions. there are rarely real tied situations; most of so-called 'ties' are all man-made.
if it really turns out to be exactly the same, (which is such a low probability event), i am fine with either displaying by the last name of the submitter as in london olympics, or displaying by the name of the country/institution as in rio olympics. > Regarding the background, I think it should be all genes of the given network (not just those included in the modules), as that's the "universe" where the genes for the modules were selected from.
I completely agree that all genes in the network is the appropriate background.
> But yes, in our scoring they would be tied and we would likely conclude that they have complementary strengths and weaknesses ;)
Assessing submissions at multiple FDR cutoffs should help distinguish between these two hypothetical submissions, right? The specific method used to control the FDR could end up being important during the evaluation. Can you share that information? Dmitry and Yuanfang, thanks for the feedback.
1. I agree with your comments on networks 5 and 6. They won't be excluded if some teams do well on them.
2. I agree that comparison to random predictions is crucial. We already did analysis with randomly generated modules to check that there is no systematic bias for some module size / number of modules predicted. (Results showed that indeed very few random modules have significant enrichment, regardless of their size/number). We will definitely include comparison to random modules in our analysis, I will post how we plan to do this exaclty to get your feedback later on. (We'll only do it systematically for the final predictions, otherwise we waste too much computation power). Regarding the background, I think it should be all genes of the given network (not just those included in the modules), as that's the "universe" where the genes for the modules were selected from. Moreover, it ensures that all predictions are compared to the same background. (It's the same problem we had in the scoring script).
3. The discussion whether the total number of predicted modules should be factored in keeps coming up, also internally :) First, the total number of predicted modules is already factored in because we do correct for multiple testing. We will definitely also analyze the number of predicted modules and their size (e.g., for teams that provide a ranking / confidence score for modules we can look at the precision as we go down the list of modules), but as the main scoring metric we think the total number of modules that show significant enrichment is most relevant.
The goal of the challenge (and module identification in general, e.g., methods such as weighted co-expression network analysis) is to decompose the entire network into modules, not just predict few modules that are relevant to a given disease. The latter would be impossible in this challenge because we don't include disease-specific input data (e.g., genetic data), and participants don't even know which traits and diseases we evaluate with. So it's impossible to be specific, as only a small subset of all genes is associated to a given disease, and only few modules/pathways are expected to be relevant for a given disease. I.e., in this setting it is expected that only a small fraction of modules are hits, and that's not a problem.
We found that metrics based on precision favor large modules. Imagine if we don't set an upper limit for module size and precision is the metric, then a winning strategy would be to just split the network in two clusters, each containing several thousand genes. If one cluster has a small but significant enrichment for disease genes, you'll have have 50% precision. More generally, if you predict few very large modules, you can easily get good precision but it's not useful (actually, in this case the individual modules are not specific as they will include many genes that are not disease-relevant). In your example "if you have a group hitting 30 out 30 and another group hitting 30 out of 6000", the 30 out of 6000 modules would have to be small (as modules are non-overlapping) and show strong enrichment to survive multiple testing, while the 30 out of 30 modules would likely be large and have weaker enrichment. But yes, in our scoring they would be tied and we would likely conclude that they have complementary strengths and weaknesses ;)
Another way to see it is to consider curated pathway libraries such as KEGG. If we test KEGG gene sets (instead of network modules) for GWAS enrichment, only few show significant enrichment, so KEGG would have very low precision. But that doesn't make much sense, KEGG is just comprehensive (like our network modules) and we are only looking at relatively few traits and diseases, so it is expected that only a small fraction are hits.
Let me know if this clarified our reasoning behind the choice of the main scoring metric, it's important that we explain this clearly otherwise in the end it will be the reviewers bringing it up ;)
Best
Daniel
Hi Emilie
- NS is the number of modules that show significant enrichment in at least one GWAS dataset (the "disease modules").
- N is the total number of valid modules (size 3...100) in the submission
- No, we do not release the IDs of the modules that showed significant enrichment. The reason is, the same modules may show enrichment for multiple GWASs -- some genes are associated with diverse traits. If we tell you the IDs of the disease modules, instead of developing better module identification methods to decompose a network, you could just start to refining individual modules. Or keep all disease modules fixed and try new decompositions only for the rest of the network.
Best, Daniel re: if you have a group hitting 30 out 30 and another group hitting 30 out of 6000, can we clearly say that they have equally accurate methods?
-- they will be tied in that case, i believe. but i have never seen exact same score before. that odd is like seeing an owl at noon. i am not worried about tied score. it's not going to happen.
see round one, there is a clear top-scorer.
i am more worried about the statistical test that is used to claim ties, which can go off way and subjective.
what if they say that phil's submission and my submission are not statistically different, even it is 15% different?
Hi Yuanfang,
>_i cannot agree on normalizing by total number, dmitry; even though that immediately brings my team to number 1.... because then will only predict one cluster, and make sure it is a positive one. then i get a score of 100%? i think total number makes more sense. so no matter what it is limited by the #total genes/3._
Yep, that's why I mentioned that the direct normalization would not work. On the other hand if you have a group hitting 30 out 30 and another group hitting 30 out of 6000, can we clearly say that they have equally accurate methods?
>_you don't even need to release the random network. you can just shuffle gene ids on submitted prediction files, and you can immediately tell the random expectations, for each submission._
Well, both of these random sampling strategies will work, however they are not likely to provide the same background distribution, if by reshuffling you mean reshuffling across all genes (including the ones that are not in the modules). Which one is the "harder to beat" sampling, is not clear to me immediately.
Best,
Dmitry Wanted to confirm the rank of the present submissions, as the list is sorted by date.
What does the N and NS mean?
Is the N vallue used in the ranking?
Is the 28 NS value presently the best/top ranking submission.
Can we request our module ids which have been identified and contributed to the score of that network. actually i have been stupid: you don't even need to release the random network. you can just shuffle gene ids on submitted prediction files, and you can immediately tell the random expectations, for each submission. i cannot agree on normalizing by total number, dmitry; even though that immediately brings my team to number 1.... because then will only predict one cluster, and make sure it is a positive one. then i get a score of 100%? i think total number makes more sense. so no matter what it is limited by the #total genes/3.
---
one can just shuffle the gene ids and map back to the original network, a process completely maintaining the original network structure, whatever it is. then in the final submission one submits a prediction set on real networks, and a prediction set on fake networks. you can then 1) immediately identify infomation-free networks 2) identify how many of the clusters one identified because of reported numbers high or low; and how many are real, as part of the post challenge analysis.
--
i think it is the responsibility of the organizers to writ e this code, daniel. but i am willing to write it as well, since it will be only 20-30 lines. Hi Yuanfang,
Yes, thank you for reminding about that post. I did remember it, but my thought was that perhaps, one should revisit that idea due to the fact that the numbers are so close to each other (especially in subchallenge 2).
I really like your idea about releasing a randomized network! That would be great! One thing to keep in mind when generating those random networks is that pretty much all of them are scale-free, so perhaps one would need to use a scale-free network generator.
Best,
Dmitry daniel correct me if i am wrong, point 1 of dmitry has been discussed before on the forum that the metrics are designed to reward granular modules (by the post of phil and you) since there are easier to be followed up by experiments and being biologically meaningful.
i agree with dmitry that it is a pretty low probability event that all 25 teams fail to detect something. maybe my logic of the last post has some problem.
this can be easily found by releasing a randomized network, i.e. shuffle the scores of each network, and let teams predict with exact the same algorithm, if any of the randomized the network also results in similar number of modules, that means this network is random. (update: randomize node not scores, because it is not complete network) Hi Daniel,
Thank you, folks, for doing an excellent job in spite of so many logistical challenges! I wanted to bring in a couple of things to consider/discuss:
- Related to your over-fitting note; Have you though about factoring in the total number of modules into the ranking? In other words, an ideal submission would include only those modules that contribute to the NS value, and the greater the number of submitted modules the lower should be the score. Obviously, a naive measure such as NS/Total_sum_of_modules is unlikely to work, but still...
- I think, the exclusion of networks 5 and 6 should be statistically justified. In other words, the predictions by teams should be statistically indistinguishable from predictions by a chance. If predictions are better than a chance but still are poor, it could be a good idea to leave them in and study the reasons of such a poor performance (I do have some ideas for the homology network's poor performance, for instance). Furthermore, if one or two teams mange to be better than everyone else for those networks, then we definitely should keep those networks as a part of competition. That would mean there is at least one team that found a better solution when tackling those difficult networks, and this team should be rewarded for that, not penalized.
Anyways, just my couple of cents.
Best,
Dmitry hi, daniel,
i do not agree that networks 5 and 6 are purely noise, although removing them will actually bring a huge advantage to my team. but i don't think it will be fair to those who are good at these networks.
i have this conclusion because one can calculate the correlation between the performance on network 1 (which i am sure is meaningful) and each of the other networks across teams.
0.523792779
0.175055284
0.38337321
0.485310729
0.398627068
then one can easily see that this correlation in performance is not random.
of course, there is a significant correlation between the total number of clusters and the score across teams: between 0.15-0.65 depending on network, i.e. the higher the number, the better the performance. however, this factor can be removed by calculating partial correlation: that the correlation between network 6 and total number is 0.19; and between network5 and total is only 0.13. but the performance between network 6 and network 1 is 0.40, and between network 5 and network 1is 0.49. that means there is still a good correlation of performance across methods independent of cluster size, which means network 5 and 6 are not random.
thanks a bunch,
yuanfang