Dear all
It's a pleasure to announce the best performers. The main scoring metric was the NS score on the hold-out GWAS set (the test set), as planned. In addition, we also conducted a bootstrap analysis to evaluate the robustness of the ranking and looked at different FDR cutoffs. The quality of the write-ups was also appreciated. Detailed results will be posted on the wiki later today.
**SUB-CHALLENGE 1**
**Best performer**
* **Team Tusk**: Jake Crawford, Junyuan Lin, Xiaozhe Hu, Benjamin Hescott, Donna Slonim, Lenore Cowen
* Tufts University, MA, USA
* Write-up: [A Double Spectral Approach to DREAM 11 Subchallenge 3](syn7349492/wiki/407359)
**Runner-up**
* **Team Aleph**: Sergio Gómez, Manlio De Domenico, Alex Arenas
* Universitat Rovira i Virgili, Tarragona, Spain
* Write-up: [Disease Module Identification by Adjusting Resolution in Community Detection Algorithms](syn7352969/wiki/407384)
Team Tusk achieved the highest overall NS score at every FDR cutoff that we tested (10%, 5%, 2.5% and 1%; at 5% there was a tie with team Aleph). No other team consistently ranked top across all FDR cutoffs. Moreover, their method also achieved the highest score on the leaderboard GWAS set at 5%, 2.5% and 1% FDR, and it was among the top teams in sub-challenge 2. While team Aleph tied at 5% FDR cutoff, the performance was not robust across different FDR cutoffs and on the leaderboard GWAS set.
**SUB-CHALLENGE 2**
**We have decided to declare sub-challenge 2 vacant because the baseline was not outperformed. That is, we consider the problem of effectively integrating multiple networks to more reliably identify disease modules not yet solved and there is no official winner.**
As baseline for the multi-network predictions in sub-challenge 2, we considered single-network module predictions from sub-challenge 1. Some single-network predictions from sub-challenge 1 have a similar or even higher score when evaluated in sub-challenge 2 than the integrated predictions. As was previously discussed in the forum, it turned out to be very difficult to effectively leverage multiple networks to improve modules. Indeed, the team with the best performance in sub-challenge 2 only used the two protein interaction networks. We think these are very interesting observations that will be of great interest to discuss further at the conference and in the paper.
We congratulate team Tsurumi-Ono for obtaining the highest score at 5%, 2.5% and 1% FDR in sub-challenge 2:
* **Team Tsurumi-Ono**: Artem Lysenko, Piotr J. Kamola, Keith A. Boroevich, Tatsuhiko Tsunoda
* RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
* [Write-up](syn7209602/wiki/405135)
Detailed results will be posted on the wiki later today. Teams interested in presenting a pre-accepted poster at the conference please send me your poster abstract by tomorrow Friday at the latest.
Best wishes
Daniel & Sarvenaz
on behalf of the Challenge Organizers
Created by Daniel Marbach daniel.marbach Hi everybody
* We decided to mention 7 teams for runner-up in sub-challenge 1. These are the teams that either ranked 2nd, or tied with the team that ranks 2nd, at any of the considered FDR cutoffs (see [Best performers](syn6156761/wiki/407453) for details). Congratulations to the runner-ups!
* Anais, thanks for sharing your insights, we agree with the main conclusions. However, we keep our decision that there is no best performer in sub-challenge 2 as the baseline was not outperformed. We definitely do _not_ think sub-challenge 2 was a failure: much was learned and is still to be learned from these results. However, we still think the baseline makes sense given the scoring metric of the challenge. While the NS score by itself is not sufficient to show the potential advantages of multi-network approaches, for the purpose of deciding on the best performers, this is what we have to look at. Predictions obtained from a single-network and multiple networks are directly comparable, indeed the former can be seen as a multi-network prediction where the weights of all but one network are set to 0. As more data is used for the multi-network predictions, it makes sense that the baseline are the single-network predictions (and not the other way around).
* CuriourGeorge, we agree it's good to recognize more teams as we now do in sub-challenge 1. Regarding sub-challenge 2, for all but few teams it wouldn't make a difference if we call best performers or not. For young researchers working in module identification, the true reward is being part of this community, which I hope will advance their research in one way or the other and will ultimately lead to publications.
* Yuanfang, we subsampled 76 GWAS (73%) because that was the number of GWAS in the leaderboard set. We also did subsampling with 50%, which gave the same results.
* Let's start a [new thread](syn6156761/discussion/threadId=1111), long threads don't always seem load correctly on synapse.
i think there is going to be a terrible time to explain to why 0.75 subset instead of standard 0.634 subset is selected. because even in standard 0.634 bootstrap, 30% of the examples are repeated, that's the whole point of estimating variance. obviously, if you use 1.0 subset, all bayes factor would be infinity. I agree with Anaïs.
Yesterday I used **Conserved module** in the previous post, where I did not mean a real conserved module across the six networks. Because there is even no single edge across the six networks! That is why some teams only merge several of networks (e.g. dense ones) to get modules. I personally doubt how many conserved/overlapping modules could be found across the six networks. If there are any, the number should be smaller than any of that from a single network.
Of course, I understand to motivation to set the baseline of sub-2 like that, to prevent someone just use a single network. That's why I wanted to know the difference of modules identified by a merged network and a single network. On the one hand (and most important) to verify the effectiveness of integrating different networks, on the other hand, to find out whether the solution comes from a single network. Dear Daniel,
Indeed, I agree with your point, the nature of the modules cannot be taken into account at this step, it will be something we'll work on from now ! However, in my opinion, the 2 sub-challenges are very different, with different questions/objectives, and I'm not convinced that the sub-challenge 2 is a failure. And I'm still not sure about using the sub-challenge 1 as baseline for the sub-challenge 2 neither. It could have been done the other way around for instance, with the best algorithm on merged network, as done by some team in sub-challenge 2, as a baseline of sub-challenge 1. In this case, maybe the sub-challenge 1 would have failed for the majority of networks ?
In our experience on simulated random networks (with SBM), considering many networks is an advantage for community identification (Didier et al. 2015). Of course, the real networks are different from simulated data. And in the case of real biological networks, we indeed found on previous attempts that the multiple-network approach was not always better than the single "best" network, which fits more the evaluation data. We were evaluating with enrichments in Gene Ontology Biological process, and the communities from a Kegg single network were always better than the communities from a multiplex network Kegg + PPI + Co-expr .... This is because the Kegg network is less "large-scale unbiased" and fits more with GO Biological Process annotations. Our conclusion was that comparing single versus multiplex communities was like comparing orange and carrots. So we chose instead to compare different combinations of multiple networks. And I though it was also this idea here with the DREAM challenge, with the division into 2 sub-challenges. Otherwise it would have been maybe easier to make only one challenge and allow teams to combine networks ?
In conclusion, I think that the sub-challenge 2 isn't a failure, as we learn that 1) Selecting the network datasets to fit the question asked is a good approach, as done by the top performer and 2) summing network with pertinent edge weight combination, as done by ShanHeLab and 3) adapting network measures to multiple networks, as the multiplex-modularity we used, are allowing to use the information from all the 6 networks to identify communities.
Sorry for the long post !
Best, and have a nice week-end,
Anaïs
Thanks everybody for your input, we'll discuss these points at our next call.
Anais, I agree with your points and I also hope that we can find some advantage of the integrated predictions. However, at this point we do not yet have evidence that the integrated modules are somehow better or complementary. (The same is true for the individual methods, maybe the method that ranked 10th discovered some very interesting modules.) Indeed, we had to "artificially" protect the multi-network submissions by prohibiting that teams submit single-network predictions in sub-challenge 2. If we wouldn't have done that, it's highly likely that a single-network prediction would have won.
--daniel i think what i propose is quite reasonable. one winner as it is, and some runner ups for those who are obviously statistically indistinguishable.
for my own team at least it will be very helpful. obviously i am not in need of any winning title or runner up position. this one is probably the worst one i ever did. but for the high school student who worked with me, a runner-up would be good enough for him to apply for Intel talent search etc, maybe even finalist. but in the end, if we get nothing just because of a small margin **even less than half of the random noises **, it is really disappointing. Hi, Daniel,
I support Yuanfang's suggestion about recognizing more than one teams for Sub1 and Sub 2. Although the result of sub 2 is disappointing in some ways, it was intense team work for several months. The efforts were probably supported by other funding organizations or resources. To declare any winner in Sub2 is not very rewarding and may not be the best way to encourage future participation for young scientists given the limited resource available.
Thanks for the consideration,
Ke
Dear Daniel and challengers,
I also have some comments about the subchallenge 2. I am not sure of the relevance of using the results of the subchallenge 1 as baseline (if I understood correctly the process), and not a randomization like in subchallenge 1. I agree with you that it is indeed disappointing to not heavily improve the number of identified significant modules with a multiple network approach as compared to the best single-network. But I think some points need to be taken into account:
- Not only the number but also the nature of the modules. Indeed, the top performer in subchallenge 2 identifies 21 modules, whereas the top performer in subchallenge 1 identifies 20 modules in network 1_ppi. But we don't know if it's the same modules nor diseases. Maybe the subchallenge 2 allows associating less studied diseases to modules ? This will be the interesting part of the work from now. I think it was also the meaning of @dxl466 comment.
- We do not expect the multiple-networks approach to identify like 60 modules (the sum of the modules identified in the single networks). Indeed, each biological network is a measure of the real (unknown) functional interactions, with its own bias. We expect the networks to provide complementary but also overlapping information. In my opinion, obtaining 20 or 21 modules in the multiple-network approach illustrates that it's working, given the huge addition of noise, even if there is obviously a huge a path for improvement.
Best, and many thanks for this interesting challenge and associated discussions !
Anaïs > but the second one is true as well, as in different java versions, the only difference is the order of shuffled genes, and the result can be as much as 1-2 difference for a network.
Perhaps this is worth to examine if once the evaluation code will be shared.
>Did you shuffle the input lines for your algorithm or the lines in the submitted files?
>In the first case, I would say that as the predicted module depends on the algorithm you used, intuition tells me that that it is OK to have variation in the result.
mine is the first one. i feel it is still the same algorithm. it is just when called by igraph it reads in different orders. we had a huge jump from leaderboard phase, like increased 20 modules (but we did drop 5 in SC2). we did a reshuffle of all input in the ones we did badly in final submission, because our leaderboard scores were so hopeless. so we thought we should bet on better luck on the final reshuffle.
>On the other hand, if you submitted exactly the same modules just in different order or shuffled the genes in the modules, and got the mentioned variation, this IMHO would pretty much reduce the credibility of the scores, as I have raised this concern on the forum before.
but the second one is true as well, as in different java versions, we were told the only difference is the order of shuffled genes? and the result can be as much as 1-2 difference for a network. I think that is what you reported. but overall, i found it is kind of unexpected. the whole sorting business was resolved in the 80s.
However, there is no problem of the credibility of the scores. ALL challenges depend on at least half by luck. That's why I always enter all of them and treat each as a sub-challenge, so at least some will work out. Dear all,
First of all, I would like to thank the organizers for the effort they put in organizing this challenge.
Second, I would like to reflect to the comments of @gyuanfan:
What do you mean by
> In most leaderboard tests we didn't do anything but to reshuffle the input lines and to see how much noise, with the exactly same algorithm, my gut feeling of S.D. of each network is probably 3.
Did you shuffle the input lines for your algorithm or the lines in the submitted files?
In the first case, I would say that as the predicted module depends on the algorithm you used, intuition tells me that that it is OK to have variation in the result.
On the other hand, if you submitted exactly the same modules just in different order or shuffled the genes in the modules, and got the mentioned variation, this IMHO would pretty much reduce the credibility of the scores, as I have raised this concern on the forum before. Hi @dxl466
I agree that the most interesting part begins now!
> Does it make sense? A conserved module across six different networks (can be viewed as layers) is supposed to have more meaning than a module identified in one single network.
> I assume this is the motivation of sub-challenge 2. Have you checked the physical meaning of these disease modules, compared with those in single networks?
We have not yet done further analyses besides the GWAS pathway enrichment that is the NS score. It could be that additional analysis will show that the integrated modules are somehow better than the modules from single networks, but we didn't yet see evidence for this in the pathway enrichment score.
It's also worth noting that most of the teams with high scores in sub-challenge 2 simply merged some of the networks and then applied standard module identification methods. That does not necessarily give conserved modules.
But I hope that now in the collaborative phase maybe teams will further improve on these results.
Cheers, Daniel
Hi, Daniel,
I was mostly joking for other challenges in previous posts, please ignore.
but this is the SERIOUS part:
I think you did a really informative analysis prior to final test set release: that the random number of modules of each network is about 2. I think the randomness for SC1 is at least 6. And I do think the top 5 are obviously a separate group. (I mean **the sole winner is winner no matter what** and the winner should take all that is what I believe; all competitions are 50% technique+50% luck).
Furthermore, as you might remember, in different JAVA versions, the only difference is the order of the genes being fed in, not even the modules, and the difference of NS can be as large as 2 for a single network. As such, reshuffling the gene input, or change a different JAVA version, may make a different team number 1.
In most leaderboard tests we didn't do anything but to reshuffle the input lines and to see how much noise, with the exactly same algorithm, my gut feeling of S.D. of each network is probably 3.
thanks,
yuanfang
Hi Yuanfang
I see your point. Our discussion was mostly focused on whether we should have one or multiple best performers, and deciding on the results in sub-challenge 2. We will thus re-discuss our decision on the runner-ups at our conference call on Thursday.
Note that the other challenges from this or previous years have nothing to do with our decision, I didn't even know how many best performers they have.
Best, Daniel The challenge is over but the real interesting things just begin, on condition that Daniel would make the GWAS platform public later on. I believe the physical meaning of modules is as important as the algorithmic performance. The other question is "what kind of modules tend to have biological relevance, do they share some topological features? And what is the maximal/reasonable expectation to find disease modules using only topological/structural information?" Looking forward the afterward analysis. i found this year's challenges so far all only have one best-performer; I will be able to see this holds true for SMC-HET and ENCODE tomorrow I guess.** I echo this decision actually, which can avoid tons of problems later on. ** I wish this was done last year.
but you can have more runner ups, at least the ones that are within half S.D. of random errors. otherwise 70 entries and only 1 winner and 1 runner-up. it hurts so much the feeling of the rest of us..... i think a good ratio should be top 10%, with some kind of honorable mention. i.e. top 4 in SC1, top 3 in SC2. because with in top 10%, it is a matter of luck. i felt my luck was completely screwed up by disturbance of left over issues of other challenges right before the deadline. as you are still formulating the results page, maybe this is something can be discussed. Hi @daniel.marbach
========================
As baseline for the multi-network predictions in sub-challenge 2, we considered single-network module predictions from sub-challenge 1.
========================
Does it make sense? A conserved module across six different networks (can be viewed as layers) is supposed to have more meaning than a module identified in one single network.
I assume this is the motivation of sub-challenge 2. Have you checked the physical meaning of these disease modules, compared with those in single networks? we only chose the largest modules with different cutoffs. like 30, 40, 50 in number of genes.
i think there are only 600 that fall with in 50-100, that is why. that's probably we got only 3 in network 3, we only submitted like 30 modules.
no filter of edge weight. too lazy to implement that.
Hi @gyuanfan,
You mentioned in your write-up that you only tested Louvain. How did you select the final submissions within only several hundreds. No filtering out on network edges? hmm....
me and causality and DMIS should have been given a runner-up. we are clearly a group.
when i am the top of the leaderboard, you know i had SEVEN co-winners...... Did I already complained this point for like 100 times now?
you know that the random network deviation is like 12 in SC1. and reshuffle the genes in the same module as input can change the significance count by 2 (when you change from java 7 to java 8 sth).... Hi,
We are updating the final round scores on the wiki tables for both sub- challenges, and the analysis of results will be posted afterwards.
Sarvenaz Hi Daniel,
First thank you and all Challenge Organizers for your great efforts for organizing this challenge.
In the wiki and discussion forum, you mentioned that the pre-publication data and scoring script will be shared with participants who made the final submissions. Since the challenge is over now, could you release the final results of each team, the network meta data, GWAS data, and the scoring script so that we can run it by ourselves and improve our algorithms?
Thank you very much.
Best,
Tianle Hi Daniel,
Will the detailed results and baseline results and methods be posted today?
Kind regards,
Suhas hi, I did not seen any scores for my results which still showed "EVALUATION IN PROGRESS", I wonder if that means mine can not be scored?