Dear all
We apologize again for the delay with the leaderboard phase. The reason is that the scoring for this challenge is far from trivial. While in typical challenges predictions can be directly compared to hold out data, here evaluation is based on GWAS pathway analysis, which is computationally very demanding. For subchallenge 1, the evaluation involves testing the modules from each of 6 networks against 76 GWAS datasets (the leaderboard set -- as each GWAS enriches in only few modules, we need many GWASs to get a robust ranking of methods). This makes 6*76=456 runs for each submission. Each run takes on average 10min, so the total CPU time to score a single submission is around 80 hours.
Our challenge currently has 20 teams and 200 registered participants, so we realized that this would break our local computing infrastructure and we needed to make our evaluation procedure run on the high performance computing infrastructure VITAL-IT (www.vital-it.ch). This has now been done and tested, but since it is difficult to estimate how many submissions we will get, we decided to open a test leaderboard for each subchallenge this week, where each team can make one "free" submission, i.e., this submission will not be counted towards the total number of submissions on the real leaderboard. We will then open the real leaderboard next week.
Depending on the number of actively participating teams, the total number of submissions in the real leaderboard may be as little as five if we have 200 active participants (we will do our best to enable more than five, the test this week will show if it can be done).
There will be only one leaderboard phase (not two as initially announced). Details will be announced next week.
We are updating the wiki with additional information today.
Best
Daniel & Sarvenaz
Created by Daniel Marbach daniel.marbach Hi Daniel, looks like I totally missed the boat with test submissions, realized just now that a leaderboard was already in place. I was just wondering if filing submissions in any of the submission periods is mandatory, or is it ok to start submitting later on, e.g. second half of September? Thanks! --Matt Thank you for correcting the deadline.
-SungjoonPark I corrected to midnight Eastern Time (ET).
--daniel Dear Daniel Marbach,
just want it to be sure, I'am a bit confused of "any time zone" you mentioned above.
I think there should be the exact time the submission is closed. So, could I ask when is the exact closing time of the test leaderboard?
Thank you,
Sungjoon Park The test leaderboard round closes this Wednesday, August 24 (midnight, Eastern Time). Information on leaderboards will always be posted here:
https://www.synapse.org/#!Synapse:syn6156761/wiki/400649
--daniel Do you know what date the test leaderboard will close? I want to make sure I get a test submission in on time, but it could be improved by a few more days work. Yes, about the sanity check: I think it should ideally be done right when submitting the files in Synapse, and the errors could be shown directly in a popup dialog, with no resulting entry in the leaderboard for invalid submissions. I'll suggest this to the Synapse team. But for now your script is a good solution.
--daniel thanks daniel.
yes that is what i used too, mcl. see, old chinese saying, smart people always think the same!
yes, that makes much more sense. shouldn't be so many significant modules. thanks for explanation.
also it's my pleasure to provide a sanity check code. also doing myself a favor. i felt i was going to throw up when i tried to find my entry among so many invalid entries...
@gyuanfan
Sorry, almost forgot to reply to your question about Fig. 3. I think you misunderstood something. Are you referring to Fig. 3b? I guess because for 100 predicted modules (second-last row), the median is close to 75 for the first network. However, this shows the distribution of the sizes of the disease modules only, while Fig. 3a shows the distribution of the size for all 100 modules. As you can see in Fig. 2, only about 5 out of the 100 predicted modules were disease modules for this network, i.e. the boxplot shows the size of these 5 disease modules.
Sarvenaz run these methods, I'll ask her if she can share the code. Any standard community detection method can also work as a good baseline, e.g. she also tried Markov Clustering (MCL, http://micans.org/mcl/).
--daniel
Hi Daniel, thanks for your reply. There are some existing tools to do test gene sets / pathways for enrichment in GWAS data like DAVID, Enrichr and Pascal you mentioned. But I think the huge advantage of the platform you constructed is that you have collected over 200 GWAS datasets. So wish you success in your grant. thanks for the information, daniel. i sincerely wish you success in your grant. i remember a friend of mine told me he saw a grant proposing for competitions. it is very challenging now since some people in the panel question why we need so many competitions...
i have a question of your preliminary data, figure 3. you can discover 75 clusters with 100 each on size. that is 7500 genes covered. i don't think it is realistic. because human genome, removing the thousands of essential genes which should never occur, then there are probably 12000-15000 reasonablly real genes left, how can over half of them are significantly involved in diseases? that's not realistic to me.
or can you please share the code that you generated these clusters, that should be nice baseline.
Hi @dxl466
Thanks for your interest, yes actually we submitted a grant to create a web platform for this purpose.
If you just want to test gene sets / pathways for enrichment in GWAS data, you can already do this with the [Pascal tool](syn6156761/wiki/401425).
We will also release the scripts and data used in this challenge after it is finished.
Best, Daniel
Hi Yuanfang
Thanks for starting right away ;)
Our aim is to allow for the maximum number of submissions that are computationally feasible given our HPC resources. It roughly works as you say, if we have half the number of participants, we can allow for twice as many submissions. However, runtime can also vary a lot depending on the size and number of modules in the submissions. So we don't want to promise anything before we finish this test leaderboard.
--daniel
Thanks for your effort. Would it be possible to provide such GWAS platform even outside of the challenge? To test a list of official gene symbols. You certainly know how important it is for people have no bioscience background. also your queue isn't open.... i was going to submit some random file from other challenges just to see if you are able to recognize it. but there is no queue hi, daniel. thanks so much for working this out. i was thinking this must be really hard
you won't have 200 teams. based on my 10 previous participation in DREAM, the number of submissing teams is given by ~#of Registrations/12 for challenges that do not require cloud computing. does that mean we will be given 12*5==60 submission instead?
we will submit some today to test it out. is there any submission format released?
thanks so much.
yuanfang