Dear participants   Unfortunately we found that due to a bug in the Pascal tool, all UCSC protein-coding genes were used as background set to compute enrichment in Leaderboard Round 1. As discussed previously, we thought we were using as background set for a given module prediction only the genes of that network (for sub-challenge 2 (SC2) the union of all network genes). However, we found that there was an error in how Pascal handled these custom background gene sets, which resulted in the whole loaded gene annotation being used as background.   First we hoped that we could continue the challenge with the UCSC protein-coding genes as background, after all it would be the same for all teams. However, it turns out that genes in our networks have slightly but significantly higher GWAS gene scores (disease association) than all UCSC protein-coding genes. This is problematic because it leads to many significant modules even in random predictions and it favors larger modules, as can be seen in the [**plots posted here**](syn6156761/wiki/405291).   **We understand that it must be frustrating for you that we again have to change an aspect of the scoring, we sincerely apologize for the inconvenience.**   We plan to open Round 2 tomorrow evening. We will slightly extend the following rounds and increase the limit number of submissions so that **each team will have at least 20 submissions in total over Rounds 2-4**.   **Some more details for those interested** Over the weekend we did several tests following up on some surprising observations from Round 1 (relatively high score for a random prediction, generally increased scores compared to the test round discussed [in this thread](syn6156761/discussion/threadId=753)). First, we suspected a possible bug in our scoring scripts, so I re-implemented the script independently from Sarvenaz' version and we confirmed that we got identical results. Second, we submitted predictions from SC1 in SC2 (after re-mapping the gene IDs) and found that scores were exactly identical. This was not expected, because we specified different background sets of genes, which should have led to slightly different results. This led us to discover that the option of the Pascal tool that we used to specify the background set of genes did in fact not work and all UCSC protein-coding genes were used as background. We apologize again for this mistake. It explains why scores increased in Round 1 compared to the Test Round, and why random predictions of large modules could get relatively high scores.   **Overview of background genes used in each round** * **Test Round**: The union of all genes in a given network module prediction (submission file) was used as background. While this option makes sense when testing a pathway database such as KEGG for GWAS enrichment, it was problematic for the challenge because submissions that don't include all genes of a network (e.g., because of modules outside the valid size range) would not be scored using the same background as other submissions. * **Round 1**: Due to a bug, all UCSC protein-coding genes were used as background. * **Round 2-4**: The genes of a given network are used as background (the union of all network genes in SC2).

Created by Daniel Marbach daniel.marbach
Nice, it seems your module score works well! We will definitely also look at theses scores in post-challenge analysis of results. --daniel
Thanks Daniel. I didnot select the subset randomly. In general modules in the former one (Sub1R2_1.zip) have higher module score. Now it makes sense.
i see. thanks for the explanation and all the work, daniel! you don't need to look at mine. my code and submissions are full of bugs.
PS: I agree this looked surprising, but I am now confident that it's not a bug because the two submissions were scored independently, the modules of the subset have the exact same p-values before correction as the corresponding modules in the full submission, and we even plotted the p-values and confirmed that the majority of the modules with very low p-values are included in the subset, which explains why some boarderline significant modules in the subset are not significant anymore in the full submission. Moreover, the same pipeline was independently implemented by Sarvenaz and me and we get the same result (for BH correction we use standard functions in Python and R, respectively). We just looked at Dong Li's submission, but would be happy to also have a closer look at yours if you still think something is not right (please include submission names).
Thanks, got it. Indeed, we checked and the majority of Dong Li's modules with a low p-value are in the subset, which explains the result. I agree this would be very unlikely if the modules of the subset were sampled randomly. But my guess is that the subset corresponds to high-confidence modules, which would explain why they comprise most of the significant modules. So I think your logic is correct, but it's not a Bernoulli trial.   Dong and Yuanfang, could you clarify if you chose the subset randomly or according to some quality score of the modules? Or maybe the modules that are not in the subset correspond to modules of size > 100 that were further subdivided to fit the size range?   --daniel
thanks daniel for the explanation:   this is my logic, can you tell me where i am wrong? when one uses B-H correction, if set A and set B distribute the same, then the total number of significance will be equal to setA+setB, instead of <(setA+setB). this is very different from berforonni correction.   this is because: suppose a module xi, originally ranks 10/1000, its cutoff of p value is 10/1000*0.05 under multiple testing correction. now we have two sets of the same distribution, this xi ranks 20/2000, its cutoff of p value is 20/2000*0.05 under multiple testing correction, which is the SAME as original.   of course, as you said, without seeing the pvalues, anything can happen, but it is so unlikely to happen, i will explain why: taking Dong' example: out of the first 1000 modules, 28 are significant. when adding in the second 1000 modules, only 2 are significant.   this is just bernoulli trial, right? this is your cdf of observing this result or more extreme of this results : 30\*0.5\*((0.5)\^29)+(30\*29/2)\*(0.5^2)\*((0.5)\^28)=.00000043306499719597 we are saying that this .00000043306499719597 event happened to one of the submissions. you in total have 240 submissions, let's also do the strongest multiple testing correction, that is .000103 event happened in this challenge. don't you think it is weird?   can you tell me where my deduction went wrongly? thanks a bunch, yuanfang (update: corrected formula, i think my calculation is correct)
Hi   My poor nerves :) But I checked and everything looks correct, and I don't understand why you find these results surprising. * I checked Dong Li's submissions. Before multiple testing correction, identical modules indeed get identical p-values in the two submissions. So without multiple testing correction, adding more modules could indeed only increase your NS score. However, as we correct for the number of modules, a module that was boarderline significant in the smaller set can now be above the threshold in the larger set. To be specific, the module that makes the difference in network 5 has a corrected p-value of 0.057 in the prediction with 382 modules and a corrected p-value of 0.029 in the prediction with 183 modules. * With the multiple testing correction, adding more modules can either increase or decrease the NS score. The score increases if more modules with a very low p-value are added. The score may decrease if modules with insignificant p-values are added because the multiple testing burden increases.   Yuanfang, I don't understand your argument. How could you know the expected number of BH-corrected p-values that are significant without knowing their value / distribution? You don't know the actual p-values of the smaller set, neither those of the additional modules in the larger set, so anything is possible, right? Boarderline significant p-values can become insignificant when the number of tests is increased.   Best, Daniel
have to agree: Sub1R2_1.zip 1285 28 8 7 4 3 3 3 Sub1R2_3.zip 2322 30 11 7 3 3 2 4   if this happens to one network of one person. that means this person is unlucky.   if this happens to all people, that means the scoring is wrong. an expected Benjamini-Hochberg correction should have these values: Sub1R2_1.zip 1285 28 8 7 4 3 3 3 Sub1R2_3.zip 2322 56 16 14 8 6 6 6   actually, at least these values, because B-H takes the highest ranking one (highest P that passes the cutoff). http://www.biostathandbook.com/multiplecomparisons.html No math deduction here: but read this "or example, if García-Arenzana et al. (2014) had looked at 50 variables instead of 25 and the new 25 tests had the same set of P values as the original 25, they would have 10 significant results under Benjamini-Hochberg with a false discovery rate of 0.25. " (from an original of 5) significant results.   i submitted 4 full sets, the scores are 18, 21, 25, 27 (same method with slight variations). i submitted a subset, the score is 34. i submitted a larger subset, the score is 32. how that is even possible.   the probability that only 2 of the additional 1000 modules that passes cut off (while the other 1000 has 28) is effectively zero. the probability that there is bug is 1. thanks   yuanfang
I also believe there is a bug in scoring system used on Round2. I have two submissions in subchallenge 1, Sub1R2_1.zip is a subset of Sub1R2_3.zip w.r.t all six networks. I have double checked this. But the NS score goes like Submission Name N NS 1_ppi(NS) 2_ppi(NS) 3_signal(NS) 4_coexpr(NS) 5_cancer(NS) 6_homology(NS) Sub1R2_1.zip 1285 28 8 7 4 3 3 3 Sub1R2_3.zip 2322 30 11 7 3 3 2 4 which does not make sense. NS of Sub1R2_3.zip should be no less than Sub1R2_1.zip at all networks. One more thing, since NS is the only criterion, is it possible to tell us which module among one submission s a disease module?
i have very confusing results too. can you please tell me what could be the possible cause?   my betterbug.2.3.zip (which has 14/87) is a subset of betterbug_2.4.zip (7/117), by which i run a second time slightly relaxed the criteria to be considered significant and thus trim off less in the tree.   which means even if all the additional 30 modules are wrong, there are 7 out of 14 significant models fall exactly between 0.003 and 0.008 for their original p-value reported. that probability is about flipping a coin 10 times and get all heads. 0.05*14/87 .00804597701149425287 -- old: 14/87 modules have probability below 0.008 0.05*7/117 .00299145299145299145 -- new: 7/117 modules have probability below 0.003   then i worried that this is some low probability event happened, so i did the same experiments on SC1, and observed the same thing, which puts the probability to zero.
Hi Daniel, We have just tested our previous best scoring modules using the new scoring system, and the results are very confusing. Is it possible at all that the 23 detected modules for the co-expression network, for example, correspond to 8 modules in the new scoring system. The drop is very drastic. Also, it looks like that while this result was among the best scores for this network, it's no longer the case. Can it be true, and if so, do you have any explanation w.r.t to the new scoring system? Cheers, Dmitry
i see your point, daniel.   at least we moved on.   i shared the validation code in the hope that at least **this** challenge would successfully finish in time with a lot of participating teams. so no matter what we found later, we stick to what we already have and finish in time, ok? because there is no perfect system, all right?   thanks a ton,   yuanfang
Yuanfang, we're really sorry about the tight schedule. Unfortunately the weekly rounds are necessary as we observed a peak in submissions on the last day (as expected), which was managable as teams only had 5 submissions but still took almost two days to score (a power outage we had on the cluster didn't help...). If we had such a peak with up to 20 submissions per team, we fear that it might take over a week to finish the scoring, which would pose a significant problem for the already tight schedule. So we decided to continue with the weekly rounds, which is the only way to ensure that the scoring is more or less evenly distributed. (I realize that paradoxically this now prevents teams who would like to make more submissions early on to do that). --daniel
**We decided not to re-evaluate predictions**, mainly because it would take too long and the scores from the 1st round are NOT fundamentally wrong, only the background set changed and many insights gained in Round 1 are still valid. For example: * If a method/setting performed very poorly in Round 1, it is very unlikely to perform well in Round 2. * If a team compared two methods / settings at similar module size and one performed much better in Round 1, the same would likely be true in Round 2. In short, the relative performance of your five submissions would likely be similar in Round 2 (except for submissions with large modules, which may have benefited from the bias in Round 1). We expect that some teams will resubmit the same 5 predictions, while others may make changes. For those who want to resubmit the same predictions, we recommend that you do it as soon as Round 2 opens. Sorry for the inconvenience and thanks for your understanding, Daniel
i suggest moving on instead of re-evaluation. this is why:   1. one can approximately figure out their scores by minus-ing the background corresponding to their cluster size, e.g. the top 10 teams in sc1 now can minus 30, and that brings back a performance ranking quite similar to round 0, i.e. phil's submission should be top again.   2. one may choose to resubmit (i am going to), if we all have the same number of quota, that is fair to everyone, 20 submissions are already enough to overfit the leaderboard by 2-3 fold. however, please keep the old leaderboard and **open a brand new one** as it is now very confusing to compare results.   3. this is the last challenge i see in 2016 that will have **the potential** of a decent, in-time finishing for the dream conference. the encode challenge extended deadline to next year; the RV challenge is running with data leak and still waiting for new data to be generated. the SMC-HET challenge may never finish scoring. the dream community needs at least another challenge to finish in decent shape by sep 30th.   I also suggest to **merge round 2-4** and allowing a total of 20 submissions, because the time in waiting for breaks we cannot get any feedback, if not able to get feedback but have to squeeze submissions in 3 days, it is equivalent to only submitted 3 times.
Is it possible to re-evaluate all the submissions made in subchallenge2 because it will give us a better idea on how to proceed? Whatever we wanted to try in round 2 is based on the results that we got from round 1. If the results are wrong,then there is no point in thinking in that direction. So it would be better if you can postpone round 2 till the re-evaluation for round 1 subchallenge 2 is complete. Re-evaluation is better than resubmission. Please let us know.
Dear Daniel Marbach, will the round1 predictions be re-evaluated?? Best, -SungjoonPark
> **Results for random predictions for SC1 are now [available here](syn6156761/wiki/405291)**
ok...thanks daniel for explanation.   but i said that random prediction beating 70% of teams is weird, the organizing team has assured me there is no bug, and it turns out to be indeed a bug. now i say that sc2 distributing way lower than sc1 is weird, would you please take a look again, please, daniel? nothing coincidence can happen at this scale...   i am so frustrated, i think i am going to throw up--i already called an end to this project yes today. if there was any angry tone in this post, i say sorry in advance!
> but how does it explain "why random predictions of large modules could get relatively high scores."? Assume set A are the network genes, set B are the UCSC genes. The gene scores of set A are significantly higher than those of set B. You randomly sample N genes from set A. The larger N, the more likely the small shift towards larger scores in set A will lead to a significant enrichment. This can be seen in results for random modules (I'm about to post). > i am sure SC2 scores are lower than SC1 scores. otherwise, how can you explain the distribution of SC2 scores is way lower than SC1 by a huge amount? I think it's because teams tried multi-network strategies that didn't (yet) work well. It's not a coincidence, results were numerically identical (not just the NS count, also the enrichment scores). The only difference between SC1 and SC2 is the background, that's how we found out that specifying the background had no effect. I will post results for random predictions both for SC1 (now) and SC2 (soon) so we can compare the expected scores, and also do some more tests with real predictions. --daniel
i appreciate your update. but how does it explain "why random predictions of large modules could get relatively high scores."? i don't get it. i think it only means everyone's score will drop in proportional.   also, are you sure that "we submitted predictions from SC1 in SC2 (after re-mapping the gene IDs) and found that scores were exactly identical. " ?? i think this is an co-incidence, you only submitted one, it happens to be the same.   i am sure SC2 scores are lower than SC1 scores. otherwise, how can you explain the distribution of SC2 scores is way lower than SC1 by a huge amount? that everyone just get terribly worse and doesn't even know to submit something from SC1 to SC2?

*** IMPORTANT: Change in background set for Round 2 page is loading…