We have submitted the same file for one network in subchallenge 1 than before, and received one more score than in the previous round.
I can only imagine two ways how this is possible:
a) The scoring is stochastic, and apparently the sampling is not sufficient.
b) The scoring function change between two rounds.
Organizers, can you please explain which of the two?
Thank you
Created by Andras Hartmann ahartmann Yes, we will share all data and code with participants with final submissions.
--daniel Dear Daniel,
Thank you for your efforts and feedback on this issue.
Are you planning to share the final source code of the scoring in the community phase of the challenge?
I think this would be beneficial for the community, because participants could examine issues like these.
Best, Andras Dear Daniel,
Thank you for your efforts and feedback on this issue.
Are you planning to share the final source code of the scoring in the community phase of the challenge?
I think this would be beneficial for the community, because participants could examine issues like these.
Best, Andras Dear Daniel,
Thank you for your efforts and feedback on this issue.
Are you planning to share the final source code of the scoring in the community phase of the challenge?
I think this would be beneficial for the community, because participants could examine issues like these.
Best, Andras The mystery has been solved... Upon further investigation, I found that the small difference in some of the scores occurs when changing from Java 8 (on VitalIT) to Java 7 (on our server). When using Java 7 on VitalIT, results again match those on our server exactly, so it's definitely due to the Java version and not a difference in the setup of the pipeline on the two platforms.
After some debugging, I found the reason for the difference. We use the chi2 enrichment method described in the Pascal paper, which is based on a ranking of the genes based on their GWAS score. Some neighboring genes on the genome are assigned the same SNPs and thus have identical scores. Pascal does not assign the average rank to ties, but takes them in the order as they appear in the list. It seems something in the implementation of the data structures changed between Java 7 and 8 so that ties now appear in a different order, and may thus be assigned slightly different ranks, leading to slightly different enrichment p-values for some modules.
Few of the GWAS have a relatively small number of SNPs, for these there could be more genes with tied scores and the effect thus a bit bigger, which explains why sometimes a module could move above/below the threshold (see my post above).
Best, Daniel thanks, daniel, but may i know the direction of changes? e.g. consistently higher, consistently lower.
because i found my score is slightly higher by average 1 or 2 this round. i am not changing anything other than this method is not deterministic. i am trying to figure out if really a new non-deterministic partition is better than an old one or just because the scoring is changed.
thanks * Sarvenaz rescored 10 submissions (6 networks each). Of the resulting 60 NS scores, 48 are identical, 11 differ by 1, and 1 differs by 2.
* I agree we need to understand the underlying reason even if we stick to one platform, I'm working on it.
* Either way, this should not affect the scoring: a difference in score of 1 or 2 is minor and would not be significant -- the noise in the NS scores is much bigger than that, as the results on random predictions show (NS scores varied by up to 7 for a single network, see [here](syn6156761/wiki/405291)).
```
Yes, we are now doing all the scoring on VitalIT, including the final scoring
```
I understand that you need to stick to one standard, but we don't think that this actually solves the problem of precision.
If the values were different on two systems, we can suspect that both of them may introduce errors, and unfortunately this error is in the magnitude of the p-values, which in fact effects the q-value calculation.
Bottom line, we agree with Yuanfang, that this can have a radical effect on scoring and may even completely invalidate it. Yes, we are now doing all the scoring on VitalIT, including the final scoring.
--daniel
```
wait, numerical errors occur at 10^-14, not 10^-4.
```
That is correct, however 10^-14 round up errors may accumulate over operations causing numerical instability.
I am not saying that this is happening, but if this is the cause then the error is definitely in the range of the threshold, and +/- 1 in the score (on each network) may indeed influence the final score. wait, numerical errors occur at 10\^-14, not 10\^-4. One guess you could examine is numerical instability, meaning that you might loose precision because of roundoff errors, and the rounding procedure on the two systems might be different.
Anyway, can you tell in advance that the final evaluation will be on the system you are running the 4th leaderboard on?
Thanks,
Andras Yes, I guess it's not about different versions of libraries because the difference is substantial for this module. But again, in our tests we got identical NS scores for all networks and in the two submissions that we rescored the difference was only 1, and in both cases for the signaling network. We are now rescoring more submissions to see what's going on.
PS: In the meanwhile we stopped scoring on our server out of precaution.
--daniel but daniel, this is two fold difference:
< 0.000795
\> 0.000354
that is going to change all rankings. because No. 1 to No 10 are not two fold different right now......
thanks a bunch
yuanfang
Sarvenaz is investigating the issue. The reason is that the first time your submission was scored on the VitalIT cluster and the second time on our server. In our tests, we got identical results on the two platforms. Now that we looked closer, we found that for the majority of modules the enrichment p-values are exactly identical, but for few modules the p-values can be slightly different. We still don't understand the reason, but we know that only few modules are concerned and the differences are very small. We have found a difference of 1 in the NS score of the signaling network for another submission, while all other scores were identical. We are now rescoring more submissions to investigate further.
**From the participants perspective, the difference seems to be so small that the issue can be ignored.** Note that even random predictions have NS scores fluctuating between 0 and 7 (see Preliminary results). For example, for this network and GWAS all module p-values were exactly identical except for these five modules:
```
< 1.82299064E-1
> 1.82227394E-1
< 4.53231975E-1
> 4.53384082E-1
< 6.8433909E-1
> 6.84290822E-1
< 9.52527959E-1
> 9.525242E-1
< 9.85340746E-1
> 9.85340265E-1
```
In Andreas' prediction for the signaling network, the top ten modules had exactly the same p-values except for three modules (the last one made the difference in the NS score):
```
< 0.000007
> 0.000005
< 0.000126
> 0.000159
< 0.000795
> 0.000354
```
We still don't know why some modules have slightly different scores on the two platforms, we have checked that the same settings are used and the differences would be much larger if an option would have been set differently. Moreover, all the input files and even intermediary result files such as the gene scores of the fused genes are exactly identical. I don't know if a different version of a math library used by Pascal on the two platforms could explain the difference.
We'll give an update once we have rescored more submissions.
Best, Daniel
Sorry, I mixed up two columns, it is SIGNALING!
Thanks --coexpression network, as said before--
Sorry, I mixed up two columns, it is SIGNALING! Could you tell for which network? It is the coexpression modules between
round 3: 7247318
round 4: 7254800
It was between round 3 and round 4 Could you specify which two rounds? As you might know round 1 had a different scoring than proceeding rounds.