We are a bit puzzled by the results of Subchallenge 3, from the Final results ranking https://www.synapse.org/#!Synapse:syn18404605/wiki/609127 our team cSysImmunoOnco is first with the submission 9710965 but we were not declared winners, could you help us understand why?
Kind regards,
Federica
Created by Federica Eduati Fede Hi @Michael.Mason,
When the bootstrapping was done, did they keep the same proportion of responders and non-responders? I ask because the number of non-responders is more than responders, so you may have a situation where stratified sampling should have been done to ensure that each bootstrap reflects the original population.
Thank you,
Jacob Dear @Michael.Mason,
Many thanks for this discussion.
In the criteria page, the comparison with (any of) the baseline models is stated as a possibility rather than as a predefined rule, and it is not even reported in the main challenge description page. The latter actually just reports: "Top teams determined by the primary and tie-breaking metrics described above".
You did an impressive job describing the evaluation procedure also with detail on how metrics are compute and how to deal with ties to have a transparent evaluation. However, the implementation of this additional rule only in the final evaluation without properly informing the participants makes the process less transparent and more subjective, since the baseline metric was not defined in advance, and it is likely that the choice of the baseline affects the results. From our point of view, it should have been clear already from the validation phase whether this criterion would be applied, and an indication of the corresponding metric should have been reported in the leaderboard to allow participants to properly assess their methods, otherwise the aim of having a leaderboard is lost. And in the final results, of course, as stated by @jacob.pfeil.
We do not want to be misunderstood - We really appreciate all your efforts for the organisation of this very nice challenge, and we really enjoyed the process and we are thankful for this opportunity. We would just like to express our concerns about the decision to include this criterion just in the final evaluation.
Also thinking about the contribution to the immuno-oncology field, We believe that discarding the top-ranked teams in 2 out of 3 challenges also in terms of analysis of the results in the publication would strongly limit the insights that could be obtain from this challenge to advance the field.
Kind regards,
Federica on behalf of the cSysImmunoOnco team
PS: I am looking forward to see the plot for SC3, as @FrancescaF mentioned it is currently not accessible. I am also trying to understand why our method has a broad distribution in the bootstrap, could it have to do with the fact that classes are unbalanced? Do you do a normal bootstrap or a stratified version? Dear @Michael.Mason ,
thank you very much for the explanations and for sharing the plots.
However, the plot for SC3 seems to be not accessible/visible:
"_Sorry, you do not have sufficient privileges for access. You do not have READ permission for the requested entity, syn25327648_."
Thanks,
Francesca Thanks, @Michael.Mason. I see the blue diamond is the DSS C.index of OS from the table. How was this determined since it doesn't look like the mean of this distribution. I'm surprised my method had this much variability. I predicted binary outputs, so I wonder if the extreme values hurt the performance. Did tryitout get removed from the top-performers during the tie-breaking? Dear @Fede and @jacob.pfeil ,
We did not rerun models repeatedly but bootstrapped sampled from the prediction files. All submitted models had the same bootstrap samples with N = 1000. I have attached two plots (one for SC2 and one for SC3). The plots show the bootstrap based distributions of the primary metric for each submission and those of the TMB baseline and the PDL1 baseline. I added the p-value based on the empirical null next to a team's name followed by the TMB baseline based Bayes factor (green denotes baseline models and yellow denotes teams tied with the highest ranked team that met the empirical null criterion and the baseline based BF criterion). Please let us know if you have more questions.
Kind Regards,
Mike
${preview?entityId=syn25327649}
${preview?entityId=syn25327648} Thanks again, @Michael.Mason and @Fede . I have one more question. I see that the metrics are calculated by resampling the prediction file and **not** by rerunning the docker (https://www.synapse.org/#!Synapse:syn18404605/discussion/threadId=7798&replyId=24246). So, essentially each model got one chance to make the predictions, but the distributions were generated by subsampling the predictions. My question is did each model get to predict on the same data or was the input data also sampled from a larger dataset? @Michael.Mason
Thanks, it would be very useful to see the bootstrapped distributions. And also to add the Bayes factor for all submissions as @jacob.pfeil suggested, since it appears to be a very crucial criteria. It would really help us understand our performances.
Kind regards,
Federica Dear @jacob.pfeil @Fede ,
There are two layers of criteria:
- First: models must out perform a null distribution and not be tied with either the PDL1 baseline model or the TMB baseline model as noted [here](https://www.synapse.org/#!Synapse:syn18404605/wiki/607424).
- Second: models that pass those criteria will be then be assessed for ties as stated [here](https://www.synapse.org/#!Synapse:syn18404605/wiki/607233).
Unfortunately, your models did not pass the first criteria. Specifically they were tied with the TMB baseline (BF < 3) with the TMB baseline as the reference. I will check with the other challenge organizers to see if I can share some plots that illustrate the bootstrapped distributions to give you a better sense of how your models performed.
Kind regards,
Mike Hi @Michael.Mason,
Building on @Fede 's point, it is concerning that the top performing methods overall are not identified by the challenge criteria. I wonder if it has to do with the bootstrapping parameters. I'm not aware of the specific bootstrapping parameters, but I suspect if you increased the sample size, my method would have more consistent predictions across bootstraps and potentially improve the Bayes factor.
Thank you,
Jacob Dear @Michael.Mason,
thanks for the clarification. However, in the challenge description it states:
"Determining Ties
After the Validation Phase round, distributions of participants' primary metric score will be computed via bootstrapping. For each sub-challenge question, a Bayes factor (K) will be computed for each team using the best-performing team as the reference. The Bayes factor will be used to determine a group of statistically tied teams (ex. K < 3 or K < 5) and the tie-breaking metric will then be applied to this group."
So the comparison should have been done with respect to our score, not with the TMB. It is quite absurd that the team which perform best in the overall dataset is not among the winners.
Kind regards,
Federica Thank you for the clarification, @Michael.Mason. Would it be possible to update the leaderboard with the Bayes factor, so we can see which models met this criterion? Dear @Fede ,
Something similar happened in sub-challenge 2 where @jacob.pfeil appears to be in 1st place at first glance. In both of these sub-challenges the highest ranked team has much wider distributions of their bootstrapped primary metrics that made their Bayes factor < 3 with the TMB baseline model used the reference models. As such these models did not meet [the criteria](https://www.synapse.org/#!Synapse:syn18404605/wiki/607424) for being declared a top performing model.
Please let us know if you have any more questions.
Kind Regards,
Mike