Dear organizers,
I am wondering whether there could be a mistake in the computation of the Hosmer-Lemeshow p-value in the leaderboard. Indeed, it does not seem to correlate with the Harrel-C index (poorly performing models sometimes have much lower HL values compared to top models), and the HL value is often extremely small, which seems hard to believe.
Also, currently, the top performing model (submission 9730535) has an amazingly good performance, much better than all the other submissions, so I was just wondering whether there could be any data leakage (the ?Event? and/or ?Event_time? columns were not removed from the test set). Anyway, if not the case, it is really impressive.
Thanks,
Tristan
Created by TristanF Hi @TristanF ,
Very good observation.
We use exactly the same calculations as we shared in this [code](https://www.synapse.org/#!Synapse:syn48276655) and the example of calculation in our baseline model (including the baseline model) is available [here](https://www.synapse.org/#!Synapse:syn49071631), please find the README.md file for the metric calculation result.
We use HL p-val as a complementary metric to see whether the model you are provided is well calibrated, i.e whether the differences between observed and expected proportions are significant.
The low p-value will indicate that the model is lack of fit (not well calibrated). As you may be observed, most of the models submitted are not well-calibrated despite of impressive C-statistic.
We will have to evaluate this metric, but for now, we will be using the metric as a complementary metric, for example in the event of tie.
Regarding the next questions, we already mentioned that we did not censor the event and event_time in the test set, however, we might hide this informations in the scoring set that we will use during the validation phase.
We are very sorry if the instructions is unclear.
We will come back to you soon about the detail of the validation phase, so please stay tune.
Best regards,
Pande