Dear challenge organizer,
I am not sure about the relationship between the primary metric and basal metric. Could you please confirm if I understand this term correctly: "instead of receiving their exact scores, participants will receive an estimate of their score computed by taking the mean of 10 bootstrapped samples?"
Say in subchallenge 3 I noticed the **Estimated DSS AUC of BOR** is not really directly calculated using the value of **Estimated AUC of BOR in Nivo** and **Estimated AUC of BOR in Chemo** .
If I understand correctly, for each submitted prediction, 10 pairs of estimates for nivo_auc and chemo_auc will be generated (basically one pair for one bootstrapped sample).
The DSS is calculated for each pair, and the 10 DSS are used to form the confidence interval. And for each submission, the other three values reported in the leaderboard, are either the mean or median of the 10 estimates.
Thanks in advance!
Created by Wenyu Wang WenyuWang Dear @Michael.Mason,
A quick question or comment on the primary metric: according to the formula you have used: primary metric=scale(BM nivo SC)^2 - scale(BM chemo SC)^2 , the DSS score is penalizing any model that is predictive at all in the chemotherapy arm. However, there are a number of cases among the leaders for sub-challenge 1, as well as the TMB baseline model - included below - where the predictive scores are in opposite directions. In other words the model is predicting Nivolumab resistance and at the same time predicting chemotherapy sensitivity. In my thinking these scores should be given a boost rather than a penalty.
Clinically this is what would be most useful: to define a subset of patients who respond to treatment A instead of treatment B and vice versa. If Kaplan Meier survival curves were drawn based on the predictions in the model (i.e. patients grouped into 'predicted responders' vs 'predicted non-responders' based on model and then KM curves for chemo-arm and nivo-arm plotted), you would expect the curves to show even more separation if this is the case.
TMB_Baseline_Model: DSS C.Index; Primary metric 0.1386
C.index PFS Nivo: 0.6994
C.Index PFS chemo: 0.4148
Best,
Jacob Dear @oski ,
I have looked into this submission specifically. Essentially, this submission was consistently around 0.4 (1 - 0.4 = 0.6) in the Nivo arm. However, the model's predictions were inconsistent in the Chemo arm with bootstrapped samples yielding around values near 0.65, 0.45 and 0.55 with one outlier at 0.096. This outlier chemo AUC resulted in an outlier DSS of -0.60792 which then skewed your distributions of bootstrapped DSS's to be non Gaussian and resulted in your estimated DSS being just outside your [Q1, Q3].
We are beginning the process of determining top performers now. We will likely use > 1K bootstraps so this situation should be mitigated a bit.
I hope this helps makes sense.
Regards,
Mike
Dear @Michael.Mason,
Following up on the interpretation of the primary metric, in particular for sc3, we are struggling to understand how the ** Estimated DSS AUC of BOR** can be outside the Q1-Q3, and also why it can be negative even if we predicted nivo better nivo than chemo.
Take submission 9710965 as an example, the AUC_Nivo is 0.4188 (equivalent to 1-0.4188=0.5812), the AUC_chemo is 0.4833 (equivalent to 0.5167), while the DSS=-0.0571 with Q1-Q3 [-0.032, 0.05]. Could you help us interpreting this?
Many thanks,
Oscar Awesome, thank you! Hi @adamklie ,
The default ranking *should* be by performance in the primary metric but Synapse tries to remember "where" you were last, meaning if you sorted on a given column it may sort by that when you come back. That said it does look a little inconsistent in this case so we are looking into it. Any yes -0.01 is better than -0.1. Any larger negative number implies the model is better at predicting response to chemo than nivo.
Kind regards,
Mike @Michael.Mason
I guess I was just wondering about this since it is sorted the way I described by default. I wanted to make sure I understand the rankings. To clarify, is a DSS of -0.01 better than a DSS of -0.1?
Dear @adamklie ,
You should be able to sort the leaderboard as you like. If you happened to click on a column and inadvertently sort it by that field, it may appear odd but you can change it to what you prefer.
Kind regards,
Mike Piggybacking off of this, could someone explain how the leaderboard is currently sorted? It appears to be sorted in descending numerical order for positive DSS submissions, but then flips to ascending numerical order for negative DSS submissions. I would have thought that it would have been descending all the way. Dear @jtjing ,
Unfortunately, I cannot reveal too much about your team's specific score, but you can imagine a scenario where the bootstrapped basal metrics in the nivo arm and/or the chemo arm have high variability and effect the individual DSS calculations resulting in a sub par primary metric but the the average metric in the nivo arm or chemo arm looks OK.
Does this make sense? Dear@Michael.Mason,
Thanks very much for the explanation.
We still are puzzled about our submission 9710641, that the AUC _nivo is 0.3946 (equivalent to 1-0.3946 = 0.6054) and AUC_chemo is 0.5062, while the DSS = 0.0118. By AUC our performance is quite close to the top2 at the moment, but the DSS is much lower than expected, even when considering the square effect. We wonder if it is due to the limited number of bootstrapped samples that cause much variance in the performance.
Best,
Jing
Dear @WenyuWang ,
Thanks for double checking this. First, a minor detail: we are not providing the confidence interval but the Q1 and Q3 that correspond to the IQR. For the estimates. The reason some some DSS look marginal is due the the fact that we are taking the square after scaling each basal metric, when it might have been more intuitive to take the absolute value after scaling. The square of any value between -1 and 1 drives it towards 0 causing some decent scores to look marginal when they may not be. This is not an issue for the basal metric.
Hope this helps,
Mike Hi @vchung ,
Still wondering why some submission has much smaller DSS than it should according to the mean of the AUC.
Take submission 9710641 as an example. The mean of Chemo is quite close to 0.5 and the mean of Nivo is less than 0.4 (or more than 0.6). However, the mean of DSS is marginal.
We are not sure whether it is because of using only 10 bootstrapping samples, or is there any other specific reason? Is it possible to get access to the estimation code so as to make sure we did not have a misunderstanding of the scoring?
Thanks! Hi @WenyuWang ,
Your interpretation is correct! For each metric, the final score returned to you is the mean of 10 bootstrapped values.
Drop files to upload
Relationship between the primary metric and basal metric page is loading…