Dear organizer, Since the AUC has been recently introduced as the secondary metric I wonder if it is possible to update the current leaderboard with the AUC results? I think perhaps it may help all the participants see their leaderboard performance in terms of both concordance index and AUC. Thanks! Best, Jing

Created by Jing Tang jtjing
Dear @jtjing, Great questions. Let's dig into the [computation of cumulative_dynamic_auc from sksurv](https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.metrics.cumulative_dynamic_auc.html#sksurv-metrics-cumulative-dynamic-auc): $$\widehat{\mathrm{AUC}}(t) =\frac{\sum_{i=1}^n \sum_{j=1}^n I(y_j > t) I(y_i \leq t) \omega_i I(\hat{f}(\mathbf{x}_j) \leq \hat{f}(\mathbf{x}_i))}{(\sum_{i=1}^n I(y_i > t)) (\sum_{i=1}^n I(y_i \leq t) \omega_i)}$$ The inputs to this equation are: 1) $\hat{f}$: a predictive model (in this case, think of a participants' Docker container) 2) $x_i$: the input features of sample i **in the validation set** 3) $y_i$: the groundtruth survival time of sample i **in the validation set** 4) $\omega_i$: inverse probability of censoring weight on sample i Now, there is nothing contentious about 1, 2, and 3 - you'd expect to find those in any scoring function. I think your concern comes from the ambiguous $\omega_i$, so let's talk about that variable. The first thing I'd like to point out is that the vector of weights, $\omega$, is just a function of the training data on the Synapse website. **It is the same for all participants, regardless of what auxilary training they used.** In fact, either of us could have computed all of the weights $\omega$ back in January when the training data was first released. Regarding your two points: 1) As you said, "a classifier should be evaluated on the same dataset where the classifier predicts". This function does just that - it compares $\hat{f}(x_i)$ to $y_i$, where $x_i$ and $y_i$ are both from the validation set. 2) The scoring harness always uses the same $\omega_i$, even if some teams used other datasets in their solution. Does that make sense? Best, Jacob
Dear @Jacoberts, Thanks for the reply! Would be great if we may continue the discussions so that I may have better understanding of the IPCW-based AUC. I have two follow-up questions: 1) Why we cannot use the test dataset to undo censorship? I understood that the 'inverse probability of censorship weight' can be also estimated using the test dataset. In fact, based on the H. Hung and C. T. Chiang (2010), a classifier should be evaluated on the same dataset where the classifier predicts (c.f sections 4 and 5). For our case, using test dataset to estimate the censoring probability seems more robust, as we don't need to worry about the assumptions of homogeneity between test and training datasets. 2) How to compare the two methods if they used **different training **datasets? For example, if team A utilized training dataset 1 (say 100 patients) to build a model and predict the testset, while team B utilized another training dataset 2 (say 200 patients) to build a model to predict the same testset. When calculating AUC, which training dataset should be used to estimate the censorship probability? I am pretty sure that the IPCW-based AUC will be affected by the choice of training datasets, even when both teams predict identically for the test dataset. This issue is particularly relevant as our team utilized a larger cohort of beatAML patients to train the model. Note that the patients that we used are all from beatAML so it is expected that they are following the same censoring distribution. Best, Jing
Dear @jtjing , I apologize for the delayed response. I recommend reading the [SAS documentation](https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_rmstreg_details13.htm&docsetVersion=15.1&locale=en) for how they calculate the inverse probability of censorship weight, $\omega_i$. One way to think about it: we are reframing this problem as binary classification ("Did the patient survive after 1 year?"). We use AUC as the score for the binary classification problem, but to compute AUC we need to "undo" censorship, in some sense. So we model censorship (we use the Kaplan Meier) to reweight the inputs. Now, we cannot use the test dataset to both undo censorship _and_ compute AUC, so we need another dataset. The training dataset is independent, so we pull it in and use it to build the Kaplan Meier curve of censorship probability. Note, I do not believe there is any problem with using the test dataset to train your model and in our computation of AUC. In other words, I think the only requirements of the dataset used to build the Kaplan Meier curves for censorship are: * independence from the test dataset * following the same distribution as the test dataset * all requirements necessary for the Kaplan Meier model to hold, as laid out in the SAS documentation above Hope that helps, Jacob
Dear @jtjing, Your questions may be better answered by one of the Challenge Organizers, @Jacoberts . We'll get back to you soon! Verena
Dear Verena, Thanks for the reply. It seems that the cumulative_dynamic_auc is determined by https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.metrics.cumulative_dynamic_auc.html. According to the documentation, the function needs the survival times from the training data to estimate the censoring distribution. I am not sure if I understand it right but why the training data is needed to estimate the censoring distribution? A related question is that the auc seems depending on the choice of training data. I wonder what if we utilized more data than what was provided by the organsior, and whether such an 'modified' training data affects the auc result. This might be of a concern particularly for our prediction method, which utilizes additional AML patients in the training data. Thanks for the help. Best, Jing
@jtjing, Apologies for the delay; here is the evaluation code for SC2 AUC: ```python import lifelines import pandas from sksurv.metrics import cumulative_dynamic_auc from sksurv.util import Surv def responseToSurvivalMatrix(response): """Converts a response.csv to a survival matrix expected by scikitsurv.""" return Surv.from_dataframe('vitalStatus', 'overallSurvival', pandas.concat([ (response.vitalStatus == 'Dead'), response.overallSurvival ], axis=1)) predictions = ( pandas.read_csv('leaderboard_response.csv') .set_index('lab_id') .join( pandas.read_csv('my_predictions.csv') .set_index('lab_id') )) DAYS = 365 trainingdata = pandas.read_csv("training_response.csv") auc = cumulative_dynamic_auc( responseToSurvivalMatrix(trainingdata), responseToSurvivalMatrix(predictions[['vitalStatus', 'overallSurvival']]), -predictions.prediction.to_numpy(), [DAYS] )[0][0] print(auc) ``` Best, Verena
Dear @v.chung, Thanks for the reply. If so then could you make the AUC evaluation code available? That will help us evaluate our predictions more accurately. Best, Jing
Dear @jtjing, That's a great suggestion! Unfortunately, AUC was not calculated during the Leaderboard Phase dates (1/6/2020 - 3/2/2020) so it will not be available to show on the current [leaderboard page](https://www.synapse.org/#!Synapse:syn20940518/wiki/600158). Best, Verena

AUC for SC2 page is loading…