Hi @trberg and challenge teams,
Based on [this thread](https://www.synapse.org/#!Synapse:syn33576900/discussion/threadId=9864) the quantitative evaluation of models is going to be based in part on a random split from the training data. This is a major problem for this particular data -- evaluation based on this random split will substantially benefit models that overfit to a strong confounding structure in the data due to severe imbalances in case/control fractions in many of the hospitals. This structure, which can be picked up by models unintentionally (unless guarded against), will be consistent in a random split and therefore the top performing models in terms of AUC are almost surely going to have worse generalization on real-world data outside this challenge compared with some of the other models.
Below, I provide a detailed explanation of the problem. I'm also proposing a way to overcome this problem in order to allow an objective evaluation that would better reflect how the models will perform if widely used in the clinic. **Putting any weight on the performance of teams on a random split of the training data is unfair towards anyone who made an effort to submit a model that generalizes to out-of-train-data distribution (i.e., anyone who tried to meet the official goal of the challenge).**
@trberg, please advise how you are going to handle this.
In more details:
(1) Looking at the fraction of control individuals (i.e., non-long covid cases) of each hospital in the data (based on data_partner_id) reveals that many of the hospitals have only controls (or almost only controls); for example, here are the top 3 hospitals in the data in terms of the number of patients, all of which have 99-100% controls:
data_partner_id, # patients, # controls, frac controls
(888, 11372, 11316, 0.9950756243404854)
(124, 6988, 6988, 1.0)
(850, 3857, 3824, 0.99144412756028)
(2) If all patients from hospital X are controls (non long COVID) then after a random split of the data all patients from hospital X are controls in both the train and test splits. Now, consider the following trivial model: if a patient is from hospital X then predict control, otherwise guess. This model will do better than a pure random model on the test split because it will never make mistakes with patients from hospital X. Yet, clearly, if applied to real-world data outside the competition, this model would be useless. (In fact, it would also be useless for hospital X since in reality there will be long COVID cases in that hospital.)
(3) This problem is very severe in the challenge data. In fact, most of the patients in the censored train data are coming from hospitals that collected only (or almost only) controls: 52% of the patients in the data were collected from hospitals that >99% of their collected patients are controls. Since a random split was used for creating a test split these are also roughly the numbers in the test split. Extending the trivial model for hospital X above to all hospitals in the data (i.e., always predict control or always predict case if coming from a particular hospital, depending on the fraction of controls in the hospital) achieves AUC 0.65 on a random test split (compared with 0.5 achieved by random); in other words, there is a strong confounding effect due to the severe case/control hospital imbalances that is systematic in this dataset (but not in the general population) in both the train and test. As a result, if using the current test split the best models will appear to be those that combine these confounding effects with true risk factors rather than models that include only real risk factors.
_ROC AUC:_
random, always predict control, based on majority in partner id
0.49366608373230886, 0.5, **0.6515146737001042**
_Average Precision (AP):_
random, always predict control, based on majority in partner id
0.16225279853323152, 0.16395222584147665, **0.32596833948453857**
(4) This is a major issue because there will be no obvious way to see which models boost their AUC due to this confounding effect based on a random split. The reason is because any standard machine learning model (and even more so high capacity models) can capture this confounding effect by overfitting via combinations of legitimate features (so there is no need to explicitly include the hospital assignment of a patient as a feature in the data in order to capture the confounding effect). This confounding effect can be accounted for during model fitting in different sorts of ways, but those who did not account for it will greatly benefit from an evaluation that is based on a random split of the test data. Such models are expected to be among the best models in terms of AUC and other metrics on a random test split even if they are among the worst performing models on data outside this competition. In fact, in the likely case that there are more confounders in the data that are specific to cases or controls in specific hospitals, the 0.65 AUC I reported above will likely go up to 0.7-0.8 by relying solely on confounding effects; adding true risk factors on top of these effects is the reason why you will likely see models with AUC 0.9-1 performance on the random test split but relatively poor performance on out-of-train data.
(5) In summary, using a random split for testing in this case will positively score models that perform well on the test due to overfitting, and it will downgrade models that do not overfit and therefore perform worse on the random test but can probably do better in real life. In order to alleviate the risk of positively scoring models that will eventually perform poorly in the clinic, we must be careful with the construction of the test data. The ideal test set would be prospective data from hospitals that were not included in the train data. I understand that such data may not be available to you, in which case the second-best test set would be either data from hospitals that were not included in the train data (i.e., even if the data is not prospective) or prospective data (which you have/will have from what I understand). The latter option of prospective data is good provided that the prospective data does not present the same severe imbalances that the train data has -- the prospective data should be designed such that the ratio between cases and controls of every hospital in the test data is 19:81 (or whatever ratio that you think would correspond to the prevalence in the population).
Elior Rahmani