Dear Organizers,
I have a notice/question regarding the subchallenge 2. I felt that it is the fairest to raise this question publicly here on the forum.
Until very recently all the submissions were below 0.6 score, but then a submission arrived with an extremely high score (>0.9 AUC for both subtask). As we have a limited amount of data we cannot really trust our local validation, but this was way better than any of our local validation scores and to be honest it seemed to be too good to be true. Classifying preterm birth with such accuracy from transcriptome?
After some investigation, for us, it seems that there is some sampling bias which can be the reason behind such high scores.
For example, patients with more measurements are more probable to have sPTD. It might make sense, the women with a known higher risk of any complication (eg >35 years) may have more frequent visit at the doctor. Also, the GA for the last visit has a correlation with the different classes.
Currently, our team has the second place on the leaderboard. We did not use any gene expression level data for that submission, only the number of measurements for each person and the GA for the last measurement.
We deliberately created that submission to check our theory, which seems to be confirmed. But with training a complex enough model, one can get the same score too without any willingness of 'hacking the system' as we can use the GA and also all the measurement points.
I know that this information will make the leaderboard full of 0.9 scores soon, but the main point for this challenge is the scientific progress, not the leaderboard.
I feel this fact may change the objective of the subchallenge 2 significantly. So I am turning to the organizers: What do you advise?
Best regards,
Balint Armin Pataki
Created by Balint Armin Pataki patbaa We did not state explicitly in the challenge rules that gestational age at blood draw can not be used
in conjuction with gene expression data.
The GA data can be used to define the sequence of expression profiles within each subject, or to
extract temporal gene expression features (e.g. slope in expression).
However, using GA explicitely as a predictor should be avoided since the resulting model may not
generalize to instances when the GA distributions will be the same between cases and controls in the test set.
Such scenarios will be considered in a post-challenge phase. @bcbuprb
I'm still not clear that if we need to build classification model using only the gene expressions or is it okay to use other features of training dataset in combination of gene expressions ?
Can you please clarify on this ? Dear Adi,
Thank you very much for the quick and detailed reply. I fully agree that we should build a model that has predictive value in real-life usage, that is why I made this comment. I wanted to point out that if someone uses 100 genes + the latest GA (which makes sense, as that one is the closest to the birth) might get unrealistically good results because of the presence of GA. Until now it was not clear to me if we cannot use GA (in real-life usage that is an available information). But if we can use only the gene expressions this sampling bias will most probably not appear.
Working on such a model... :)
Regards,
Balint Dear Balint,
Thank you for bringing up the issue of possible sampling bias, and the use of properties of the the distribution of gestational age at blood draw as predictors.
I would start by saying that according to the stated goal of the challenge, teams are expected to use gene expression data to make the predictions, and hence submissions/predictions that are not based on gene expression data will be deemed invalid regardless the prediction performance metrics.
We do know that gene expression changes with gestational age at blood draw and this is why we did a reasonable attempt to remove the effect of gestational age from gene expression values in the preprocessing step. This will not deal perfectly with possible confounding between GA value distribution and clinical groups, but if one uses the entire training set provided to build a model, potential information leak due to GA at sampling should be minimal. This can also be confirmed in a post challenge analysis where only two time points may be used from both cohorts (similar to the design of GSE59491 set).
We and you have devoted resources to organize and participate in this challenge to learn something about preterm birth that is translatable across cohorts and can help make a difference. Therefore, as organizers, we retain the right to award the teams that help us in this goal by using a sound analysis approach.
@james.costello @bcbuprb