Dear DREAM Preterm Birth Prediction Challenge participants,
We are grateful for your effort in addressing the two questions we put forward in this crowdsourcing project based on maternal blood gene expression data: sub-challenge 1) how to predict gestational age in normal and complicated pregnancies and, sub-challenge 2) how to identify women at risk of preterm birth.
For sub-challenge 1, the prediction performance obtained on the test set was reflective of the information embedded in the gene expression data and the skill of the teams to leverage that information to predict gestational age. Therefore, for sub-challenge 1 we could confidently rank and award the teams based on the prediction performance displayed in the leaderboard.
However, unfortunately, for sub-challenge 2, the prediction performance displayed in the leaderboard reflects both information from gene expression data but also differences in the sampling distributions between patient groups. Although such differences were expected to be minimal when the full training set is used for model development, these differences are substantial when only the PRB_HTA subset of the training set is used. Since the use of the full training set was not required in the challenge rules, and would have been difficult to enforce, the teams that intentionally or unintentionally relied on gestational age distribution differences had an advantage in terms of prediction performance, but did not necessarily help us find a better solution to our research question. This is because, for any practical application, gestational age (GA) when the biomarkers are measured will not be informative of the risk of preterm birth.
Therefore, we have decided to not award any team based on the prediction performance displayed in the current leaderboard, but, to ask for your support to derive new prediction performance metrics in a post challenge phase that would be both fair to the teams and address the research question.
Briefly, we propose to use the analysis scripts that you are expected to provide (per challenge rules) and we will train and test the resulting models under several scenarios in which training and test sets do not feature differences in the GA sampling distributions.
The detailed steps of this proposal are below:
1. Per initial challenge rules, teams will provide a write-up of their approach and upload the analysis scripts (Deadline Jan 5th 2020, a new write-up submission queue will be created for this sub-challenge). Yet, these write-ups will not be made public at this point, but only shared with organizers. The analysis scripts expected from each team will implement the same approach used to generate the prediction submitted to the leaderboard, or a new approach, as long as it is designed to use two or more longitudinal gene expression profiles per patient to make the required class predictions. Teams are free to use gene expression data and the GA at blood draw in any way, as long as at least one and most 100 genes are involved.
2. Using participant provided scripts, the organizers will train and test the algorithms using several training and test datasets. Teams may be asked for additional details needed to run their algorithms.
3. A new leaderboard will be disclosed on Feb 5 and the top 3 teams will be invited to present their approach at the RECOMB 2020 meeting in Italy.
To address any questions that you may have, we plan to organize a second webinar in the next few days.
Additional details regarding the submission of the analysis scripts will be posted on the wiki page.
Thank you all for your participation and continued support,
The DREAM Preterm Birth Challenge Organizers