Hello,
I had a question about the availability of metadata information (Sample_annotation.csv file) for the leaderboard/test datasets which can be used by the submitted algorithm to perform pre-processing steps on unseen data before predictions?
Thanks.
Created by TUSHAR PATEL tpatel Hi Tushar,
Thank you for your detailed information. I'd like to provide further clarification:
The leaderboard dataset is a subset of samples from the same studies that generated the training data. Consequently, it shares the same "batches" or sources of variation present in the training data. The test data (protected/private), on the other hand, was generated at a single center using the 850K array on a rather homogeneous population.
For your modeling strategy, if you plan to correct for batch effects, I recommend estimating the batches directly from the data (matrices of beta values). Such an approach could then be directly applied to leaderboard or test beta matrices. For real-world applications, however, a model robust to different preprocessing methods or one that can even work on raw data is preferred in my opinion. See the paper by Lee et al., 2019, for an example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6628997/.
I hope this clarifies your concerns. Please let me know if you have any further questions.
Best regards,
Gaurav
Hi Gaurav,
Thanks for your prompt response and detailed information. To make my doubt a bit more clearer: the metadata information from the leaderboard or test datasets can be used to preprocess these datasets before making predictions on them (which would be done by the submitted algorithm). For example, to correct for batch effects in test data before using the batch-corrected trained model to make predictions.
Thanks,
Tushar Hi
Thank you for your question regarding the availability of the Sample_annotation.csv file for the leaderboard and test datasets. I'm not entirely clear on how the metadata information from the leaderboard or test datasets may be used for pre-processing during training of the models.
However, here are some important details
Both the leaderboard and test datasets are designed to facilitate an unbiased evaluation of the participants' models. No information, including metadata, from these datasets should be used for training the models. Therefore, we have not provided the sample annotation file for either the leaderboard or the test dataset.
The only information about the leaderboard dataset that has already been shared is that the samples are a subset of publicly available DNA methylation studies submitted to GEO or ArrayExpress. All the leaderboard samples were assayed on the 850K array, thus requiring no imputation.
The test dataset, along with its metadata, was generated by the Pregnancy Research Branch at Detroit Medical Center and Wayne State University. This information is private and cannot be shared. However, I can confirm that the distribution of gestational age in the leaderboard and test datasets is similar.
We hope this clarifies your query. If you have any further questions, please feel free to ask.
Gaurav Bhatti