Dears, I would like to ask about sub-challenge 1. From Figure 1, it's clear to me that all training, leader board, and testing data have the same number of probes (n = 452453). However, the number of probes in the preprocessed training data is 346407, i.e. using the script Preprocess.R. When transposing the preprocessed training data, the size becomes (1742, 346407). I am adopting deep learning techniques to address challenge 1, thus, the number of filtered probes (i.e. features) has to be the same for the leader board and test data. Because we don't have access to the meta data, we can't run Preprocess.R on the leader board or test data before feeding it to the model to obtain predictions. Moreover, running the same script doesn't guarantee that the number of filtered probes equals 346407. II am sorry if my question sounds confusing, but could you please shed light on this? Many thanks, Ibrahim

Created by Ibrahim Alsaggaf ialsag01
Dear Gaurav, Many thanks, that was really helpful. Kind regards, Ibrahim
Dear Ibrahim, Thank you for your question regarding the preprocessing steps. Here are a few clarifications that might help: 1) The preprocess.R script uses the sample annotation file only to filter out probes that are not detected in all the training samples. This step is not required for the leaderboard or test datasets. Therefore, the sample annotation file is not necessary for evaluating your model on the leaderboard or test datasets. 2) All other steps in preprocess.R, such as filtering cross-reactive probes, probes close to SNPs, and BMIQ normalization, do not require any metadata. These preprocessing steps can be applied without needing additional sample meta data. 3) The 346,407 probes remaining after preprocessing are also present in the leaderboard and test datasets. Your script that applies the model to the test or leaderboard data can start with a simple subsetting step to select these probes from the full dataset. 4) The provided preprocessing script is merely an example of common preprocessing methods applied to Illumina methylation array data. You are encouraged to explore other preprocessing methods or even work directly with the raw data. I hope this clarifies your concerns. Please feel free to reach out if you have any further questions or need additional assistance. Best regards, Gaurav Bhatti

About preprocessing the leader board and test data page is loading…