Issue with the simulation data : surge in heart failures at the end of the 15 years period.

Hello, I believe there is an issue with the simulation data (pheno_training.csv and pheno_test.csv). Below is what I get when I plot the histograms of Event_time for training and test sets, grouped by Event (removing the rows with missing values). As you can see, many patients experienced HF exactly at the end of the 15 years period (left column). There seems to be a confusion between censored observations and HF. Could you please clarify this point? ${imageLink?synapseId=syn41687356&align=None&scale=100&responsive=true&altText=}

Created by TristanF
Hi @notauser, We of course would expect some novel insight on using microbiome informations in time-to-event research questions in this challenge. Thus, we also hope on finding best intersections between using biological knowledge and machine learning in solving the task of this challenge. As you may know, we have writeup submission that we will also use to consider the winner of this challenge, where participants will explain about their algorithm. So we really encourage all participants to leverage both machine learning and biological knowledge explorations for this challenge. We just would like to emphasize that it is impossible for us at this point to capture all the biological informations without compromise the private information. However, since we will test your algorithm in real datasets, hopefully biological knowledge that you implement during creating the features for microbiome could outweigh those which depend on machine learning technique alone. As far as we experimented, it is quite challenging to reach best performance model using the machine learning technique alone (e.g cox regression model or random survival model that we presented in our baseline model). I understand it probably quite difficult due to our limitations in synthetic generations data process, but we hope it still capture most of the interactions as this is generated from the real dataset. Best regards, Pande
Hi @pande I'm a little worried about this: "Please note that this synthetic data is a placeholder, doesn't capture all biological relationships. It was impossible for us to keep the relationships, especially in the microbiome data and still be sure to not compromise the private information at this points." Not sure if I interpreted it correctly, please correct me if I'm wrong, but basically this competition might rely more on the machine-learning side of things. As a result, if we spend more time using some biological knowledge to create features from the microbiome data provided, this might be ineffective in this synthetic environment as not all biological relationships may be captured (although it might be useful when testing on real data from the test dataset), especially for microbiome samples.
Hi @TristanF, The number patients experiencing HF near the end of follow-up time is of course larger in synthetic dataset due to the generations process that I've mentioned above (keep the cap at 15 years). For future use, we may need to consider this as a result and could modify our methods. Thanks for pointing this out. We have to acknowledge that our synthetic generations data process is indeed need further improvement. Please note that this synthetic data is a placeholder, doesn't capture all biological relationships. It was impossible for us to keep the relationships, especially in the microbiome data and still be sure to not compromise the private information at this points. Regarding your questions about the validations phase. We will replace the test dataset with scoring dataset (i.e add an additional folder of scoring data). We will not merge the train and test data into larger training set. So the steps will remain similar between the Submission and Validations phase. The difference is, we will score your submission based on the scoring dataset for final scoring. The main reasoning behind splitting the data into 3 categories is to keep scoring system fair for everyone and to prevent participants from overfitting their models. Please also remember, during the validations phase we only accept 1 final model and participant required to submit Writeups for us, organizers, to select the top perfomer. Best regards, Pande
Hi @pande, thanks for the clarification. I have modified the figure access rights so that everybody can see it. I find it hard to believe that the surge is acceptable when comparing to the real data set, given that it is so huge. I also have a question regarding the folders structure in the validation phase: will you replace the test data with the scoring data and merge the train and test data into a larger training set, or will you just add an additional folder with the scoring data ? Thanks, Tristan
Hi @TristanF, Very good observation. Thanks for reporting. It is indeed the artifact from the synthetic data generations. During the generations process, some of the Event_time goes exceeded our observations time (15 years), thus, we put a cap on 15 years so that it could stay similar with our real data follow up time. This process results in surge of observations of individual experiencing HF exactly on the 15 years period. But I can clarify that we also have individuals in our real datasets who experience HF near the end of follow up time, although not as much as what we observed in the synthetic dataset. Unfortunately, I cannot access your plot. I generated a plot with the same pincipal (plot the histograms of Event_time for training and test sets, grouped by Event), and observed that the surge is still acceptable when comparing it to the real dataset. Best regards, Pande

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Issue with the simulation data : surge in heart failures at the end of the 15 years period. page is loading…