Not sure how to interpret next sentence regarding data:
"**negative values indicates the occurrence of Heart Failure in participants before the baseline; thus measured during the baseline, please see the Heart Failure status at the baseline in PrevalentHFAIL columns"
92 patients are in this group with negative values, 7 of them have a positive record in the "Event".
How to treat them? Exclude them from the analysis?
What is the Event time then for these 7?
If HF before baseline doesn't interest us, their Event Time should be as long as possible? But then it is a clearly an artifact of simulation and would lead to worse model performance.
Will our models be tested on simulated data as well?
Created by Valentyn Bezshapkin crusher083 Hi @Chih-Han,
We are very sorry for the confusion.
We will have to clarify that only **Event** and **Event_time** will be removed from scoring dataset.
Further information about final phase will be ready soon.
Thank you.
Best regards,
Pande Dear @pande and @ecekartal ,
I am a little confused about the data format of hidden scoring data set.
In the recent thread "Data format of the scoring data set (N=1809)", it is described that PrevalentCHD and PrevalentHFAIL would not be in the scoring dataset.
https://www.synapse.org/#!Synapse:syn27130803/discussion/threadId=9888
However, in this thread "`PrevalentHFail` variable", it is described that the proposed model is allowed to include PrevalentHFail, which means scoring dataset should has the column PrevalentHFAIL
Would you mind telling us the data format of hidden scoring data set? Thank you very much!
BTW, I agree that PrevalentHFAIL should be removed in scoring dataset. PrevalentCHD is worthy to be discussed
Thank you very much
Best regards,
Robin Hi @crusher083, I have checked these cases in both synthetic and real datasets.
These 7 cases are clearly artifacts during the synthetic generations data process.
In the real datasets, all individual with PrevalentHFAIL==1, have negative Event_time (ie. the HFAIL already happened before the baseline measurement), and the event is all mark as 0 (no incident HFAIL recorded during the follow up for these volunteers), thus the event time is remain negative.
As @ecekartal mentioned, we usually removed those who has develop diseases of interest (PrevalentHFAIL) during the analysis as what is usually done in doing the cox model.
We want to include this information in the synthetic datasets as similar to the real dataset if someone wanted to use this informations in the model.
Other questions have been answered by Ece clearly, and for the last questions, yes by default we won't run the model in synthetic datasets during the evaluations.
However, this could be further discussed.
Best regards,
Pande
Hello @crusher083, I can answer most of your questions and @pande can you maybe check those 7 cases?
92 patients are in this group with negative values> yes this is similar in real FINRISK data as well.
7 of them have a positive record in the "Event". > We will check those cases but this looks like indeed an artifact of simulation.
How to treat them? Exclude them from the analysis? > That is what we did for basic models, but some may come up with novel ideas how to include them.
What is the Event time then for these 7? > I would exclude those 7 individuals since we dont have those cases in the real dataset.
Will our models be tested on simulated data as well? > By default no, but if necessary, we can discuss this further. As mentioned, synthetic data is a placeholder, doesn't capture all biological relationships. It was impossible for us to keep the relationships especially in the microbiome data and still be sure to not compromise the private information. Thats why, each group/ participant has more than one submission. We hope, those different submissions will give you a better idea about the real dataset. Hope that helps.