The study expression data have been just released on GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73072 There are apparently more subject/array data in the GEO than in the challenge training dataset. First obvious question. The challenge test data are NOT in the GEO dataset? Because if the late stage test expression data are available, then "challenge" to predict virus response from early data points become pretty much obsolete

Created by Vladimir Morozov vmorozov
One other way to deal with the situation is K-fold cross validation where K is a fixed number supplied by the organizers. When you combine the training and testing data together you now have some X% training data in there and some Y% of testing data. One can use this Y as the number K for cross validation. That way one would always get some X% of the data to build their model and have Y% for testing it, also, since this kind of cross validation is repetitive in nature, there will atleast be one model that is built on the originally intended training data and is tested on the originally intended test data. The best model could just be the model with the highest individual accuracy on their corresponding Y% test data.
I think the idea of the Gustafsson Lab - Nordling Lab team is very interesting!
Thank you all for your suggestions! I hope you'll all be available, or at least send a representative from your group, for the webinar on Monday. We value your opinions. Webinar registration details are as follows: > We have scheduled a webinar for Monday, July 11th at 12pm Eastern/9am Pacific to discuss the Community Phase of the Challenge and the path forward. >You may register by clicking the Registration URL: https://attendee.gotowebinar.com/register/8625583333628415234.
Considering the data release we in team Gustafsson Lab - Nordling Lab would like to propose a different angle on the challenge that can be done even though all data is public. We propose that the challenge should focus on identification of robust biomarkers. One possibility is described next. We propose that the challenge should focus on identification of the genes (probes) that provides the most significant separation between the two classes, i.e. patients with viral shedding and without, and the associated classifier based on the four different sets of time points defined in the original challenge. The performance measure, i.e. significance (p-value) of the separation between the two classes, can objectively be calculated using Monte Carlo simulations with random class assignment to all patient samples. The frequency, corrected for multiple hypothesis testing, of observing the separation using the suggested classifier and number of genes (probes) provides a reliable p-value. If the challenge focus on robust biomarkers, then we believe that the biological and medical value of the challenge is increased and it would be more in accordance with the scope of science translational medicine. This performance evaluation method does not require any hidden test data, so the release of all the data does not hamper a fair evaluation of contributions. It also does not suffer from overfitting, because the actual selection of genes (probes) is not used when calculating the p-value.
The best option would be to generate new test data. We can not chose the best model just based on leave-one out cross validation, because one could easily overfit the training data, especially with the small number of observations and so much variables like this challenge. I have such a painful experience of over-fitting in competing machine learning competitions in kaggle.com, by choosing the best model just through the CV errors.
I agree that the best option would be to generate new test data but if this would not be possible due to time constraints then a second option could be to evaluate all the models based on leave-one out cross-validation accuracy on the combined train and test data. Then best performing models can be selected. The criteria for choosing the best models could be extended perhaps including the number of gene expression data (i.e. features) used, the time it takes to generate predictions, etc.
The data were released by a former post-doc from the group that generated the data in support of his publication. It was an unfortunate miscommunication, and like you, we're disappointed in this turn of events. We are actively exploring options to make the collaborative phase of the challenge worthwhile for those who chose to participate.
Dear Solveig Sieberts and challenge organizers, I'm quite sad for having received your email about the untimely end of this challenge. I already started to work on the challenge scientific goals and I had just designed my first prediction model. I'm quite surprise that dataset leak has happened. A question: how was this dataset leak possible? How did it happen? Who put the dataset on GEO? Was he/she authorized to do that? Were you organizers aware of this possibility? I would like to read more information about this unpleasant episode, to better understand what happened. I'm sure that other challenge participants have the same desire. Thanks, best regards -- Davide
Thank you all for bringing this to our attention. Clearly the challenge will need to be altered following this unforeseen data release. We are looking into it - and open to your suggestions. -Lara
Solly, If I can, I would advise you to not escalate it further. I agree with Yuanfang that the challenge in its current form is dead. You can't stop people to use this data . The data can be used "implicitly" for parameter "optimization". Are you going to impose restriction on model parameters, seeds, only "default" parameters, seed equal "1" .....? I am deeply sympathetic to challenge organisers. But it might a good chance (for responsible stakeholders, e.g DARPA) to do it right instead of cheaply. Generate truly independent test set. Since the real challenge is to predict from pre-exposure data, only pre-exposure samples could be profiled. Vlad
Hi, Solly, I am sorry for any trouble I might have caused. But I win every challenge, and I was going to use THIS ONE to celebrate my 10th winning challenge in DREAM. And now I need to pay for some other person's mistake since I am no longer qualified and cannot win this one, what a disappointment! We have had intensive collaborations before, Solly. Speaking it honestly, do you think I am the one with the worst moral standard? Do you really expect that only me would download this data? I will remove when GEO removes, which doesn't affect you anything. Yuanfang
I ask you to please remove this from your site immediately Yuanfang!
oops. expecting someone to not look at the answers when it is at the back of the page is a pretty tough morality test. at least i failed it. but i guess that saved me some time this summer. suddenly i feel the michigan sky becomes so bright. from a quick matching it is indeed a one-to-one array for about ~12000 genes between sage data and the GSE data, with correlation over 0.8, except 3 samples which doesn't even look like human, which must have come out of some type of processing error. i put a matched ID and matched gene version on my website http://guanlab.ccmb.med.umich.edu/yuanfang/data_comparison.tar you can take a look if my match is correct. and if that includes more time points (it does include tons of more samples)
I'll ask again, please DO NOT DOWNLOAD these data. The Challenge organizers are discussing how to proceed, but one option is to disqualify anyone who has downloaded the data. We apologize. The data were released by someone unrelated to the Challenge, and we are as upset about this as you are.
what the hell! that will make whoever wins the challenge completely meaningless.
That is indeed troubling to hear. I would ask everyone not to download these data, and we'll do what we can to get the data pulled in order to salvage the challenge. Thank you for your cooperation.

The study expression data have been just released on GEO page is loading…