Thank you all for your participation in today's webinar. For those who could not make it, the [recording is available here](https://drive.google.com/open?id=0B4Gply5UVfcjUmNJaWdEbDNqMnM). If you have any further comments or discussion, we'll be taking those this week, and devising a specific plan and timeline to be announced next week. Please post those on this Discussion Forum. Additionally, we have posted a copy of the [DREAM Principles](https://www.synapse.org/#!Synapse:syn5647810/wiki/402109) this and all recently launched challenges will be operating under going forward.   **Q:** What is Leave-one-out-CV, and is it possible to tell the original significance? (Mohammad Rahman) **A:** Leave-one-out cross-validation is a particular form of cross-validation in which a dataset of size N is partitioned in N training/test sets each of which is composed of N-1 training examples and 1 validation example. This implies that each example is used only once in the entire validation process, thus guarantees that estimates of risk (e.g., accuracy) based on this procedure are unbiased.   **Q:** Are the deadline for submission is changed? (Qian Li) **A:** Yes. As we define the new submission requirements, we?ll make the new timelines available.   **Q:** Can you upload the presentation online (ziv shkedy) **A:** Yes. We have uploaded the [recording here](https://drive.google.com/open?id=0B4Gply5UVfcjUmNJaWdEbDNqMnM).   **Q:** Is the test data available as a separate download, or has it been combined with the rest of the data? (Francois Collin) **A:** The test data have been provided as 4 separate downloads corresponding to the 4 original phases.   **Q:** It seems to me that with the GEO release, the data available now would've all be available at phase 4 (since the outcome data wasn't released). Couldn't we just skip phases 1-3 and judge the challenge based on phase 4? (Jacob Silterra) **A:** Unfortunately, the GEO release included later timepoints as well, so restricting participants to phase 4 data (time <= 36 hours) can not be guaranteed.   **Q:** Is it possible to submit a report about the analysis in a pdf document in which we will present and summarize the main results ? (Ziv Shkedy) **A:** Analyses should be submitted as writeups in a Synapse Project Wiki, and code must be provided.   **Q:** To Dr. Nordling: When you say randomizing samples are you suggesting something like a stratified partitioning? (Sajal Kumar) **A:** Randomizing samples refers to permuting outcomes data relative to predictors, though this may done in a stratified manner (e.g. within study or virus). I like Torbjorn Nrodling's suggestion. It is similar to a recent study in MRI analysis. See http://www.sciencealert.com/a-bug-in-fmri-software-could-invalidate-decades-of-brain-research-scientists-discover (Prasad Chodavarapu)   **Q:** Also if you already have used all the data to create a model, I don't understand how would the monte carlo simulation help improve the model better than already is (Sajal Kumar). **A:** Monte-carlo simulation would give the background distribution of scores for each method when no true correlation between outcome and predictors exist. Thus, a method which overfits the data, should show (approximately) equal overfitting in the true and permuted data, and thus is an unbiased assessment of the model relative to it?s ?null distribution?.   **Q:** It is a great idea to identify robust biomarkers. As Solly pointed out, repeating sampling many times could be highly computationally intensive. If these types of assessment criteria is used, I would like to request the organizers to provide computational resources to aid participants. Thank you. (Ka Yee Yeung) **A:** Your suggestion is noted, and we will look into this possibility.   **Q:** When you say randomize within each study, does that give the expected results, because within study samples will already have been used in the initial model. Can you clarify this? (uma) **A:** We can definitely debate this issue in the Community Forum. I personally believe this is the right way to do this, because we are not interested in the study-specific effects, but only the gene expression predictors.   **Q:** What are your thoughts about building predictive models based on Virus type instead of different studies? (Reem Almugbel) **A:** It is definitely your option to build predictors any way you choose. From the organizers? perspective there is no requirement either way.   **Q:** Is the outcome of the test set that is released in Sage is available - can you elaborate on it a little bit more? (Mohammad Rahman) **A:** The outcome data released through this challenge are not available elsewhere, and are subject to the terms of use of the Challenge. In other words, you may not use the outcomes data for any other purpose than Challenge participation.   **Q:** This is a general question: why do you propose to use leave one out cross validation and not k-fold cross validation ? (Ziv Shkedy) **A:** LOOCV is an unbiased estimator and is not dependent on a particular partition of the data. We could equally ask for specific k-fold CV partitions, however this introduces the potential for the introduction of errors. LOOCV ensures an apples-to-apples comparison across teams.   **Q:** I think it is great that you now are communicating a clear focus on finding biomarkers. We have been looking into previous publications based on the studies and there are quite a few suggesting biomarkers. Do you have some sort of summary of suggested biomarkers? **A:** We have not summarized the reported biomarkers from previous works, but it is a good suggestion to compare our biomarkers to those used in previous publications. Since previous publications focus on later-stage biomarkers, we don?t necessarily assume that they should be the same.   **Q:** Will the benchmark results be available for us to compare the models we get? **A:** A leaderboard will be provided for you to be able to test models, but we won?t expose the outcome data for test partition during the initial course of the challenge.   **Suggestion:** Using K fold cross validation. Let's say the test data is 10% of the the entire combined data then it would be sensible to use a 10 fold cross validation because that ways there will at least be one model that would have the desired partitioning. We can then plot all the models to see which model did the best, thus saving ourselves from overfitting. **A:** I don?t fully comprehend how this saves us from overfitting. Could you clarify?   **Q:** To whom we need to send proposals/suggestions for the analysis? **A:** Please post to the discussion forum on the Challenge website.   **Q:** In some of the publication on a few studies there are outcome data, so it would be great if you could clarify exactly what outcome data that has not been published previously. Could you also in more detail describe how the viral shedding was determined? **A:** We will look into this further, but my understanding is that Viral Shedding and Symptom Score have not been previously published anywhere, and if Symptomatic (binary outcome) has been made available, it would only have been done so in the previously published studies (DEE1 RSV, DEE2 H3N2, DEE3 H1N1, and Rhinovirus UVA).   **Q:** Our team will have to finish our work around Sept. 10th due to the start of the next semester, so the time line clarifications would be most appreciated. **A:** Thanks for this comment. We?ll keep this in mind as we finalize the analysis plan.   **Q:** Submitted codes and write up will be available to download for everyone or only for Dream organizers/participants? **A:** As per DREAM rules, code and write-ups need to be made public at the end of a challenge.   **Suggestion:** We could simulate the the process of training and test datasets by designating the most recently generated training dataset as "test". Everyone would have to agree not to use the "test" dataset in building their models. While this would simulate the best validation procedure, it definitely relies on honor code.   **Q:** Can we stress a little bit more on code reproducibility? **A:** Code is required in order to demonstrate the reproducibility of an analysis/predictor, should reproduce the solution exactly (i.e. documents package versions, and in the case of stochastic algorithms, random seeds), and should be the same code used to generate the solution (i.e. dummy code which produces the solution, but uses a bogus method are not allowed).   **Q:** As Data Set for this challenge has been released so, Now, Can anyone use this data set (specially the normalized data set which has been released by Dream Challenge Organisers) for their research paper publication? **A:** No. Normalized data and outcome released through this Challenge are available only for use in this Challenge. You may use the GEO version of the data in any way you like.   **Q:** I have received an email on 7th July that normalized version of the gene expression data for the test set has been released and these are now available in Synapse for use. Could you please send the link for data sets. I am not able to access new data (test set). When I am clicking on download data, it is showing me the same older version of data and modified date is also 17th may for that data set. **A:** The test data are available in the folder called ?Original Test Data? (syn6613428).   **Q:** Has the post doc who released the data set, published any research paper in this year with the use of this data set? Could you please send the links to those papers for my literature survey. **A:** I believe Al Hero?s paper is referenced in the recent GEO release.

Created by Solveig Sieberts sieberts
In Response to : "Using K fold cross validation. Let's say the test data is 10% of the the entire combined data then it would be sensible to use a 10 fold cross validation because that ways there will at least be one model that would have the desired partitioning. We can then plot all the models to see which model did the best, thus saving ourselves from overfitting". How can it prevent overfitting: While it does not guarantee to prevent overfitting, it does ensure that not all data (training + testing) is being used for generating the model, in order to make sure that only a certain percentage of the data is used, one can enforce the submission of the test data that was used to evaluate the model and along with the predictions made; since reproducible code is already being submitted it would be easy to verify if the entire data was used or not.

Webinar 2 Summary of Questions and Comments page is loading…