Hi I have some questions regarding the data used in the 2 submission for a given round. From the wiki https://www.synapse.org/#!Synapse:syn6187098/wiki/449433 : "Challenge 1 and 2 questions have leaderboard rounds where participants can submit twice and receive scores (there is not limit to express lane submissions). A subset of the validation data is randomly sampled and set aside for these leaderboard rounds. In each individual round is made of a sample with replacement from this set aside subset. Leaderboard rounds give participants an idea of their score and allow them to adjust their models for improved accuracy. Final submissions are then scored on the remainder of the samples in an effort to avoid overfitting.' My understanding from this description is that the total validation data is split in 3 subset one for each of the rounds. Then within a round you bootstrap (sample with replacement) the corresponding subset twice and use these 2 bootstraps for the 2 submissions. Is this correct? - Everyone sees the same 2 bootstrap of the data in a given round? - if you submit twice in a round how are the 2 scores combine into that round score? thanks DA

Created by exquirentibus veritatem exquirentibus
I forgot to provide the link with the details [here](https://www.synapse.org/#!Synapse:syn6187098/wiki/449444).
Thanks!
Hi here is a point by point answer to help with clarity. My answers are in *italics*. 1. If subSetLB (25%) is used for the lead board round, and you are sampling it with replacement, are you doing resampling exactly once per challenge round? My understanding is, you generate dataset 1 for stage 1 and dataset 2 for stage 2, all from the same 25% of the data sampled with replacement. You then use these two synthetic datasets to provide two estimates of performance. Every team submitting a result gets scored against the same bootstrap, and that bootstrap only depends on stage. Right? *this is correct* 2. Is bootstrap size appreciably larger than the size of subSetLB, or are you sampling to the cardinality of subsetLB? *We sample N from N with replacement/ we are not "oversampling".* 3. subSetVal (75%) is used without any bootstrapping to generate final scores. *Sort of, see 4 below* 4. If my understanding of 1-3 is correct, could I ask why are single bootstraps used to generate the two checkpoints? One alternative would have been to generate a large # of bootstraps from different draws from the overall full data set and estimate mean and variance of every submission from that set of bootstraps at each stage. Again, IFF my understanding is correct, then the current strategy is providing two point estimates of performance of a small fraction of dataset bootstrapped to 1/4 size, and I think that would answer dreamAnon's question. *We indeed use your alternative approach to get distributions of participant's validation score. 100's or 1000's of bootstraps are run and primary metrics are computed with the validation datdatset (subSetVal). These are then used to define statistically tied groups via K, the Bayes factor. A tiebreaking metric is then applied to those teams that are in the tied group.*     *I am trying to avoid the term bootstrapping for the leaderboards since the sampling with replacement is only done once per LB round. In this challenge the LB round scores are NOT used in any official capacity (final awards/incentives are not based on them). Instead they are used to provide participants an idea how well their method is doing in comparison to other teams. The true bootstrapping is done in the validation phase using subSetVal.*
Dear Mike, I'm with dreamAnon - I also don't understand how the bootstrapping works, so let me ask the following clarifying questions: 1. If subSetLB (25%) is used for the lead board round, and you are sampling it with replacement, are you doing resampling exactly once per challenge round? My understanding is, you generate dataset 1 for stage 1 and dataset 2 for stage 2, all from the same 25% of the data sampled with replacement. You then use these two synthetic datasets to provide two estimates of performance. Every team submitting a result gets scored against the same bootstrap, and that bootstrap only depends on stage. Right? 2. Is bootstrap size appreciably larger than the size of subSetLB, or are you sampling to the cardinality of subsetLB? 3. subSetVal (75%) is used without any bootstrapping to generate final scores 4. If my understanding of 1-3 is correct, could I ask why are single bootstraps used to generate the two checkpoints? One alternative would have been to generate a large # of bootstraps from different draws from the overall full data set and estimate mean and variance of every submission from that set of bootstraps at each stage. Again, IFF my understanding is correct, then the current strategy is providing two point estimates of performance of a small fraction of dataset bootstrapped to 1/4 size, and I think that would answer dreamAnon's question.
Dear DA, Your understanding is close but not quite how we do it. The two submissions for a given round are run on identical data. There are really just two subsets of data for a challenge question. One subset is randomly sampled once and set aside for the leaderboard rounds (let's call it subSetLB). The remainder is used for the final validation (subSetfinalVal). The split is roughly 25% to 75% respectively. subSetLB is then sampled with replacement once for Leader board round 1 (and both submissions are run on it for a given team/individual). subSetLB is then sampled again with replacement once for Leader board round 2 (and both submissions are run on it for a given team/individual). The two submissions are not actually combined for a leader board round. The leaderboard round have not financial awards associated with them. We simple use the second submission score to rank teams/individuals once the challenge has closed. I hope this helps, Mike

sample selection in submissions and score for round page is loading…