I have some questions about the optional permutation analysis:
1. Are we supposed to do the permutations using the combined time-point features up to 0 and up to 24 or can they be single time points? For instance up to 0 includes time points -24 and 0 for Rhinovirus UVA. If we combine features from these two time points then subject 15 will be eliminated because she does not have data for time -24.
2. In some time points there is some imbalance in the proportion of class 0 and 1 labels. For instance DEE4X_H1N1 at time point -24 (for SC1) has only a single subject with label 0 and 9 subjects with label 1. If we consider permuting these labels there are 10! / 9! possible ways to permute, which is only 10. For such "class-imbalance" cases we will not be able to have 10,000 alternatives.
3. Not all permutations will be useful if our goal is to get a background score because some subjects can be re-assigned to their correct labels during this random label assignment process. It could make sense to compare each candidate permutation with the original true labels and drop it if it resembles too much to the true labels (say > 50%). I don't know what could be the appropriate threshold here.
4. How are we going to compute the p-value using Monte Carlo after generating random permutations and training models with these? It would be useful to have some description or a code sample.
Created by Zafer Aydin zaferaydin Correct! I see. When you say predictors I first thought these are prediction models but now I understand that these are feature vectors. Permuting those vectors is the same as permuting labels and subject ids together (i.e. the labels are tied to the subject ids). Thanks for the clarification. In order for us to be able to score the permutations, the outcomes data assigned to each SUBJECTID must not change. Thus, the predictors are what must be permuted.
Thus, for SUBJECTID 4046, you will always be training with SHEDDING_SC1=1, and you will always be trying to predict SHEDDING_SC1=1, but in one permutation you might assign the predictors (i.e. age, gender, gene expression) which were originally attributed to SUBJECTID 4035, and in the next permutation assign the predictors originally attributed to 4048, etc.
And yes, you need to re-train the models with each permutation. Inseparable condition means the case where positively labeled examples cannot be separated perfectly from the negative ones. I was trying to say while the permutations could drop the accuracy on train data since we learn on the same data the accuracy will tend to be high in general for the train data.
Now I am confused by this statement: "That is subject 4046 should remain SHEDDING =1, but you should permute the entire set of predictors from another subject in the same study." Why do we fix the label of 4046 and permute the others? Also are you trying to say the following: "we are not going to re-train the models for each permutation but just permute the predictions obtained from already trained model using the true labels"? Correct, you do not need to do LOOCVs on the permuted data. Just fit the data you used to train the model. Yes, there will be a difference in fit, however comparing to the permuted data is the way to assess the overfit in this case... Actually I made a mistake in my earlier instructions: you want to permute the predictors relative to the SUBJECTIDS, but keeping the relationship among the predictors. That is subject 4046 should remain SHEDDING =1, but you should permute the entire set of predictors from another subject in the same study. Apologies for any confusion this may have caused.
I don't understand your comment about inseparable label conditions. Could you please clarify. I am guessing that we are not going to make LOOCV on training data while doing the permutation test right? That is to compute predictions on train data if we train the model on the same data the accuracies will tend to be somewhat higher as compared to LOOCV even if there could be permutations that lead to inseparable label condition. Zafer-
1. Evaluation metrics are outlined on this page: https://www.synapse.org/#!Synapse:syn5647810/wiki/399118, however you don't need to score your own models, in fact you cannot score your own models without the true labels in the test data. You are asked to provide the predicted values (for both training and test sets) for the 10,000 permutations and we will score them.
2. This is sometime called a Monte Carlo permutation test, or a permutation test. In general, Monte Carlo refers to a random simulation, among which a permutation is one kind.
3. By my calculations, there are 1.02x10^27 combinations here for SC1, 7.47x10^28 for SC2 and 1.87x10^102 for SC3, excluding the sham samples. An exact test is not computationally feasible for most participants.
Solly Thanks for the reply. A few other comments and questions:
1. For the last question I think we compute the correlation between the permuted labels and the predictions for these labels right? I just wanted to make sure.
2. I don't see any Monte Carlo steps here right?
3. There was one recommendation for small sample sets to use an exact permutation test instead of a simulated one but we might still have problems with class imbalance data. Anyway this could still be a choice if the total number of permutations are less than 10000 but not too small. Zafer-
In response to your questions:
1. Your permutations should permute the outcome relative to SUBJECTID (and thus all predictors). Thus for one permutation, subject 4046 might have SHEDDING_SC1 = 0, and for that permutation you should build your model using the observed gene expression for range `Up to time 0` and `Up to time 24`. In other words, outcome is permuted relative to predictors, but predictors should never be permuted relative to each other.
2. By my count, DEE4X has 12 subjects in the training data, 3 with label 0 and 9 with label 1. This is what you should be permuting, and while there are not 10,000 combinations within this individual study, there will be plenty of random combinations across the data in the 7 studies.
3. The null hypothesis of a permutation test is that the observed data come from the same distribution as the randomly permuted data, and there is no association. In other words, we assume the observed data is just another permutation, so even permutations that are similar to the observed data are appropriate.
4. A permutation p-value is computed as the proportion of permutations + true observation with score >= the true data score. Thus, if 5 permutations have correlation greater than the observed then the p-value is 6/10,001 (because we also count the observed data).
I hope that answers your questions.
Drop files to upload
Questions about permutation analysis page is loading…