Dear organizers,
I have noticed some proposed methods use PCA computed from the combined train and test sets; I am also worried other pre-processing routines including standardisation as such will also result in potential test set leakage into the model. It goes without saying, you can attain smaller RMSEs by doing that. Will you consider test set leakage in the evaluation of the proposed models?
Regards,
Francisco
Created by Francisco de Abreu e Lima monogenea Hi Adi,
Thanks for your response. I fully understand you, and honestly prefer the format adopted for this competition. My question is not about how entries have been scored thus far, but rather what will you, the organisers, ultimately consider an optimal model when the challenge is over - that is what I meant by "evaluation". I think it is an important distinction, as in practice the chosen model(s) will predict GA from "unseen" data in future diagnostic tests.
To illustrate my point, consider the simple case of a 10x 5-fold CV'd principal component regression using 10 PCs. In one case we split the partitions AFTER conducting PCA (leaky case), in the other case we split the partitions BEFORE conducting PCA (non-leaky case). In the latter case we need to use the PCA `predict` method on the validation fold. Here is the code to simulate this:
```
# Thu Aug 8 18:26:51 2019 ------------------------------
# Load dataset & annotation + response
library(caret)
library(tidyverse)
library(beeswarm)
load("data/HTA20_RMA.RData")
anoSC1 <- read_csv("data/anoSC1_v11_nokey.csv")
RNA <- t(eset_HTA20)
# Define train and test sets
trainY <- dplyr::slice(anoSC1, which(Train == 1)) %>%
select(SampleID, GA)
trainSet <- RNA[trainY$SampleID, ]
simFun <- function(iter, X, Y, leaky){
sapply(1:iter, function(x){
idx <- createFolds(Y, k = 5)
sapply(idx, function(k){
if(leaky){
pca <- prcomp(X, center = T, scale. = F)
x <- pca$x[-k,1:10]
y <- Y[-k]
mod <- lm(y ~ x)
val <- cbind(rep(1, length(k)), pca$x[k,1:10])
pred <- val %*% mod$coefficients
}else{
pca <- prcomp(X[-k,], center = T, scale. = F)
x <- pca$x[,1:10]
y <- Y[-k]
mod <- lm(y ~ x)
val <- cbind(rep(1, length(k)),
predict(pca, X[k,])[,1:10])
pred <- val %*% mod$coefficients
}
RMSE(pred, Y[k])
})
})
}
leaky <- simFun(10, trainSet, trainY$GA, leaky = T)
notLeaky <- simFun(10, trainSet, trainY$GA, leaky = F)
beeswarm(list(leaky = as.vector(leaky),
notLeaky = as.vector(notLeaky)),
ylab = "RMSE")
```
${imageLink?synapseId=syn20609845&align=None&scale=100&responsive=true&altText=}
This effect will manifest over different models and pre-processing routines. I think it is important for all participants to understand this, because you can definitely go some decimals down in RMSE by doing something that is ill-advised. I hope this makes sense!
Regards,
Francisco Hi Francisco,
The only way to have avoided any leakage between training and test set would have been to not provide the test data at all and request prediction models that we would apply to the test data to obtain predictions. However, since we did not go that route, we have limited ability to enforce avoidance of eventual information leaks between training and test set.