Hi everyone
Can you guys Recommend some possible workflow for the analysis
Thanks
Created by Raghav Awasthi Raghav here is a simple model in R to get you started. It scores an RMSE of 8.2 on the leaderboard
It computes a linear regression based on the first 5 principal components of the genetic data:
```
library(tidyverse)
load("HTA20_RMA.RData")
anoSC1_v11_nokey <- read_csv("anoSC1_v11_nokey.csv")
hta.all <- cbind(anoSC1_v11_nokey, t(eset_HTA20))
pca <- prcomp(hta.all[, -c(1:6)])
dat <- cbind(hta.all[, c(1:6)], pca$x[, 1:5])
hta.train <- dat %>% filter(Train == 1) %>% select(-Train, -SampleID, -Platform)
hta.test <- dat %>% filter(Train == 0) %>% select(-Train, -SampleID, -Platform)
model <- glm(GA ~ ., data = hta.train)
pred <- predict(model, hta.test)
submit <- read_csv("TeamX_SC1_prediction.csv")
submit$GA <- pred %>% as.vector()
write.csv(submit, "submission.csv", row.names = F, quote = F)
```
Hi Raghav,
You can look up the approaches that worked in other DREAM (https://scholar.google.com/citations?user=uGmMpmoAAAAJ&hl=en) or IMPROVER challenges (https://www.ncbi.nlm.nih.gov/pubmed/?term=IMPROVER+challenge+%5BTitle%5D) that involved transcriptomics data.
In the end, it is a matter of optimization over a wide parameter search space that includes the type of data you want to work with (gene level, exon/junction, or isoform level), the type of feature selection/reduction method, and type predictive model. Hi Raghav,
In my perspective, the choice of workflow primarily depends on which programming language you are willing to use. The datasets can be put to good use with both R and Python. Independently of that, this prediction problem seems to be straightforward and I risk saying one could do well in this challenge without knowing what is being predicted. Equally, I doubt the integration of gene functional annotation of any shape or form will help much. One could hypothetically mine the read data for SNPs but again I think it would add little to the transcriptomes. Also, the expression data seem to be pretty clean in terms of pre-processing. So, I guess a lot of workflows with varying complexity will work but I am confident the top submissions will be neural networks.
Regards,
Francisco