Though it might not matter now that the GE73072 data is to be used instead of the original training data, further exploration shows that the training data samples may have been misidentified. Using a more up to date microarray CDF which can identify entrez 7503 - XIST - it is possible to identify a sample as being either male or female.
The entrez XIST expression values (which will be either very high for females, or very low for males) do not correspond to the sex as annotated in the ViralChallenge_training_CLINICAL.tsv file. Without access to the original clinical records it is not possible for me to further track down the problem. I do note though that my original observation of high immune system activation before time 0 still holds even with the GSE7307 data. Given the nature of these problems, I'm afraid that I could not have much confidence in these data.
I have a long backlog of gene expression studies to analyze, particularly cancer and autoimmune disorders, so I'm not sure that it would be a good use of my time to continue working on this particular challenge. Sorry, but that's the way it is. Good luck, you'll need it :-)
Created by Alan Robinson robin073 The reason why properly identifying samples as being male or female is important is that males have additional genes on the male Y chromosome, and that some of them appear to be important genes of the immune response. But the reported gene expression levels for female subjects would probably not be exactly zero because of such issues as cross-hybridization. Overall this could create problems in predictive model building.
This raises another issue in that how can we do QC on the lab results so we can be really sure that the samples are what someone says they are. It's easy to get things messed up when we are dealing with several thousand samples as in this research effort. It would be helpful if there were an independent study at a second completely independent institution, however the Duke series of experiments of viral infection are the only ones I can find of this type. In contrast, there are many, many independent studies of cancer gene expression of various sorts, which allows us to replicate the results scientifically, and increases our confidence in the overall results..
There is also another broader issue in that the science behind gene expression in the immune system is still incomplete, and it is my firm belief that getting a full picture based on any number of experiments will be necessary to complete the jigsaw puzzle. The immune system is involved in many different diseases, especially including cancer. I would like to build a complete numerically accurate model of the immune system, but this is still a separate ongoing research project for me (possibly involving both oblique factor analysis and principal components as methods of dimensionality reduction.) Until such time as I have such a model I can only do an informal check of the gene expression levels to decide if the immune system is activated.
.
Thank you for pointing this out. We performed a gender analysis and found a small number of samples with probable gender issues, though it affected only about 1% of samples. Unfortunately, errors of this type typically occur in every data set, which is part of the fun of working with real data, and at this low rate, is nothing to be alarmed at from my experience. I've uploaded list of problematic samples to Synapse for those who may find it useful: syn6185093.
As for the immune activation you speak of, I don't believe this negates the problem at hand. Regardless of whether they start with an activated immune system, we still want to predict whether they will succumb to the virus to which they are exposed, however it would be of use to see whether these subjects are less predictable than those with different baseline conditions. Would you be willing to share your criteria for determining subjects with an activated immune system at baseline?