Hi all, I am confused about the outcome label and where to find it. I think we are trying to rank the sensitivity of cell lines to a given drug using a confidence interval between 0 and 1 ( like a classification model would give). Where do we find the label of the sensitivity of a cell line to a given drug for the training data? My conceptual idea is input(gene expression changes) = output(probability of sensitivity). But I can't find a cell level sensitivity label. My understanding is that the Achilles data is the closest thing to this sensitivity label we are provided, but it is given as the effect of individual gene changes on the viability of a given cell line. I'm sure we could look at how a given gene is affected by a drug in a specific cell line and the effect that gene has on the cell line (Achilles) and repeat this for all of the genes to create a cumulative score. Then repeat this for all drugs on all cell lines and normalize the cumulative scores between 0 and 1 to rank. But this wouldn't be able to consider interactions between genes (that I can think of) because there is no explicit training label of how the drug affected the cell line as a whole (that I can find/understand). So instead of training a statistical model, it seems like this would be more scoring unless there is a label that shows how each tested drug is affecting the cell line as a whole. It is also possible that I have missed something or am out of my depth here. I would greatly appreciate any help. Best, James

Created by James Young JamesYoung
This was very helpful. Thank you all for input and clarification. Best, James
I agree with the previous points. To train a statistical model, you probably need additional perturbation gene expression data (like LINCS-L1000, referenced on *Data* page of Wiki) and external AUC/IC50 data (like the ones Robert mentioned). The possibilities with "mechanistic model" are much less explored, but maybe even more interesting in the biological context. Hope our answers mede the problem more clear.
Yeah I definitely agree with Robert. I think your statement: "So instead of training a **statistical model, **it seems like this would be **more scoring**" Is key. You can either: (1) find external AUC/IC50 data and train a ** "statistical model"** (2) use the physical meaning of this data itself to score a **"mechanistic model"** We are a team of experimental scientists and data scientists and are interested in both approaches. I outline this in a bit more detail in the webinar but you can imagine: PANACEA gives data on: drug -> **differential-mRNA** ACHILLES gives data on: **differential-mRNA** -> cell-death could be used to infer: drug -> cell-death without training on drug -> cell-death data directly.
Hi James, Thanks for your patience and apologies for my delayed response. The actual outcome label for training will need to be generated by each participant, using publicly available databases like the GDSC1000 and CTRP. A great introductory resource is PharmacoDB, which aggregates gene expression and drug sensitivity data from many public resources. In this challenge, we are are asking participants to use the post-treatment gene expression data provided as a surrogate outcome measure and predict sensitivity using this data (i.e. by comparing it to public datasets like Achilles or CTRP expression data). In brief, I think that the missing connection in the scenario you laid out is the drug sensitivity data (such as IC50, AUC, etc) from public databases. Let me know if this doesn't make sense. Tagging @efd2115 and @SzalaiB for their thoughts as well.

What is the actual outcome label for training? page is loading…