For the sub-challenges 3, are we allowed to use more than 20 genes when training model. We understood only 20 genes will be used for test 1297 cells. However, we are trying to use more than 20 genes when training our model. Is it allowed to do like this?

Created by Garam Lee goeastagent
Yuanfang, just to be clear, you cannot extract features from the 84 in situs in an exhaustive way, you have to pre-select the 20,40 or 60 genes and consider as if the rest did not exist. Sorry for the delayed response
you cannot correlate to the 84, but you can train against the mcc which is calculated from the 84 are you sure
Hi Yuanfang You can only use the localization of the cells as given by the maximal MCC for training purposes, you cannot correlate 20 genes to the 84 as you are actually using them all 84... This should be enough to train an algorithm. The 84 genesmin situs were released because they are public anyway... Hi Barbara You can use all genes in RNAseq (as already stated above) but only a subset of the in situs from the 84 genes having spatial information. The RNSseq has no spatial information per se, the insittid hace expression and spatial information. Your second question relates to again the genes from the in situs you use, not RNAseq. You can imagine several ways to weight depending on the genes you use, for now we will keep the details. Thanks Pablo
Sorry, another question just to be sure I have understood. Can we use all the genes in scRNAseq dataset? We have to exclude those that correspond to the 84-20 mapped in situ, I suppose! Second question. In the challenge documentation you wrote that in the scoring process you are going to weight the results also based on how many genes (I suppose how many genes we are using among the 8000 in the scRNAseq genes) and which genes. What does "which" means here? How is the scoring weighting the type of gene? Thanks!
>the point is to use only 20 genes when training so the biggest grey zone is how the 20 genes are selected. I am sorry but I don't understand how this can be done.... If one first select the 20 genes to maximize the similarity between the 20 and the 84, and then just say, these are the 20 genes I used.... that is allowed or not allowed? actually from your answer to a potential docker environment, and the fact that the 84 is released in the first stage. it seems that it is allowed in the feature selection stage to use the 84..... then wouldn't everyone gets perfect result? if it is not allowed why do we release the 84 gold standard table at all.... We should only see the snrna and the binarized snrna in the docker environment as the only input in the first stage. because you are not supposed to train from the 84 genes..... only then in second stage, you see the 20X3029 table. either way, i don't see how you can prevent explicit or implicit training from the 84 if it is not in a dockerized environment that only certain data is allowed to be seen, and no model/external data allowed.... for example, what about using an external resource to implicitly maximize the similarity between 20 and 84? that seems satisfy all the rules, but it is basically overfitting in the gold standard...... by knowing the gold standard...
Thanks for the clarification.
It seems clear to me that you would be clearly using the information from 84 genes in your final step... so NO!
One final question: Lets say I make a classifier that converts a given expression vector from 8924 features to an binarized vector of 84 elements, but I only used 60 in-situ genes for training (imagine I had a way of doing this). Would I be allowed to calculate the in-situ IDs (the 1-3039 labels) by doing MCC between my binarized 84 vector and the 'binarized_bdtnp.csv' table or would I only be allowed to do this final mapping step using only 60 columns from the binarized_bdtnp.csv table?
you can use all the RNAseq data you want to localize the cells but only 20 genes from the in situs i.e bdntp. The 20genes.csv file contains the 20 genes you will use from bdntp and the 10 predicted possible positions from the 1297 cells as indicated by a number from 1 to 3039.
I am honestly still a bit confused about this challenge and what we are and are not supposed to use. So the goal for subchallenge 3 is to fill the 20genes.csv file such that table[i,j] is either a 1 or a 0 correct? Is this from using ONLY the dge_raw.txt data and only the specified 20 genes in the dge_raw.txt data? Presumably our results should match binarized_bdtnp.csv, but are we not allowed to use the binarized_bdtnp.csv file as labels for training?
Pablo, I trust your team can deploy it in time because you have extensive experience on containers. If we have a docker environment, we could easily check that no model, no binary build and no constant/cutoff is included in the code, by passing a quick sanity check code, and maybe as well as what you suggested to let the organizers/second/third place to check the code. It is just my personal opinion, it is important to present solid results instead of hurrying the study out. I think we can randomize the orders of the cells/positions just for a second layer of security no matter it is published or not. Yuanfang
Hi Yuanfang, we considered dockerizing, but given that all data is public, it was difficult to do anything similar to what you suggest for the randomization of cells and genes. I like your idea of the 2 step submission, but not sure we will be able to deploy it in time. Pablo
> this DREAM challenge does not have a blind Gold standard You could still make it a dockerized environment to ensure trustability. Then, only executable code can be submitted in the docker to select features. On the deployment side you can just randomize the order of the locations in the location file and anonymize the gene names, and permutate the cell orders and gene orders in both the 84 file and the whole 8000 file. This way, you only need to check only code are submitted in the docker, no text file, other binary file of encoding is there. Otherwise, anyone can simply do a hash between any couple of genes or any gene set at their choice and the localization and just say this is the model I build and those are the gene signatures. On the testing stage, you could do a 2-step pass as the ALS challenge, first step is to let participants output the selected feature list into a temp file. The temp folder should be checked for only including this 20 gene list. When the program passes the first step, in that the next step, the program should only see the subset 20 from this list from the 84-20 small table + the 8000-20 in the big table, but nothing else.
You can use all the 8000 genes from the RNAseq, but you can only use information from 20 genes from the 84 in situs
Cheating is prevented first of all by trusting the participants and second by using and testing their code....But you are right, this DREAM challenge does not have a blind Gold standard.
@jeriscience OK. got it. One more Question. scRNA-Seq data has about 8000 genes. So, I'm confused if we have to use only 20 genes in scRNA-seq data when testing although about 8000 genes were given.
How is cheating prevented in this case then?
Hi ABSOLUTELY NOT, the point is to use only 20 genes when training, scoring will take into account the 20 genes you use, but will be based on all 84 genes. thanks for participating and sorry if this was unclear

sub-challenges 3 : feature genes page is loading…