The webinar slides are available [here](syn18906690).
We are working to transcribe the questions and answers from the webinar.
Thank you for you joining today!
Created by Brian White brian.white Thanks for the response @brian.white , this answers my question.
Hi @zurkin
I have just answered a related question posted to the discussion, which should answer your question. Briefly, the validation datasets will be ~100 samples. Some of the leaderboard datasets are relatively small--the smallest two have 5 and 13 samples. Please post any followups regarding sample set size to a new forum question or to the entitled 'How many sample will provide in the validation phase?'
Brian Hello @tomsnir ,
I apologize for the delayed response.
We prefer not to say much more about the "unknown content." You do not need to predict it. i.e., if you provide proportions your proportions will likely sum to less than 1 for those admixtures with unknown content.
By "publically-available" we meant that the data used for the leaderboard has already been published. In some cases, these datasets may have been used to assess deconvolution methods. The validation data has been generated for this challenge and no method will have been previously run against it.
I'm sorry--I don't follow your last question about LM22 and the GEO datasets. Would you please ask again? I assume you recognize that there could be significant batch effects between a particular GEO dataset and the data used to derive the LM22 signature matrix. Note also that CIBERSORT quantile noramlizes data by default. Unfortunately, we can't make any type of guarantees regarding the quality, purity, or correctness of the cell types allegedly expressed in the GEO datasets. We have attempted to do a reasonable, if very quick, curation of these datasets to get you pointed in the right direction. You will likely want to convince yourself that such data are appropriate for use in training your methods.
Brian I'll also appreciate answer to this question.
Also, can we assume that each dataset will contain more than a few samples?. Hello,
I have a couple of questions regarding things that were mentioned in the webinar:
In slide 31, it is said: "Some admixtures will contain ?unknown content?; others will not."
Is there anything you can say about these unknown cell types? Could it be any type of cell, or limited to cells related to those found in PBMCs? How should our model predict these? Should we be able to label them as "unknown", or should our total be less than 100%?
In slide 34, it reads: "Leaderboard phase: methods scored using (publically-available) data with ground truth"
Can you please clarify what publicly-available data means?
Thank you,
Tom.