Dear organisers,
Could we access the actual expression bins (e.g., [13, 13, 14]) that each sequence was assigned, rather than just the average (13.333)? This would allow tackling the challenge as a classification problem, which might be more manageable at first.
Thank you!
Created by Héctor Climente González hclimente Dear @cdeboer,
Thank you for the clarification. As a matter of fact, we had considered just rounding the provided values. But we were reluctant since the decimal part was noisier than we had expected; we were expecting only fractions of small integers: 1/2, 2/3, etc. Hence we wondered if the underlying variance was larger than we thought and if there was a better way to obtain class values (e.g. a median). Or to know which sequences are reliable (only detected in one bin) and which ones are not (seen in multiple, very different bins). Some measure of the bin variance might be helpful if only to understand the data better. However, your explanation makes perfect sense, and we also prefer focusing on the machine learning problem.
Best regards,
Héctor. Hi @hclimente
We want to avoid having the challenge turn into data processing optimization. I'm sure what we are doing to process the data could be improved, but we want to leave that for another day.
The raw data includes sequencing errors, and so may actually be prone to misleading the models via these errors (some of which are corrected when we consolidate sequences into related groups). We observe each sequence as a distribution over the bins (although most are in a single bin). Further, each bin had different numbers of cells sorted into them as well as different sequencing depth. Accordingly what you are asking for is actually a table of sequences+bins+counts (i.e. the read counts for each sequence in each bin). We think that providing these data would likely lead people to focus on data processing optimization rather than the machine learning problem, and so would prefer not to provide these data.
If you want class data, your best bet is to round to the nearest integer. You could also exclude any that are not integer numbers. The majority of the training data is already an integer number from having been observed in only one bin.
Keep in mind that the evaluation will be treated as a regression problem, and so your classes would have to be converted to a score before evaluation.
-carl
Drop files to upload
[Request] Can we have the original expression bins? page is loading…