Hi,
Can I know the metric used scoring performance on isoform-level expression quantification in round1? Is it the Spearman's correlation?
Thanks,
Bo
Created by Bo Li BigBadBo We're actively discussing the best ways evaluate expression quantification. Thank you for your ideas. We'll be added more metrics to the benchmark reports. We've already added in code for evaluation based on quantiles ( https://github.com/Sage-Bionetworks/SMC-RNA-Challenge/commit/406ac6d1ca9e54f77a8390d9cb6ab0c33d8a6d3b ) and we'll look at your ideas about FP rate reporting and thresholding. We want to make sure that the challenge places appropriate weight on the ability to report low expression values. Hi Kyle,
Thank you for your detailed reply and the sharing of the evaluation codes.
I have looked into the evaluator. It seems that for the log-transformed Pearson metric, you use all known isoforms for evaluation and add 0.01 to each isoform before taking the log. I have a concern about this approach because the non-expressors would dominate the measure.
Here is an example using the ground truth for sim31 (round 2). There are 196,501 isoforms in total and 157,677 of them are not expressed (TPM = 0) according to the ground truth. This means over 80% of the isoforms are not expressed. For each of non-expressed isoform, it has a log TPM value of -4.6 ( log (0.01) ), which is at the same magnitude as the moderate-to-high expressors (e.g. an isoform of 100 TPM has a log value of (log(100 + 0.01) = ) 4.6 ). But there are only 165 isoforms with TPM >= 100, which composes 0.08% of all annotated isoforms.
In fact, I think the Spearman correlation also has the same issue and the Pearson's correlation will instead bias towards high expressors.
One way to avoid these biases is to partition the ground truth into two parts: isoforms that are expressed and isoforms that are not expressed. Then for expressed isoforms, we can calculate the Spearman, Pearson or log Pearson correlations. For the unexpressed isoforms, we can evaluate each method by its false positive rate (i.e. percentage of nonexpressed isoforms reported as expressed). To determine which isoforms are expressed, we can set a threshold on TPM (e.g. 1 TPM).
What do you think?
Thanks,
Bo
There was a bug that caused it to not render for a little while. Round 1 stats should be back now: https://www.synapse.org/#!Synapse:syn2813589/wiki/408787 Hi Allison,
I have checked the round 1 leaderboard but I could not find any results for the isoform quantification task there.
Best,
Bo
We have updated the round 1[leaderboard](https://www.synapse.org/#!Synapse:syn2813589/wiki/408787) to include these metrics. There has been a lot of internal discussion about your question ( and your previous question https://www.synapse.org/#!Synapse:syn2813589/discussion/threadId=643 ). Most genes are expressed at low levels, i.e. above the noise floor but still at low abundance, and part of this challenge is to measure an algorithm?s ability to detect and estimate these transcripts accurately. We understand that Spearman correlation may not be an ideal measure as the ranks among the lowly expressed transcripts may be rather arbitrary. For that reason, we are also implementing a log-transformed Pearson metric and will use that as well to rank algorithms. We'll be adding that code into the evaluator ( https://github.com/Sage-Bionetworks/SMC-RNA-Challenge/pull/36 ). These values will be part of the next leaderboard. Hi Allison,
Thanks for your quick reply. Can I know if it is the Spearman's correlation over all isoforms or some lowly expressed isoforms were filtered out?
Thanks,
Bo
Correct, the leaderboard shows the Spearman's correlation for the isoform quantification.
Drop files to upload
What is the metric used for scoring performance on isoform-level expression quantification in round1? page is loading…