The leaderboards are displaying the recall at FDRs of 5, 10 and 25% but the wiki says 10% and 50% will be used. Can you clarify which will be used?

Created by John Reid Epimetheus
thanks that takes away all my concerns. i just remembered that you can submit your baseline to double-check the final test, so i was worrying about being hit by a meteorite thanks again.
> in your code, you mount to chromosome 8 (/mnt/data/leaderboard_labels_w_chr8/), isn't that what you used for leaderboard phase? (i was surprised, because in wiki it says chr1 and 21) or does the w means without? then in the wiki, it says "Methods will be evaluated, scored and compared using TF ChIP-seq binding sites in the held-out chromosomes (chr1, chr8, chr21) in these true-blind held-out cell types for the specific subset of TFs. " under the final test set submission section. We only use chromosomes 1 and 21 for the leaderboard calculation, but we ask for 8 so that we can track progress on a blind chromosome. We will only use chromosome 8 for the final scoring. > i am just worried about the mapping issues. but once you open your final test queue, i will submit one similar to my taf performance, so you will have about 1 month to double-check, ok? Sorry, I'm not sure what you mean by the mapping issues. > i have another question , when there are two cell lines map to a single tf, what do you do? it was previously missing on the leaderboard for the second one. We score for each independently. The missing leaderboard was a bug, which should now be fixed. Best, Nathan
i am really confused in your code, you mount to chromosome 8 (/mnt/data/leaderboard_labels_w_chr8/), isn't that what you used for leaderboard phase? (i was surprised, because in wiki it says chr1 and 21) or does the w means without? then in the wiki, it says "Methods will be evaluated, scored and compared using TF ChIP-seq binding sites in the held-out chromosomes (chr1, chr8, chr21) in these true-blind held-out cell types for the specific subset of TFs. " under the final test set submission section. i am just worried about the mapping issues. but once you open your final test queue, i will submit one similar to my taf performance, so you will have about 1 month to double-check, ok? i have another question , when there are two cell lines map to a single tf, what do you do? it was previously missing on the leaderboard for the second one.
> does that mean in the final test set you are going to cut the 3 chromosomes out for evaluation, since the rest are not used? No, we will only use chromosome 8 for the final evaluation. > (btw, yes, i did see in your code of using [index], so essentially pasting the two together). I'm not sure what you mean by this. I didn't understand your FDR problem example - could you please include the actual data points and the output from the call to precision_recall_curve? Thanks, Nathan
thanks for your explanation, nathan and archil! may i know how do you cite and reply? because in many cases i also need to do that. Re: The predictions need to be submitted in exactly the same order as the reference files - including chromosome order. ---does that mean in the final test set you are going to cut the 3 chromosomes out for evaluation, since the rest are not used? (btw, yes, i did see in your code of using [index], so essentially pasting the two together). i printed out my precision vector, and it has these points, that's why i was worried. [ 0.0015841 (my baseline precision was 0.00055) 0.00214684 0.00237417 0.00237982 0.0030012 0.0030581 0.00309406 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ** 1. ** (should have been zero) ] thanks
Hi Yuanfang, To clarify, for the precision/recall calculation there is no interpolation or any complicated computation. Scikit and the R code both simply take observed unique thresholds in the predictions and compute the precision and recall at each threshold. Then we simply select the recall at an observed prediction threshold that has the closest conservative FDR to the one we want. E.g. if a method has an observed prediction value for which recall is 0.01 at 8% FDR and then next prediction value gives recall of 0.02 at 15% FDR, the we will report 0.01 recall at 10% FDR (by selecting the most conservative observed FDR i.e. 8%). This is generally the standard approach to report recall at a specified FDRs and does not use any complicated calculation like the area under the curve does. It may be possible to use interpolated recalls at specific FDRs. But that is quite an unconventional approach that hasn't been tested extensively.   As Nathan said, the following scenarios are certainly possible and sensible * method1 > method2 in auROC but method1 << method2 in auPRC (e.g. See http://www.nature.com/nmeth/journal/v13/n8/fig_tab/nmeth.3945_F4.html) * method1 > method2 in auPRC but methods1 < method2 in recall at some FDRs (especially lower FDRs) and method1 > method2 at other FDRs . This is possible because the precision-recall tradeoffs are not monotonic like sensitivity/specificity.   For example in this case that you showed HINT does better than the baseline at the higher FDR (last column) but worse at lower FDRs. This behavior is not unexpected. E.g. 7106578 Nathan Boley SCORED 0.7426 0.1982 0.0009 0.0016 0.0229 7115061 HINT SCORED 0.8618 0.2596 0.0001 0.0001 0.1257 We certainly want to keep the recall at specific FDRs in the evaluations as specific FDR guarantees is what we are really interested in when using the classifier as a predictor on the genome. At the same time, the auPRC is a good summary across the entire spectrum of recall and precision. As Nathan said, this is why we will be combining multiple metrics (we'll share this code as well in the next week or so) since none of them are perfect and ideal. Evaluating the statistical significance of differences between methods should eliminate any instability issues. We are running tests on the combining metrics method to see how it behaves. Will share this soon. If you any suggestions or thoughts let us know. Thanks, Anshul.
Hi Yuanfang, > may i ask when there are a sub set of chromosomes does the order of the chromosome matter? and how does it match to the selected chromosome? The predictions need to be submitted in exactly the same order as the reference files - including chromosome order. > also, i still see several cases where the precision@fixed recall are opposite as what is expected from auc/auprc. do you think it is real, I suspect that this is due to variance in the precision@fixed recall calculation. For the final combined scoring, we will ignore differences that aren't statistically significant so situations like this one shouldn't be a problem. That being said, it is possible to have a higher auPRC but lower recall at fixed FDR -- which is why we are including all 4 measures in the final calculation. > or do you think it is an artifact from some weird calculation in skit learn? AFAICT scikit learn is doing the correct thing when it comes to calculating the recall/precision values at each threshold. The problem with the auPRC calculation comes from how it calculates the area under the curve. Best, Nathan
thanks so much nathan! i represent all participants to acknowledge this quick action representative of the dream spirit of open science! may i ask when there are a sub set of chromosomes does the order of the chromosome matter? and how does it match to the selected chromosome? also, i still see several cases where the precision@fixed recall are opposite as what is expected from auc/auprc. do you think it is real, or do you think it is an artifact from some weird calculation in skit learn? e.g. 7106578 Nathan Boley SCORED 0.7426 0.1982 0.0009 0.0016 0.0229 7115061 HINT SCORED 0.8618 0.2596 0.0001 0.0001 0.1257 e.g. 7112610 Alex SCORED 0.6650 0.3193 0.0180 0.0270 0.1352 7115057 HINT SCORED 0.8342 0.1654 0.0101 0.0211 0.0516 e.g. 7106579 Nathan Boley SCORED 0.7357 0.2454 0.0018 0.0073 0.0600 7115043 HINT SCORED 0.9368 0.1465 0.0003 0.0003 0.0006 just these few to illustrate, but looks to me like the similar artifact as auprc (sincerely apologize to the teams whose performance i have repeated posted on the forum, and hope you understand) thanks a bunch,
Thanks for that Nathan!
Hi Epimetheus and Yuanfang, I've posted the scoring code (score.py) in the resources directory. I hope this helps! Please feel free to contact me with any questions. Best, Nathan
that's so nice of you, anshul! i deeply appreciate it! i also learned it from one of the organizers quite recently, whom i think might be entertaining to everyone to reveal when we conclude this challenge.
Also Yuanfang. We will definitely acknowledge your contribution to finding this bug in the scikit code. (Already did so on twitter. See https://twitter.com/anshul/status/761118080638464000 ) -Anshul.
Thanks Yuanfang. We are going to share the scoring code today or tomorrow which has the empirical combined p-value ranking strategy (which uses bootstraps to get variance estimates). So once we have shared the code, we will discuss this case and any others. Easier to discuss once everyone is looking at the same code. Will get back to you on this soon. -Anshul.
archul, nathan, and castello: i think these two entries are enough to show my point of fluctuations (by the way, both are awesome submissions that beyond my imagination of the performance of this problem, and both have great potential to win! i am just using yours as example here, hope neither team would mind) 7114832 ChIP Shape SCORED 0.9369 0.1886 0.0000 0.0000 0.0000 7115059 HINT SCORED 0.8163 0.0954 0.0001 0.0001 0.0001 in this case Chip Shape actually does much better than HINT, because of a much better overall profile in AUC and AUPRC, right. But according to your rank method, HINT would be ranked higher, because of actually ONE (or at most 2) example scored happened to be in the top. right? this put too much chance in ranking. thanks yuanfang
i represent all participants to thank the organizing team for their quick reaction, costello, arshul and nathan!! when you send out the notice, can you please spell out that i reported the error? i think it is a pretty mentionable contribution to the bioinformatics field, since about 10%-20% of the papers are operating on a piece of evaluation code that might exaggerate results by folds even hundred folds, right? actually i don't think i have other bigger contributions so far, ok? thanks a bunch. yuanfang
just a quick note. We are in the process of testing alternatives to scikit-learn as Anshul has posted in another thread. Once the code is ready we will release it as we have always planned to and we hope to have that ready early next week. We will make a post to the forum and send an email to all participants when this happens.
i cannot agree with you more, epim!! because everyone is tuning against the evaluation metrics. i am sure everyone is waiting the release of the code (other than me who is willing to be the laboratory mice...) but in general i think fdrs at certain recall would add another layer of complexity and huge fluctuation in measurement of performance. i sincerely wish they use the correct R implementation to integrate that successfully.
No I'm not using scikit-learn (thanks for highlighting the problems with that one). What I'd like to see is the python or R code from the organisers that they are going to use to calculate all the meaningful scoring statistics in the challenge. Until that happens I just feel like I'm chasing my tail trying to evaluate how good my predictions are.
i think 5%, 10% and 25% will have huge fluctuations.... you are not using the scikit learn for those things anymore, right?
Dear Epimetheus, Thanks for letting us know. We are updating the leaderboards to display 5%, 10%, 25% and 50% here. We will only use 10% and 50% for evaluation as the wiki says. Cheers, Robert

FDR cut-offs page is loading…