(just kidding--but i did write this whole message twice, as the first time for some reason, it didn't even successfly post). if i am wrong again i am happy to learn the lesson and would be good to have someone help me debug.
all files related to this discussion are in http://guanlab.ccmb.med.umich.edu/yuanfang/test.tar
i used sklearn's build-in auc, auprc. and i have good reason to believe there is a problem.
gold standard: test_gs_tmp.dat
prediction (continuous value): output_tmp.dat
auc: 0.84
auprc: 0.03
(this is mine and every other particpant's performance)
then binarize output_tmp.dat by average values: (aaa.pl --> output_tmp_1.dat)
auc: 0.77
auprc: 0.31
(this is the organizer's performance. i was stunned at your performance yestoday, you know it. since this problem in my view is impossible to get to auprc at 0.44 something. most 'surprisingly, we found that...' == 'we had a bug')
then separate output_tmp.dat by one third: (bbb.pl --> output_tmp_2.dat)
auc: 0.74
auprc: 0.41!!
(btw, for my orignal file the top accuracy i got for the lowest recall at different cutoff was around 0.05....)
as i lose information, my performance in auprc goes to very very high, is that normal? then why don't everyone just tune their binning to win?
same thing for recall at fixed fdr, if it is calling any precision recall function.
as always, thanks for your patience with me.
Created by Yuanfang Guan ???? yuanfang.guan Yes. Nathan and Chris are working on updating the auPRC values in the previous submissions to the leaderboard. @nboley @akundaje :
Will previous submissions be rescored with the new implementation for auPRC? In particular, will the baseline be rescored?
Thanks! Hi All,
I've updated the scoring method to use the auPRC calculation method implemented in the R package PRROC (described here: http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf). The updated scoring code can be downloaded from the 'Challenge Resources' directory (https://www.synapse.org/#!Synapse:syn7117837).
Thanks again to Yuanfang for identifying this problem.
Best, Nathan Maximus: You can get it from here http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf Anshul, sorry, I dont have access to this paper. thanks so much, maximus! Maximus: I've been playing with PRROC package which implements the Davis Goadrich method. Its doing the right thing. The scikit implementation is wrong though.
Also, using trapezoidal method directly on the **observed** (precision,recall) points is not right for the area under the PR curve (See the Davis and GoadRich paper http://dl.acm.org/citation.cfm?id=1143874 ). It basically assumes a linear interpolation. The right way to do this is to perform the non-linear interpolation described in the Davis and Goadrich method followed by the trapezoidal method on the interpolated curve. In this case, accordingly to PRROC.pdf https://cran.r-project.org/web/packages/PRROC/PRROC.pdf , you should
set the first argument scores.class0 as your predictions and weights.class0 as the labels.
```
*USAGE
roc.curve( scores.class0, scores.class1=scores.class0, weights.class0=NULL,
weights.class1 = {if(is.null(weights.class0)){NULL}else{1-weights.class0}},
sorted = FALSE, curve = FALSE,
max.compute=F, min.compute=F, rand.compute=F)
Arguments
scores.class0 the classification scores of i) all data points or ii) only the data points belonging
to the positive class.
In the first case, scores.class1 should not be assigned an explicit value, but left
at the default (scores.class1=scores.class0). In addition, weights.class0 needs to
contain the class labels of the data points (1 for positive class, 0 for negative
class) or the soft-labels for the positive class, i.e., the probability for each data
point to belong to the positive class. Accordingly, weights.class1 should be left
at the default value (1-weights.class0).
In the second case, the scores for the negative data points need to be provided
in scores.class1. In this case, weights.class0 and weights.class1 need to be provided
only for soft-labelling and should be of the same length as scores.class0
and scores.class1, respectively.*
```
So it gives
```
predi<-read.table('~/Downloads/test_1/output_tmp_1.dat')
gs<-read.table('~/Downloads/test_1/test_gs_tmp.dat')
roc <- roc.curve( scores.class0=predi[,1], weights.class0=gs[,1])
ROC curve
Area under curve:
0.6865003
Curve not computed ( can be done by using curve=TRUE )
pr <- pr.curve( scores.class0=predi[,1], weights.class0=gs[,1])
Precision-recall curve
Area under curve (Integral):
0.001900391
Area under curve (Davis & Goadrich):
0.001932103
Curve not computed ( can be done by using curve=TRUE )
```
Doing it in another way, using the trapz function from PRACMA r package,
that uses the trapezoidal rule and the predict from ROCR function:
```
ROCR
pred<- ROCR::prediction(predi$V1, gs$V1)
P=table(gs)[2]
N=table(gs)[1]
thrsh=pred@cutoffs[[1]]
PL=pred@n.pos[[1]]
NL=pred@n.neg[[1]]
tp=pred@tp[[1]]
fp=pred@fp[[1]]
fn=pred@fn[[1]]
tn=pred@tn[[1]]
prec<-tp/(tp+fp)
tpr=tp/P
fpr=fp/N
precision<-tpr*P/(tpr*P+fpr*N);precision[is.nan(precision)] <- 1
recall<-tpr
calculating area by trapezoidal area:
score<-data.frame(auroc=trapz(fpr,tpr),aupr=trapz(recall,precision)) # trapz(x,y)
auroc aupr
1 0.6865003 0.2289518
```
Re: " If you have a fix for the scikit code already let us know as well."
There are two ways to fix this:
1. Sub-sampling 1/300 of the negatives in each sample and all positives. You have 50 million samples. From statistics, if you sampling is indeed random that size is way more than sufficient to draw a conclusion (the RA challenge is running with 22 samples now, you still have 200k). Then, you can add a random small value: rand(1)*(smallest prediction difference between two samples), the probability of having a numerical mistake is pretty low.
--I think it is dangerous to do on the whole set (because 10\^7\*10\^7, might just overflow doubles), but you can try.
---
2. Change all values into UN-TIED ranks with int32/64, and input into scikit. Because prediction values don't matter at all in AUC or AUPRC, only ranks matter. As interpolation, this is an approximation. but on data of this size it is more than sufficient. just double check it actually takes in int32.
0.99 -->1
0.99--> 2
0.99--> 3
0.98 -->4
0.98-->5
0.97-->6
---
but please share the code to us. thanks
yuanfang
can you please tell me how to run the R code?
it gives me 0.54 on auROC, which should be 0.67 something on the 20000 line dataset i sent yestoday. where did got wrong (i don't know R, so this should be my mistake:)
this is how i run it.
pred<-read.table('output_tmp_1.dat')
gs<-read.table('test_gs_tmp.dat');
roc <- roc.curve(pred[,1], gs[,1]);
roc
Area under curve:
0.5406 (0.68 in sage's version)
pr <- pr.curve(pred[,1], gs[,1]);
Precision-recall curve
Area under curve (Integral):
0.5855755 (0.0019 in sage's version. thanks)
Area under curve (Davis & Goadrich):
0.5855755
also, we must be very careful in using this piece of code. if it is interpolating ranks, that is fine, because int32 can take in more than 50 million, but we need to make sure it is actually not using int16. if it interpolating values and using float, it will have numerical errors and keep many tied values because you have 50 million samples.
-----
i think i might have seen this paper at some point this year.
arshul, i think i didn't explain myself clearly. i mean after the examples i listed, there are many many others (not just those 6), but a point is erroneous drawn in the middle.
but anyway. now you know that scikit learn is wrong, are you shocked? you had a sleepless night, right?
i think it is in general a dangerous habit to think someone or something has to be correct. what if chip-seq is wrong? that our life is over? it is frequently seen that a whole branch of science including the brightest people can go to the wrong direction. don't forget people around the world studied how to turn mercury into philosophical stone for longer period of time than any branches of science. i think newton wrote more than 1 million words on that topic.
i will certainly do you the favor to control my language (i already improved a lot, i think). in general, one can avoid seeing that side of me by responding appropriately to my first 3-5 messages . Hi Yuanfang,
A few things to clarify.
For the balanced case that you posted before
labels predictions
1 1
1 1
1 1
0 1
0 1
0 1
The auPRC is 0.5 even with the correct interpolation explained in the ICML paper. This also happens to be the precision of the predictor with the fixed threshold of 1 i.e. TP/TP + FP = 3/6 . The R implementation https://cran.r-project.org/web/packages/PRROC/index.html and the implementation provided by the authors of the ICML 2006 paper http://mark.goadrich.com/programs/AUC/ agree on this one. Scikit gives an overinflated value of 0.75.
For the unbalanced case
labels predictions
0 1
0 1
0 1
0 1
0 1
1 1
Now the R implementation and the ICML 2006 paper implementation computes an auPRC of 0.166667 (which is also the precision of the classifier with the fixed threshold of 1 = 1/6). The Scikit implementation gives an auPRC of 0.583333, which is also quite clearly inflated in this case.
-Anshul
Hi Yuanfang.
You are right. The scikit implementation of auPRC is indeed incorrect. They seem to make the mistake pointed out in this ICML paper (http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf) which clearly explains why linear interpolation is Precision-Recall space gives inflated results (Fig 6 infact exactly illustrates the extreme case of 2 points in the Precision-recall space giving incorrect auPRC of 0.5 with linear interpolation when it should be infact much lower if interpolated correctly). I am very disappointed and annoyed that the primary implementation of a very standard performance measure is wrong in the defacto standard scientific computation package in Python. In fact, just noticed today while looking into this that someone seems to have made a pull request to fix this exact problem with this function in Dec 2015 https://github.com/scikit-learn/scikit-learn/pull/6065 . But it doesn't seem to have been incorporated yet.
Anyway, we'll fix this problem over the next 2-3 days. If you have a fix for the scikit code already let us know as well.
You win your bet after all ;-). Thanks for bringing this to our attention. We'll report this back to scikit as well. Please report any additional strange behavior you notice.
And if you can do me a personal favor - you raise great points more than often. I'd really appreciate it if you could keep the conversations friendly and scientific and avoid getting adversarial with personal attacks since I don't believe it helps in any way. We are interested in setting up a robust challenge for the community and appreciate constructive corrections and suggestions for improvement.
Thanks again,
Anshul. i think you touched part of the problem, hongjian; but not whole, even two points connected and calculated correctly, it won't get that wrong value, it is also the ranking problem and an associated false data point i mentioned..
****
i hope the organizing team can answer these three questions:
****
1. why this scoring can be 200 times different from the scoring code used in all other dream challenges using AUPRC as a measurement, for the same input data?
-- if this is how you operate, then next time when i see you report accuracy, i need to divide by 200 to compare to other methods...
****
2. why this scoring results in a random prediction with the highest AUPRC (0.5 for all 1 predictions) compared to all other types of meaningful predictions submitted so far? and it becomes a rule that as you AUROC goes up, your AUPRC and Recall@FDR goes down???
*****
3. many of the organizers have organized previous challenges, RA, PC, and RV, etc. since when did AUPRC have a 0.5 as random???
If so, why don't you teach me how the whole dream community, involving thousands of researchers now breathing in the bioinformatics field, failed to generate any better than random predictions in so many different types of problems?
thanks, i think i would really benefit from this education of how amazing results can be generated.
yuanfang
I think the problem originated from how you calculate the area. If you use binary result as prediction, there is only two points(0, 1) and (1, 0), using linear interpolation, the auPRC will be 0.5. Without enough points, it will be meaningless to talk about area. U can increase auPRC by doing that and auc will decease a little since the majority of negative is still right.
nathan,
you surprised me, but i won't blame you.
AUPRC is only 0.5 in sci-learn (AUROC is always 0.5), for random predictions in any shape it should be the baseline probability. and i think that is where it goes wrongly, by connecting a 1 and a close to zero value...
or could you tell me why SAGE's implementation is 200 times different on the same input data, and to me (and i believe to the whole SAGE/IBM people, and hopefully the rest of the participants) those numbers make much more sense. in that implementation it is baseline (0.0005 in the dataset i linked).
thanks,
yuanfang
***
i hope it is clear to everyone that this is not meant to be confrontational. obviously, i could have made use of this and top the final test by small tweeks. because it is almost impossible for the other participants to find out this loophole in scikit learn if i didn't post, right? then everyone else will turn around 0.03 while i tune around 0.3.... occasionally i do win in this way by using the loopholes, maybe once a year, that's when i am disappointed at the organizers that they do not take my suggestions seriously.
i posted because i hope this can be done correctly for an important dataset in our community, right? rather than a game of fitting scoring loopholes (you can be assured i am a world-class expert in nothing else but this one). and we all focus on what is right rather than who is right, alright?
Hi Yuanfang,
I think that this would make more sense to you if you reviewed the definition of auPRC. I can guarantee that a classifier that predicts all positives will achieve an auPRC of 0.5 on any data set with at least 1 positive.
Best, Nathan so this is enough to prove it is wrong, i hope.
i used the EXACT same input to calculate the auc, auprc, using SCI-kit learn and the scoring code used in RA challenge written by SAGE people: all files are in http://guanlab.ccmb.med.umich.edu/yuanfang/test_1.tar
i have to cut to the top 20000 lines, because their code is very slow.
when no tied values:
sci learn:
auc: 0.843932344608
auprc: 0.00172072591633
SAGE:
auc: 0.84393234460771605,
auprc: 0.0017770154872119341
after I binarize, still, ABSOLUTELY EXACT same input
sci learn:
auc: 0.669140754688
auprc: **0.228496149342**
SAGE:
auc: 0.66914074981640992,
auprc**0.0014404219897782339**
Obviously at least one of them is completely wrong. that's two hundred times difference... for anyone who is not using her toe to think, obviously SCI-LEARN is wrong..... how is it even possible that you lose information and you get much better performance?
i think maybe that's how these surprising findings are published. i never published in this field but i have been following closely on this data since it got published. and i believe i have studied it thoroughly as any other encode person. i really don't think this dataset can achieve that performance....
*****
i can tell you how to fix it so to maintain your ski-learn speed as well as right calculation (if there is that need)
i don't understand. can you make a submission that 'classify every observation as bound which will trivially give you an auPRC of 0.5.'? i think if all values are the same it should give a auprc at baseline, 0.001 something.
how can losing most information gets much higher scoring? also these scores are impractical.
i think the problem is caused by original:
0 0.99
0 0.98
0 0.97
1 0.96
1 0.95
1 0.94
becomoes tied predictions which are inappropriately sorted.
1 1 (0.96)
1 1 (0.95)
1 1 (0.94, and a point is drawn here)
0 1 (0.99)
0 1 (0.98)
0 1 (0.97) Hi Yuanfang,
Thanks for your comments.
You're correct - up until your auPRC reaches 0.5, it's trivial to increase your auPRC at the cost of auROC. To see this note that you can always classify every observation as bound which will trivially give you an auPRC of 0.5. The problem, of course, is that your auROC will also be 0.5. This tradeoff is why we will use both auROC and auPRC to determine participants' final scores.
Also, good models are able to achieve auRPC's significantly above 0.5 (ie 0.7-0.9 depending on the factor) with auROC's > 0.95. At these levels, the structure of the problem makes it such that gains in auPRC usually correspond to gains in auROC.
Best, Nathan
Drop files to upload
if this is really a scoring bug, would i be given a by-line coauthorship? page is loading…