I am trying to understand how the scoring numbers work, I have read the technical description, and I have some understanding of how it works. I would be grateful if you could give us a simple rule of thumb.
The top submission of round one was:
auc: 0.678
partial AUC: 0.0406
Specificity At Sensitivity 0.8: 0.4065
You said, that all the rest were not better than chance.
To win the competition, what numbers woukd we need?
For example, would this score, if it was the top one, be a winning submission:
auc: 0.8
partial AUC: 0.09
Specificity At Sensitivity 0.8: 0.5
The reason for my question, apart from wanting to be clear about this, is that the second besr submission was:
auc: 0.6699
partial AUC: 0.0428
Specificity At Sensitivity 0.8: 0.4476
You can see that the second, no better than chance, entry has both auc and partial auc, higher than the best entry.
Would it be true to think of the 'Specificity At Sensitivity' (SAS) as the key indicator?
If so, shoukd we be auming for 0.5, or, as it suggests, by the 0.8, should we be aiming to get this to >0.8, and, idally closer to 0.9?
I'm going to be trying a number of variations, and it would valuable to knkw this so that I can concentrate in improving the right submissions. I know they are all related, but, if getting a high auc is less important than getting the best SAS, it would be very useful to know that, so I can jude]ge how well submissions are doing better.
Created by Peter Brooks fustbariclation Here is how we compute the Bayes Factor.
Let AUC_1 represent the AUC of the best performing submission, AUC_2 represent the AUC of a competing submission, and DeltaAUC = AUC_1 ? AUC_2 represent the difference in the AUCs. Note that the observed DeltaAUC is positive since AUC_1 is the top performing model.
What we want to check is whether the top submission is statistically better than the competing submission. Therefore, we want to test the one-sided null hypothesis (AUC_1 is not better than AUC_2) H_0: DeltaAUC <= 0 vs the alternative hypothesis (AUC_1 is better than AUC_2) that H_1: DeltaAUC > 0. And we use the nonparametric bootstrap to do it.
The first step is to estimate the sampling distribution of DeltaAUC statistic using non-parametric bootstrap. We do that using a paired bootstrap approach blocked by subjectId. (Note that we need to re-sample the subjectId blocks, since we cannot assume that the data from the left and right breast of the same subject are independent. It is fine, nonetheless, to assume that the subjects are independent.) We generate B=1000 bootstrapped versions of the data, and for each bootstrap version we estimate the AUC_1, AUC_2, and DeltaAUC value (call this estimate DeltaAUCstar, to differentiate it from the DeltaAUC resulting from the actual test set).
The estimated sampling distribution of DeltaAUC statistic corresponds to the histogram of the 1000 DeltaAUCstar values.
If the top submission is statistically better than the competing one, we would expect DeltaAUCstar values to be positive in most of the B bootstrapped data sets.
One way to test H_0: DeltaAUC <= 0 vs H_1: DeltaAUC > 0 is to check whether a bootstrap percentile confidence interval contains 0. Or (equivalently) we can simply invert a percentile confidence interval to obtain the bootstrap p-value. The bootstrap p-value is given by the number of times that DeltaAUCstar was smaller or equal to 0, divided by the number of bootstraps. In other words, if the top submission is better than the competing one in a fraction f of the B bootstrapped data sets, then the bootstrap p-value is 1-f. We will use lines below that f ~1-p.
Alternatively, we can perform a Bayesian hypothesis test by recalling that the sampling distribution generated by the non-parametric bootstrap closely approximates the posterior distribution of the quantity of interest when we use a non-informative prior (see, for example, Section 8.4 - Relationship Between Bootstrap and Bayesian Inference - in the book The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman). Under this Bayesian interpretation, the posterior odds in favor of the alternative hypothesis is given by the ratio of the posterior probability of H_1 to the posterior probability of H_0 (where the posterior probability of H_1 is simply given by the fraction f of the B bootstrapped data sets where DeltaAUCstar was positive, and the posterior distribution of H_0 is given by 1 minus the posterior probability of H_1). Hence, the posterior odds in favor of H_1 is given by f/(1 - f). Now, by definition, the posterior odds is given by the product of the Bayes factor, and the prior odds. However, because we are assuming non-informative priors, the prior odds are approximately 1 and the posterior odds approximate the Bayes factor. Therefore BF ~ f/(1-f) ~ (1-p)/p. When p=0.05, BF ~ 19.
Hope this clarifies how we compute the Bayes Factor.
Gustavo and Elias Chaibub-Neto
Yuan fang, when saying the one who invented BF may not be very small, I think you may want to clarify which BF definition you are talking about. The one used on organizer or the one that commonly accepted by the statistics science community (i.e. The one that defined here https://en.m.wikipedia.org/wiki/Bayes_factor ) The several counter intuitive examples you gave are all apply only to the former one, not the latter one.
That why I feel very strange. We cannot use the same terminology to reference very different things. And more stange is that to apply the commonly agreed rules( like BF has to be 10ish, to make it statistical significance) to a metrics that is very different from real BF. >If it is different from commonly defined BF, please don't call it BF then.
I am not at the position to judge who is right.
But I do think their way is the common way. because P are conditioned probabilities not predicted values/probabilities.
But at the same time I think BF itself is not appropriate for hypothesis testing, as the several counter-intuitive scenarios i mentioned.
because clearly a hypothesis testing method without taking into the total number of examples cannot be right.
also the predictions are dependent on each other on the original input. i don't believe any method uses non-dependent test can be remotely correct.
clearly, any statistical test that would rank a better absolute method to a lower rank is impossible to be correct.
and I don't believe Bayes invented it, it doesn't look like something invented by someone very smart.
it should be one of the 1000 possible statistical tests invented to justify things not so justifiable.
having said so, i don't suggest change to a different test, because they are statistics and they all have problems. i am just saying this is wrong, but i do not suggest attempt to correct it. If it is different from commonly defined BF, please don't call it BF then. Also don't apply the rules (like bf?19?that apply for BF here before prove it is scientifically make sense. So do you mean there are different Bayes factor definition? This is really beyond me. :) OK. you can be very correct. but in statistics, there is no absolutely right or wrong. they do not implement this way and to them, their method could also be correct. from the leaderbaord i think they have implemented in the way that i explained.
i am sure the organizers will explain to us their rationale in great detail soon: they are very committed to rigorous statistics.
My calculation is strictly based on Bayes factor's definition. See https://en.m.wikipedia.org/wiki/Bayes_factor
**BF=P(D|M1)/P(D|M2)**
If you think it is wrong, would you kind enough to point out what's wrong? I don't think this topic will affect any team's core competencies or affect auc in any way. Thanks!
In any case, BF won't be infinity unless one module predict one sample with 0 probability positive while the sample is indeed positive ( or vice versa) hahahahaha. your statistics knowledge is indeed limited, actually even more limited than the organizers.
i will find a chance explain to you in the fall or next year. now we work for different teams and we must work independently. but here is the short summary:
if there is one example, all people tie, (in this case only, as they use auc).
if there are two examples, all predictions generate different rank will have BF = infinity
that's why i said 80 examples with bayes factor may be problematic.
but we probably won't be able to do correctly. there might not be better ways. the underlying premise that something needs statistics is that it is not so correct, so you need statistics to justify it.
typo:
If the sample truth is positive, then BF(A over B) = 0.8/**0.4 **= 2. Yuanfang Guan ???? (yuanfang.guan):
> Obviously, if there is only 1 example, bayes factor would be infinite ...
I disagree on this, based on my very limited statistics knowledge. Please correct me if I am wrong.
Say there is a test involves one sample. Model A predict 80% positive and Model B predict 40% positive. If the sample truth is positive, then BF(A over B) = 0.8/0.2 = 2. If the sample truth is negative, the BF(A over B) = 0.2/0.6 = 0.33 (or BF(B over A) = 0.6/0.2 = 3). If there are many samples, to calculate the BF, you just need to times all those individual BF for each sample together. So there is no easy way to get BF from AUC. As the sample number grows, the BF can compound. e.g. if every prediction of model A is strictly better than model B by 0.1%, with 10000 samples, the BF will be 21916.
If we really want to use BF to define win/tie, we need to do it correctly.
@vacuum
Bayes factor is a very problematic measurement, it often renders poor submissions to have better scores. Thus, suppose the top submission is 0.68, a submission of 0.6 (team A) might win over a submission of 0.65 (team B), as long as team A is more orthogonal than team B to the top 1 team. Furthermore, I don't believe their bootstrap can be implemented correctly. sometimes they refersto sub-sampling as bootstrap. Then how much one sub-samples obviously would affect the result, if you sub-sample 100%, all bayes factor is infinite. if you sub-sample 1, then nothing is significant.
But, if you are the top 1, it is unlikely to be changed by any statistical test (but with two metrics, it is also possible, if the two metrics are not correlated, which has occurred in a couple of previous challenges). Thus, so called co-winners here are not meaningful. But, number 1 is always number 1, while the rest of the list will be questionable.
Speaking here, I am now worried about the result of this challenge will be distorted by the 2nd metric, as if the size of the test test is only that of the training set, that means roughly 400*0.2=80 truth examples for the 2nd metric are going to determine the winner. Obviously, with only 80 examples doing statistical test, the result is very likely to be distorted and manipulated. Obviously, if there is only 1 example, bayes factor would be infinite and favors whoever pre-dicted better on this 1 example.
According to Harvard stats 1111 lecture notes:
3 ? B < 10 Substantial for M1
B >= 10 Strong for M1
https://isites.harvard.edu/fs/docs/icb.topic1383356.files/Lecture%2019%20-%20Bayesian%20Testing%20-%201%20per%20page.pdf
According to wikipedia:
5 to 10 substantial
10 to 15 strong
Also can someone kindly share the equation to generate BF from 2 AUC (or 2 pAUC) (need sample numbers?)? Thanks!
Gustavo,
>> (BF needs to be bigger than or equal to 19 to be significant, equivalent to a p value of 0.05).
Can you show the equation how do you get this p value of 0.05? Thanks! Hi Peter, let me know if this helps. Let's analyze the numbers you gave for Round 1.
Our primary metric is AUC, therefore we first ranked submissions according to AUC. The 1st ranked team (Team A) had 0.678 and the 2nd team (Team B) had 0.669. We next checked if the two are statistically significant better than the baseline (random chance in this case). Both teams A and B were better than chance. So we checked if 0.678 is statistically significantly better than 0.669. It wasn't, given that the Bayes Factor ~2 (BF needs to be bigger than or equal to 19 to be significant, equivalent to a p value of 0.05).
Therefore the two teams A and B are better than random and tied in the primary metric.
We then moved to the secondary metric, which is the pAUC. Here there 2nd team B had a pAUC of 0.043, better than the team A, which got 0.041. Both pAUC scores were better than chance. However they were tied, as the BF was 1.55 in the comparison B to A. (Note that 1.55=1/0.645, where 0.645 is what we wrote in the leaderboard table, as we are referring the BF to team A, the first according to the primary metric). in favor of team A.
Therefore the two teams A and B are better than random are tied in the secondary metric.
Therefore it's a tie.
**We will not use Specificity at Sensitivity 0.8 as a metric**. However we show it, as we would like to compare the results of the Challenge to the radiologists' specificity of 0.9. We have a ways to go from 0.45.
Let me know if you need further clarifications.
Drop files to upload
Scoring - what is better than chance page is loading…