I think the current weighting by square root is wrong, instead, it should be normalized by the number of examples.   Let me just comparing my submission to the current best performer in sc2.   objectId team weighted_avg_iAUC DFCI_iAUC Hose_iAUC M2Gen_iAUC GSE15695_iAUC 9634922 PrecisionHunter 0.6669 0.7081 0.5591 0.8097 0.6552 9632309 Yuanfang Guan ???? 0.5968 0.7371 0.6104 0.1711 0.6617   Clearly, in three of the large cohorts that have around 100 examples, i am performing better, but in the cohort that have merely 15 example, I am significantly worse, yet, the over all score i am lower by 0.07. Do you think that is appropriate?   Let us do some simple calculation.   let us suppose we have a submission, that in A cohort with 100 examples, perform at 0.8, and B cohort with 4 example, perform at 0.2, the aggregation by your method would be: 0.70= (0.8X10+0.2X2)\/12 . while by my aggregation would be: .7769=(0.8X100+0.2X4)\/104. Clearly, AUC from 4 examples is barely trustable or informative in this case; yet in your calculation, it is going to drag down the performance by 0.1 to a set of 100 examples. Do you think it is possible that it is right? this is actually comparable to the current leaderboard, where 15 examples can drag down 250 so by 0.1   let us now suppose it is a measurement of RMSE, we can then have the situation where A is 100 example, and B is 1 example, yet the weighted average rmse inferred by your method would be 0.73=(0.8X10+0.2X1)\/11. by by method, it will be .7940. Clearly my method gives the correct pulled RMSE   I am not saying that i have to be top or anything. But I think it is obvious that the current calculation will be skewed towards who ever is the best in the smallest cohort, as clearly seen in both leaderbaord, which probably is controlled by only 2-5 positive examples. It will be same to everyone in the final round that the performance will be left more to chance.   I welcome comments from other teams on this.

Created by Yuanfang Guan ???? yuanfang.guan
@wdong Wei, isn't log(2^mn)= (m+n) just the total number of examples at each time point? Then I think we all have a consensus, it should be the number of total non-censored examples at a particular time point to be used in weighting. actually the first example i listed it is still the one who performs the best the smallest cohort (which probably has 1 positive and 1 negative in the 24 month measurement) ranks the top. but as you said, i should already be disqualified for giving an opposite prediction on this (likely 2-3 effective example cohort), so i guess i can tolerate with (m+n). I am fine to go with your other proposal as well, sqrt(mn).
@Michael.Mason then do sqrt(mn) i think whoever figures that out still probably should just win. actually i cannot immediatefy find a way to get the number of positives in just one submission even breaking down. because you do not know the total due to censoring . but yes, you are right, everyone has got a guess already based on previous submissions. the difference is just how accurate is this guess. i think it can be pooled iAUC as primary, and pooled AUC at 12 or 18 as secondary, where you won't have this problem. furthermore, i don't get your logic, because: 1. let us say that someone find out a way to figure out exact # pos in each dataset by just one submission, which would not be easy, because you had break down it, then, by this time, he already know the number of positives from the leaderboard. it won't be affected by any additional pooled score being provided. 2. Let us say that someone could find out the #pos by just one submission if you provide a score normalized by number. then clearly, he will be able to figure out still when you normalized by square root. 3. Your current argument is: because if we do in the right way we will leak information, so let us go with the wrong way. @unfashionable cut -f 1 -d ',' */csv|sort|uniq|wc returns a cohort level statistics as number of pid, which is also provided in simulated data about the same number.
Dear All, These are excellent points. However, please keep in mind that we cannot use the number of cases for weighting if we are going to provide cohort level scores. If we do so it is easy to determine the number of positives in each cohort after a team's first submission and then adjust thresholding to get a better BAC F1 etc.We could stop providing cohort level metrics but I am not sure that would go over well.
Hi @yuanfang.guan Where did you find out the number of examples in the validation set?${reference?inlineWidget=true&text=Clearly%2C in three of the large cohorts that have around 100 examples%2C i am performing better%2C but in the cohort that have merely 15 example%2C}
@cuiyi Math is correct. The product $mn$ is the number of cells in the ROC grid or possible values of AUC. The step function is whether to color a cell black. Larger number of cells means higher resolution or more information. I don't think this justifies using $mn$ as weight, but a function of $mn$ indeed seems more reasonable than a function of $m+n$. From an information theory point of view, there are $2^(m+n)$ possible predictions one can make on $m+n$ examples. So the amount of information is $log(2^(m+n)) = (m+n)$. This is why one would weigh with $m+n$. But not all predictions are useful conditioned on that the organizers know that m are positive and n are negative. For ranking purpose, the useful number of state is $mn$, the number of possible values of AUC a sample set admits. So the amount of information becomes $log(mn)$. If the bigger cohort has about 150 samples, and if in the small cohort there are few positive samples, $sqrt(mn)$ and $log(mn)$ generates similar weights. Update: Another way to measure the amount of information is to count the possible ROC curves the sample set admits. The amount of information is roughly $ (m+n) * entropy of p$ where $p = m/(n+m)$.
as i think now, actually what you proposed makes sense, because a sample with 1 pos and 99 neg is much less informative than 50 pos and 50 neg. but somehow, just intuitive feeling without deriving maths, i feel the scale of 50X50 is not right, maybe it should be sqrt(mn)? that scale makes me feel more realistic, can you see if there is something missing in your math? like the dependency between pairs makes the effective size actually smaller? further, squared(mn) mains the premise that pooled estimation equals separated, while (mn) doesn't meet this premise. for iAUC, it can be done then on each time point.
Yes, if we have to choose between sample size and its square root, I would vote for the former.
>even weighting by sample size is still questionable. with that you implied it is much better than what is being done right now.
You cannot do that because in iAUC, each time point the m and n are different.
I actually have a different opinion than both of these weighting strategy. In my opinion, even weighting by sample size is still questionable. **Rather, each AUC should be weighted by** $$\(mn\)$$, **where m is the number of positive samples, and n is the number of negative samples in that cohort. ** This weighting is based on the definition of AUC as follows: $$\(AUC=\frac{1}{mn}\sum_{i:y_i=1}\sum_{j:y_j=-1}\mathbb{I}(s_i-s_j)\)$$, where $$\(m=\#\{i|y_i=1\}\)$$, $$\(n=\#\{j|y_j=-1\}\)$$, $$\(\mathbb{I}\)$$ is the step function, and s is your score. I encourage everyone to validate the above equation. So we see when we calculate the AUC for a given cohort, there are actually $$\(mn\)$$ terms of step function in the summation, then normalized by $$\(mn\)$$. Therefore, if we want to evaluate the averaged AUC across multiple cohorts, the correct way is to weight each AUC by $$\(mn\)$$, rather than the sample size nor its square root.
According to your logic, Wei, all submissions will be disqualified in sub2, as they have their performance in M2Gen all around 0.28 so.   Let us consider the problem from another angle: the pooled estimation of average/median or variance etc, is expected to be representative as if they are together. Now, if we have two cohorts, A 150 patients performing at 0.8, B, 15 patients performing at 0.2. Using squared pulling, you get a score of 0.65, using my pulling, I get a score of 0.745.   Now let us seperate the first cohort into 10 cohorts, each of 15 patients, their expected score is then still 0.8 in each. With my pulling, I still get a score of 0.745. But using the square pulling, now the score changes from 0.65 to 0.745. Clearly, the squared pulling does not stand the premise that the pooled estimation reflects as if they are together, instead, it is affected by how the population was separated.
I think it's totally reasonable to use square root weighting. However, I do think the legitimacy of M2Gen as a test cohort is questionable -- if it has too few positive examples. I think AUC of 0.17 (or ours, 0.19) is no better a failure than NaN. A submission fails 1 out of 4 should be disqualified. But if a test case causes too many submissions (otherwise working) to be disqualified, it's likely to be problematic.
Dear Yuanfang Guan, You bring up some very interesting points. I will bring this up internally to the challenge organizers but I encourage other teams to respond here.

Concerns with the weighted average page is loading…