RMSE, Pearson Correlation, F1, scoring code

I see the scoring described here: https://www.synapse.org/#!Synapse:syn20545111/wiki/597243 and here https://www.synapse.org/#!Synapse:syn20545111/wiki/597242. The second link contains this phrase: "we will use the Root Mean Square Error (RMSE) as the primary metric and Pearson correlation coefficient as the secondary metric". Please elaborate on how the Pearson correlation coefficient (of what?) would be used in automated scoring as a tiebreaker. Nowhere is [F1 score](https://en.wikipedia.org/wiki/F1_score) mentioned, so I assume we don't have to think in terms of F1. It would be very helpful if the sponsors provided a Python code which, given a CSV file of ground truth and a CSV file of prediction, produced the official score as described above. Also please confirm that we will not be evaluated on F1, just on RMSE per case, by which I assume you mean something like ``` np.sqrt((ground_truth-prediction)**2).sum() ``` on either a case by case basis or the sum over all cases. Please be more specific. Also please describe how you will summarize the scores for the 3 challenges on the Leaderboard, i.e. how do the individual scores for all cases add up (total sum, mean, something else?) to a final score. Also will you have one Leaderboard or 3, one for each sub-challenge?

Created by Lars Ericson lars.ericson
@allawayr Thanks As @dongmeisun suggested, I'd wait for the answers from your biweekly meeting.
@arielis I'm just following up on your request from above and confirming that your submission has been removed from the queue. Cheers, Robert
@dongmeisun I will keep working on models. @arielis I would say the goal is to make a clinically reliable image analysis that works with equal confidence on unusual cases and doesn't produce false positives or false negatives. The training data has some interesting properties with implications for this goal: * Some features (columns in the ground truth spreadsheet) have missing values in the range. For example LH_pip_E__2 has values 0,1,2,5 but not 3 and 4. It remains possible that testing or validation images will have a set aside ground truth which has scores 3 and 4. A regression (linear) rather than one-hot-encoded (categorical) model is required, because a categorical model will never predict 3 or 4 levels, because there is no training data for those levels. * All features have very few positive values. For example LH mcp E ip has only 8 non-zero occurrences out of 368 examples, or 2%. The most represented feature, LH wrist J mna, has only 90 positive occurrences. Imagine trying to train a neural net to translate Pashto to English with only 8 phrase examples. We don't have nearly enough positive training data. * The deformed foot example above would be very hard to parse into joints with a naive joint localization method. Any method we produce should be equally reliable on deformed feet as on normal feet. Given that we have so few positive examples per category, joint localization seems required. Yet there is no labelled training data for joints, so each solver has to do the necessarily manual labelling work individually. Joint localization is mandatory with so few positive examples, but it is very unlikely that individual solvers have the resources or inclination to label the whole training set.
@dongmeisun But the algorithm and the metric are intricately linked. The goal of the model is to minimize the distance, as defined by the metric, between predictions and expected values. If you're measuring the distance differently than the RMSE, than the neural net (or whatever model we use) must be trained to minimize YOUR distance, not the RMSE.
Thanks @arielis and @lars.ericson. Please work on your algorithms. WE have a biweekly meeting set to discuss the issues from participants. I will collect your discussion here to the meeting (Feb 5th) and get the answers back to you asap.
Thanks. I am even more confused now. I thought that you were using RMSE, with weights for each feature. Now I understand it's not even an RMSE. Are you or not taking the squared difference between prediction and ground truth, or is it a different measure?
@arielis , We will not be giving out the actual weights, but again, we are happy to share the goal of the scoring metric. It is an RMSE measure with weights that balance for the distribution the scores. As you can see from the scores, they are non-uniform with many more low scores than higher weight scores. We want to make sure people won't game the scoring by simply submitting models that have no relationship tot he images. That is, a model that randomly selects from the distribution of scores from the training data or a model that submits all 1s or all 0s. The weights are meant to balance across the full range of scores, thus predicting a 0 (true value) is just as important as predicting a 324 (true value).
Thanks. Is there any chance we will know how the weighted rmse is calculated. ANYTHING would be better than an rmse with unknown weights - even an unweighted rmse - as long as we could reproduce the calculation and understand its meaning.
@arielis , we will remove your submission.
@james.costello I would like to cancel my submission. I don't feel I have enough information on the scoring metric to use one of my attempts.
By the way, if the underlying assumption is that joint localization is somehow easy, then take a look at this foot: ${imageLink?synapseId=syn21552449&align=None&scale=100&responsive=true&altText=deformed foot} and then compare it with the usual foot: ${imageLink?synapseId=syn21552450&align=None&scale=100&responsive=true&altText=normal foot} Given that 95% of feet will be shaped more or less normally, and 5% or less look like the first foot, it is safe to assume that any non-neural-net joint localization solution is probably going to fail on the first foot. That could lead to a clinically significant failure if a doctor passed on an automated testing result to a patient without looking at it that was based on a joint localization method that worked well 95% of that time, but not at all on that patient's foot. Sponsors should not assume * that joint localization is easy * that a robust solution can be designed that does not depend on joint localization * that it is clinically safe to deploy a solution that cannot correctly assign scores to joints on a deformed foot
@jakechen , "Unless our plan is shown to be seriously flawed and unable to pick out true winners who have the highest chance of making clinical impact, we will keep it as is. Thanks." Without any knowledge of the weighting scheme, there's no way for us to tell... And when the winners are selected, and the weighting scheme (hopefully) given away, wouldn't it be too late ?
@jakechen , without knowing in detail how the score is defined (intention) and computed (implementation), it is impossible for me to compute the score for my solution, and to evaluate how it compares with other solver's solutions on the Leaderboard (which doesn't exist yet, very late into the game for this challenge). It means I have to wait for someone else to submit to the Leaderboard to get one datapoint on the possible range of scores. Then I have to waste one of my limited submits to see where I fall in that range. That still doesn't give me any idea of how much my solution needs to improve to beat someone else's score. Since we are competing for a contingent payment (and a byline in a journal article, which has some intangible value, but you can't buy a cup of coffee with it in the short term), the setup you have defined leads to a pure gamble from the point of view of whether I should put more or less time into this project, and how much effort I should invest to achieve a better score. For example, if solutions without joint localization don't perform well (and I don't know what "perform well" really means without seeing some scores), then I still don't know whether I need to invest the (very significant amount of) time into labeling the joint training data myself (because the sponsors did not), and defining and training a joint localization model. By hiding the scoring mechanism you are seriously disincentivizing us for the extra effort that might be required to actually solve this problem, in the face of the severely limited amount of training data being supplied, compared to comparable machine challenges. You are asking us to roll the dice and spend significant amounts of our time so that you can see (at no cost to you) whether the amount of data you are supplying is sufficient to the task, which is valuable information in and of itself. Speaking only for myself, I doubt that I will go the extra mile under the circumstances.
@jakechen, There might have been some confusion over the term "baseline". In my suggestion, when I used the term "baseline", I was not referring in any way to your baseline model. I called "baseline model" a putative (uninformative) prediction that predict, for all patients the mean of each feature in the training set. The RMSE of this "baseline prediction" would be the SD of the feature. Hopefully, the competitors would do much better than this baseline prediction- but this concept provides a way to standardize the RMSEs so that they are all on the same scale.
The purpose of the baseline model is to triage bad models submitted. We don't wish to use our baseline model in any way to guide the innovative development of predictive models of each team. While the proposed suggestion has a lot of merit, it wasn't the plan adopted by the organizers. Unless our plan is shown to be seriously flawed and unable to pick out true winners who have the highest chance of making clinical impact, we will keep it as is. Thanks.
Great, Jim
For what it is worth, here is how I would design a weighted RMSE that would ensure all individual joint scores are well predicted in subchallenges 2 and 3. One can consider a baseline model that returns the same prediction for all patients. Such a model, would minimize the RMSE if it returns for each feature its mean in the training set. So a straightforward way to properly weight the RMSE, would be to take, for each feature a standardized RMSE defined as the ratio of: the RMSE of the feature in the user's prediction / the RMSE of this feature in the baseline model (which is the std deviation of this feature in the training set). The interpretation of such a weighted RMSE is pretty easy: - less than 1, better than a naive baseline model (if we assume that the mean and SD of the features in the test set is close to the ones in the training set). The closest to 0 you get, the better the model - more than 1. You do worse than the baseline model. If one wants to ensure that the score takes into account equally the total erosion (resp. narrowing) score and the individual erosion (resp. narrowing) joint scores. Then one would add one half to the standardized RMSE of the total score, and one half of the mean of the standardized RMSE of the individual joint scores. I think such a score would be fair, and would well serve the purpose of this competition. I don't think that knowing that the weighting formula gives equal importance to all features, will cause competitors to overfit (I can't think of anyway to do it). On the contrary, it will encourage participants to pay attention to each joint score, and to the overall score equally, which is, I think, what the competition aims for.
Dear Jim, Thanks for your answer. I do understand the Challenge Organizing Committee concerns, but I think the decision not to release the weighting function or the weights is unfounded, and hurts the purpose of this challenge. There are two possibilities: 1/ the metric function and the weights associated have been correctly chosen to ensure that the winning solution will be the closest to human evaluation performed today and will prove useful in the clinical setting. In which case, knowledge of this metric will help the competitors design solutions that ultimately fulfill this goal and ensure they will try to optimize their solutions to be the most clinically useful, and this challenge will succeed. Since competitors do not have access to neither the test set of radiographs, not their scoring, there is no way they could overfit their design by knowing this function. 2/ the metric function has been chosen in a way that does not properly ensure that the winning solution will be close to human evaluation in all aspects. Knowing how this function has been designed gives a hint about particular aspects which have more weight than others and whoever gets this knowledge will focus their efforts in optimizing some particular aspect of this solution. Whoever has prior knowledge of the metric (and there must be some, working with anyone of the team of acamedics), gets some unfair advantage over other competitors. Moreover, the choice of this biased metric, will cause the winning solution not be the one who is the most clinically useful, and the organizers would discover this problem only at the end of the challenge, when it is too late, instead of benefiting from the wisdom of the crowd who could spot possible bias in the chosen weights early on, and help choose better weights. Knowing that the score is a weighted RMSE is not very informative. With no knowledge of the weights, participants cannot correlate the weighted RMSE returned on the test set to any measure they can perform on their side. And to my opinion, since they can rely only on the test set measurement, they will be more prone to overfit on the test set. On the other hand, understanding how the score is produced, and compare it to the scale of a similar score that we could compute on our validation sets, would help design a solution that does not overfit (by looking at the extent of the difference between the score we get from synapse on the test set, to the score we can compute on our validation sets, we can get an estimate of how much we are overfitting our solutions, and this can guide us to minimize overfitting) So I think it is very important that competitors understand how the score is produced, to ensure that our solutions are well designed for the purpose of this challenge, and do not suffer from overfitting.
Dear @arielis @lars.ericson @decentmakeover This thread has raised several issues regarding the design of the Challenge. These issues have been brought to the attention of the Challenge organizing committee, whose responses are captured below: First, we greatly value the feedback we receive for the Challenges, and are always open to improving the design of a Challenge. Based on early requests for more training data, we are now providing all training radiographs instead of the original 50 images. We also have provided a small video on image segmentation and the SvH scoring (See @allawayr comment on this [thread](https://www.synapse.org/#!Synapse:syn20545111/discussion/threadId=6498)). We take very seriously the decisions that are made and we do discuss the decisions as an organizing group. Regarding scoring, two primary concerns have been raised. First, knowledge of the scoring metric is important in considering modeling choices. To this end, we have revealed in this thread (and to follow in an email to all participants and updated the wiki pages) that we will be using a weighted RMSE. It is a custom coded version of RMSE that we have benchmarked for performance. Second, there appears to be a concern that the organizers will not maintain the integrity of the challenge and change the weighting scheme as we see fit. This is not the case and we have a fixed weighting scheme that is built into an automated scoring harness. When you submit your docker solution, it gets automatically run and scored from fixed code and the score will be returned after processing, so there really is no way for us to manipulate the weights under this scheme. Immediately after the challenge closes, we will release the docker scoring container that contains the scoring code. The container is timestamped on Synapse and has a specific SHA digest that references that timestamped container. Participants will be able to inspect and run this code to reproduce the exact same score as was returned to participants in the Challenge. Again, we have no intention of misleading teams but we also feel that providing the full code is giving away too much information (see the next item). With this updated information, if any participants (@arielis) want us to remove their previous submissions from the queue, we are happy to do this. We anticipate opening up the leaderboard very soon, so please let us know. Finally, concerns have been raised regarding the number of submissions allowed. Given the limited set of data we have, we have to be strategic in how we design the evaluation scheme. In nearly all DREAM Challenges there has been a cap on submissions to prevent overfitting. Given the limited number of radiographs available to perform validation, capping the number of submissions is a common and reasonable approach to mitigate overfitting. We would like to note that several queries have been made to increase the number of submissions. We are actively considering increasing the number, but these too will have caps on submission. The organizing team is a group of academic scientists, which includes practicing Rheumatologists. Accordingly, we have attempted to maximize the likelihood of this Challenge will produce an open source clinical tool to score radiograph. This requires that we make decisions, particularly for the scoring metrics and the number of submissions, that are aimed at incentivizing the development of the best algorithm, not the tweaking of parameters to optimize against the scoring metric. We have seen examples of this in past DREAM Challenges, where the winning solutions was not useful in the clinic but won the challenge. We are again grateful for your feedback, and hope to make this a valuable and scientifically impactful Challenge. Kind Regards, Challenge Organizers.
an R function would be fine as well :)
In case it wasn't clear, I wasn't saying to give us the test/validation ground truth CSV files. I was just saying give us the function which, given a hypothetical ground truth file and hypothetical prediction file, computes the score. If the score is really just RMSE, then all you have to disclose to us is a function which looks something like this: ``` import pandas as pd import numpy as np def score(gt_fn, pred_fn): df_gt = pd.read_csv(gt_fn) df_pred = pd.read_csv(pred_fn): rmse = np.sqrt((df_gt.values - df_pred.values)**2) score_sum = rmse[:,0].sum() score_narrow = rmse[:,1].sum() score_damage=rmse[:,2].sum() return score_sum, score_narrow, score_damage ``` If the scoring function looks significantly different than this, then solvers really need to know. Not for overfitting, but to set the training objective function appropriately.
I can understand the concerns of the organizers who wish the winning solution to give both good scores for the total and for the individual joints. I have pointed at the problem of the original unweighted rmse formula myself, showing that it means that the score would be primarily affected by the error on the total erosion and narrowing scores, and make the errors on invidual scores negligible. There have been several conflicting answers given by the organizers, which may mean that they are still not sure of how to tackle this problem. Adding weights to parts of the score could be a way to approach this problem - there are others, which I could give away if you ask me to, even though these ideas may be detrimental to the scoring of my own solution :) But the solution to this problem cannot be a formula including *undisclosed* weights! There is no way to understand the scoring scheme or optimize the solution if the participants do not know how to reproduce the scoring scheme. This may also raise suspicion of a rigged competition, even though I believe that the organizers are sincere with their concerns. But anyway, as Lars put out, I have never seen an undisclosed scoring scheme in any serious machine learning competition.
@lars.ericson, The Kaggle contest cheat was in interesting story! Thanks for bringing it here. I agree with Lars that it is problematic to use an undisclosed metric scheme. What purpose would it serve to hide the metric function? This approach can't prevent fraud from unscrupulous competitors who may have access to part of the test dataset. But it hurts reliability and prevents scrutiny, and overall hurts the competition fairness.
Hi @james.costello I have already submitted a dockerized solution, and I am waiting for it to be scored. If you are changing the scoring metric with regards to what is written in the wiki, I would kindly ask you to cancel this submission (or make its score uncounted).
Jim, people will not be able to compare and challenge scores on the Leaderboard, because they won't know how the scores were arrived at. I have been in US Government challenges including IARPA's GFC1, GFC2, Mercury and PINS, DIU xView1 and xView2, and NIST OpenSAT and OpenCLIR. In every one of them, the scoring formula was unambiguous and published, to the extent that people could calculate their own scores offline and challenge the Leaderboard score. Almost always such scoring challenges lead to the discovery of an acknowledged bug in the official scoring code. You can't prevent over-optimization by hiding the scoring function or by limiting the number of submissions. We are training to what is, for machine learning purposes, an extremely small sample set with an even smaller number of positive instances per category, to the extent that you can count them on one hand in some cases, and 0 hands in the other, for example categories with occurrences of values 0,1,3 and no 2. Getting good out-of-sample scores of any kind on this challenge will be a miracle, especially if category ground truths occur in the test set which had no examples in the training set. The 3 solvers who have participated in the Discussion board so far, and hence the total of 3 teams that are likely to submit anything, will most likely just be fighting over the best worst score. Limiting the number of submissions to 5 means that they will be just rolling the dice, not really optimizing against each other, because there won't be time or submissions to have a real competition in the sense that happens in every other challenge I mentioned above. Hiding the scoring function makes it a total roll of the dice. There are things you should worry about, especially in a clinical use setting, but these worries are not addressed by the decisions you have made. For example: https://www.theregister.co.uk/2020/01/21/ai_kaggle_contest_cheat/
Hi @lars.ericson, we are preparing a larger email to all participants to address a number of questions and concerns, including the scoring metric, so look for that soon, but in short, the scoring metric will be a weighted RMSE. We will not release the scoring code as we don't want to give away the weighting scheme. We have found in the past that teams will often optimize and overfit specifically to the scoring metric, which kind of defeats the ultimate goal of developing a more generalized and hopefully a real-world applicable model. I hope that addresses your question. Cheers, Jim
@james.costello, @allawayr, any guidance? I'm asking for a Python function which takes a ground truth CSV file and a prediction CSV file and calculates the score for the 3 subchallenges, which we can use for offline testing.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

RMSE, Pearson Correlation, F1, scoring code page is loading…