If I've understood correctly, our models should predict just the probabilities for a positive outcome rather than the outcomes themselves, but the F2 metric requires a prediction for the outcomes, implying some choice of the 0-1 decision boundary. It seems reasonable that the cutoff would be selected such that the predicted number of positive outcomes matches (as closely as possible) the mean value of the predicted probabilities. Is that how it will in fact be selected? Also, in evaluating the mean and variance of the partner-specific AUC values, it could happen that there are no positive outcomes for a given data_partner_id, leading to an indeterminate AUC_i. Am I correct in assuming the undefined values, if any, will be excluded from the mean and variance calculations?

Created by Bruce Cragin Bcragin
Hi @evenmm, We don't have a finalized scaling method just yet. I'll be proposing a scaling method to the challenge organizers at our next meeting for approval and will update this thread with the decision and method. Thank you, @trberg
Hi @trberg, Have you decided how the scaling will be carried out, and will there be any update on the final calculation of the score? Best, Even
Hi @trberg, A follow up to question 1: scaling of F1, AUPR and cross site generalizability. Can you describe how this scaling will be carried out? Best, Even
Hi @Bcragin, Yes, it will include data_partner_ids of new partners that deposited data over the course of this challenge. So there could be new data_partner_ids. I wasn't going to, but you're not the first person to ask this, so I'm thinking I'll create a "blinded" gold standard file that has the covid_index and outpatient_visit_start_date (task 1)/hospitalization_start_date (task 2), but no outcomes column. So in the meantime, use the gold standard files assuming that a test version will exist with no outcome column. Thank you, Tim
Hi @trberg, Will the Test data ever contain data_partner_ids that are not present in the corresponding Training data? Also, will the covid_index dates be provided for the Testing data, or do we need to regenerate them ourselves if we want to use them? Thanks, @Bcragin
Hi @evenmm, 1. Yes, they will be scaled so they have equal contribution. We did want to give small increases in higher AUCs a boost, hence why we squared 2. We won't be using care_site_id, we'll be using data_partner_id, and all patients should have that variable. 3. Correct, the AUC values will not be weighted by patient number. Thank you, @trberg
Hi @trberg , Following up with two more questions: When calculating the cross site generalizability, will the patients missing a care_site_id be included and given a dummy care site, or will they be excluded? And just to confirm: The care site AUC values will not be weighted by patient number, so a care site with 10 patients will matter as much as a care site with 1000 patients? Best, Even
Hi @trberg, The three metrics included in the quantitative metric typically operate on different scales (at least for unbalanced data). With the squaring, the contribution of F2 and AUPR becomes very small compared to cross site generalizability. Will there be some rescaling of the metrics before calculating the score? Best, Even
Hi @evenmm, The motivation for that metric was to evaluate generalizability across sites. The idea is that if a model has a large variance across all the different sites, Mean(AUROC) should be penalized. If the variance is low, the penalty is lower. Hi @christophe.lambert, Yes, data_partner_id is the site identifier. We are definitely going to look at calibration, we just decided not to include it in the qualitative metric. We will consider the calibration metrics as part of the qualitative metrics (utility, interpretability). For the post challenge publication, those calibration metrics will be included. Thank you, Tim
Hi @trberg , Can you confirm that the variable that defines site is data_partner_id from the person table? I am not so familiar with F2, but am concerned that a post hoc parameter search with the known answers would be made to find the optimal one -- shouldn't you just pick 0.5 as a threshold to reward well-calibrated models? I'm disappointed that there are no measures of calibration such as Brier score. Many publications expect a calibration plot for your model. That is, it is important to know the degree to which a model?s predicted probability (ranging from 0% to 100%) corresponds to the actually observed incidence of the binary endpoint ? which is commonly assessed using calibration curves, calibration slope and intercept, Brier score, the expected/observed ratio, and the Hosmer-Lemeshow test. Thank you, @christophe.lambert
Hi @trberg, I struggle to understand the motivation for the last term in the quantitative metric the way it is currently described in the Challenge Instructions: (Mean(AUROC) - Variance(AUROC))^2. Could you elaborate a bit on the motivation for choosing this metric? I would understand if the term were instead: Variance(AUROC) = 1/n sum over sites [ (AUROC_i - Mean(AUROC))^2 ]. All the best, @evenmm
Hi @Bcragin, My plan for the F2 metric was to use the F2 optimal score, so whatever threshold gives the highest F2 score in each model will be the threshold used to generate the F2 score. And yes, if a site doesn't have any true positives, the score at that site will not be included in the mean and variance. Let me know if you have further questions, @trberg

Calculating the Quantitative Scoring Metric page is loading…