BraTS 2023 Performance Evaluation and Ranking

Thank you all for participating in this year's BraTS Challenge. This year we are introducing two new performance metrics called lesion-wise dice score and lesion-wise Hausdorff distance-95 (HD95). These were developed to evaluate segmentation performance at a lesion level rather than at the whole study level. By evaluating segmentation performance at the lesion level we can understand how well models detect and segment multiple individual lesions within a single patient. Traditional performance metrics used in prior BraTS are biased for large lesions. In clinical practice detecting distinct small lesions is just as important as large lesions. The code used for performance metrics is available here: https://github.com/rachitsaluja/brats_val_2023 Here is an outline of how we perform this analysis and compute the final ranking - 1. First, we isolate the Lesion Tissue Sub-regions into WT (label 1,2,3); TC (label 1 and 3) and ET (label 3). 2. We perform a dilation on the Ground Truth (GT) labels (for WT; TC and ET) to understand the extent of the lesion. This is mainly done so that when we do a connected component analysis; we don't classify small lesions near an "actual" lesion as a new one. An important thing to note is that the GT labels don't change in the process. 3. We perform connected component analysis on the Prediction label and compare it component by component to the GT label. 4. We calculate dice scores and HD95 scores for each lesion (or component) individually and we penalize all the False Positives and the False Negatives with a 0 score for dice and 374 for HD95, we take the mean for the particular CaseID. 5. Each challenge leader has set a volumetric threshold, below which participants' models won't be evaluated for those "small/false" lesions. GLI, SSA, PEDS: 3x3x3mm dilation and minimum 50 voxels MEN: 1x1x1mm dilation and minimum 50 voxels MET: 1x1x1mm dilation and minimum 2 voxels 6. Final Ranking Method: The final ranking will be based on lesion-wise dice and lesion-wise Hausdorff distance-95 (HD95) scores for WT, TC and ET. Each team will be ranked for N subjects, for 3 regions, and for 2 metrics, which resulted in N32 individual rankings. The final ranking score (we call it as BraTS score) for each team will then be calculated by firstly averaging across all these individual rankings for each patient (i.e., Cumulative Rank), and then averaging these cumulative ranks across all patients for each participating team. Let us know if you have any questions. -- Rachit Saluja, Jeff Rudie and Ujjwal Baid

Created by Rachit Saluja rs2492
Hello @Ayoub_bzr: Thank you for your question. For the GLI, SSA and PED challenge, we perform a dilation of 3x3x3 on the Ground Truth (GT) labels (for WT; TC and ET) to understand the extent of the lesion. This is mainly done so that when we do a connected component analysis; we don't classify small lesions near an "actual" lesion as a new one and is evaluated accordingly. We don't actually change the GT segmentation, it's to understand its extent only. We don't evaluate GT lesions under 50 voxels. Please let me know if you have further questions.
Hi, can you provide more details about point 5 (3x3x3mm dilation and minimum 50 voxels).
@vchung , I thought it might be a mistake on your part, but then I looked at the code for calculating the metrics and saw that the number of true positives, false positives and false negatives depends on the ground truth. Since I do not have the ground truth of the validation set, I cannot reproduce the results provided by the platform. So, I think everything should be ok with the online evaluation. Thanks for your time!
@ShadowTwin41 , Can you also share a submission ID where you observed this incongruence between the submission system and your local testing?
@ShadowTwin41. Thank you for pointing it out. Let me look into it.
Hello @rs2492 , I have a question: Are you sure that the online evaluation is using the same code as published here: https://github.com/rachitsaluja/brats_val_2023 ? Because I'm getting distinct results using the implementation you provided and the online platform. I think the difference is here: ``` ## Dilation and Threshold Parameters if challenge_name == 'BraTS-GLI': dilation_factor = 3 ``` if I change to dilation_factor = 1 it works. Thanks for your time!
Hello @Tumor2023 I will tag @neuronflow and @branhongweili to answer this question! They lead the Synthesis challenges and can guide you best!
Hello @rs2492 Can I ask a question regarding the evaluation method for the missing modality synthesis challenge? I would like to inquire about how the SSIM (Structural Similarity Index) metrics will be calculated. Considering that the intensity range of the different cases in this challenge is quite diverse, I was wondering if we are required to predict the intensity range of the missing modality, or if the SSIM will be calculated after intensity normalization. For instance, should we normalize all images to a range of 0 to 1 before calculating the SSIM? Your clarification on this matter would be greatly appreciated. Thank you for your time and attention to this query.
Hello @ShadowTwin41: Great question! If your model predicts a False Positive lesion of any volume, your model is penalized. Although, if there is a GT lesion that has a lesion volume of <50 voxels and you happen to predict that using your model, those entries will be ignored and you won't be awarded or penalized for it. The reason for that is that for challenges such as GLI, MEN, PED and SSA, these small lesions are actually not part of the pathology. For METS there will be a significant drop in threshold volume (is estimate between 2 or 5 voxels) since there could be metastatic lesions that are that small. I hope this answers your question. Please feel free to clarify anything!
Hello @ysuter: Thank you for your question! You are correct! We are working with the METS team to get that information out to the participants soon. Yes, the performance of all three labels will be weighted equally.
Dear @rs2492 , Also regarding the volume threshold: If my solution predicts a lesion volume of less than 50 voxels, is this lesion not considered? In the ground truth, are lesions with less than 50 voxels also ignored ?
Dear organizers, Thank you for providing these details! Do you know when the information for the Metastases challenge will be available (dilation and volume threshold)? I expect this to be very relevant with the many individual lesions in the many small individual lesions in the dataset. Will the performance of all three labels for the Mets challenge be weighted equally for the ranking? Best and thanks, Yannick
Yes, that is correct. I have added that to the post. Thanks for pointing that out.
Dear @rs2492 , You say in point 4. "We calculate dice scores and HD95 scores for each lesion (or component) individually and we penalize all the False Positives and the False Negatives with a 0 score, we take the mean for the particular CaseID." Should the HD95 be the maximum (374) for those cases and dice score the minimum (0)?
Hello @ShadowTwin41: Thank you for your question. The lesion-wise metrics should be displayed on the leaderboard in the Results page. You can see all the displayed metrics for all the entries. If your entry is not there for a particular challenge you can let me know and we will find out what's going on. https://www.synapse.org/#!Synapse:syn51156910/wiki/622971
Dear @rs2492 , Are these metrics also displayed in the validation phase, or only in the test phase? I have already submitted some entries, but the metrics for lesion-wise are not displayed. Thanks for your time, André Ferreira

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

BraTS 2023 Performance Evaluation and Ranking page is loading…