Hi data scientist participants! The evaluation scripts can be found in the Github repo [here.](https://github.com/SurgicalScience/syn-iss-2023/tree/main/eval_scripts) There is a README file with instructions in the repo. Please reach out to us if you have any questions or need any further assistance. Further information on the metrics can be found on Synapse [here.](https://www.synapse.org/#!Synapse:syn50908388/wiki/622714) Stay deep in data and as always, we wish you great metrics!

Created by Kimberly Glock SuS_Seattle_DS
@TaiyoIshikawa thanks for reaching out. Class 1 is the shaft and if that class is not present, you will see those values. What you are seeing stems from the following code found in the task 2 eval script: ``` if pm is not None and np.sum(pm == class_label) > 0: precision = calculate_precision(gt == class_label, pm == class_label) hd_distance = calculate_hd_skimage(gt == class_label, pm == class_label) else: precision = 0.0 hd_distance = "nan" ```
Dear Organizer, I tried to run "evaluate-task2-parts.py" for the sample-test-data. It worked well for most of the data, but there is one thing that concerns me on "p-161e753dc6fdb8d9e54b.png" in the output csv named "class_1_metrics.csv". p-161e753dc6fdb8d9e54b.png has only two segmented parts, jaw and wrist. like below, p-161e753dc6fdb8d9e54b.png,pred-161e753dc6fdb8d9e54b.png,nan,nan,nan,0.0,nan I guess that class1 represents the shaft, and this result is caused by the no areas for the shaft. Is this expected behavior? Best regards, Taiyo Ishikawa
@GeorgiiKos thanks for explaining what you are observing. We have made updates to the evaluation script to address these issues and pushed them to the Git repository. Please let us know if you face any further issues in the using the evaluation scripts. Syn-ISS Organizing Team
Dear Organizer, thank you for sharing detailed information about the evaluation. We have been using output from the docker container for our evaluation as well. To better illustrate our problem, we created copies of the 6 binary masks from `docker/templates/task1-binary/sample-test-data/inputs/` and replaced the prefix `b-` with `pred-`, and subsequently ran the `evaluate-task1-binary.py` script: ``` python eval_scripts/evaluate-task1-binary.py \ docker/templates/task1-binary/sample-test-data/test.csv \ docker/templates/task1-binary/sample-test-data/inputs/ \ docker/templates/task1-binary/sample-test-data/inputs/ \ docker/templates/task1-binary/sample-test-data/ ``` Therefore we are comparing identical RGB masks and expect perfect results. The script yields IOU of 1.0 and Hausdorff distance of 0.0 for each image, as expected. However, the resulting f-score, precision and recall are around 0.0039 for each image. We could also observe the same problem on our test split. Could you kindly advise if there are any errors in our approach? Further, we noticed that both scripts skip the first image provided in the `test.csv` files, as the scripts expect a header row which is not present. Is this expected behavior? I apologize, if this message causes any confusion. Sincerely, Georgii
@GeorgiiKos thanks for the message. We understand what might be happening here. This following note would be useful for all participants. Participants, The evaluation scripts have been written with the output of the Docker containers in mind. They expect the format that the Docker containers will produce when we run them on "test" datasets. The evaluation flow diagram posted on the [evaluation wiki page](https://www.synapse.org/#!Synapse:syn50908388/wiki/622714) shows how files and scripts will use test data to generate predictions and then evaluate those predictions against groundtruth. We would recommend that the participants place and call their segmentation models using the Docker templates that have been provided. The [Docker instructions](https://github.com/SurgicalScience/syn-iss-2023/blob/main/docker/Docker-Submission-Instructions.md) will guide you where the code needs to be added. The Docker template file `segment.py` saves the predicted labels in a 3-channel RGB image using [255,255,255] value for instrument pixels. Based on the testing we did, the evaluation script should work as intended on the Docker container outputs. In order to use the Docker templates you don't need to prepare a Docker container image. You can call the `main.py` script and pass the three command line arguments it requires. This would also ensure that your model scripts and functions are working with the Docker template as you conduct final tweaks and tests. > Notes: > 1. An update to the Docker templates is coming soon that would fix the issue related to using the `skimage.color.label2rgb` method. Until then you may encounter similar issues to those mentioned in [this thread](https://www.synapse.org/#!Synapse:syn50908388/discussion/threadId=10491). > 2. Please use the updated versions of the dataset moving forward for training and testing your models.** Sincerely, Syn-ISS Organizing Team.
Dear Organizer, Thank you for providing scripts for the evaluation. Unfortunately, we have encountered some problems with the binary evaluation script, as it produced low precision, recall and f-score on our test split. We assume that the script expects ground truth masks and predictions masks in RGB format with (255,255,255) representing an instrument, because of the `convert_rgb2label` method call during the hausdorff calculation. In our case, the reason for the low scores are presumably `np.sum(gt_mask)` and `np.sum(pred_mask)`, as they sum the pixel values over the three channels of the image. Could you please provide information whether the expected format for the evaluation scripts are RGB masks? I apologize, if this message causes any confusion or if our conclusions are incorrect. Thank you for considering our request Sincerely, Georgii

Evaluation Scripts Released page is loading…