I'd like to ask for more detail on the pilot data. You mention in the wiki that it contains false positive and false negative data.
* Is this to representative proportions?
* what are those proportions?
* is the cancer flag in images_crosswalk_pilot tsv files actual ground truth from proven pathology?
Thank you
Created by Philip Teare phil_teare Hi David,
I meant "exams" instead of "subjects". Also the total of 124 exams mentioned in my answer refers to a previous version of the pilot data that I have generated. In the current version, there are 111 exams associated to 58 subjects. There are 9 exam-breasts (a subject with N exams who has systematically both breasts imaged has 2xN exam-breasts) that are TP, 191 TN, 16 FP, 5 FN and 1 exam-breast missing (not imaged) for a total of 222 exam-breasts (2x111 exams). There are 14 positive exam-breasts (TP + FN) associated to 13 subjects. Once again, these numbers are not representative of the final dataset. Hi Thomas,
I'm having problems following your numbers on TP, FN, FP and TN.
In the pilot data set I'm seeing 58 subjects in total of whom 13 have cancer according to the labels provided.
Your numbers for TP, FN, ... do not seem to add up to me. Could you please clarify?
Thank you in advance.
No, I meant that you need the output of a classifier **and** a ground truth to generate a confusion matrix (TP, TN, FP, FN).
Whether a breast has developed a cancer within 12 months (cancerL/R set to 1, see [Challenge Dictionary](https://www.synapse.org/#!Synapse:syn4224222/files/)) is based on tissue diagnostics. Do you literally mean 'and' here:
"based on the decision made by the radiologists to recall a breast **and** whether a breast has developed a cancer within 12 month"
i.e. must both of these things be true to to be labeled cancer?
Hi Philip,
These values are based on the decision made by the radiologists to recall a breast and whether a breast has developed a cancer within 12 month after a given mammography exam.
We have decided to not provide output from the radiologists for the Leaderboard Phase, however this information will be available during the Collaborative Phase.
Thanks! Thankyou
```
In the Pilot Set, there are 10 subjects with TP breast cancer, 4 FN, 10 FP and 100 TN
```
Are you able to say which is which? Also does this mean you have marked them falsely, or that mammographers marked them falsely and you have marked them correctly? Or both?
Thank you. Hi Philip,
> Is this to representative proportions?
No. The goal of the Pilot Set is only to show examples of mammography images since we can't give participants access to the Challenge training set.
> what are those proportions?
In the Pilot Set, there are 10 subjects with TP breast cancer, 4 FN, 10 FP and 100 TN.
> is the cancer flag in images_crosswalk_pilot tsv files actual ground truth from proven pathology?
As mentioned in the [Wiki documentation of the Pilot Data folder](https://www.synapse.org/#!Synapse:syn6174174), the column _cancer_ has been removed and will not be available during the challenge. The same information is available from the column _cancerL_ and _cancerR_ in the exams metadata file. You can regenerate the _cancer_ label at the image level if your method requires it. _cancerL_ and _cancerR_ are effectively the ground truth. It is important to note that _cancerL_, for example, set to 1 doesn't strictly mean that the left breast of the subject has a (visible) cancer at the time of the exam (see [Challenge Dictionary](https://www.synapse.org/#!Synapse:syn4224222/files/)).