Once again, many thanks for all the work that has gone into getting this challenge up and running.
When analysing the detailed feedback in the discrepancy report, I am coming across a lot of failures that arise from not retaining text.
Consider an entry such as:
index check_passed check_score tag_ds tag_name file_value answer_value action action_text
919 0 0 <(0018,1030)> <5.1 4DCT & ITV FB + 4D + INSP/EXP 20161113> <5.1 4DCT & ITV FB + 4D + INSP/EXP> tcia TCIA-P15-DESC-C
We are being scored zero here because we replaced the entire tag with ANONYMISED, rather than simply removing the date. I would argue that we are following a valid precautionary principle here, and also one that is entirely in line with the DICOM Standard. Note 4 in Section E.3.5 Clean Descriptors Option says this:
_This Option specifies what needs to be removed, not what needs to be retained. Depending on the application, it may be desirable to retain some information, such as technique description, but discard other information, such as diagnosis, for example because it may bias the interpretation in a clinical trial. For example, one approach is to remove all description and comment Attributes except Series Description (0008,103E)_
The approach we adopted was to process all "descriptor" tags from the set marked C in Table E.1-1 and to replace in its entirety any tag that flagged as "suspicious". It takes a high level of understanding to be sure that the string preceding the "20161113" in the example above is safe and (according to https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview) it would seem that TCIA still relies on manual inspection by curators, whereas the goal of this task is (presumably) to produce an automated system.
As a further example of why we think our approach is safer, consider:
2853 0 0 <(0018,1030)> <4.6 COLONOSCOPY (ACRIN) DR.IYER for 9256614630> <4.6 COLONOSCOPY (ACRIN) DR.IYER>
We score zero for not retaining text that appears to have PHI **"DR.IYER"** in the "action text" column.
Many thanks in advance for any thoughts you might have on this issue.
Created by Simon Doran simonjdoran Hi Michael,
Thanks very much for your response. It was great to have your succinct summary of the purpose of the challenge. That was really helpful.
Best wishes,
Simon Hi @simonjdoran,
I understand your concern. Let's start by addressing your comment:
"it would seem that TCIA still relies on manual inspection by curators, whereas the goal of this task is (presumably) to produce an automated system."
You are correct. First, TCIA does employ a hybrid approach using both automated tools and manual curation methods with two main goals. 1) 100% HIPAA compliance and 2) Preservation of scientifically valuable information. Second, your task for this contest is to attempt to automate this process with these same goals and scored according to the standards set by TCIA.
I agree that removing the full text is the safer method, and is a possible approach defined in the clean descriptors profile option, but this does not preserve the scientific value. We're looking for complex algorithms that ensure data is preserved that may be of value in the future.
Regarding the "DR.IYER" text. This is an interesting example of the TCIA team whitelisting the authors of the publication for which this original dataset was published. This should have been addressed during the creation of the synthetic dataset as there is no way users of the dataset would know this context. The impact of one series to our score should be negligible. Thanks for bringing this to our attention. I'll add it to my list to correct before public release.
Good luck!
Michael
Apologies, that the Synapse platform seems to have redacted my post to remove the contents of the "tag_name" and "file_value" tags for some reason, which really doesn't help the clarity of the explanation!
For both of the entries, these were "Protocol Name" and "ANONYMISED" respectively. (Here's hoping that this correction doesn't also get redacted!)