Dear Challenge participants, Thanks for reporting the various errors that you have found in the annotated files. We have fixed these errors as well as others that follow the same patterns. Unfortunately different errors continue popping up. We'll continue to update the files as needed, but we are changing this post into a changelog which we'll update when new files are released. File updates will be done in the Synapse repository as well as in docker. If you copy files to your local environment for development/testing, please update your copies. As of5/4/2020 there have been four sets of file updates.    ///////////////////////////////////////////////////////////////// ////    File Update 5/4/20 ///////////////////////////////////////////////////////////////// There are some minor updates in three of the annotated leaderboard files. * In the Annotated-APOLLO-2 file, the "IIIA" entry under the "stage_figo_2014" had been incorrectly annotated, and for the "G3-poorly differentiated" entry under the "tumor_grade" column the code annotation had a typo. * In the Annotated-Outcome-Predictors file, three entries under the "T1FLAIR" column had no annotations. * In the Annotated-REMBRANDT file, the entries under the "OnStudy Therapy Surgery Procedure Title col umn had been annotated with "Other Specify" and this has been changed to NOMATCH. There is also a change to the caDSR-export.tsv file to update all the records of the CDEs where the permissible values have added content in the last year, and the CDEs were used in the annotations of the public leaderboard files.    ///////////////////////////////////////////////////////////////// ////    File Update 4/27/20 ///////////////////////////////////////////////////////////////// Updated annotated ROI-Masks json file. The CDE annotation for "primary_radiation_therapy" was missing the permissible values and the concept annotations.    ///////////////////////////////////////////////////////////////// ////    File Update 4/23/20 ///////////////////////////////////////////////////////////////// Annotated leaderboard json files (Apollo2, Rembrandt, ROI-Masks, Outcome-Predictors). Typos in CDE labels and identifiers have been corrected in all the annotated leaderboard files. One CDE used for annotating the Apollo 2 leaderboard file had been retired and is now replaced with a current CDE. Examples: * CDE label "Birth Year CONFORMING" changed to "Birth Year Number" * CDE label "Persom Gender Text Type" changed to "Person Gender Text Type" * Identifier 289696 changed to 2896960 * Identifier 392464 changed to 3392464 * Retired CDE 2390930 replaced with 7050072 (in Apollo2) A new caDSR dump file, caDSR-export.tsv, has been generated as it was missing some of the CDEs used in annotations. These CDEs had datatypes or status that had been excluded from the caDSR export, they have now been cherry-picked and added to the caDSR dump file. * CDEs with java.lang.String and java.lang.Integer (3 CDEs in all). Their datatypes appear in the caDSR dump file as CHARACTER or NUMBER, respectively. * A CDE with status DRAFT MOD has been added. Its status appears in the dump file as DRAFT MOD. * Two CDEs that had been created subsequent to the caDSR export used to generate the dump file have now been added (one is shown above as a replacement for a retired CDE)    ///////////////////////////////////////////////////////////////// ////    File Update 4/14/20 ///////////////////////////////////////////////////////////////// caDSR reference dump file: a new caDSR dump file, caDSR-export.tsv, which addresses the issues below has been generated. Please note that in comparison to the previous dump, only the processing has changed, the content is the same and this is to eliminate variation between the annotations in the annotated gold standard files and the caDSR data that was used used to annotate them. * Some annotated files were manually annotated with CDEs that had a "DRAFT NEW" status ((https://www.synapse.org/#!Synapse:syn18065891/discussion/threadId=6896). The previous caDSR dump file did not include the CDEs with this status. CDEs with the "DRAFT NEW" status are now included. * A CDE (ID 6385439) contained the pipe character in its permissible value and value text. This was the only CDE we found with this character (which we use as a field delimiter in the dump file) and the pipe is now replaced with a dash character. * Multiple CDEs had a large number of permissible values and in these CDEs the PV field was truncated to 32 Kbytes due to post-processing in spreadsheets. The current dump file is minimally post-processed during the addition of the column headers and the CDE PVs are not truncated. In addition, a number of fields had been enclosed in unnecessary quotes during such post-processing; these extraneous quotes are not present in the current file. NCI Thesaurus file: a new Thesaurus.tsv has been posted. This file was reposted with minimal processing in order to avoid some of the processing issues encountered in the caDSR reference dump file above (i.e. truncated fields and extraneous enclosing quotes). * A difference in the filename extensions (.tsv vs .txt) was present the files posted in the Sage repository and the docker image, which led to one team having issues in submissions. The new Thesaurus.tsv file has now been posted in the Sage repository and the docker images with the same filename. Annotated leaderboard json files (Apollo2, Rembrandt, ROI-Masks, Outcome-Predictors). Issues in these files due to errors in manual annotation or processing have been fixed and the files reposted. Some of these issues were reported in emails to the organizers, some other issues were searched for and fixed based on the pattern of errors that had been reported. * Typos in codes were fixed, e.g. a code "G46110" in a PV was changed to "ncit:C46110" (the 'C' in the code itself in addition to the prefix). For another example, ncit:25331 was changed to ncit:C25331 (the original annotation was missing the 'C' character in the code). * Prefixes were fixed (they should be as found in the caDSR dump file). For instance the Loinc code "LA6156-9" in the annotated Rembrandt file was changed to "ncit:LA6156-9" (although this is a Loinc code, the source in the dump file is the NCIt - fixing the origin of this error in the caDSR dump file is outside the scope of the challenge). For another example, ncit:C0439234 was changed to ncim:C0439234. The solvers should use the codes and prefixes as they are found in the caDSR dump file. * The PVs of some nonenumerated CDEs were incorrectly assigned codes, e.g. "1-999" is a range, not a code. These "values" are now "null" (without quotes) in the annotated json files. * There were differences in how nulls were found in the json files, null vs quoted "null", the quotes have now been removed from all of these. In cases were nulls are entered as the empty string "", the scoring program will score them as if they were null. * Some PV entries that were labeled NOMATCH in the gold standard nevertheless had an associated code. These associated codes have been eliminated and the values are now null. * A number of Permissible Values (PVs) annotated with multiple codes had entries with formats "prefix:code1,code2" and "prefix:code1:code2". These entries have been edited to have the format "prefix:code1 prefix:code2". Please note that the Data Element Concepts annotated with multiple codes retain the previous array format, i.e. ["prefix:code1","prefix:code2"]. We apologize for the various formatting and processing issues that were found in the previous version of the files. Furthermore, as it is likely that we may have missed other trivial or complex issues introduced during manual annotation or processing, we have decided to allow participants to request review of specific results when they feel their result is a better match to the data than the gold standard annotation. Thank you again for participating!

Created by Gilberto Fragoso fragosog

File Update Changelog (was "Updated files for third leaderboard phase") page is loading…