Should there be any duplicates in the caDSR-export.tsv file? For example: grep 2644755 data/caDSR-export.tsv | wc -l returns two rows with the same CDE_ID of 2644755, but they differ in the number of CLASSIFICATIONS. How do we resolve these duplicates?

Created by PAUL PERRY paulperry
Hi @paulperry , Yes, there will be entries with the same CDE_ID in the caDSR-export.tsv file. These entries are different versions of the same CDE. A related question came up before (https://www.synapse.org/#!Synapse:syn18065891/discussion/threadId=6541). Here is part of @DeniseWarzel 's reply in that thread: There can be multiple versions of the same CDE_ID in the dump file, this is because CDEs are unique by their Public ID + Version, and we did not include the Version number in the dump as it is not needed in the annotation." The classifications could help you, or a curator annotating manually, select which CDE to use in a particular instance because it's a reflection of usage. However, in comparison to the question_text and the permissible_values, the classification has a lower weight in the manual annotation workflow (https://www.synapse.org/#!Synapse:syn18065891/wiki/600446). In the case of CDE 2644755, I can't tell by looking at the PV whether the permissible values are different, one might be a subset of the other or maybe there's a partial overlap. Which CDE is the best for annotations will then depend on whether the data can only be annotated with the PVs of one CDE. But if all the data is included in the PV of both versions of the CDE, then you'd have to use other criteria to select the best, the cde_short_name, the cde_long_name, and so on down the line. I can't say if this will be the case in the challenge, but I'd add that if your solver selects a version of a CDE that you consider better than the manual annotation in the "gold" standard, you can request a review of the result. Regards, Gilberto

Duplicates in caDSR-export.tsv ? page is loading…