Hi,
I'm observing some bugs in the scoring code WRT the [algorithm described](https://www.synapse.org/#!Synapse:syn18065891/wiki/600432). I was confused by the scores we were getting as we added the valueDomain annotations to our submissions, so I created a minimal one column dataset to investigate. (LMK if you want me to share that with you.) In looking at a dataset with a single column and copying the exact correct annotations from the leaderboard file I see that if I leave the valueDomain as a blank list [] I score at 5.0. If I add observedValue entries into the list I loose points for every one I get wrong. The amount I loose for each incorrect value seems to be based on the number of correct values there are but is not limited to the 0.5 max for the valueDomain section (eg if the leaderboard annotation has two values and I enter 3 incorrect values, I'll get a score of 4.25, even if I also have the two correct values).
This caused us considerable confusion in evaluating our various builds and definitely impacted the version we submitted for round 2.
Created by tconfrey @tconfrey ,
Thank you again for the good catch! You are correct -- 0 should be returned, not 0.5. Our logic at the time was not taking this test case into consideration, but we have implemented a fix for this.
@attila.egyedi ,
Your concerns are understandable! However, adding all synonyms will not result in 0.5, as the score is proportional to the number of predicted value domains, that is, # of correct predicted matches / total # of predicted matches.
Thank you both again for your insightful comments! We really appreciate them, and please continue to let us know of any other cases we may have missed.
Best,
Verena Hello @attila.egyedi, @tconfrey --
I apologize for responding a week late -- thank you for continuously helping us perfect our scoring methods! I have taken both of your inputs into account and will hopefully push all the fixes by EOD tomorrow.
We really appreciate everyone's patience during this time.
Best,
Verena Hi @v.chung , @fragosog ,
Turns out there's still a significant scoring bug here. I'm getting a 0.5 score on a totally incorrect column prediction when the gold standard has a "NOMATCH". See below for a minimal test case.
Regards,
Tony
```
# Minimal incorrect prediction file:
bash-3.2$ cat output/ROI-Masks-minimal.json
{
"columns": [
{
"columnNumber": 1,
"headerValue": "TERT-Promoter",
"results": [
{
"resultNumber": 1,
"result": {
"dataElement": {
"name": "Molecular Sequence Analysis Genetic Marker Assessment Status",
"id": 64579
},
"dataElementConcept": {
"name": "Molecular Sequence Analysis Genetic Marker Assessment",
"id": 2010547,
"conceptCodes": [
"ncit:C25574",
"ncit:C17565",
"ncit:C16622",
"ncit:C20989"
]
},
"valueDomain": []
}
}
]
}
]
}
# Minimal gold standard file, column extracted (and renumbered) from annotated golden source
bash-3.2$ cat data/leaderboard_annotated/Annotated-ROI-Masks-minimal.json
{
"columns": [
{
"columnNumber": 1,
"headerValue": "TERT-Promoter",
"results": [
{
"resultNumber": 1,
"result": {
"dataElement": {
"name": "NOMATCH",
"id": null
},
"dataElementConcept": {
"name": "NOMATCH",
"id": null,
"conceptCodes": []
},
"valueDomain": [
{
"observedValue": "Mutant",
"permissibleValue": {
"value": "NOMATCH",
"conceptCode": null
}
},
{
"observedValue": "wt",
"permissibleValue": {
"value": "NOMATCH",
"conceptCode": null
}
}
]
}
}
]
}
]
}
# Obviously we'd expect this to score as 0.0, but it gets 0.5
bash-3.2$ docker run -v /Users/tconfrey/Documents/projects/metadata-automation-challenge/output/ROI-Masks-minimal.json:/submission.json:ro -v /Users/tconfrey/Documents/projects/metadata-automation-challenge/data/leaderboard_annotated/Annotated-ROI-Masks-minimal.json:/goldstandard.json:ro metadata-scoring score-submission /submission.json /goldstandard.json
0.5
```
I am wondering, what happens if I submit all the synonyms from all concepts from the Thesaurus in the ValueDomain section. That should give me 0.5 points, but I am not sure if that would be fair.
Actually base on the current code, the more I submit in the VD, the less deductions I get.
IMHO the VD list should be not allowed to be longer than the actual OV list.
In my opinion what the scoring code should do for this part is:
- take all **observedValue**s from the Gold Standard
- look it up in the submission, then check either the **value** or **conceptCode** for that valueDomainElement
- the submission should not have the same **observedValue** more than once (so practically could be considered a map)
- then I would divide the correct matches with the size of observedValues from the GoldStandard
Maybe this is over-explaining, maybe I misunderstood something, but I guess this would work, and I think the current algorithm does not do this.
Of course, if the Gold Standard OV size is 0, and the Submission is not, than that is 0 points, if the submission is 0 as well, than that is max points.
Let me know,
Regards,
Attila @tconfrey ,
The score for the value domain coverage is calculated by the true positive rate of submitted matches, so missing annotations should not affect this particular score. That being said, there is one exception to this, in that the score will be 0 if there are 0 value domain predictions when the goldstandard has >0.
Best,
Verena Thanks Verena, Gilberto, it looks much better now.
I am still seeing one corner case issue - if I submit fewer annotated Observed Values than there are I don't seem to loose any points. For example looking at a column which has two entries in the valueDomain array in the gold standard data, if I only enter a single (correct) value in my submission I still get top marks. So I get deductions for incorrect annotations, but not for missing annotations. Hi @tconfrey,
Thank you again for notifying us of this bug! We have implemented a fix and the scoring model has been re-built to include it. Please let us know if the scores received are still not as expected.
Best,
Verena Thanks Gilberto. The check is in the mail... Hello @tconfrey ,
I'll send you an email separately. Yes, we'd like to get your test file.
Regards,
Gilberto Hi @tconfrey ,
I'll alert the team member in charge of the scoring program, in case he hasn't seen your post yet.
Regards,
Gilberto