Hi there,
this is team 243IDA speaking. We have looked into the proposed evaluation scheme, see https://www.synapse.org/#!Synapse:syn20825169/wiki/600404 and believe there are several drawbacks regarding the proposed evaluation method.
(1) Performance should be evaluated in a LOSO scheme, such that ML models have not seen data from the same subject before. You propose the evaluation of personalized models but this evaluation will just support overfitting on original data and will not enable making a point on generalization capability of the proposed model(s).
(2) MSE: Due to the high imbalance of the classes, MSE is not a reliable metric to judge the performance of models because even a model that always predicts the majority class (0) would have a good performance.
Proposal:
(1) Unseen subjects of whom meta-information (can be used to personalize models) is known should be used in the test set.
(2) The MSE-Score should be class-weighted to benchmark the performance of submitted models or predictions reliably.
Cheers,
#243IDA
Created by Dactyl Hygiea Dactyl @Jingan_Qu -
Yes, you may use any real value you can express as float. @sieberts How about some prediction values like -1.6 or 5.8, which is out of the range 0-4. @Jingan_Qu -
Yes, you can use float values. @sieberts Actually I have the same concerns with @Dactyl , also I have another question, if MSE is used to evaluate our model, can we include values like 3.4, 1.8 in our predictions? Thanks! For anyone else viewing this thread, the evaluation scheme linked to in the original post has been moved to here: https://www.synapse.org/#!Synapse:syn20825169/wiki/600897 Thanks for your explanation @sieberts - this is helpful to better understand the purpose of the challenge. @Dactyl-
Thank you for your thoughtful comments.
The task for this challenge is to build _personalized_ models for the subjects provided, which was set after careful discussion with clinicians, as well as evaluation of the available data. The specific models will not be generalizable to other individuals, as you stated, and are not expected to be so. While global models may be possible in the future, they will require substantially more data (specifically in the number of individuals) than are currently currently available to build generalizable population models given the heterogeneity of the disease manifestation. As the symptom labels are self-reported, they are more subjective than clinician-evaluated symptom labels, and therefore support the greater suitability of individualized models in this case.
To your second point, we did explore the use of macro-averaged MSE in our pre-challenge evaluation, but found it highly variable depending on the data split. While in principle we agree with you, in practice we found that there is not enough data in the rare classes, and therefore too much variance in this metric to be able to statistically distinguish between models using this metric.
Best,
Solly