validation metrics mismatch

I am running a docker image locally and doing model selection on train/validation from synthetic_small. It had good cross-val AUC on synthetic (0.69) so I submitted to the submission track. In the same logs on the remote execution, all models report ~0.50 AUC. This should also be on synthetic_small, so I really don't understand what is happening. Any advice would be greatly appreciated! https://www.synapse.org/#!Synapse:syn21074863 Remote: {('knn-50', (0,)): {'mean': 0.5056994911493145, 'std': 0.001253499187269469}, ('nb', (0,)): {'mean': 0.5023483097229078, 'std': 0.0015709190515220683}, ('rf', (0,)): {'mean': 0.5093644109143289, 'std': 0.0044939723848974045}, ('xgboost', (0,)): {'mean': 0.5083829686693513, 'std': 0.00692975203598295}, ('knn-50', (0, 1)): {'mean': 0.4968063084240032, 'std': 0.007022946417330783}, ('nb', (0, 1)): {'mean': 0.5011018260055085, 'std': 0.008501295315360091}, ('rf', (0, 1)): {'mean': 0.4999121328359145, 'std': 0.011161201140373528}, ('xgboost', (0, 1)): {'mean': 0.49920252054128156, 'std': 0.005649276274481058}} Local: {('knn-50', (0,)): {'mean': 0.5495818869654114, 'std': 0.006640718489192565}, ('nb', (0,)): {'mean': 0.5464131861065545, 'std': 0.0011579488407931215}, ('rf', (0,)): {'mean': 0.5752633573933903, 'std': 0.006774571960443343}, ('xgboost', (0,)): {'mean': 0.5706863140075102, 'std': 0.0005712915433592758}, ('knn-50', (0, 1)): {'mean': 0.5404653498725577, 'std': 0.016252211926316762}, ('nb', (0, 1)): {'mean': 0.5337382602348395, 'std': 0.006756026689029304}, ('rf', (0, 1)): {'mean': 0.6731838711740076, 'std': 0.003118069627863762}, ('xgboost', (0, 1)): {'mean': 0.6939327853248356, 'std': 0.004791861678338949}}

Created by Ivan Brugere ivanbrugere
Hi @ivanbrugere , I'm currently having the same problem and am unsure how to proceed. Would you mind sharing what you think caused the problem in your case?
I don't think the synthetic data was supposed to have any signal (see https://www.synapse.org/#!Synapse:syn18405991/discussion/threadId=6129). I think model development this way leads to better / more robust models as the chances of overfitting on the validation/test dataset reduces.
Hi @trberg, The definition of true positives should be the same because all the code to ingest the death.csv is the same between remote and local. I am fairly certain the issue is these two different labelsets. My understanding was that the synthetic data doesn't follow the distribution of the real data, but that it would have some signal that we can evaluate our models locally before submission. Is there no evaluation data to produce an educated guess whether the model may produce better AUC than our current best? The ef369 labelset has *some* signal for these goals.
Hi @ivanbrugere, So there may be a difference in how the fast lane has defined the true positives versus how you are defining them locally which could explain the differences. Are you defining the true positives as patients with death dates 180 days after their last visit or 6 months after their last visit? The other thing to keep in mind is that the synthetic data only mimics the real data in form and variable type and not in variable distribution or correlations. I would be surprised if you were to find any real signal so it doesn't surprise me that you are getting random results. The main purpose of the synthetic data is to give participants something to check that their models will run.
more info on this: there's another death.csv file that was (quietly?) updated in the current fast lane file: fast_lane_synthetic_training_data.tar.gz (md5: 737cd6977dc379465a835b56cbaf9391). The updated death.csv md5: a605a1dc1120c402f9d2c1279a362cee. The modified date on it is Oct 21, which is after I started working on the small synthetic dataset. Are the a605a labels incorrect? I've run vanilla baselines (naive bayes, random forest) on the ef369 labelset would give some lift over random and an ordering that made sense (e.g. NB < RF < Gradient Boosting). I've tried a static and dynamic model on a605a and it is always random.
more info: the code should be identical, and the running pipeline shouldn't have any scope to affect this. there may be two issues: 1. different data 2. different library versions r.e. #1, the md5 of my death.csv is: (base) ibrugere-ltm:ehrdc ibrugere$ md5 train_small/death.csv MD5 (train_small/death.csv) = ef3699905d17334dfe4cd0623b35d897 I executed: docker run -v /Users/ibrugere/Desktop/ehrdc/train_small:/train:ro -v /Users/ibrugere/Desktop/ehrdc/scratch:/scratch:rw -v /Users/ibrugere/Desktop/ehrdc/model:/model:rw docker.synapse.org/syn20833371/dates_cv_yearsplit_eval bash /app/train.sh

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

validation metrics mismatch page is loading…