DM Challenge: Please check that predictions are reproducible

I just got the e-mail telling me about this. It says: " A. 1 dataset with the data from 10 subjects B. 10 datasets, each including the data from only one subject (same subjects as in A) It appears that your submission fails to satisfy at least one of these two requirements: - The confidence levels must be identical in A and B. - A submission that successfully processes all the subjects in A must be able to successfully process the individual subjects from B. " I understand the point being made, but I don't understand what is meant here by 'datasets'. As I understand it, we have: /inferenceData/*.dcm /metadata/images_crosswalk.tsv /metadata/exams_metadata.tsv (Sub-challenge 2 only) /output/predictions.tsv What difference would we see in these files in the cases A and B? As I understand it, you are saying that there could be one instance with: /inferenceData/ one.dcm two.dcm three.dcm four.dcm five.dcm six.dcm seven.dcm eight.dcm nine.dcm ten.dcm Or 10 separate instances, environments, each with: /inferenceData/ one.dcm or /inferenceData/ two.dcm Is that a correct understanding?

Created by Peter Brooks fustbariclation
hello, I think that we are wasting a little too much energy in this issue in the last few days From our perspective there are two points which are important: 1- Reproducibility 2- Common sense It is clear that the organisers have these two points in mind and for this reason they gave a quick answer to detect reproducibility with common sense. The 0.0000001 tolerance Probably a better solution would be apply a variable tolerance for each participant which could be (mean of all SC1 scores)*0.0000001, i.e. a tolerance which is 6 orders of magnitude of the provided scores, this avoids the trick of the score scaling. But even with this change, it would be vulnerable probably to other tricks (remember that we are all smart people an if we use enough energy and time we can think in good tricks :-) ) In my particular case, our submission applied a regularisation that was not reproducible (the change of scores was in the order of 1e-3), and I am happy that the organisers pointed it out because since we are not experts in the field helped us to understand the importance of reproducibility. A minor change was also required. For the other cases, fighting too much for the reproducibility of cudnn I think that it is wasting energy, time and loss focus on the main problem: **breast cancer.** (although it has been interesting to know about this problem of cudnn) Best
@yuanfang.guan My submission is reproducible and I am NOT using any tricks ( like limiting digits ). I just proposed that solution because i wanted to help. Reproducibilty is possible in the enviroment ( even using cudnn. ) Also please note that this test the organizers are using is important because it enforces that exams are processed in a truly individual way. While reproducibility might be dropped, this criteria has to be enforced. Best Dezso
@ynikulin I mean the score of each individual example is reproducible.
Thanks for all your replies and suggestions. @Ryu Have you tried to check if it is reproducible at the level of single examples? In my case, I had perfect reproducibility in terms of AUC even with non constant summation order at the level of Python dictionaries, that's why I had not fixed it earlier. Basically, it can be that such small variations (especially if they are 0-mean, which should be the case) are averaged out for integral scores like AUC or cross entropy for 500+ examples. Also, I really don't know cuDNN implementations details, but according to the guy who seems to know "Non-determinism now only exists for convolution operations, which rely on NVidia code which we allow to select a non-deterministic summation-order version if it is faster." from [here](https://github.com/Microsoft/CNTK/issues/361) it should not depend on the training/testing mode, as non-deterministic summation should happen in both cases.
I suggested dropping reproducibility requirement all together, and some people don't agree -- I don't understand why. What I suggest is just not to double-standard people. Either, everyone needs to be reproducible (without any tricky way such as dividing by 1000, or something like not binary reproducible, but actually reproducible ), or , all don't need to be reproducible. The organizers have to decide between the two. It can be any choice, but cannot be only applying to the people who choose not to use the tricky 'reproducible' way. ** I did not use dividing. I asked for a confirmation that my submission is valid, but I never get one. Then, no matter what my test result is, you have to make my submission valid. ** Clearly, like me, Yukulin also doesn't want to become 'reproducible' by this tricky way. other wise, this problem is very easy to fix, right? The organizers already decided to drop binary reproducibility, which must has surprised themselves , as reproducibility was their flag over the years, and as reproducibility is the basics of doing any science. **Then, EVERYONE must be allowed to be non-reproducible. Either way, I request 48 additional training hours, as I need to either 1. reference the test set to assemble ranks and exclude outliers, if no reproducibility is needed; clearly, this can be used by others now but not me; or 2. use this time to make my submission binary reproducible in any scenario.** I think I can fix through either way in 48 hours, but if anyone thinks more time is preferred, then I have nothing to do until July, I am happy to extend an entire phase to fix this.
@ynikulin I used Tensorflow with cuDNN very intensively in the kaggle lung cancer competition just finished. I was able to binary reproduce my predictions on 500+ cases, each with input of ~500 images of about 500x500. My model ensembles 10+ CNN models, some of them being non-trivial multi-task models, and are computed with a mixture of GTX 1060, 1080 and Titan X, and a mixture of ubuntu 16.04 and centos 7. I didn't expect that as I still believe floating point arithmetic not being associative and special care must be taken for reproducibility, but at least I'm rather confident on binary-reproducibility of Tensorflow (0.12.1, with python 2.7) and cuDNN across any combination of GPUs and OSes I have mentioned above. I'm willing to open my code for stress test if there's any doubt on this. The two posts you mentioned are both about training. My above statement is about prediction only. In my post above I mentioned we also received the non-reproducible message. It turned out we do have at least one bug in our python code that caused the non-reproducibility. The bug could also have, theoretically, very slightly lowered our AUC.
Hi Yaroslav, I'm using caffe + cuda + cudnn, and my results turned out to be reproducible according to Thomas. As was previosly suggested: if you limit your number ot 3-4 decimal units, they will be reproducible (0.0645 for example) . I don't know how much it would hurt your precision. Best, Dezso
Hello to everyone, Update from my side: after having fixed all the issues with Python dictionaries I still have the results slightly varying from run to run on the same (!) dataset of one single example. I traced all the atomic probabilities (I call atomic the result of one single push of an image through a particular network, inference() method) and while the order of averaging was fixed (I sorted keys for Python dictionaries) the atomic probabilities still slightly varied (for example 0.0645187 vs 0.0645188) from run to run. I was astonished as I did believe these atomic probabilities to stay stable. Next, I've googled several discussion threads about non-determinism of cuDNN implementation of convolution, for example: [this](https://groups.google.com/forum/#!topic/torch7/bkVMKmmwG1A) and [this](https://github.com/Microsoft/CNTK/issues/361). Moreover, this max difference is reported to be sometimes more than 1e-06, see [here](https://groups.google.com/forum/#!topic/torch7/wVFUxpYduNE). So, what do we do with all this, especially with single shot variations more than the established threshold? It should concern all or almost all participants as cudnn is a low-level component of most part of DL franeworks. My apologies if I misunderstood something, just trying to figure it out. Best, Yaroslav
> So is our approach trying to reproduce the error correct? @mimie001 Yes, this is the way that we are using to test that the predictions of the methods are reproducible but also do not depend on the data from other test subjects. > If I have received no such letter, that means my finished submission has passed all tests? @riblidezso The predictions from your best submission in Round 3 are reproducible.
@tschaffter If I have received no such letter, that means my finished submission has passed all tests? Thanks Dezso
Is it possible for the organizers to test each submission on a small subset to assess its reproducibility before evaluating it on the entire test set? This is much better than running a submission for a few days before it is declared to be not "reproducible"? @tschaffter, @thomas.yu, @brucehoff
i am just asking my submissions to be valid. what's the difference between non-binary reproducible and not reproducible at all? **it is clear that by dividing 10^6 all submission will become reproducible but not necessary 'binary reproducible'. ** You said your submission is reproducible but just cannot be binary reproducible. Do you even believe what you are saying? the reason that i ask this way is just because i **don't want to become reproducible through such tricks**. i would rather the organizers just don't require reproducibility at all. **Binary reproducibility was asked in all previous challenges from training to prediction in a single bash file**. All your submissions are not reproducible at all, right? So aren't mine. But for the challenges that I have some chance to win, I always make them binary reproducible, otherwise it is unpublishable results. Of course, you managed to make this exception, because of your exceptional performance, and your exceptional ability on many other aspects. But then, you should allow everyone else to make this exception in a legitimate rather than a fishy way. Now, ** no body **can assure reproducibility on this environment, unless we use some tricks as I mentioned. Why can't they just lift the reproducibility requirement?
Strange suggestion, as it was already declared by the organizers that binary reproducibility is not required. From my side, I'd also like to know if my submission with ID 9605324 (I fixed a couple of places where I had a non-deterministic order of addition terms) passes the tolerance test or not - may be I could fix something before the deadline. Thanks, Yaroslav
I suggest all non-binary reproducible submissions eligible for leaderboard /community phase, but not for prize. when you publish your papers, you only need the top 1 algorithm to be reproducible, who cares about the rest? There is only one day left, and last night we had a tornado, and all servers went off. even now you tell me that there is a bug, there is no way that in the next 36 hours i could possibly fix any thing.
@mimie001 That's also what I did to test my code. But my submission has not been completed yet so I don't know if mine will pass the criterion or not. @tschaffter Is there any recommended way to systematically test our code to make sure they are reproducible? My submission will take a few days to execute. What if it fails the reproducibility requirement in the end? That will be a disaster for anyone who has spent months working on a solution. Thanks!
Hi Thomas, we have received the same mail regarding the mentioned problem for our submission with the id Submission ID 8384938 and tried to reproduce the error to test if our last scoring submissions in the validation round are correct, but with no success. So i want to make clear that we understand the problem right: I took the first 10 subjects of the pilot crosswalk file and saved them into a new crosswalk file (Dataset A), then i took every single subject and saved them seperatly into a crosswalk file(Datasets B, C.. and so on). For example: subjectId examIndex imageIndex view laterality filename cancer 20 1 1 CC R 000135.dcm 0 20 1 2 CC L 000136.dcm 0 20 1 3 MLO L 000137.dcm 0 20 1 4 MLO R 000138.dcm 0 98 1 1 CC R 100151.dcm 0 98 1 2 CC L 100152.dcm 1 98 1 3 MLO L 100153.dcm 1 98 1 4 MLO R 100154.dcm 0 These would be the two subjects with the id´s 20 and 98 splitted into two files (line 1-4 and 5-8). Running our inference script results in identical predictions for each subject in the complete prediction file (Dataset A) and the files from single runs (Dataset B,C..). So is our approach trying to reproduce the error correct? Thanks
@tschaffter can you please quickly test if my existing 3 submissions to validation round satisfy these criteria? also, if it is indeed non-reproducible, we will get no score? (just to be prepared) also, are you not worried at all that nobody actually passes the test? if my submission is not reproducible. can you please just retain my score but not eligible for prize? i think i need some miracle to get a any prize, but not getting a score at all is disappointing. thanks
Hi, As a reminder, please also check that your method satisfy this very important requirement: > A submission that successfully processes all the subjects in A must be able to successfully process the individual subjects from B. While there is no valid reason for a method to fail satisfying this requirement, we have observed that there are several methods submitted in Round 3 that fail to meet it, that is, the method fails to process datasets that include data from only one subject (sometimes only one exam in SC2). I don't know if this is cause of the errors, but please review carefully the values that the fields can take in the exams metadata and images crosswalk files. In particular if you don't specify yourself the types of the columns (int, string, etc.), please make sure that your parser always handles correctly the data that it reads. Codes are more sensible to this type of errors when there are only one or a couple of rows.
i thought they said strictly 0.000001, i was just going to divide by 1000 just to make sure everything passes. R is double. so wont' make any difference in performance. a more appropriate test should be, randomly pick 100 individuals (or a sesome pairs) that are most close and continuous (adjacent in ranking) in prediction values, and retest these individuals, the relative rank of these individuals shall not change for separate and together (ties are fine). this ensures, reproducibility of challenge results. but i doubt, there will be a single submission that passes this test. So i suggest not require reproducibility at all.
@yuanfang.guan That doesn't work if the difference is measured in relative sense.
@thefaculty can just divide all predictions by 10/100 (or anyone by 10^6) to make it reproducible and not affecting AUC/auprc. i think it might be a good idea to lift the reproducibility requirement in this way, as i think less than 5% of the submission can be strictly reproducible. but it would be better if it was notified long ago, so that everyone can assemble ranks instead of absolute values across models.
Hi @ynikulin Wouldn't it be easy for you to simply limit the precision of your output confidence scores if the difference is as small as 1e-7? The scoring system simply relies on your output file, right?
Hello @tschaffter When will we know if our models have passed the new reproducibilty requirement with threshold tolerance, can you send this info together with e-mail that the inference has completed? What will you do if it does not but internally the method satisfies this condition? I checked on 6 cases from pilot data, the max difference was ~ 1e-07, but who knows if it stays the same if you select other examples. Best, Yaroslav
Hi Yuanfang, > there will be no test as shuffling the image orders or visit orders, right? That's correct when testing for the reproducibility of the predictions. While ideally the order of the images should not affect the confidence levels, I understand that teams may have arbitrarily decided to use the first CC view when multiple views of this type are provided, for instance. Therefore we will not shuffle the order of the images.
hi, thomas, i just want to make sure, within a patient, there will be no test as shuffling the image orders or visit orders, right? because i don't know how to write code efficient enough to make it finish in time, so i only make predictions based on the last image in the /cross file of a particular breast. (no matter what, i won't be able to change this anymore in the next 3 days. ) thanks,
Hi all, You should have received a newsletter a few minutes ago with information regarding the threshold that we will apply. To those who have received an email on Friday about the reproducibility test, I just want to precise that no threshold was applied, that is, we were comparing the exact predictions that your inference method generated on the dataset A and B. Thanks!
@ynikulin you should try your best to reproduce. can you just fix the adding, if you already know it is the order problem? you can just add the probabilities in order, unless the actually order depends on other individuals. because otherwise you are putting the organizers in a very difficult situation. If they let you get the cash prize or any cash prize in previous round, you will clearly be the first one who win a DREAM competition with a non-reproducible result. If they don't let you win, then the winning submission will have to go to a much worse method. either situation is very ugly. @smorrell @vacuum the rest of us don't need to worry at all about reproducibility. in previous challenges, they only ensure that the winning submission is binary reproducible from training to test in a single bash. i am sure over 95% of the rest submissions are not reproducible at all.
Hi @tschaffter This diagnosis (thank you Yaroslav!) also supports application of a threshold. I.e. tolerate differences say < 10-5. Would that be a reasonable solution which is easy and quick to implement? If so, it would be helpful to rerun that criterion against the test results. Thanks & regards, Stephen
Hello @tschaffter, I did various tests locally on pilot data (run examples separately and all together) and the biggest difference I observed was in the 7th digit. I also checked my code and have not found an obvious bug. The only explanation I can see right now is the non commutativity of float addition. An example of predicted probabilities: 98 R 0.0739966252004 - one single example in crosswalk file 98 R 0.0739966265974 - 2 examples in crosswalk file 98 R 0.0739966308465 - all the pilot data processed together I believe that different partitions create different probability dictionaries in Python that I next use to average scores and apparently probabilities for each case can be averaged in different order => slightly different results. It is probable that I could not reproduce the problem you are mentioning, but I have not found a more serious issue in such a limited time. Could you please give more details about the problem? If the difference is only due to the non-commutative addition it is alright I hope? Thanks, Yaroslav
Hi all, > If you can simulate the docker locally using the pilot set, you can create different datasets (by manipulating the images_crosstalk.tsv or the exams_metadata.tsv). The score for a patient that appears in several datasets should be the same in all cases. i.e. It does not depend of the particular dataset or the order in which patients are processed. Yes, that's the correct way to proceed. You can also decompose the dataset into individual datasets on the Express Lane. > We have already sent two submissions for the validation round which will suffer of this problem. Will they be considered totally invalid for this issue? If so can you please kill the submissions so we can fix the problem? @alalbiol You can send a request to cancel your submissions to @brucehoff. Thanks!
Dear all, Apologies for the delay in response. We are working on a response. Best, Tom
This issue seems fairly widespread. It would seem reasonable to apply an error tolerance. Could the organisers please comment on those points? Thanks.
Dear @alalbiol Thank you for your reply! We tried rearrange pilot data, but still cannot reproduce the problem. I am suspecting that our processing of the images may have some floating point rounding issue on the training/submission lane machine in very rare conditions, so I am focusing on that for now.
Dear Vacunn If you can simulate the docker locally using the pilot set, you can create different datasets (by manipulating the images_crosstalk.tsv or the exams_metadata.tsv). The score for a patient that appears in several datasets should be the same in all cases. i.e. It does not depend of the particular dataset or the order in which patients are processed. Hope this helps
Dear Organizers, Can you supply more diagnostic information? We have tried to debug this problem by running many tests on pilot data and analysing very detailed log we got from training lane, but still don't understand where the problem is.
We also got the same message. Floating point arithmetic is, strictly speaking, not commutative/associative so parallelization might have caused non-reproducibility. I've participated in multiple competitions, mainly on Kaggle, over the years. While all competitions have reproducibility requirements, I've never really seen convincing indications that such requirements are enforced. I have to say this reproducibility check done by the Dream organizers is a giant step forward in competition organization. Well done!
Thanks Yaroslav for sharing, In our case we can explain the randomness in our submission to sc1 can be explained by the regularisation. But our submission to sc2 did not use this regularisation, so we can not explain what is going wrong (yet) as in your case
The same for me. Apparently, that's how the Organizers test the requirement number 6: "When submitted for evaluation, the methods must process subjects INDIVIDUALLY and generate a confidence level in the interval [0,1] for both breasts. In other words, the confidence levels for a subject should be independent of the other subjects in the evaluation data set." Intrinsically my approach of course satisfies it, so, either I have a small bug in my inference script (standard +- 1 for loop error when processed in batches) or I can even imagine that probabilities being added and averaged in different order (not commutative float addition) can cause this issue. In any way I hope that Organizers will be understanding and not too formal with respect to such problems.
A related question pls: can the organisers supply diagnostic information, e.g. confidence differences, etc. This issue could be difficult to fix with only 5 days left - has this test been applied before? Thanks.
We also got the same e-mail. As far as I understand what you want is that the confidence level of a particular studio is DETERMINISTIC, right?, so if you change the order in the evaluation dataset it remains the same In our case, we added a regularisation step that adds some kind of randomness to the images and this is probably the reason for the differences. Because we did not reset the random seed for each studio In the round 2 we did not added this regularisation, so please can you check that our best submissions to round 2 are Ok? (so we can check that we did not added by mistake any other source of randomness??) We have already sent two submissions for the validation round which will suffer of this problem. Will they be considered totally invalid for this issue? If so can you please kill the submissions so we can fix the problem? One final question, we assume that the atomic unit of images is the patient studio right?, if this is correct we can correct our issue by just reseting the random seed for each studio Thanks

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

DM Challenge: Please check that predictions are reproducible page is loading…