External data policy.

Hello, reading through the challenge rules, I cannot find clear instructions regarding external data policy ? Could this be clarified ? - it's allowed whatever the dataset ? - it's allowed assuming the data is public ? - it's not allowed ? Thank you !

Created by olivier c olivierc
@bill_lotter @ynikulin A very simple question: did you use external public mammography datasets to achieve your score? Thank you.
....... so much confusing and misleading information in this thread. i wish i had not entered here.
In case people are using public datasets, I am looking forward for seeing an AUC of 1.0. So, since public data are permitted, here a couple of papers about classifying the whole mammograms https://arxiv.org/pdf/1612.05968.pdf http://cs.adelaide.edu.au/~carneiro/publications/multimodal.pdf Many of you already know them. Enjoy! P.S. don't forget multi instance learning Here a good python code for multi instance SVM https://github.com/garydoranjr/misvm
i am just reading from both replies. are you saying that you used public dataset but not private dataset? or you used only leaderboard dataset? because this could be some implicit misleading information that leads one to wrong direction... i admit that you are very very smart. but it is still hard to believe that blindly training a model can reach such a high accuracy.
@ynikulin Thank you for sharing your opinion. I think that there is a bit of confusion in your comments about the difference between transfer learning and model training. Transfer learning is based on the assumption that a model trained to solve a DIFFERENT PROBLEM (imagenet model is not trying to discriminate between cancer/non-cancer images) can be fine-tuned opportunely to solve a completely different problem. This is completely different from training a model on the SAME PROBLEM and fine-tuning on a dataset of the same kind. This is not transfer learning, there is no transfer of information from one problem to another since the model is trained on the same stuff. Said that, I don't agree with what you say at the end. It is definitely true that deep learning NEEDS a lot of data to generate a good generalization of the problem. So, please don't make false statements, because having more data (even if it's only about using more cancer images, since only a bunch of them are positive in the training set) can be extremely helpful. Hope that now everything is much clearer.
Hello to everyone. Let me also present my point of view. First, I did not use any private data neither in getting 0.85. Second, I can understand (and support) the demand to make public any data used in creating a model, simply for the sake of further research and result improvements. Because if only a (pre-) trained model is released it will fix the architecture and training procedure forever, without the possibility to play with architectures and training strategies. Don't forget please, first of all we are trying to solve the problem. At least that's how I see the things. So, everything that improves the results and will serve to the community in future - on the table, please. Third, if you want to forbid pretrained model you are automatically banning any kind of transfer learning. Seriously, you permit to use a model pretrained on ImageNet (VGG or whatever), but forbid a model pretrained on other public sources of data? It's just illogical and counterproductive. Finally, I do not understand your astonishment at all - this question was discussed several times already, as @tschaffter pointed out. P.S. I can confirm that from my side it was also a lot of blood, sweat and tears. Deep Learning does not begin to work automatically once you have more data. Not yet.
I can see some fuzziness in the definition of an open model and reproducibility. I'm expressing here my personal opinion : private data shall not be allowed, as it makes a strong (and unfair) biais on the results. And to be completely explicit : publicly available data shall be allowed.
Yuanfang that's what community phase is for I guess lol ;), but thanks! It's been a lot of blood, sweat, and tears, which I'm sure everyone else is going through too
congratulations bill. Do you have some insight to share? 0.12 is a huge gap. that obviously not achievable by tuning parameters...
Congratulations Bill. Your score imposes us wishes of improvements. But, at thi I think that the post referred in https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=830 should be clarified in terms of what Paolo and Yuanfang are referring. The use of fair a priori knowledge. From our perspective **we have no access to hospitals with good tagged data**, as was mentioned and declined by other fair competitors. This can be a big issue for us, as some of you have mention you could have access to this data. The work necessary to perform well in the next round, could not compensate the expected results if this door is completely open and not clarified...
The following statement is part of the rules since the beginning of the Challenge. > Participants are free to use external data other than the Data provided to develop and test algorithms and Entries. This sentence clearly states that the use of external data is allowed. > Also this is completely in contradiction with the "reproducibility" required in the wiki (if your model is based on random number we should also provide the seed). How can you reproduce the results if someone was using an external dataset? We are asking participants to make sure that the output generated by their containers is reproducible, i.e. that the same values are generated for a given subject when the container is run multiple times (thus the instructions regarding the random seed). Please also see the response from @Justin.Guinney to the [same question](https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=830&replyId=7773).
For what it's worth, I can tell you that I'm not using any private datasets in getting 0.86. Not saying that in a braggadocious way, I just don't want to be accused of using private data if that's what's being insinuated.
>Also this is completely in contradiction with the "reproducibility" required in the wiki (if your model is based on random number we should also provide the seed). How can you reproduce the results if someone was using an external dataset? let's give them some time to re-think about it. implicitly in pre-train obviously it refers to imagenet-like, instead of using private datasets. I feel Thomas (together with Justin) is playing a word game here. That's not good...... especially among this community.... i think I over-reacted when I saw the leaderboard. It is indeed hard to imagine just by deep-learning on this cloud platform with only 300 training hours and less than 1% images, one can achieve such high score. You can say I am jealous, that I won't deny. But I think for a fair comparison on this platform without seeing any images at hand, it takes some magic to generate such a gap. Even if you give me 1000 hours, I cannot achieve this performance.

having a difficult cloud set-up without seeing the images, at the same time, allowing external data, that's basically concluding all people without internal/private data should stop here. i am very sad, i can't believe i am so stupid to have refused internal dataset, and i can't believe the organizers can be so stupid to allow this to happen. i thought one of us would have some brain!!! absolutely agree with Kiko
So **fair use of pre trained models should not include data** not available for everyone. Using data from a home hospital, should eliminate a participant. Right? As Paolo says 1% of images requires lot of effort to make it usable for a good training model. And people with access to well formed and good tagged clinical data **could have an unfair advantage**.

I tend agree with Paolo, that means you don't even need to explain your model.... you have been also saying for 6 months that we have to use cloud data and cloud hours. in **all challenges** , one has to reproduce model on the challenge data. i at least have spend 5% of my life span in making these models reproducible on the challenge data. And now, you tell me you don't have to reproduce. i am really confused. Probably need to check with your boss on this? then i spend at least 100 hours on thresholding the images, as we cannot see them, so practically we have to thresholding. now you tell me, that you can use internal data. actually, i was offered numerous times of internal datasets. Yet, I refused them. Because i thought that is blatant cheating. Now you just make me think i am really stupid.

Here is the reference: [DREAM10 OFFICIAL CHALLENGE RULES > 5. USE OF OTHER DATA](https://www.synapse.org/#!Synapse:syn4294018/wiki/232126) Here is the same question asked six months ago: https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=830 > but do we need to disclose the training images? i.e. if I acquire some images from my hospital, do I need to disclose them? No Thanks!
but do we need to disclose the training images? i.e. if I acquire some images from my hospital, do I need to disclose them? I think using external data should be strictly forbidden. and all models must be completely regenerable and reproducible on the cloud platform and on the cloud data only (or at most, an outside well-known dataset such as DDSM). this was the requirement for all challenges. and should have been remained here. otherwise, who knows what one does. if someone somehow acquires the test set, and the model uses the test set as the training data, you won't even have a way to find it out, unless everything is reproducible on cloud. Because you said we are limited for hours (and obviously the tech that we can use) since we are working on a dataset which we cannot see. Obviously, the biggest obstacle is not knowing where I predict wrongly. If you allow outside data, this should have been spelled out clearly very early on, and obviously everyone's strategy will be very different now.
Hi Olivier, Someone already posted that question but I can't find the reference. Yes, it is allowed to use a model that has been pre-trained on either a public or private dataset. However, you must agree to make the content of your Docker containers (code and pre-trained model) available under an open source license once the challenge ends. Thanks!

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Hello, reading through the challenge rules, I cannot find clear instructions regarding external data policy ? Could this be clarified ? - it's allowed whatever the dataset ? - it's allowed assuming the data is public ? - it's not allowed ? Thank you !

Drop files to upload

External data policy. page is loading…