Hello,
I already joined the pre-challenge phase because I wanted to try docker and because I think the format of this competition might be the next level in data-challenges.
So I already learned a lot however...
The infrastucture is VERY slow. Much slower than my local pc, amazon etc. I don't know what those 24 processors are doing but they are not doing it for me. I use battle tested software that I have run an many different setups so I'm quite sure it's not my software. Having data local would allow for much faster experimentation. More experiments = better model.
Now here come my concerns.
**1. 1Gb modelstate.**
This allows someone (not me) to save and download around 50.000 pngs to their local system. This is not allowed but there is not much you can do to to prevent this.
Having 50.000 files on a local PC gives a HUGE advantage. I'd rather not have this possibility so that at least you are not tempting us to cheat.
**2. External data.**
As I understand external data is allowed. There are many publications already from teams getting high accuracy on their internal datasets.
Such a team could use their dataset to train a great model and make it even better with your data. Of course this is very good for the result of this moonshot.
But for me as a hobby-ist there is no point in joining.
Kaggle forces contestants to publish their external data sources to keep an even playing field.
If the external data is private it's not allowed.
**3. Baseline available AFTER competition**
In order to win prizes you must beat a baseline. Reasonable.
However.. I read the baseline will be published AFTER the competition phase.
Again.. my faith in humankind is perhaps a bit low. But what is keeping you from adjusting the baseline to the best solution ?
You have the source code in docker format sent to you.
**4. Last but not least.. citation.**
I'm not a scientist so this point is not big for me and it has been raised before.. However I consider it VERY unfair that the winner of this challenge get his name in the middle of the list of the paper.
A (scientist) friend of mine told me that this basically means you don't get credits.. now I ask.. Who gets the credits then ? Am I part of a citation pyramid scheme ?
Sorry if I might sound arrogant or offensive.. But I think these concerns are valid.
Created by Juul de puul juulepuul Thanks @juulepuul. I agree with you that having full access to the data might be more important than skills in machine learning, though I think these papers are very weakly informative on how you would approach the problem posed in this challenge. I did not have them on hand so I did a quick search
I'm sure it's not all 100% on the mark but you at least get a transfer learning opportunity from it.
https://breastcancer-news.com/2016/10/21/zebra-medical-vision-says-new-mammography-algorithm-improves-breast-cancer-detection
https://cs.adelaide.edu.au/~neeraj/mass_detection_dicta.pdf
http://cs231n.stanford.edu/reports2016/306_Report.pdf
My point was however that some teams might have external data and many do not have this.
A big advantage is that you can train on your own hardware and have more computation hours.
@juulepuul I have seen several papers solving different sub-problems (breast density assessment, lump classification etc.) but not exactly the same problem. Can you point me to the papers that you have in mind?
On some level, I agree with these remarks. I'm also very interested in the problem posed in this challenge, but I have a strong feeling that given this time limit per submission and the size of the data set this competition is going to go in the direction of optimising the code to handle large quantities of data rather than scientific innovation. Hi Juul,
> The infrastucture is VERY slow. Much slower than my local pc, amazon etc. I don't know what those 24 processors are doing but they are not doing it for me.
Another participant posted recently a similar observation. Can you please share all the information that you can provide us with in this [thread](https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=1207)?
Thanks!
>This is not allowed but there is not much you can to to prevent this.
have to agree. there is such a fine line between image and model. all images can be encoded into a model. but i think it should be allowed, because the second point is also so true. i am also concerned with internal data. because we do not have data, and the training image is not released. so anyone having internal similar data would have a huge advantage. and it is so true, i already see several teams that have their internal data. **i am sure a big fraction of the data is coming from some consortium spanning multiple centers, where the centers obviously would have some similar images, or even exactly the images used here. ** **it is quite possible (if not absolutely true) that the data contributor (one level lower than group health), are exactly some participating teams** , otherwise where do you think these data come from? i am thinking others like us still have a slim chance, as they might only have 1/2 or 1/3 of the data from the consortium relationship, because there is always a time delay in synchronizing and distribution.. **but if there is no 1GB, then others have ZERO chance; thats the difference between having training data, and not having training data. **as so i think we should just allow the 1GB, and see what we can do. this would be fair for people who has no access to these images.
> However I consider it VERY unfair that the winner of this challenge get his name in the middle of the list of the paper... now I ask.. Who gets the credits then ? Am I part of a citation pyramid scheme ?
as this topic has appeared several times and every time the discussion was confusing and confrontational, instead of any further discussion, i remain my original suggestion back in June to revise the principles. i trust the organizers would eventually be able to see the problem in this item when this topic comes up again and again, but i just don't know how many times it takes.
Drop files to upload
I'd like to join but I have a number of concerns about this challenge. page is loading…