Hi,
I've seen a couple of people/threads mentioning that they dont get "timely" feedback from even small jobs since the cluster is completely occupied.
It is kind of frustrating to submit a job that is pending for 1/2 day only to find out that it crashed due to a bug...
What about providing a "short queue", where jobs are automatically terminated after say 5 minutes. This could be used by those of us who still struggle with the infrastructure (i.e. just trying to get the stuff to run without error, not actually fitting anything seriously) and give almost instant feedback, since no jobs that run for long time block the whole thing.
Its probably a bit late, given that the first phase starts next week, but still: I would find that very useful...
Created by Michael Strasser redst4r Jobs that are sent to AWS should start INSTANTLY. With waiting or not, AWS charges the same amount of money.
>IBM and Amazon should donate more servers for free
that is a pretty high expectation given how much we need ......
i think to make a good solution to this problem, at least one needs 2\*7\*24\*90 compute hours (obviously not working hours, but the pre-processing/compute takes too long). that's the minimal i think.
seems the only way is to pay by each participant for much lower end machines.
i am more worried about the scoring phase.
it is SO UNREALISTIC to run the models, especially large net on half million images individually for each patient (instead of loading once and predicting them one by one); suppose one only has a single trained model, (which is almost impossible), taking 3 seconds to initialize tensor flow, load the model, and make the predictions, that is going take 17 days to score half million images. Yes, it's not realistic to expect people to pay, and not fair either, I might be able to rent 100x more computer power than a poor student.
I also understand the limited access to the 640.000 images and this docker approach is a good solution.
IBM and Amazon should donate more servers for free, and organizers should acknowledge and address this capacity and queue problems.
If I was the organizer I would say "we have infrastructure problems wrt capacity, stability and fair quota enforcement. We promise we will solve them and reset deadlines once it's solved. The goal is that everyone will have a quota of x hours per challenge phase and be able run jobs on demand."
ALL of us are happy to pay, i believe. The PROBLEM is, if we pay, we obviously will get the right to transfer the data, and that is what we pay Amazon for.
So it can be only paid by organizers, and then reimbursed by participants. For example, if I request 2 full time dedicated nodes to do experiments, then the organizers should first buy for me, and then I pay back.. one machine is far from enough to do any experiments. but i also don't want the competition to run forever, because i am not rich. 2\*\$7\*24h\*90 days that is \$30,000, that is already a significant budget, probably the maximum level i can afford on this project. i think ideally we can pay for much lower end machines.
if it can be run on local, it will be way cheaper, only ~$7,000 for two nodes that we can use for 3-4 years. but i assume the data can still not be released? I agree with @sitmo. I'll be totally happy to pay for the test runs out of pocket. Not being able to see the data means that one would need many more test runs to develop a good model.
Or why not just 20 nodes? There is a lot of funds available for this competition, ..and marketing opportunities for Amazon, IBM.
I'm also happy to pay for a p2.8xlarge instance at Amazon with (eight NVIDEA K80 GPUs and half a terabyte of RAM), just for $7/hr if that's possible? That would allow me to build much better models-which is the whole point of this competition-.
As it is I'm thinking about quitting this competition because the infrastructure is un-workable.
This is a shame because I have invested quite some time already and I'm passionate about the benefits that will come out of this competition. It's a very unproductive research environment and my future ranking will mostly depend on infrastructure issues instead of model quality -which is something I'm not willing to risk and invest time in-. This is a good idea and I think we should keep it once the Leaderboard Phase starts. I'll discuss with our IT team to see how we can provide this service.
Thanks! is there any update for a short queue? probably limit each job to be 1 hour and provides about 5 nodes. that would be at least 5*24=120 tests done for all participants a day.
that would be sufficient to allow all participants to debug quickly.
Now on average I have to wait 4 days to de- a single bug. I am also interested in the idea of short queue to check the infrastructure issues. As I have been pushing the job for the last 4 to 5 days and twice the job status queue refreshed / or some bug in the code.
i absolutely agree. it is not 1/2 days, i have to wait for 3-4 days to know that ok, some how the memory is totally out, potentially by another problem in the cluster.
i think the current logic has some potential problem: if there is one participant, who decided to do a pre-processing for 30 days, then, none of us need to submit anything. and I am almost sure that the current queue is dead because of **either 1) some schedule/close error or 2) some participants started to copy all images to the pre-processin dir, that is probably 10 days!! and most likely both.**
for example, if there is ONE participant, who for some very difficult-to-understand reason just decided to augment images X20 in pre-processing, rather than on the fly. then, **not only this participant would be screwed up, every one of us would be screwed up with him.**
i think everyone should have a **dedicated, locked** node that just for this person/team. i am absolutely happy to pay for this node for full.** if there is no compute resource, how can we do anything? **
**by having independent node, at least one would not be screwed up by some other's mistakes**
** currently, there is practically no resource for me**
I don't think I am the stupidest one that is still struggling with the infrastructure, on the contrary, it is likely a couple of people accidentally blocked the whole queue by a really inefficient program that is going to run for ever.