I submitted two jobs on January 15th with synapse IDs syn8026571 and syn8026487 (Submission IDs 8026572 and 8026488) to regular training lane. It stayed in the VALIDATED state for 4 days and started to run but after a short while it got terminated giving "submission exceeded allotted time". This came from the preprocessing script. Is there any time quota for preprocessing phase? I am surprised to see that it is terminated early. It was supposed to do preprocessing and model training but looks like preprocessing part was not able to finish.
I submitted syn8026487 (which includes preprocessing and training image) to regular training phase again and this time it waited in the evaluation phase around 15-20 minutes or so and then got terminated by giving the same message. I thought there is no short time quota in the model training lane. As a 3rd trial I submitted another image that does preprocessing only to the training lane and it got terminated 5 minutes after I received the log files available message. The same script was able to run in the express lane for training without any problems. Please check this and let me know what is going on. Thanks.
Created by Zafer Aydin zaferaydin vacuum, thanks for your feedback. We are reluctant to increase the express lane time limit (currently 30 minutes, or 1800 sec) because it reduces the responsiveness of the queue.
> we have to write special code to handle that
As described in a recent newsletter ' To check the time limit of your running submission, please use the WALLTIME_MINUTES environment variable.' This value is set to 30 in the express lane and 20160 in the leaderboard queue (the latter being an upper limit, as you may have less time remaining in your quota). So you can introduce a logic into your code which reduces the number of images processed based on WALLTIME_MINUTES which you needn't remove when submitting to the leaderboard.
Another approach is to run the code on your own machine before submitting to the leaderboard. This would allow you control the time limit as you see fit. Multiple cloud providers now offer instances having the Tesla K80 GPU which allows you to mimic the challenge environment closely. One problem for the inference express lane that is that 1200 seconds is too short to finish process all images (about 2000 in sc2), so we have to write special code to handle that, so it is not a good sanity test. Also, as a result (because many images are not processed), the prediction result from express lane is way off (below 0.5), and is not good for sanity test purpose either. I think we should either reduce the number of images or increase the time allow to run in the inference express lane. Thanks! Dear Zafer,
The objective of the express lanes is to make sure that the format of the prediction file is correct / the training step works properly. I will bring this up with the other challenge organizers, but I don't think we will be increasing the time limit of the express lanes.
Best,
Thomas I thought we had a 14 day limit for each training image not the overall. It was not clear to me from the description page. Also it is quite restrictive to have a time quota of 14 days. It would be more useful to train multiple models and compare or combine them. If training a model takes 3 days then I can train only 4 models per round, which is restrictive. I lost most of the server time trying to run the Tensorflow code provided in the Docker folder but it had multiple bugs. This is not fair. Furthermore it is not the most effective way of using resources. The distribution of server usage is not uniform across five weeks. Here are my suggestions or requests:
1. To promote people submitting early, maybe you can apply softer penalties in the first 2-3 weeks and apply the 14 day rule in the last 2-3 weeks.
2. It could be useful to have shorter partition queues within the regular training lane for those jobs that can finish model training within 3 days and have a separate time quota for those.
3. Can you also increase the time limit of the express lane?
I would appreciate if I can get more time because I implemented new models which run faster now. Thanks. > Is there any time quota for preprocessing phase?
Yes: The 14 day limit applies to all phases of model training (preprocessing and training). We will clarify this in the challenge instructions.
> I thought there is no short time quota in the model training lane
Again, the 14 day limit applies to all phases of model training. (Note this quota does not apply to the Express Lane, only to the Leaderboard queue.)
According to our records your first four submissions of Round 2 used the following amounts of server time:
Submission ID|Time used
---|---
8010083|D7:H:10M:22
8010085|D4:H:16M:41
8010086|D7:H:0M:4
8010094|D5:H:23M:25
The total time exceeds the current calendar duration of Round 2 because you submitted multiple jobs in parallel.