Hi, @thomas.yu @tschaffter @brucehoff
My submission ( submission ID 8496564) for SC2 inference recently failed due to out of GPU memory after running several hours.
I want to ask:
1) will this be counted toward my 3 times quota? it failed in the middle.
2) I have a similar SC1 inference submission, which uses the exact same trained model (but without access to meta features). However, it has no problem, and has been running for over five days.
So, I am wondering how this GPU memory issue happened? During inference, will each submission have independent access to two GPUs?
BTW, the submission has no problem running on expressline.
Thanks
Created by Bibo Shi darrylbobo 8458995 and 8502236 follow the same pattern as shown above: Just use one GPU and the memory is close to the maximum.
Clearly having these metrics for every submission would be insightful for participants. I regret that I won't have time to check each submission manually and report on the forum. Time permitting we could implement an automatic mechanism to generate such reports. Hi, @thomas.yu , @brucehoff,
Thanks very much for your answers. The picture is very helpful.
I set to use two GPUs, and I will double check the code to find why it only uses one.
However, my similar inference submission for SC1 did pass that case, and is able to run further.
Can you also help plot the memory usage for submissions (8458995 and 8502236), if it will not affect their running.
Thanks very much. I will take your suggestion during the final validation phase.
@darrylbobo Out of curiosity I looked at the record of video memory usage for the submission you mentioned, 8496564.
${preview?entityId=syn8506532}
Your submission was given exclusive access to each of two GPUs (shown in blue and green). It looks like it used just one of the two (the blue one) and hit the maximum of 12GB (shown in yellow).
I definitely understand the concern about having a software failure in your submission after the close of the validation round (when corrections won't be allowed). An obvious suggestion is to prepare your trainable model as soon as possible and submit it right at the beginning of the validation round. Then, once it's done training, submit your inference model right away, so it can run and complete prior to the end of the four weeks. That gives you a chance to correct and resubmit should a bug arise in your software.
I hope this helps. Dear Bibo,
Thanks for your participation. Invalid submissions will not be counted towards your quota. I defer to other challenge organizers for other questions.
Best,
Tom Sorry, I don't know.
In addition to allow more submission in the Final Round, I also want to suggest that one physical machine per submission job. My observation is that it is can increase the whole system throughput to 2 to 4 times without addition resource. We have to spend considerable time to speed up our inference speed so we can have pretty big performance margin to secure a score. We wish those time could have been used to improve our model's accuracy. @vacuum
Thanks for the suggestion. I probably need to write some error protection codes. BTW, do you happen to know the fail during the middle will be counted toward to the 3 times quota?
This is one of reasons that I suggest we should be allowed to submit more than once for each sub-challenge in the Final Round.
The inference processes are complex and can last weeks. During that many days process, even a single hardware glitch will be fatal, unless one writes error protection code everywhere, and GPUs (especially the libraries) are known to be not as robust as we wished. And we have been bite by this before.