Hello @thomas.yu, @tschaffter, @brucehoff
I submitted several containers (last had ID 8446855) to the Express lane Inference and received for all of them an e-mail with "Model exceeded allotted time to run" error.
The last lines of the log file are:
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:19.0)
STDERR: I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:1a.0)
So the model did not even begin to calculate the inference (I log during the execution), it just waited 30 minutes and then was killed for exceeding allotted time. Could you please fix the issue, as I cannot check the correctness of my inference method.
Thanks!
Yaroslav
Created by Yaroslav Nikulin (Therapixel) ynikulin Hi Yaroslav,
> either it was a bug (which, however, did not show up in the logs) or something has changed on the server's side
If not a bug, it's likely to have been an outage on the side of the cloud provider as we didn't find anything that we could fix on our side.
Thanks for the update!
Thomas
Update: either it was a bug (which, however, did not show up in the logs) or something has changed on the server's side, but it works alright now. Thanks and sorry for disturbing!
Yaroslav "If your submission started (and it did since you received log), it already got exclusive access to a GPU card." - probably, but it did not advance from creating two TensorFlow devices. Again, according to the logs, the model was not even built, and of course was not loaded.
I resubmitted the container (ID 8447797), the same thing. 30 minutes of time to get STOPPED_TIME_OUT error. I don't know what else to check in my code as it was only slightly modified from working version + I did not get any error at all - it's just like nothing happened. > Yes, before my inference method took from 8 to 15 minutes, depending on the exact implementation and approach.
Can you resubmit this submission and let us know if it still completes? If not, there is for sure an issue (yet to be determined) with the GPU servers.
> Probably there is a huge waiting queue for Express Lane servers and submissions can't get access to GPU during all 30 minutes?
If your submission started (and it did since you received log), it already got exclusive access to a GPU card.
> I experience the same issue with Express lane Inference Submissions: Sub-Challenge 2 (ID 8447327).
There is maybe an issue with AWS (Express Lane is running on it). We will have a look and keep you updated.
Thanks! Hi Thomas, thanks for fast reply.
Yes, before my inference method took from 8 to 15 minutes, depending on the exact implementation and approach. But the problem is not that it became slower, the execution does not come to GPU at all (as you can see from the logs). Probably there is a huge waiting queue for Express Lane servers and submissions can't get access to GPU during all 30 minutes?
Also, I experience the same issue with Express lane Inference Submissions: Sub-Challenge 2 (ID 8447327). Again, I expect the submission to finish in not more than 10 minutes, but it did not even get access to GPU.
Thank you
Yaroslav Hi Yaroslav,
Has the submission 8446855 ever completed before the 30-minute wall time? If not, can you please rerun your last submission of the same type that ran successfully on the Express Lane and let us know of the result?
Thanks!