Hi, @tschaffter, @thomas.yu @brucehoff
My recent Express lane Inference Submissions (submission ID 8067200, and 8066724) both failed.
Can you help check what is the cause?
The submission 8067200 is __exactly an old job that has successfully passed round 1 on the real SC1 inference job__.
The submission 8066724 is a new one, which just revises some part of 8067200 for the purpose of speed-up.
They both failed after 20 mins of running on the express lane inference SC1.
I checked the returned log files, the only error I can locate is :
STDERR: libdc1394 error: Failed to initialize libdc1394.
But I think it is due to opencv and can be ignored, and it doesn't affect the results.
So, I am wondering if you can help check what else is going wrong?
Thanks.
Created by Bibo Shi darrylbobo >LOL, you sound like my mom
:-) i have never said that to any of my four kids in the past 10 years. You have logic, which is difficult to find in most people.
thanks a ton for your clear explanation. > must be your side killed it, but didn't sent the time-out error'
This is what I looked at in Step (2), above. It must be the case but it's hard to see how. Nevertheless we will add some extra "protection" in the pipeline logic.
> you are very smart, Bruce
LOL, you sound like my mom
> can you increase it to 40 minutes, as everyone has problem to finish in 20 minutes.
Please remember the idea is not to provide enough time to run through all the code you plan to submit for scoring or all the images in the pilot set but rather to give you a place to make sure things are 'wired up' right with a few minutes of run time. Our aim is to keep the run time short so that the queue is very responsive.
Step (1):
There are six Docker containers for submissions on the express lane inference server which terminated with a 137 error code. Here are their Submission IDs, start times, 'last updated' times and difference between the times, minutes:
Submission ID|Scoring Started|Scoring Last Updated|Minutes
---|---|---|---
8067909|01/24/2017 12:45:55PM|01/24/2017 01:14:03PM|28.1
8067200|01/24/2017 11:43:55AM|01/24/2017 12:09:01PM|25.1
8066724|01/24/2017 11:14:16AM|01/24/2017 11:39:14AM|25
8065122|01/24/2017 02:57:34AM|01/24/2017 03:23:49AM|26.3
8064755|01/24/2017 01:21:59AM|01/24/2017 01:49:10AM|27.2
8064669|01/24/2017 12:36:55AM|01/24/2017 01:03:42AM|26.8
Here's a handful of submissions to the express lane inference queue that finished with a 0-error code (no error):
Submission ID|Scoring Started|Scoring Last Updated|Minutes
---|---|---|---
8064579|01/23/2017 11:41:29PM|01/23/2017 11:45:26PM|4
8064526|01/23/2017 11:15:54PM|01/23/2017 11:19:49PM|3.9
8064475|01/23/2017 10:56:49PM|01/23/2017 10:58:39PM|1.8
8064420|01/23/2017 10:29:30PM|01/23/2017 10:31:28PM|2
8064337|01/23/2017 10:00:41PM|01/23/2017 10:25:51PM|25.2
8064017|01/23/2017 07:14:24PM|01/23/2017 07:16:19PM|1.9
There's clearly a relation between duration and error code, supporting my hypothesis.
Step (2):
I inspected the logic for the execution pipeline and it seems really unlikely that we would stop a container but report an error having occurred rather than that we had stopped it. Nevertheless we will add some additional logic to test for this case 'as insurance.'
that is very likely and what i thought too: https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=1517: 'update: as I think it now, it should be time-out, but it failed to report time-out. did you change your code around that? "says exit code is 137, which implies a kill -9 scenario"http://serverfault.com/questions/252758/process-that-sent-sigkill --- must be your side killed it, but didn't sent the time-out error'
you are very smart, Bruce. and I found the communication cost with smart people is much lower.
can you increase it to 40 minutes, as everyone has problem to finish in 20 minutes.
Yuanfang Guan The Docker container for your submission terminated with a 137 error code. Two other participants ( @yuanfang.guan and @wbaddar) also encountered this issue, so it may be a problem with the server rather than with your submission.
My hypothesis is that the reason for the 137 error code is that you hit the 20 minute limit and the system ran 'docker stop'. The reasons I think this is the case is:
(1) In round 1 we did not enforce the 20 minute time limit on the express lane inference submissions. You are one of several who said 'this submission worked before but now does not' and this is the one known change between round 1 and round 2;
(2) I found an discussion on-line which says that the 137 error code can occur when 'docker stop' is run. https://github.com/docker/docker/issues/21083
(3) You yourself say, "They both failed after 20 mins of running on the express lane inference SC1." which is the point at which the system runs 'docker stop' to enforce the time limit.
There are two steps to test my hypothesis:
(1) Check the run time of all recent submissions that terminated with a 137 error and see if they ran for 20 minutes.
(2) Check the system logic and see if it makes sense that it can return 'ERROR_ENCOUNTERED_WHILE_SCORING' rather than 'STOPPED_TIME_OUT' when in fact it should be the latter.
I will report back.
Drop files to upload
submission failed on prediction express lane page is loading…