My submission (ID: 8031153) got stuck for about 10 hours without being executed. I tested it on express lane and it worked fine. So I'm sure it has something to do with the server. The log file didn't even show the basic meta data information to me. That means, it didn't even execute the code for training. I'm about to cancel the submission and resubmit it.
I'm just wondering whether the server hours are still counted towards the quota or not. Thanks!
Created by Li Shen thefaculty > Shall we flush the stdout each time after a print statement to get the log file updated promptly?
Certainly. It's not a bad idea to ensure that the output of your code is sent to the system (flushed) so that it's included when the logs are captured and uploaded. Also you can get a sense of the behavior of the challenge processing pipeline by running your container locally (in detached mode) and, while it's running, using 'docker logs' https://docs.docker.com/engine/reference/commandline/logs/ to print the output of your submission. @brucehoff
===
Just a technical note: Shall we flush the stdout each time after a print statement to get the log file updated promptly? Thx!
Log files are currently updated each 30 minutes, as well as when the submission terminates. https://www.synapse.org/#!Synapse:syn8031830 Dear Bruce, @brucehoff
Thanks for addressing the problem for our team. However, as I checked, the log file still remains the same after ~30 minutes while the time quota submission shows updating. So is the log still able to update automatically as before? Thanks for your time and patience. (submission id: 8031827) @wangsijia1990: The problem has been addressed and your logs are now uploaded. Thanks @brucehoff for your answer
> 7988535 finished successfully a few hours ago. According to our records is WAS submitted by you. If this is not the case then someone else may be accessing your account. Please change your password immediately. Regarding the last modified date, please see my earlier comment. We will correct this display issue, but there is no effect on the pipeline operation or quota computation.
I don't think this is accurate. If you look at the log file for that run at https://www.synapse.org/#!Synapse:syn7989180 you'll see that it was last modified on January 5th, and the log itself contains a timestamp towards the top which says it was started on `Wed Jan 4 16:40:36 UTC 2017` and ended on `Thu Jan 5 05:07:20 UTC 2017`.
Not that it matters much though; sure enough looking at the new submission dashboard (thanks for that!) suggests that this job didn't use more time from our quota than it should have.
> The issue is not that every server is busy but rather than your submission is requesting a preprocessing step which was previously run (over the course of 11 1/2 hours) and cached on a server which is currently processing other submissions. When that server becomes free then your submission will run.
This is very interesting. I had no idea that preprocessing caches were bound to specific servers; I had assumed that the caches were on a shared file system of sorts.
> An alternative is to restart your job from the preprocessing stage on a different server. This is certainly possible but runs the risk of the original server becoming free long before the preprocessing finishes on another one.
Right. Although it sounds like another possibility is that when using preprocessing we don't ever get to use our full quota in this phase because we'll be waiting much more often compared to others; we may be better off not using a separate preprocessing step. :-(
@thomas.yu @brucehoff
I am from the same team as duhao. Could you help to check the status of [id: 8031827]. The log has stopped updating for two days while we are not sure whether it is because the job is completed - haven't received any notification. The last update of the current log looks like the following:
STDERR: I0117 16:10:48.285243 15 solver.cpp:252] Train net output #2: loss3/loss3 = 0.000449469 (* 1 = 0.000449469 loss)
STDERR: I0117 16:10:48.285303 15 sgd_solver.cpp:106] Iteration 3870, lr = 0.005
It has remain not updated.
Thanks,
Sijia
> I'm seeing a job with id 7988535 that's a copy of id 7905295, but that we never submitted. The job is listed as "Completed" while its last modified time has kept updating (though that seems to have stopped now).
7988535 finished successfully a few hours ago. According to our records is WAS submitted by you, @dnouri. If this is not the case then someone else may be accessing your account. Please change your password immediately. Regarding the last modified date, please see my earlier comment. We will correct this display issue, but there is no effect on the pipeline operation or quota computation.
> Meanwhile, we have a job in the queue with id 8038490 that is waiting in the "Validated" state since last night. I assume that this is because the queue is just full, though it's a little unsettling that it shall be full so early in this second phase.
Yes, 8038490 is waiting for a server. The issue is not that every server is busy but rather than your submission is requesting a preprocessing step which was previously run (over the course of 11 1/2 hours) and cached on a server which is currently processing other submissions. When that server becomes free then your submission will run. An alternative is to restart your job from the preprocessing stage on a different server. This is certainly possible but runs the risk of the original server becoming free long before the preprocessing finishes on another one. > My job's status is actually "in process" but it hasn't done anything.
I see two recent submissions from you: 8031153, which you canceled, and 8032903 which finished successfully after running for about a day. Our system uploaded logs on schedule and the final log file seems to indicate success. So I can't see any problem here. If you feel something's wrong please elaborate.
Wow, there are a lot of questions in this thread. I will see if I can answer:
> My submission (ID: 8031153) got stuck for about 10 hours without being executed.
As @thomas.yu said, if your submission is RECEIVED or VALIDATED but not yet EVALUTION_IN_PROGRESS then there is nothing wrong, the submission is simply waiting for a server to run on. Checking the state of 8031153, it looks like you have elected to cancel the submission.
> my second round completed submissions (finished last week) have shown an updated status-- all on "01/16/2017 08:35:57PM".
> We are coming across the same problem as Li. The submission on time quota shows updating status but the logs are not being modified.
This is an unintended side effect of the newly added display of training time remaining which "updates" submissions (just their metadata) after they are done running. It does NOT affect any time quota computation. We will update the dashboards so that the last updated time stamp shows the end of the training rather than when the submission was last "touched."
Will address the other issues mentioned in this thread ... I can confirm that I'm seeing similar issues with the job queue.
I'm seeing a job with id 7988535 that's a copy of id 7905295, but that we never submitted. The job is listed as "Completed" while its last modified time has kept updating (though that seems to have stopped now).
Meanwhile, we have a job in the queue with id 8038490 that is waiting in the "Validated" state since last night. I assume that this is because the queue is just full, though it's a little unsettling that it shall be full so early in this second phase.
@thomas.yu
===
My job's status is actually "in process" but it hasn't done anything. This is not normal. The same job finished on express lane in 4 mins. It shall finish in around 400 mins.
I canceled my job and resubmitted it yesterday. But I'm still having the same problem. The log file doesn't show anything meaningful except that the tensorflow has found CUDA and two GPUs. It's has been 17 hours... Dear Organizers, @thomas.yu @tschaffter
We are coming across the same problem as Li. The submission on time quota shows updating status but the logs are not being modified.
Could you help on this? (id: 8031827) Dear Bibo,
I will confirm with the other challenge organizers if this is an issue.
Best,
Thomas Dear Li,
If the status is not EVALUATION_IN_PROGRESS then it will not be counted towards your QUOTA. Sometimes the servers are just busy and it takes time for it begin scoring your submission. Please kindly let me know if this is still an issue.
Best,
Thomas Seems the server is very busy now. Can you verify this? @thomas.yu
Also, I found a strange thing that all of my second round completed submissions (finished last week) have shown an updated status-- all on "01/16/2017 08:35:57PM".
But, it seems there is no influence on the log and model state files. Can you also help verify this? @thomas.yu
Drop files to upload
Submission got stuck without any log file update page is loading…