I haven't been able to try anything for a while.
i have the following problems:
1. i am not sure if i have accidentally exceeded time limit, since some zero-content jobs somehow hanging there and listed as running for ever.
2. a previously working docker not working anymore, i am not sure if it is because tons of other empty jobs are hanging there, or because of some configuration change? i am absolutely sure nothing is changed in the docker file.
3. wish to kill all hanging processes, but cannot access job status. now, only one one "notifications"
4. yestoday tried to kill some hanging jobs, and received that a job was successfuly killed like 10 hours after i killed it (plus it is a job supposed to have finished days ago).
can someone help to figure out what has happened?
finally, i think i need an extension of timeline to figure out what happened. is everyone else working fine?
Created by Yuanfang Guan ???? yuanfang.guan @sitmo: Sorry not to have responded earlier. If there is still an issue please post here. update:
The "stop a running job" dashboard is working again, but the job I posted last night with id=7328151, syn7328055 is still pending. Is it possible to get some feedback on that job?
Thanks! Hi Bruce,
I have similar issues.
After submitting docker.synapse.org/syn7327563/sitmo0 last night it seems to be stuck in pending mode, however this morning I can't access "stop a running job overview" page to see what's going on, it say's:
> Too many users
> Sorry, but this application has exceeded its quota of concurrent users. Please try again later.
Are there infrastructure issues you're currently working on that should be communicated? Or is it more likely that my docker is flawed? I see mixed experiences from other users reading these posts. >the issue being at the level of the agents that distribute the jobs to the machines.
i knew it! the cluster didn't kick in. was it solved?
can i buy more hours from IBM. i need much lower end machine but more hours.
thanks a ton. Dear Yuanfang Guan,
> is it the IBM machines or AWS?
At the present time, Amazon does not offer machines comparable to the high-end GPU servers that IBM is providing for the Challenge with 48 CPU cores (not virtual cores), 500 GB of RAM and 2x NVIDIA Tesla K80 (4x NVIDIA Kepler GK210). Note that the machines are working fine, the issue being at the level of the agents that distribute the jobs to the machines. Dear Bruce,
i SINCERELY thank you for your feedback.
but the problem is not the display. There is something wrong with how the cluster is set up. i think the cluster didn't kick in at all.
is it the IBM machines or AWS?
if AWS, this shouldn't have happened at all.
I strongly suggest
1. this challenge to postpone for 2 months (i.e. running between Dec 1st to Mar 30th). and during the two months these problems are sort out. I say two months because I think it needs two months.
2. allowing infinite resource in time and parallel computing to participants, i.e. anything beyond 144 hours paid by participants. Because 144 hours is enough to train nothing.
> I'm seeing the same thing, jobs are still listed as "Running" long after I get the Model Training Completed email.
There were two bugs in the status display causing jobs that were Pending or Completed to show as Running. One has been fixed and we are working on the other. Thanks for your patience.
> Additionally, jobs I clicked Stop Submission on have remained in "Stop Requested" status for more than a day now.
Again, I believe such a job is actually Pending. When it reaches an available server its status will change to cancelled. We could add another process solely to process your cancellation requests on Pending jobs. This would make the system appear more responsive. Thank you for your feedback. Hi All,
Thanks for bringing this issue. I have the similar issue with the job status.
1. The stopped or completed jobs are listed as still running.
2. I receive email for confirmation of the job submitted only after the actual job is running, without having the knowledge of the global queue.
3. The log file disappeared from the synapse project after downloading.
4. Will the log file will be updated after 1MB quota or will it refresh.
I'm seeing the same thing, jobs are still listed as "Running" long after I get the Model Training Completed email. Additionally, jobs I clicked Stop Submission on have remained in "Stop Requested" status for more than a day now. TDERR: 2016-09-22T03:22:33.884363660Z ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
STDERR: 2016-09-22T03:22:33.884407941Z initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1
STDERR: 2
screen shot:
http://guanlab.ccmb.med.umich.edu/yuanfang/screenshort_for_tschaffter.png
it says 'Please enter valid alternate text for the URL' when i tried to insert an image. Hi Yuanfang Guan,
Can you please upload a screenshot of your screen? This should help up for debugging.
Thanks for your past and future feedback! >You should see "YOU DO NOT HAVE ANY RUNNING SUBMISSIONS TO THE SELECTED SUBCHALLENGE." since you have no running jobs.
i see 22 hanging jobs. i wish i can show you my screen. i can send you my password in a separate email. but you have root, you should be able to enter as anyone, right? >Can you describe the symptom?
there are so many symptoms that i am not sure which one will be helpful to get a diagnosis. but here are the major ones:
1. https://www.synapse.org/#!Synapse:syn7268785
can you read the log? this docker file used to work. now it cannot find gpu. it stops at loading theano.
2. yestoday, or the day before yestoday, i cannot remember, when i looked at the job list.
i was stunned to find all jobs, including the one that the first day when it is open, i just ls the dir, is still showing running.
i thought it is going to take all my time, so i started to stop some. but every line i wrote i submitted once, so there were about 20 of them. then i am not patient enough to click 20 times. so i left about half there.
after 10 hours, i received, one of them is killed..... i am not sure what does that mean.... all of them should be done in like 2 minutes, since there is nothing there. i am still trying to figure out what's going on....
3. now i cannot see the queue anymore. > i am not sure if i have accidentally exceeded time limit
At this time we are not enforcing time limits. When time limits are in place (before scoring opens on 10/4) we will inform you of how much time you have consumed/remaining. The clock will 'reset' at the beginning of the first round, on 10/4.
> somehow hanging there and listed as running for ever.
Can you describe the symptom? We show no submissions of yours still running. The last one you submitted has completed with output here:
https://www.synapse.org/#!Synapse:syn7268785
> a previously working docker not working anymore
Can you describe the symptom and include a reference to the log file?
> wish to kill all hanging processes, but cannot access job status.
You should see "YOU DO NOT HAVE ANY RUNNING SUBMISSIONS TO THE SELECTED SUBCHALLENGE." since you have no running jobs.
> yestoday tried to kill some hanging jobs, and received that a job was successfuly killed like 10 hours after i killed it (plus it is a job supposed to have finished days ago).
When you click "STOP" your immediate acknowledgement is the display of "STOP REQUESTED". Once your submission makes it through the queue we mark it as canceled rather than running it and, at that time, send you a notification that it was canceled/skipped. The email notification can come at an arbitrarily long time after you click "STOP". It's not a surprise that you got the message 10 hours later.
Hope these explanations help you understand the system's behavior.
Drop files to upload
haven't been able to try anything, would there be an extension of time line? page is loading…