Hi,
My express lane submission id is syn7844638. I have two parts to my train.sh script: an Rscript that creates a list of files for training and a python script that trains the model. It should immediately progress within seconds from the Rscript to the python script and print a message about using TensorFlow, however it seems to get through the Rscript fine but never prints any output from the python script. I can see that the log file is updated periodically, although nothing is ever added to the file itself (I can see the timestamp increasing). Finally the submission is cancelled because it exceeds the allotted time, without ever getting to the python script. I'm not sure what could be wrong, since it runs fine locally and until recently was running fine on the express and challenge training lane. Is this a problem on my end or on the server's end (since there appear to be some problems right now based on other threads)?
Thanks for the help!
Created by Luli Zou luli.zou > At the time that I made this post and replied 4 days ago, the container had not moved on after two days. However after leaving it there, it finally started running the next stage sometime between today and 4 days ago. It appears that now it might be stuck at epoch 47.
I see what you are saying: If you visit the page for your log file: https://www.synapse.org/#!Synapse:syn7861502 and click on "File History" you can see progress in your logs by the growing file size. Your log file has not increased (your model has printed nothing out) in **three days**.
> I'm assuming this behavior is a result of limited computational resources towards the end of this leaderboard phase.
I don't think so: Once your job is started it continues uninterrupted, regardless of other pending or running jobs. While running it has exclusive access to the resources described elsewhere (22 CPUs, 2 GPUs, 200GB RAM, etc.) The rate of progress should only be a function of the software you submitted to us to run.
It's up to you what to do next. The cut off for training your model is the end of this week. You could let it keep running and hope for the best. You cancel the current job and start a different model that has more verbose logs in an effort to see what's going on, if you prefer. With the recent spike in submissions, there's no guarantee that a new job would run to completion by the end of the round. You could let it continue to run and simultaneously do some sleuthing on your own machine to get a feel for the behavior of your code. Good luck! Hi Bruce,
Thanks for the reply. At the time that I made this post and replied 4 days ago, the container had not moved on after two days. However after leaving it there, it finally started running the next stage sometime between today and 4 days ago. It appears that now it might be stuck at epoch 47. I'm assuming this behavior is a result of limited computational resources towards the end of this leaderboard phase.
I guess there is no outstanding problem other than the fact that my container used to run within 2 days and now due to stresses on the server takes much longer due to random pauses - which might not be anything that you can fix. > I submitted the container to the regular training queue and it has been up for over a day without "moving on" to the next step; ID for that submission is syn7861501.
The submission ID for the file you reference is 7861493. It has been running for more than five days. You can download your log file here: https://www.synapse.org/#!Synapse:syn7861501
The tail end of your log is:
```
Epoch 45/100
Epoch 00044: val_loss did not improve
3392s - loss: 2.6439 - acc: 0.0898 - fbeta_score: 0.0061 - val_loss: 1.2856 - val_acc: 0.0028 - val_fbeta_score: 0.0033
Epoch 46/100
Epoch 00045: val_loss did not improve
3432s - loss: 2.3521 - acc: 0.0916 - fbeta_score: 0.0061 - val_loss: 1.2901 - val_acc: 0.0051 - val_fbeta_score: 0.0067
Epoch 47/100
Epoch 00046: val_loss did not improve
3548s - loss: 2.2408 - acc: 0.0930 - fbeta_score: 0.0051 - val_loss: 1.2893 - val_acc: 0.0019 - val_fbeta_score: 0.0033
```
Not knowing the details of your code, does this not show that your code has "moved on" to training? I see no evidence of any problem that we need to address, but if you feel there is an outstanding problem, please respond and explain. Hi Thomas,
Thanks for the response. Yes - locally, the submission does not take >20 minutes to print (in fact, it immediately prints out) additional log file statements from the second python script, even before the memory-intensive training starts. It seems like the problem is that it never runs the second script, although previously my container had no trouble doing this on the challenge server (and still has no problem doing it locally). I submitted the container to the regular training queue and it has been up for over a day without "moving on" to the next step; ID for that submission is syn7861501. Overall in the past, with my architecture, training has taken a bit over one day, so I am very confused.
Thanks again for your help. Dear Luli,
Apologies for the late response. Does your submission run in under 20 minutes locally? I took a look at your log files, and it seems like your submission is cut off, due to going over the 20 minute time limit that we have set on the express lanes. The submission id for that submission is 7844632. I will have someone else also take a look at your submission.
Best,
Thomas
Drop files to upload
Submission runs fine locally but stops in express lane page is loading…