Detect and take care of stuck preprocessing/training submission

My training submission's (id 7804074) log (regular lane) has not been increasing since 10+ hours ago. I have several questions regarding it: 1. In ideal condition, what is the expected period of the log update (and upload)? 2. How long is the suggested waiting time to observe my log for update before considering it stuck (and cancel it manually)?

Created by Yohanes Gultom yohanesgultom
@brucehoff noted. I will take a look at it and adjust my code if necessary. Thanks!
> May I know how do your system capture the standard output? Yes, we use the "docker logs" API, https://docs.docker.com/engine/reference/commandline/logs/, and put the content into a file that we return to you. We capture both STDOUT and STDERR. Essentially whatever your model prints out to the command line is captured. That is, if you run your container on your local machine with the "-i -t" open (see https://docs.docker.com/engine/tutorials/usingdocker/), the content printed to the command line is what we capture and return to you when we run your code as a container.
@thomas.yu thanks for the information. They help us to decide when to cancel a submission job (especially information about the file version and history). > Your code may have stopped producing output while continuing to run, which may be why it seemed stuck? May I know how do your system capture the standard output? Does python multiprocessing somehow have chance to be incompatible with it? I don't see any other thing in my code that may caused inconsistency in producing output
Dear Yohanes, Apologies for the delay in response. > In ideal condition, what is the expected period of the log update (and upload)? We nominally update the logs each 15 minutes. You can see your log file here: https://www.synapse.org/#!Synapse:syn7804080 There are 68 versions meaning it ran for 17 hours. Training started at 7:30AM on 11/30 so it ran through 12:30AM on 12/1. That coincides with the 'last updated' column in the dashboard. Your code may have stopped producing output while continuing to run, which may be why it seemed stuck? >How long is the suggested waiting time to observe my log for update before considering it stuck (and cancel it manually)? You can click on the file History button for your accumulating log file. If the file size is not increasing then your model is not producing output and you are welcome to cancel your submission if that is unexpected. Best, Thomas
Quick update: after **repeating same submission 3 times** finally I can see the log getting updated (submission id 7806864). Apparently in my case the suggested **observation time is around 5 hours** before deciding whether to cancel the submission or not. Hopefully I still have enough time quota for all my experiments :(
I just canceled another submission (id 7805903) as it was stuck (again) for 10+ hours. It was better last week. Is there anything wrong?

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Detect and take care of stuck preprocessing/training submission page is loading…