Training submission no response or updates

Dear Bruce, @brucehoff Could you please help to check out our submission 7835220? The preprocessing took much longer than our expectation with no updates on the log. We've tested it both locally and on the express lane and it works fine. Just wondering whether it's stuck on the server? Greatly appreciate your support and thanks for your time!

Created by Hao Du duhao
> BTW, if I terminate a submission after it completes preprocessing, in between of training, could I use this cached preprocessing afterwards? Yes. But if you terminate a training submission while preprocessing is in progress then the partial result will not be retained.
Dear Bruce, I see. Noted with thanks! BTW, if I terminate a submission after it completes preprocessing, in between of training, could I use this cached preprocessing afterwards? Thanks!
@duhao: You failed to mention the *three* other submissions you made between the 7835220 and 7859614. Here are your submissions: 7835220 ``` preprocessing=docker.synapse.org/syn7824993/testbase2@sha256:7afd6232d3b2e4f32bcc561cde90f78ae20a142411e214ea197a2834bfa83380 training=docker.synapse.org/syn7824993/dm-traintest-1@sha256:ade58111e34198de45ed99a8eef13e859b8a5b64d96589170ea61dd6adf7e980 ``` 7856987 ``` training=docker.synapse.org/syn7824993/dm-traintest-1@sha256:e69e9498f16ec8966bdf653d1c42a5c95bd485ca43803b2d62843dfae1820dd2 ``` 7859568 ``` training=docker.synapse.org/syn7824993/dm-traintest-1@sha256:630d76c003d553a9e9bbee7f8b8f019aa542497b4b166d12dd10d89011f26f48 ``` 7859584 ``` preprocessing=docker.synapse.org/syn7824993/testbase2@sha256:6a8364a93c375f84a988cdb5d6792d785c0c5f5f2df1d1ff62bf08f6d6ff0e1c training=docker.synapse.org/syn7824993/dm-traintest-1@sha256:630d76c003d553a9e9bbee7f8b8f019aa542497b4b166d12dd10d89011f26f48 ``` 7859614 ``` preprocessing=docker.synapse.org/syn7824993/testbase2@sha256:7afd6232d3b2e4f32bcc561cde90f78ae20a142411e214ea197a2834bfa83380 training=docker.synapse.org/syn7824993/dm-traintest-1@sha256:630d76c003d553a9e9bbee7f8b8f019aa542497b4b166d12dd10d89011f26f48 ``` When you submitted 7835220 we performed the requested preprocessing. When you submitted 7859584 you requested a different form of preprocessing so we cleared the cached files and computed the new values as requested. When you then submitted 7859614, referencing the original preprocessing Docker image, we again complied with your request to switch back and also switched your submission to a different machine which had no waiting. At this time your preprocessing is in progress. If you want to reuse cached preprocessing data you must consistently specify the preprocessing container image and not switch back and forth.
Dear Bruce, Thanks for your reply. After our preprocessing step, we intended to use the cached preprocessed data to train and tune our model. However the submission seems not using the cached data but rewrite and process. Could you help to check for us? Our submission is 7859614 and should use preprocessedData at sha256:7afd6232d3b2e4f32bcc561cde90f78ae20a142411e214ea197a2834bfa83380 in submission 7835220. Thanks!
> Could you do us one more favor to check out the number of allocated cpu cores? You container is run with `--cpuset-cpus=25-46` (See https://docs.docker.com/engine/reference/run/ for details.) Further, no other container on the server has access to those 22 CPUs. > BTW, do we need to delete files in /preprocessedData if we cancel and resubmit a new job? No. If you submit a second job with a preprocessing image different from that used to create the content of `/preprocessedData` then we will remove the current preprocessing results and start generating preprocessed data as specified in the second job. In effect each of your submissions is entirely independent and unaffected by what has come before.
Hi Bruce, Thanks! As we test it locally our processing time is: 0.3 minute for 500 images using 16 cores = 0.3 * 16 / 500 * 1000 = 9.6 minutes per 1000 images per core. So 9.6 * 640 / 24 = 256 minutes = 4.5 hours and on the express lane our processing test took 931 seconds with 20 cores which is expected to take ~26 hours on 100% data. However, our submission has taken 40+ hours with no updates yet. Could you do us one more favor to check out the number of allocated cpu cores? Because the number of files generated looks like a one cpu computation while we did distribute to cpu cores. BTW, do we need to delete files in /preprocessedData if we cancel and resubmit a new job? Thanks for your time and support!
> May I ask whether you can check our /preprocessedData directory is updating or not? In general you should 'instrument' your preprocessing to check this, i.e. print to STDOUT the progress of your preprocessing. Our job processing pipeline will then return the logs to you periodically which you can check to monitor progress. For expediency I ran a file count command twice, with a few seconds in between: ``` $ ls -1 | wc -l 92730 $ ls -1 | wc -l 92828 ``` This shows that your code is doing *something* in the /preprocessedData directory.
Hi Bruce, Many thanks for your assistance. May I ask whether you can check our /preprocessedData directory is updating or not? It works fine in express lane but this takes much longer than our estimation. If the directory is not updating we will do a resubmission. Thanks for your time!
@duhao: Your submission is running and is not 'stuck'. This is the running image: ``` docker.synapse.org/syn7824993/testbase2@sha256:7afd6232d3b2e4f32bcc561cde90f78ae20a142411e214ea197a2834bfa83380 ``` It's been running for 39 hours with minimal log output: ``` Loading required package: EBImage R Version: R version 3.3.1 (2016-06-21) snowfall 1.84-6.1 initialized (using snow 0.4-2): parallel execution on 23 CPUs. Warning message: In searchCommandline(parallel, cpus = cpus, type = type, socketHosts = socketHosts, : Unknown option on commandline: --file Library oro.dicom loaded. Library oro.dicom loaded in cluster. Library png loaded. Library png loaded in cluster. ``` If this is not the behavior you intended you are free to cancel and send another submission.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Training submission no response or updates page is loading…