Dear contest organizers,
The log file of my running preprocessing submission (current log file: https://www.synapse.org/#!Synapse:syn8104123) has not changed in the last 22 hours.
This is the same preprocessing submission (preprocess@sha256:e94fc0f87ea17b4334eea76e107ed1c9b7a9774419c2fb4b2772b7bc2771144c) that previously completed in under 2 hours (previous log file: https://www.synapse.org/#!Synapse:syn8077591).
Could someone help me to understand why this job is apparently not running but still consuming training time?
Created by DREAMer To dig in further we need to identify the submission ID of interest.
It looks like the original post is referring to 8104037.
The log file (https://www.synapse.org/#!Synapse:syn8104126) has 47 revisions. Since the log files are uploaded each 1/2 hour, that means the submission ran for 23 hours. Each revision has 246 bytes, same as the first revision.
The challenge infrastructure will not continue to monitor your running model and upload logs if your model stops. However it *will* let your model run unproductively, which is what seems to have happened here.
The logs file is that of a preprocessing phase. So the submission never made it to the training phase. The entire content of your log file is:
```
STDOUT: Resizing and converting 317617 DICOM images to PNG format
```
I realize that so far I have only repeated things that you already know. It would help to know **which** of your several submissions had this problem. Looking at more recent submissions it seems like you may have fixed the problem. Can you confirm this?
${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7213944%2Bwhere%2BuserId%253D%253D%25223321630%2522%2BAND%2BcreatedOn%253E%253D1483401600000&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CDESC&columnConfig1=none%2CStatus%2Cstatus%3B%2CNONE&columnConfig2=none%2CStatus Detail%2CSTATUS%5FDESCRIPTION%3B%2CNONE&columnConfig3=none%2CTraining Quota Remaining%2CTIME%5FREMAINING%5FDISPLAY%3B%2CNONE&columnConfig4=cancelcontrol%2CCancel%2CcancelControl%3B%2CNONE&columnConfig5=epochdate%2CSubmitted On%2CcreatedOn%3B%2CNONE&columnConfig6=epochdate%2CLast Updated%2CmodifiedOn%3B%2CNONE&columnConfig7=synapseid%2CSubmitted Repository or File%2CentityId%3B%2CNONE&columnConfig8=none%2CFile Version%2CversionNumber%3B%2CNONE&columnConfig9=synapseid%2CLog Folder%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig10=none%2CSubmitting User or Team%2CSUBMITTER%3B%2CNONE&columnConfig11=synapseid%2CModel State File%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE&columnConfig12=none%2CDaily Data Used %2528MB%2529%2CDAILY%5FQUOTA%5FUSED%5FMB%3B%2CNONE&columnConfig13=none%2CDaily Data Remaining %2528MB%2529%2CDAILY%5FQUOTA%5FREMAINING%5FMB%3B%2CNONE&columnConfig14=none%2C%2CWORKER%5FID%3B%2CNONE} Dear DREAMer,
I will pass this along to other challenge organizers as I don't think I am able to help you.
Best,
Thomas Dear Thomas,
Thanks for your reply. All of these submissions (both preprocessing and training) have run correctly on the express lane, and the associated training jobs all produce output immediately (before any significant computation occurs), so log files are always created when the training component begins.
I have also checked that in each case of stalling, the issue was with the preprocessing phase and not the training phase, as the preprocessing component of the submission was still being executed, and the training component had not yet begun.
Here is one detailed example of a submission stalling in the preprocessing phase (I can share several others as well if they would be helpful in debugging this issue):
Submitted File: https://www.synapse.org/#!Synapse:syn8103589
You?ll see from the file that this is a joint preprocessing / training submission with preprocessing target preprocess@sha256:e94fc0f87ea17b4334eea76e107ed1c9b7a9774419c2fb4b2772b7bc2771144c
When this job began, a preprocessing log was created (log file: https://www.synapse.org/#!Synapse:syn8104123), and only a single line of output was written to that preprocessing log.
However, the preprocessing job should have then printed many additional lines of output to the log file as it progressed through the preprocessing job. These additional lines of output were produced when the identical preprocessing job was submitted earlier (see the log file https://www.synapse.org/#!Synapse:syn8077591 for comparison) and are also produced whenever this job is run in the express lane. Note that in the previous run of the same preprocessing target (preprocess@sha256:e94fc0f87ea17b4334eea76e107ed1c9b7a9774419c2fb4b2772b7bc2771144c) all lines of output were generated within 2 hours, while the stalled submission in question had not moved beyond the first line of output after 22 hours of execution.
Rather than continuing to execute the preprocessing job or moving on to the training job (which also produces a log file upon execution) the job apparently remained stuck at that first line of output in the preprocessing log for over 22 hours. After that point, I terminated the job, because no progress was being made.
Could you provide any advice for getting my preprocessing jobs to run as expected instead of stalling in this way? I have tried submitting preprocessing jobs independently of the training jobs, but I am still observing the same stalling behavior (see for example my latest submission https://www.synapse.org/#!Synapse:syn8115811). This is a slightly modified version of the original preprocessing job which has been tested on the express lane but is apparently stalling after outputting two lines (many more lines should be outputted as the job executes to completion).
Dear DREAMer,
I responded to another thread, but it appears that one submission only has the preprocessing step. The other submission actually has both preprocessing and training which is why it is taking longer.
Best,
Thomas
Drop files to upload
Log file of running job has not changed in the last 22 hours page is loading…