I submitted a job to Express Training queue, and received a failure message:
Submission exceeded alloted time.
The submission ID is 8057130
But the same processing completed successfully two days ago, ID 8045922
Could you explain?
Thanks
Ljubomir
Created by Ljubomir Buturovic ljubomir_buturovic Thanks for letting me know. We will investigate. Thanks for this. In the meantime, here is more info:
- While searching for a workaround, I ran this - or simpler - worfklow at least half dozen times. It failed every single time.
- The workflow has two major steps: image conversion and LMDB indexing. Previously, both completed easily within 20
minutes. As of last night, indexing alone does not complete in 20 minutes, though conversion does.
I am concluding that the Express Training queue has slowed down significantly compared to five days ago, and that
this is not transient behavior, but completely reproducible.
I have a workaround, so strictly speaking you don't need to investigate further on my behalf. Thank you for offering. I'd say
however that there is an issue in the system and it would be advantageous to explain it. Most importantly, since we don't
know the cause, it is possible that it may affect the "full" queues. I.e., all queues might be running slower, not just express
I see.
Here's the output of the preprocessing log for the first submission, 8045922:
```
STDOUT: Resizing and converting 3188 DICOM images to PNG format
STDOUT: DICOM to PNG conversion completed
STDOUT: Grayscale to RGB conversion complete
STDERR: I0120 10:04:10.634892 24590 convert_imageset.cpp:91] Shuffling data
STDERR: I0120 10:04:11.107224 24590 convert_imageset.cpp:94] A total of 3188 images.
STDERR: I0120 10:04:11.107795 24590 db_lmdb.cpp:35] Opened lmdb /preprocessedData/lmdb
STDERR: I0120 10:04:14.877275 24590 convert_imageset.cpp:160] Processed 1000 files.
STDERR: I0120 10:04:18.520160 24590 convert_imageset.cpp:160] Processed 2000 files.
STDERR: I0120 10:04:22.003229 24590 convert_imageset.cpp:160] Processed 3000 files.
STDERR: I0120 10:04:22.610239 24590 convert_imageset.cpp:166] Processed 3188 files.
STDERR: /usr/lib64/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
STDERR: warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
STDOUT: (3, 227, 227)
STDOUT: Channel 0: 79.4509742087
STDOUT: Channel 1: 39.9673120961
STDOUT: Channel 2: 24.0351331289
STDOUT: Done
```
and this is for the second submission, 8057120.
```
STDOUT: Resizing and converting 3188 DICOM images to PNG format
STDOUT: DICOM to PNG conversion completed
STDOUT: Grayscale to RGB conversion complete
STDERR: I0122 22:54:55.946682 24591 convert_imageset.cpp:91] Shuffling data
STDERR: I0122 22:54:56.415037 24591 convert_imageset.cpp:94] A total of 3188 images.
STDERR: I0122 22:54:56.417742 24591 db_lmdb.cpp:35] Opened lmdb /preprocessedData/lmdb
```
It's clearly cut off while preprocessing files. The first one ran to completion in a few minutes (both preprocessing and training), while the second ran for >20 minutes (apparently) and then was cut off.
One possibility is that something else was happening during the second submission that took inordinately long. For example, there are file shares that have to be cleaned up / initialized before running each submission and if it took a very long time to do so that might have affected your submission. One way to examine this is to look at how the time stamps in the log file line up with the time stamps in the dashboard: The model was started at 14:35:06 Pacific Time and was last updated at 15:00:26. In UTC this is 22:35:06->23:00:26. So, indeed the submission was terminated around five minutes after the last time stamp.
Looking at the earlier submission, the analogous time stamp is 10:04:11 and the model started at 01:55:37AM Pacific, or 9:55:27AM UTC. In that case the time stamp occurred about nine minutes after the submission started. This is in comparison to the appx. 20 minute delay in the later submission, so something else happened for about 11 minutes in the second submission. My best guess is either (1) it took longer than usual to clean up file shares from the previous submission or (2) there was a temporary outage when communicating with Synapse, and the execution pipeline had to sleep/retry until the outage was over.
My suggestion would be to resubmit the model. If you have on-going issues, please let us know and we will investigate further.
Bruce
You are right that those two files are not identical, and that's on purpose: they _cannot_ be identical.
If I submit the identical file again, the system will, as you know, skip the preprocessing step,
because it had already been done, and therefore I would not be able to demonstrate the problem.
For this reason I had to create a new submission, with the new sha digest, in order to trigger re-running
of the preprocessing steps. However the two submissions have identical processing workflow (i.e.,
identical shell commands). And yet, the first one completed, and the second was terminated.
Submission 8057120 is file syn8054924 which has content
```
preprocessing=docker.synapse.org/syn7415408/cpi38x@sha256:67da607a4ec3db87e362d275f312c33ba17409a0a9495005fa79a6bc868b5e64
training=docker.synapse.org/syn7415408/cpi38x@sha256:67da607a4ec3db87e362d275f312c33ba17409a0a9495005fa79a6bc868b5e64
```
while submission 8045922 is file syn8045853 which has content
```
preprocessing=docker.synapse.org/syn7415408/cpi35x@sha256:976f37db57c6b9b1709cc9c897bc5705b93f0fdb71ae8dbd7e5eb35224e8d82c
training=docker.synapse.org/syn7415408/cpi35x@sha256:976f37db57c6b9b1709cc9c897bc5705b93f0fdb71ae8dbd7e5eb35224e8d82c
```
They don't appear to be the same thing. Hi Thomas
You are right - I posted the log file IDs, not the submission ID
The correct number for the failed submission is 8057120
Thanks
Ljubomir
Dear Ljubomir,
I can't seem to find your submission that failed. Are you sure that is the correct submission Id? Please try submitting again and let me know if you run into the error again.
Best,
Thomas
Drop files to upload
Express queue: problem exceeding alloted time page is loading…