Hi all,
I was expecting the error on training since it's based on caffe, and some issues still exist. But can anyone explain why the preprocess log file was truncated?
Apparently there was no error while running the preprocess container, the error refers to the train container. But the log file is incomplete with regards to the preprocess steps.
I have run this preprocess container previously, I know the expected output (log).
```
Your Submission to the Digital Mammography challenge, docker.synapse.org/syn7273857/train@sha256:5158a1a6d3ff40061c42df332670eb629f7556e5040497d40f94ce0fa3c718b9 (submission ID 7440486) has stopped before completion. The message is:
Error encountered during training.
Your log file is available here: https://www.synapse.org/#!Synapse:syn7444792 Please direct any questions to the challenge forum, https://www.synapse.org/#!Synapse:syn4224222/discussion .
```
Created by Jose Costa Pereira josecp Here's an update to my last post: I reran two of your submissions. The first was syn7500476, which received submission ID 7542257 and whose logs were uploaded to https://www.synapse.org/#!Synapse:syn7542261. The folder has two log files, one for preprocessing and one for training. The preprocessing log look like:
```
STDOUT: total 4.0K
STDOUT: drwxr-xr-x. 5 root root 440 Nov 16 04:59 .
STDOUT: drwxr-xr-x. 24 root root 4.0K Nov 16 04:59 ..
STDOUT: lrwxrwxrwx. 1 root root 11 Nov 16 04:59 core -> /proc/kcore
STDOUT: lrwxrwxrwx. 1 root root 13 Nov 16 04:59 fd -> /proc/self/fd
...
STDOUT: - 2dsi6cpt_1_R
STDOUT: - 2dsi6cpt_2_L
STDOUT: - 2dsi6cpt_2_R
STDOUT: - 2dvczulj_1_L
```
The training log looks like:
```
STDOUT: Wed Nov 16 05:22:51 2016
STDOUT: +------------------------------------------------------+
STDOUT: | NVIDIA-SMI 352.99 Driver Version: 352.99 |
STDOUT: |-------------------------------+----------------------+----------------------+
STDOUT: | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
STDOUT: | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
STDOUT: |===============================+======================+======================|
STDOUT: | 0 Tesla K80 Off | 0000:83:00.0 Off | Off |
STDOUT: | N/A 33C P8 27W / 149W | 56MiB / 12287MiB | 0% Default |
STDOUT: +-------------------------------+----------------------+----------------------+
STDOUT: | 1 Tesla K80 Off | 0000:84:00.0 Off | Off |
STDOUT: | N/A 30C P8 28W / 149W | 56MiB / 12287MiB | 0% Default |
STDOUT: +-------------------------------+----------------------+----------------------+
...
STDOUT: -rwxr-xr-x. 1 root root 218 Oct 26 16:14 test.sh
STDOUT: drwxrwxrwt. 3 root root 30 Oct 28 15:33 tmp
STDOUT: -rwxr-xr-x. 1 root root 1.5K Nov 4 11:42 train.sh
STDOUT: drwxr-xr-x. 10 root root 97 Oct 28 15:33 usr
STDOUT: drwxr-xr-x. 11 root root 4.0K Jul 27 01:53 var
```
I then went to the archived containers and examined their logs. The log for the preprocessing container is *much* longer than the content returned via Synapse. That is, I also see the truncated log file. The full log file is 7.6MB uncompressed, 1.7MB compressed. This is beyond the enforced 1MB limit. **My suspicion is that the complete file was not uploaded because it exceeded the 1MB limit.** I then turned my attention to the training log, which matches exactly what was uploaded to Synapse. In this case the file is very small.
The second submission I reran was syn7510494. This was assigned submission ID 7542258 and its logs were uploaded to https://www.synapse.org/#!Synapse:syn7542285. I see two log files, one for preprocessing and one for training. As before the uploaded preprocessing log is small (about 1KB) while the original is much larger (5MB, uncompressed). The uploaded training log file matches the one from the archived Docker container exactly. So the situation for the second submission matches that of the first.
The obvious question is why the abbreviated log file is so much less than 1MB. The answer is in how we to the abbreviation: We upload the log file periodically (e.g. every 15 minutes). When we see that the file exceeds 1MB we simply stop uploading it. If the file is found to be 1KB and and then, 15 minutes later 1.1MB, we stop uploading after the 1KB version. Obviously it would be preferable to truncate the 1.1MB file to 1.0 MB and upload it.
We are currently redefining the data limit in such a way that I think this issue be obviated: We will allow you to retrieve your model state and impose a total data limit (logs plus model state) of 1GB. For reasonably sized logs you stop encountering the data limit.
Hope this helps.
@josecp: Sorry for the delayed reply.
> The log file is again incomplete and relates only to the preprocessing part of the job.
Yes, it looks like we ran just the preprocessing part of your submission, yet marked the submission as complete, which is not correct. I have resubmitted syn7500476 to see if I can recreate the issue.
> Here is a job where this happened ...
The link you sent is to the log file for submission 7510497 which used file syn7510494. I have resubmitted this as well, again to see if the issue is reproducible. We will report back. This is almost solved. And I say almost because, I can see two log files in the folder created for my jobs. These logs refer to preprocess and train, but it seems that the last update of the preprocess log file is not written to the file (maybe you need flush the last update of the log, to force it to be written to the file we can access).
This seems to be affecting only the preprocessing log.
Here is a job where this happened [log folder](https://www.synapse.org/#!Synapse:syn7510508) @brucehoff
I've read in a related post that the issue with the logs had been resolved. Today I've submitted a file (two containers: preprocess+train) which apparently completed successfully:
Submission ID | Status | Status Detail | Cancel | Last Updated | Submitted Repository or File | Log Folder | Submitting User or Team | Model State File
7500477|ACCEPTED|COMPLETED||11/04/2016 02:59:00PM|syn7500476|syn7500518|3345451
The log file is again incomplete and relates only to the preprocessing part of the job. Are you working on fix for this? I've submitted a job with two containers (preprocess + train). Here is what I can see regarding the two containers:
Submission ID Status Status Detail Cancel Last Updated Submitted Repository or File Log Folder Submitting User or Team Model State File
7486621 CLOSED ERROR 10/31/2016 07:52:51PM syn7486617 syn7486700 3345451
7486641 ACCEPTED COMPLETED 10/31/2016 07:55:12PM syn7486638 syn7486755 3345451
I'm actually not doing much on the training (just printing the output of a few commands), it apparently terminates successfully. The preprocessing apparently is causing the error, but nothing can be seen in the log file...
ps- Needless to say this runs fine on my system, using the pilot data. The second run with only a training log is 7486422 (syn7486780), which was still running when you posted. The previous preprocessing run was 7477038 (syn7477342).
There was another run scheduled but it was cancelled before it started: 7485885 (syn7486778). Presumably it can be discounted.
Hope this helps.
Bob
@BobK: Below are the submissions I can find from you. Can you say which Submission ID is the one described in your last comment?
${leaderboard?queryTableResults=true&path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7213944%2Bwhere%2BSUBMITTER%253D%253D%25223345846%2522&paging=false&pageSize=100&showRowNumber=false&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CNONE&columnConfig1=none%2CStatus%2Cstatus%3B%2CNONE&columnConfig2=none%2CStatus Detail%2CSTATUS%5FDESCRIPTION%3B%2CNONE&columnConfig3=epochdate%2CLast Updated%2CmodifiedOn%3B%2CNONE&columnConfig4=synapseid%2CSubmitted Entity%2CentityId%3B%2CNONE&columnConfig5=none%2CEntity Version%2CversionNumber%3B%2CNONE&columnConfig6=none%2CSubmitting User or Team%2CSUBMITTER%3B%2CNONE&columnConfig7=synapseid%2C%2CSUBMISSION%5FFOLDER%3B%2CNONE} To clarify:
After a submission of our preprocessing/training containers we received an incomplete preprocessing log file and an email saying that the training had failed.
We resubmitted with the same preprocessing container (and a new training container) but now see only a training container log file. The preprocessing step must therefore have previously succeeded otherwise the second run would simply have repeated the preprocessing phase. The incomplete preprocessing log file from the first run was rather misleading.
We have seen something similar and finding it it has cost us too much time.
To test whether preprocessing actually succeeded we eventually stumbled upon the idea of rerunning exactly the same preprocessing container that had previously failed with only incomplete preprocessing log files as output. This time there is only a training log file whereas before there was only an incomplete preprocessing log file. Presumably the previous training script failed very quickly and the logs for it were not caught in a final snapshot.
Please fix this as soon as possible. The logs are our only way of seeing what is actually happening on the synapse servers and any truncations will burn our time.
Bob
Dear @thomas.yu
I know this sounds like a bug on my end, but bare with me for a second...
To make it more clear what I'm experiencing right now is the following:
- the preprocess.sh calls a few python scripts
- the first one completes fine
- the second iterates through the metadata and prints some stuff. It terminates without any STDERR message. Furthermore, if there was an error the control should go back to preprocess.sh where I'm printing a few debug messages to control the flow of execution. Nothing can be seen in the logs...
Dear Jose,
Apologies for the inconvenience. However, we do not truncate any log files. Your process most likely terminated with an error which is probably the cause of the short log file. Please kindly look at your log file [here](#!Synapse:syn7444793).
Best,
Thomas