Dear Organizer,
I've experienced a weird situation where I saw a training termination error without the log files available. I'd like to know why it failed so I can make the appropriate fixes. The log files were unavailable because there was another process that was automatically run after termination of my process preventing me from seeing the log. My group ID is syn7221363. Please kindly let me know what is going on...
Sincerly
Chris
Created by cdjk @cdjk: On 11/25 we found that one of our servers experienced some difficulties so we took it off line and restarted the containers from challenge participants (one of which was yours, for submission 7772336) on other machines.
> another process that was automatically run after termination of my process preventing me from seeing the log
To see the original logs just visit the file's page in Synapse, https://www.synapse.org/#!Synapse:syn7772342, and click on "File History". You can then see any version of the log file which is of interest.
> I'd like to know why it failed so I can make the appropriate fixes.
The failure may have been from your code or it could have been due to the server stopping the running containers. By letting the submission rerun from the start you would find out which is the cause.
> I am quite lost and frustrated.
Our hope was that by restarting the job on a fresh machine we would correct the infrastructure problem without involving you. It seems that backfired and instead we caused confusion, for which we apologize.
I recommend that you resubmit your model and allow it to run to completion.
Hi Thomas,
I submitted to the training queue, and received an email from synapse on Nov 23rd as shown below,
```
Dear ***,
The data preprocessing phase of your submission (submission ID 7772336) to the Digital Mammography challenge is in progress. Log files produced while your model is pre-processing the input data will be periodically uploaded here: https://www.synapse.org/#!Synapse:syn7772341. Further notification will be provided when your model is complete. Please direct any questions to the challenge forum, https://www.synapse.org/#!Synapse:syn4224222/discussion .
Sincerely,
Challenge Administration
```
However, on Nov 25, I suddenly received a training termination email,
```
Dear ***,
Your Submission to the Digital Mammography challenge, docker.synapse.org/syn7221363/dm-preprocessing-express@sha256:e0aa9009e989892b463c2a7c5c68f046b6893184dd117d305f49d4b1fc98e441 (submission ID 7772336) has stopped before completion. The message is:
Error encountered during training.
Your log file is available here: https://www.synapse.org/#!Synapse:syn7772341 Please direct any questions to the challenge forum, https://www.synapse.org/#!Synapse:syn4224222/discussion .
Sincerely,
Challenge Administration
```
The weird part is, when I click the "Your log file is available here" link, I am seeing a log file with another process that started which I never submitted to. Hence, the reason for terminating that process to not waste any more of my slotted resources.
The log file was overwritten by a new training process (shown below) that I never submitted.
```
STDOUT: Fri Nov 25 14:25:13 UTC 2016
STDOUT: Creating csv files...
STDOUT: Fri Nov 25 14:25:15 UTC 2016
STDOUT: Extracting 1760 files...
STDERR: /usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
STDERR: warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
STDOUT: Extracting dicom files...
STDOUT: Number of files: 317617
.....
```
Please help me understand what is going on Thomas, I am quite lost and frustrated..
Sincerely,
Chris Dear Chris,
Which evaluation queue did you submit to? I see that you submitted to both the express lane training and also the normal training queue:
${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7213944%2Bwhere%2BUSER%5FID%253D%253D%25223345322%2522&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CNONE&columnConfig1=none%2CStatus%2Cstatus%3B%2CNONE&columnConfig2=none%2CStatus Detail%2CSTATUS%5FDESCRIPTION%3B%2CNONE&columnConfig3=cancelcontrol%2CCancel%2CcancelControl%3B%2CNONE&columnConfig4=epochdate%2CLast Updated%2CmodifiedOn%3B%2CNONE&columnConfig5=synapseid%2CSubmitted Repository or File%2CentityId%3B%2CNONE&columnConfig6=none%2CFile Version%2CversionNumber%3B%2CNONE&columnConfig7=synapseid%2CLog Folder%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig8=none%2CSubmitting User or Team%2CSUBMITTER%3B%2CNONE&columnConfig9=synapseid%2CModel State File%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE}
It says here that you requested the submission to be stopped. Please kindly let me know which evaluation queue / submission you submitted so that I can help you more.
Best,
Thomas