Dear organizers,
I can't run caffe in the Express training queue. The program seems
unable to find the image mean file.
The setup is this: I have a simple train.sh script which runs caffe
using a solver proto file, as usual. The solver file refers to a
training proto file, which in turn specifies the mean image file:
mean_file: "/preprocessedData/mean_train.binaryproto"
The problem is that caffe crashes due to file not found exception:
STDERR: I0117 00:41:50.388388 14 data_transformer.cpp:25] Loading mean file from: /preprocessedData/mean_train.binaryproto
STDERR: F0117 00:41:50.389724 17 db_lmdb.hpp:15] Check failed: mdb_status == 0 (2 vs. 0) No such file or directory
However I run `ls' on the file just before and just after the caffe
command, and it reports the file is there:
-rw-r--r--. 1 root root 618362 Jan 16 19:00 /preprocessedData/mean_train.binaryproto
I also noticed that the file has SELinux permissions on it:
-rw-r--r--. root root system_u:object_r:unlabeled_t:s0 /preprocessedData/mean_train.binaryproto
I tried to disable SELinux, but "setenforce" command isn't available.
I can't think of any other reason for this. Is there anything you can do to help diagnose/correct
the situation? Most perplexing.
Thanks
Ljubomir
Created by Ljubomir Buturovic ljubomir_buturovic Bruce, just to let you know, I found a workaround for this problem
Regards
Ljubomir
Thank you for looking into this. I am searching for a workaround, but it would surely
be preferable to have this fixed.
There are many submissions in the above list which exhibit this behavior. They all
completed, but caffe crashed, as I explained. The submissions differ by the amount
of information printed out. An example is Submission Id 8032703. See line 1868 onward
in the log file:
STDERR: I0117 15:52:54.454335 18 data_transformer.cpp:25] Loading mean file from: /preprocessedData/mean_train.binaryproto
STDERR: F0117 15:52:54.455720 21 db_lmdb.hpp:15] Check failed: mdb_status == 0 (2 vs. 0) No such file or directory
STDERR: *** Check failure stack trace: ***
STDERR: @ 0x7ff069fb4e6d (unknown)
STDERR: @ 0x7ff069fb6ced (unknown)
STDERR: @ 0x7ff069fb4a5c (unknown)
STDERR: @ 0x7ff069fb763e (unknown)
STDERR: @ 0x7ff06accfd31 caffe::db::LMDB::Open()
STDERR: @ 0x7ff06ab5f2e4 caffe::DataReader::Body::InternalThreadEntry()
STDERR: @ 0x7ff06ab6f9ff caffe::InternalThread::entry()
STDERR: @ 0x7ff06a61924a (unknown)
STDERR: @ 0x7ff06a1e0dc5 start_thread
STDERR: @ 0x7ff05927bced __clone
STDERR: /train.sh: line 23: 18 Aborted (core dumped) caffe train --solver /cpi_caffe_solver.prototxt
I'm trying to determine which submission encountered the reported issue. In the list above the submissions that did not complete successfully are 8028591 and 8022179. Regarding 8028591 our records show that the error was:
```
invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"exec: \\\\\\\\\\\\\\\"/preprocess.sh\\\\\\\\\\\\\\\": stat /preprocess.sh: no such file or directory\\\\\\\"\\\\n\\\"
```
This suggests that there is no `/preprocess.sh` in the submitted preprocessing image. the preprocessing image is:
```
docker.synapse.org/syn7415408/cpi30x@sha256:dbc5f8fd29af5a95595dbe1e9166a86ee7c02faba83134343203bf6c707bdaad
```
I ran:
```
docker run --rm docker.synapse.org/syn7415408/cpi30x@sha256:dbc5f8fd29af5a95595dbe1e9166a86ee7c02faba83134343203bf6c707bdaad ls -al /
total 80
drwxr-xr-x 63 root root 4096 Jan 19 16:29 .
drwxr-xr-x 63 root root 4096 Jan 19 16:29 ..
-rwxr-xr-x 1 root root 0 Jan 19 16:29 .dockerenv
-rw-r--r-- 1 root root 18307 Sep 6 14:02 anaconda-post.log
lrwxrwxrwx 1 root root 7 Sep 6 13:59 bin -> usr/bin
drwxr-xr-x 5 root root 360 Jan 19 16:29 dev
drwxr-xr-x 83 root root 4096 Jan 19 16:29 etc
drwxr-xr-x 2 root root 4096 Aug 12 2015 home
lrwxrwxrwx 1 root root 7 Sep 6 13:59 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Sep 6 13:59 lib64 -> usr/lib64
drwx------ 2 root root 4096 Sep 6 13:59 lost+found
drwxr-xr-x 2 root root 4096 Aug 12 2015 media
drwxr-xr-x 2 root root 4096 Aug 12 2015 mnt
drwxr-xr-x 7 root root 4096 Nov 8 19:03 opt
dr-xr-xr-x 148 root root 0 Jan 19 16:29 proc
dr-xr-x--- 6 root root 4096 Nov 3 02:08 root
drwxr-xr-x 11 root root 4096 Nov 8 19:02 run
lrwxrwxrwx 1 root root 8 Sep 6 13:59 sbin -> usr/sbin
drwxr-xr-x 2 root root 4096 Aug 12 2015 srv
dr-xr-xr-x 13 root root 0 Jan 19 16:29 sys
drwxrwxrwt 7 root root 4096 Nov 8 19:02 tmp
-rwxrwxr-x 1 root root 282 Jan 16 15:18 train.sh
drwxr-xr-x 41 root root 4096 Nov 8 19:03 usr
drwxr-xr-x 37 root root 4096 Nov 8 19:03 var
```
Sure enough, there is no /preprocessing.sh file.
Regarding 8022179, you were sent this message:
```
Dear Clinical Persona:
Your Submission to the Digital Mammography challenge, syn8022178 (ID 8022179), was invalid. The message is:
docker.synapse.org/syn7415408/cpi25@sha256: is not a valid Docker commit. Must be [host/]path@sha256:digest
Please direct any questions to the challenge forum, https://www.synapse.org/#!Synapse:syn4224222/discussion.
Sincerely,
Challenge Administration
```
So there was a small problem with the submission file (which you corrected in later submissions).
Neither of these two failed submissions seem related to the problem you describe. Is the problem in some other submission? Apologies if we are misunderstanding the issue. Thanks. I am letting you know that unfortunately the problem is not resolved.
The further submissions were failed attempts to diagnose it. The only new
information is that the SELinux permissions are probably unrelated to this
issue, as the other input files to caffe have exact same permissions and
there isn't a problem reading them.
In other words, the problem remains that caffe claims the image mean
values file does not exist, even though it is present on the filesystem.
Looking at your submission history it appears you have worked past this problem. If not, please let us know.
${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7500018%2Bwhere%2BuserId%253D%253D%2522445330%2522%2BAND%2BcreatedOn%253E%253D1483401600000&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CDESC&columnConfig1=none%2CStatus%2Cstatus%3B%2CNONE&columnConfig2=none%2CStatus Detail%2CSTATUS%5FDESCRIPTION%3B%2CNONE&columnConfig3=cancelcontrol%2CCancel%2CcancelControl%3B%2CNONE&columnConfig4=epochdate%2CSubmitted On%2CcreatedOn%3B%2CNONE&columnConfig5=epochdate%2CLast Updated%2CmodifiedOn%3B%2CNONE&columnConfig6=synapseid%2CSubmitted Repository or File%2CentityId%3B%2CNONE&columnConfig7=none%2CFile Version%2CversionNumber%3B%2CNONE&columnConfig8=synapseid%2CLog Folder%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig9=none%2CSubmitting User or Team%2CSUBMITTER%3B%2CNONE&columnConfig10=synapseid%2CModel State File%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE}
Drop files to upload
file access error, Express Training queue page is loading…