Dear organizers, I can't run caffe in the Express training queue. The program seems unable to find the image mean file. The setup is this: I have a simple script which runs caffe using a solver proto file, as usual. The solver file refers to a training proto file, which in turn specifies the mean image file: mean_file: "/preprocessedData/mean_train.binaryproto" The problem is that caffe crashes due to file not found exception: STDERR: I0117 00:41:50.388388 14 data_transformer.cpp:25] Loading mean file from: /preprocessedData/mean_train.binaryproto STDERR: F0117 00:41:50.389724 17 db_lmdb.hpp:15] Check failed: mdb_status == 0 (2 vs. 0) No such file or directory However I run `ls' on the file just before and just after the caffe command, and it reports the file is there: -rw-r--r--. 1 root root 618362 Jan 16 19:00 /preprocessedData/mean_train.binaryproto I also noticed that the file has SELinux permissions on it: -rw-r--r--. root root system_u:object_r:unlabeled_t:s0 /preprocessedData/mean_train.binaryproto I tried to disable SELinux, but "setenforce" command isn't available. I can't think of any other reason for this. Is there anything you can do to help diagnose/correct the situation? Most perplexing. Thanks Ljubomir

Created by Ljubomir Buturovic ljubomir_buturovic
Bruce, just to let you know, I found a workaround for this problem Regards Ljubomir
Thank you for looking into this. I am searching for a workaround, but it would surely be preferable to have this fixed. There are many submissions in the above list which exhibit this behavior. They all completed, but caffe crashed, as I explained. The submissions differ by the amount of information printed out. An example is Submission Id 8032703. See line 1868 onward in the log file: STDERR: I0117 15:52:54.454335 18 data_transformer.cpp:25] Loading mean file from: /preprocessedData/mean_train.binaryproto STDERR: F0117 15:52:54.455720 21 db_lmdb.hpp:15] Check failed: mdb_status == 0 (2 vs. 0) No such file or directory STDERR: *** Check failure stack trace: *** STDERR: @ 0x7ff069fb4e6d (unknown) STDERR: @ 0x7ff069fb6ced (unknown) STDERR: @ 0x7ff069fb4a5c (unknown) STDERR: @ 0x7ff069fb763e (unknown) STDERR: @ 0x7ff06accfd31 caffe::db::LMDB::Open() STDERR: @ 0x7ff06ab5f2e4 caffe::DataReader::Body::InternalThreadEntry() STDERR: @ 0x7ff06ab6f9ff caffe::InternalThread::entry() STDERR: @ 0x7ff06a61924a (unknown) STDERR: @ 0x7ff06a1e0dc5 start_thread STDERR: @ 0x7ff05927bced __clone STDERR: / line 23: 18 Aborted (core dumped) caffe train --solver /cpi_caffe_solver.prototxt
I'm trying to determine which submission encountered the reported issue. In the list above the submissions that did not complete successfully are 8028591 and 8022179. Regarding 8028591 our records show that the error was: ``` invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"exec: \\\\\\\\\\\\\\\"/\\\\\\\\\\\\\\\": stat / no such file or directory\\\\\\\"\\\\n\\\" ``` This suggests that there is no `/` in the submitted preprocessing image. the preprocessing image is: ``` ``` I ran: ``` docker run --rm ls -al / total 80 drwxr-xr-x 63 root root 4096 Jan 19 16:29 . drwxr-xr-x 63 root root 4096 Jan 19 16:29 .. -rwxr-xr-x 1 root root 0 Jan 19 16:29 .dockerenv -rw-r--r-- 1 root root 18307 Sep 6 14:02 anaconda-post.log lrwxrwxrwx 1 root root 7 Sep 6 13:59 bin -> usr/bin drwxr-xr-x 5 root root 360 Jan 19 16:29 dev drwxr-xr-x 83 root root 4096 Jan 19 16:29 etc drwxr-xr-x 2 root root 4096 Aug 12 2015 home lrwxrwxrwx 1 root root 7 Sep 6 13:59 lib -> usr/lib lrwxrwxrwx 1 root root 9 Sep 6 13:59 lib64 -> usr/lib64 drwx------ 2 root root 4096 Sep 6 13:59 lost+found drwxr-xr-x 2 root root 4096 Aug 12 2015 media drwxr-xr-x 2 root root 4096 Aug 12 2015 mnt drwxr-xr-x 7 root root 4096 Nov 8 19:03 opt dr-xr-xr-x 148 root root 0 Jan 19 16:29 proc dr-xr-x--- 6 root root 4096 Nov 3 02:08 root drwxr-xr-x 11 root root 4096 Nov 8 19:02 run lrwxrwxrwx 1 root root 8 Sep 6 13:59 sbin -> usr/sbin drwxr-xr-x 2 root root 4096 Aug 12 2015 srv dr-xr-xr-x 13 root root 0 Jan 19 16:29 sys drwxrwxrwt 7 root root 4096 Nov 8 19:02 tmp -rwxrwxr-x 1 root root 282 Jan 16 15:18 drwxr-xr-x 41 root root 4096 Nov 8 19:03 usr drwxr-xr-x 37 root root 4096 Nov 8 19:03 var ``` Sure enough, there is no / file. Regarding 8022179, you were sent this message: ``` Dear Clinical Persona: Your Submission to the Digital Mammography challenge, syn8022178 (ID 8022179), was invalid. The message is: is not a valid Docker commit. Must be [host/]path@sha256:digest Please direct any questions to the challenge forum,!Synapse:syn4224222/discussion. Sincerely, Challenge Administration ``` So there was a small problem with the submission file (which you corrected in later submissions). Neither of these two failed submissions seem related to the problem you describe. Is the problem in some other submission? Apologies if we are misunderstanding the issue.
Thanks. I am letting you know that unfortunately the problem is not resolved. The further submissions were failed attempts to diagnose it. The only new information is that the SELinux permissions are probably unrelated to this issue, as the other input files to caffe have exact same permissions and there isn't a problem reading them. In other words, the problem remains that caffe claims the image mean values file does not exist, even though it is present on the filesystem.
Looking at your submission history it appears you have worked past this problem. If not, please let us know. ${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7500018%2Bwhere%2BuserId%253D%253D%2522445330%2522%2BAND%2BcreatedOn%253E%253D1483401600000&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CDESC&columnConfig1=none%2CStatus%2Cstatus%3B%2CNONE&columnConfig2=none%2CStatus Detail%2CSTATUS%5FDESCRIPTION%3B%2CNONE&columnConfig3=cancelcontrol%2CCancel%2CcancelControl%3B%2CNONE&columnConfig4=epochdate%2CSubmitted On%2CcreatedOn%3B%2CNONE&columnConfig5=epochdate%2CLast Updated%2CmodifiedOn%3B%2CNONE&columnConfig6=synapseid%2CSubmitted Repository or File%2CentityId%3B%2CNONE&columnConfig7=none%2CFile Version%2CversionNumber%3B%2CNONE&columnConfig8=synapseid%2CLog Folder%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig9=none%2CSubmitting User or Team%2CSUBMITTER%3B%2CNONE&columnConfig10=synapseid%2CModel State File%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE}

file access error, Express Training queue page is loading…