I was able to run a preprocessing container end-to-end (using only a subset of the train data). I received notification about the log-file for the preprocess container, that ends like this:
```
STDOUT: Creating train lmdb...
STDERR: I1014 12:36:40.231169 26 convert_imageset.cpp:86] Shuffling data
STDERR: E1014 12:36:40.290498 26 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
STDERR: E1014 12:36:40.297068 26 common.cpp:120] Cannot create Curand generator. Curand won't be available.
STDERR: I1014 12:36:40.297675 26 convert_imageset.cpp:89] A total of 2000 images.
STDERR: I1014 12:36:40.333710 26 db_lmdb.cpp:35] Opened lmdb /preprocessedData/lmdbs/dcom_train_lmdb
STDERR: I1014 12:36:44.062719 26 convert_imageset.cpp:147] Processed 1000 files.
STDERR: I1014 12:36:47.521318 26 convert_imageset.cpp:147] Processed 2000 files.
STDOUT: Creating val lmdb...
STDERR: I1014 12:36:48.136538 27 convert_imageset.cpp:86] Shuffling data
STDERR: E1014 12:36:48.139518 27 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
STDERR: E1014 12:36:48.141952 27 common.cpp:120] Cannot create Curand generator. Curand won't be available.
STDERR: I1014 12:36:48.142371 27 convert_imageset.cpp:89] A total of 80 images.
STDERR: I1014 12:36:48.142601 27 db_lmdb.cpp:35] Opened lmdb /preprocessedData/lmdbs/dcom_val_lmdb
STDERR: I1014 12:36:48.408162 27 convert_imageset.cpp:153] Processed 80 files.
STDOUT: LMDB creation Done.
STDOUT: Mean image creation Done.
```
There are still some errors, but I think they are not affecting the creation of the LMDB data. I'm still trying to figure out the root cause for those logged STDERR messages.
Subsequently I got a 'Submission failed' notification with the following text:
```
Your Submission to the Digital Mammography challenge, docker.synapse.org/syn7273857/caffe-preprocess@sha256:e358adf3ab5367f39afbca76c8867d7e9826726b13aca873f9667d2070102f46, has stopped before completion. The message is:
Row id: 31 has been changed since last read. Please get the latest value for this row and then attempt to update it again.
```
I tried it for a second time, and again I got:
```
Your Submission to the Digital Mammography challenge, docker.synapse.org/syn7273857/caffe-preprocess@sha256:12bdf51b49bd0c51a100316d51a6e507f5784d7314ea1bde1720b8cbdc321205, has stopped before completion. The message is:
Row id: 38 has been changed since last read. Please get the latest value for this row and then attempt to update it again.
```
I'm sure I've read another thread in the forum about a similar error, but from what I recall there is not much I can do other than requesting that the job be restarted by the challenge admins. Can someone please shed some light?
Created by Jose Costa Pereira josecp Thank you, @luli.zou, we have now updated the server. @tamthuc1995: You say
> I try to submit my docker again, but still received this error. Can someone help me ?
You should now be able to resubmit without encountering the error. Alternatively if there is an existing submission you would like restarted, just ask. Hi,
I've requested a stop on my current submission so that the server change can be done.
Thanks! Edit: The following refers to the "The Row id <> has been changed" problem. (Rereading the thread it seems other discussions have overlapped with the original issue.)
Thanks for your patience. At long last a fix is in place. Deploying code fixes has been tricky because your submissions may be long running jobs and the 'old' code could not be updated without stopping/restarting a running job. (The ability to upgrade our code without interrupting your submission is one of several improvements we have implemented. Another improvement is that infrastructure errors such as this one will be sent to our administrators, not to you, so that corrective action can be taken.) The 'new' submission processing code is up on about half our servers. There are four on-going submissions supervised by the 'old' system: 7399304 (mnottale), 7373662 (ArtzenLabs), 7325067 (kuriso), and 7355386 (luli.zou). When these end we will upgrade the remaining servers. All other pending (N=1) and in-progress (N=5) jobs are/will be supervised by the updated system and should not encounter this problem again.
We regret the inconvenience you have experienced but are hopeful that working through these issues will help ensure a smooth process once the competitive phase is open.
Thank you. I just got hit by the same error after resubmission:
```
Row id: 2 has been changed since last read. Please get the latest value for this row and then attempt to update it again.
```
I received the email said : "Row id: 31 has been changed since last read. Please get the latest value for this row and then attempt to update it again"
I try to submit my docker again, but still received this error. Can someone help me ? Apparently this docker file is using cuda 7.5.
I found the caffe binary located at '/usr/bin/' but can't find some other important utils. In particular, I can't find either 'compute_image_mean' or 'convert_imageset'. They used to be in '/opt/caffe/.build_release/tools/'
It would be nice to have a running caffe example... Hi @josecp ( CC @brucehoff, @tschaffter ),
The following errors:
```
STDERR: E1014 12:36:48.139518 27 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
STDERR: E1014 12:36:48.141952 27 common.cpp:120] Cannot create Curand generator. Curand won't be available.
```
and
```
STDOUT: - NVIDIA -------------
STDOUT: /usr/bin/nvidia-smi
STDOUT: Failed to initialize NVML: Unknown Error
STDOUT: ----------------------
STDERR: I1017 12:22:22.264183 16 caffe.cpp:217] Using GPUs 0
STDERR: I1017 12:22:22.273149 16 caffe.cpp:222] GPU 0: ???t??????b
STDERR: F1017 12:22:22.273175 16 common.cpp:151] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
STDERR: *** Check failure stack trace: ***
STDERR: @ 0x7f62c8f2fb8d google::LogMessage::Fail()
STDERR: @ 0x7f62c8f31c8f google::LogMessage::SendToLog()
STDERR: @ 0x7f62c8f2f77c google::LogMessage::Flush()
STDERR: @ 0x7f62c8f3252d google::LogMessageFatal::~LogMessageFatal()
STDERR: @ 0x7f62c96d0af7 caffe::Caffe::SetDevice()
STDERR: @ 0x40a631 train()
STDERR: @ 0x4062f3 main
STDERR: @ 0x7f62c8160f45 (unknown)
STDERR: @ 0x406b1b (unknown)
STDERR: Aborted (core dumped)
```
are known errors, and we've narrowed-down the problem to the CUDA version (6.5) being used. Please replace the dependencies section of the training/preprocessing Docker files with this:
```
########################################
########## START DEPENDENCIES ##########
########################################
# Base image
FROM nvidia/caffe
RUN apt-get update && apt-get install -y \
python-dev \
python-pip
# Check how to test caffe - RUN make test
RUN pip install pydicom
RUN apt-get build-dep -y python-matplotlib
RUN pip install synapseclient
RUN pip install lmdb
RUN apt-get install python-opencv
# needed to avoid issues with cv2
RUN ln /dev/null /dev/raw1394
RUN pip install sklearn
# GNU parallel and imagemagick for image processing boost (and OpenCV)
RUN apt-get install imagemagick php5-imagick -y
RUN apt-get install parallel -y
RUN apt-get install gnome-session-fallback -y
RUN apt-get install python-opencv -y
RUN apt-get install libdc1394-22-dev libdc1394-22 libdc1394-utils -y
######################################
########## END DEPENDENCIES ##########
######################################
```
This should hopefully fix your problems (e.g. by installing the latest CUDA version). I will update the Caffe example shortly.
Hope this helps, and best regards,
Shivy Hi,
is there anything new about this error? I commited a preprocessing + training container. After the first was finished, I got the same error and the job was canceled. Hi @brucehoff,
Thanks! But I had to cancel your submission because I re-submitted the job before that. I hope there is a permanent fix soon.
Li
@josecp I have asked a colleague who was involved in creating the Caffe example to have a look.
@thefaculty I restarted submission 7355190. I had exactly the same problem. Mine looks like:
```
Row id: 37 has been changed since last read. Please get the latest value for this row and then attempt to update it again.
```
Any update regarding this error? Thx! Hi Bruce, thanks for restarting the jobs. It seems that both are complete now.
There are a couple of errors appearing on my log files to which I'd like to call your attention, and ask for your help.
1) on the pre-processing logs (based on your caffe example), when generating the LMDB there are a couple of error messages:
```
STDOUT: Creating train lmdb...
STDERR: I1017 11:27:08.734166 23 convert_imageset.cpp:86] Shuffling data
STDERR: E1017 11:27:08.803100 23 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
STDERR: E1017 11:27:08.805604 23 common.cpp:120] Cannot create Curand generator. Curand won't be available.
STDERR: I1017 11:27:08.806165 23 convert_imageset.cpp:89] A total of 2000 images.
STDERR: I1017 11:27:08.806421 23 db_lmdb.cpp:35] Opened lmdb /preprocessedData/lmdbs/dcom_train_lmdb
STDERR: I1017 11:27:11.972836 23 convert_imageset.cpp:147] Processed 1000 files.
STDERR: I1017 11:27:15.534158 23 convert_imageset.cpp:147] Processed 2000 files.
STDOUT: Creating val lmdb...
STDERR: I1017 11:27:16.026774 24 convert_imageset.cpp:86] Shuffling data
STDERR: E1017 11:27:16.029903 24 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
STDERR: E1017 11:27:16.032135 24 common.cpp:120] Cannot create Curand generator. Curand won't be available.
STDERR: I1017 11:27:16.032538 24 convert_imageset.cpp:89] A total of 80 images.
STDERR: I1017 11:27:16.032790 24 db_lmdb.cpp:35] Opened lmdb /preprocessedData/lmdbs/dcom_val_lmdb
STDERR: I1017 11:27:16.294993 24 convert_imageset.cpp:153] Processed 80 files.
STDOUT: LMDB creation Done.
STDOUT: Mean image creation Done.
```
However, it seems that the LMDB files were generated properly; I run an ls -lah in the relevant dirs to obtain the following:
```
STDOUT: - lmdb dirs -------------
STDOUT: total 9.6M
STDOUT: drwxr-xr-x. 4 root root 4.0K Oct 17 11:27 .
STDOUT: drwxrwx---. 5 root 1000 8.6M Oct 17 10:45 ..
STDOUT: drwxr--r--. 2 root root 4.0K Oct 17 11:27 dcom_train_lmdb
STDOUT: drwxr--r--. 2 root root 4.0K Oct 17 11:27 dcom_val_lmdb
STDOUT: -rw-r--r--. 1 root root 1.1M Oct 17 11:27 dm_mean.binaryproto
STDOUT: - train_lmdb ------------
STDOUT: total 508M
STDOUT: drwxr--r--. 2 root root 4.0K Oct 17 11:27 .
STDOUT: drwxr-xr-x. 4 root root 4.0K Oct 17 11:27 ..
STDOUT: -rw-r--r--. 1 root root 508M Oct 17 11:27 data.mdb
STDOUT: -rw-r--r--. 1 root root 8.0K Oct 17 11:27 lock.mdb
STDOUT: - val_lmdb --------------
STDOUT: total 21M
STDOUT: drwxr--r--. 2 root root 4.0K Oct 17 11:27 .
STDOUT: drwxr-xr-x. 4 root root 4.0K Oct 17 11:27 ..
STDOUT: -rw-r--r--. 1 root root 21M Oct 17 11:27 data.mdb
STDOUT: -rw-r--r--. 1 root root 8.0K Oct 17 11:27 lock.mdb
STDOUT: -------------
```
Any idea what could be the origin of this error?
2) on the training logs, it seems that the GPU(s) are not being detected (I run the nvidia-smi command). Hence "caffe train" command doesn't run:
```
STDOUT: - NVIDIA -------------
STDOUT: /usr/bin/nvidia-smi
STDOUT: Failed to initialize NVML: Unknown Error
STDOUT: ----------------------
STDERR: I1017 12:22:22.264183 16 caffe.cpp:217] Using GPUs 0
STDERR: I1017 12:22:22.273149 16 caffe.cpp:222] GPU 0: ???t??????b
STDERR: F1017 12:22:22.273175 16 common.cpp:151] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
STDERR: *** Check failure stack trace: ***
STDERR: @ 0x7f62c8f2fb8d google::LogMessage::Fail()
STDERR: @ 0x7f62c8f31c8f google::LogMessage::SendToLog()
STDERR: @ 0x7f62c8f2f77c google::LogMessage::Flush()
STDERR: @ 0x7f62c8f3252d google::LogMessageFatal::~LogMessageFatal()
STDERR: @ 0x7f62c96d0af7 caffe::Caffe::SetDevice()
STDERR: @ 0x40a631 train()
STDERR: @ 0x4062f3 main
STDERR: @ 0x7f62c8160f45 (unknown)
STDERR: @ 0x406b1b (unknown)
STDERR: Aborted (core dumped)
```
Both my scripts "train.sh" and "preprocess.sh" are based on your caffe example with some adjustments that were made along the way.
Would you be able to provide some insight on these issues?
Your help is greatly appreciated.
Done: 7359273, 7363803 are now pending. Yes please restart the two old ones.
4891 is a single container submission and is still running (it's another job altogether). Jose: It's a known problem. Until I get the fix in place the best I can offer is to restart your job. The two submissions that encountered the problem are 7359273, 7363803. I see you submitted a new one, 7364891. Would you like the two 'old' submissions restarted?
${leaderboard?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3Dselect%2B%2A%2Bfrom%2Bevaluation%5F7213944%2Bwhere%2BuserId%253D%253D3345451&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100&showRowNumber=false&jsonResultsKeyName=rows&columnConfig0=none%2CSubmission ID%2CobjectId%3B%2CNONE&columnConfig1=none%2C%2Cstatus%3B%2CNONE&columnConfig2=userid%2CSubmitter%2CuserId%3B%2CNONE&columnConfig3=epochdate%2C%2CcreatedOn%3B%2CNONE&columnConfig4=epochdate%2C%2CTRAINING%5FSTARTED%3B%2CNONE&columnConfig5=epochdate%2C%2CmodifiedOn%3B%2CNONE&columnConfig6=none%2C%2CrepositoryName%3B%2CNONE&columnConfig7=none%2C%2CFAILURE%5FREASON%3B%2CNONE&columnConfig8=none%2C%2CWORKER%5FID%3B%2CNONE&columnConfig9=synapseid%2C%2CentityId%3B%2CNONE&columnConfig10=synapseid%2C%2CSUBMISSION%5FFOLDER%3B%2CNONE&columnConfig11=none%2C%2Cname%3B%2CNONE&columnConfig12=synapseid%2C%2CMODEL%5FSTATE%5FENTITY%5FID%3B%2CNONE}
Drop files to upload
Row id: has been changed since last read. page is loading…