Hi all, I'm trying to run (on my local machine) a caffe-training docker container based on the provided example.
There is one minor bug on the example that refers to the $TOOLS variable defined on the "train.sh" script.
Where you have:
```
export TOOLS=./build/tools
```
You should have:
```
export TOOLS=/opt/caffe/.build_release/tools
```
After fixing this, I run the container locally using "docker run" with the appropriate options (local folders with appropriate permissions are mounted as /metadata, /trainingData, /preprocessedData and /modelState). Data was populated accordingly in these folders.
Once it's running, I changed the /solver.prototxt to CPU mode, and run the train.sh, yielding:
```
...
solver_mode: GPU
root@8179f18b43b3:/# vi solver.prototxt
root@8179f18b43b3:/# ./train.sh
root@8179f18b43b3:/# /opt/caffe/.build_release/tools/caffe: error while loading shared libraries: libcudart.so.6.5: cannot open shared object file: No such file or directory
```
It seems that some CUDA library was not properly included in the container when I built it. Even-though the container was built successfully. Any hints on how to overcome this?
-jose
PS- by the way my local machine is running Mac OS, the container is obviously Linux-based as per the provided example.
Created by Jose Costa Pereira josecp Apparently there are still issues with the CUDA shared libraries. Below is the log file for a run of some preprocessing container.
It finishes manipulating the jpegs and fails while trying to create the LMDBs.
```
(ouptu omitted)
STDOUT: Image...04zdwqm4.jpeg ...Done.
STDOUT: Image...l3zujn4s.jpeg ...Done.
STDOUT: Image...uk5fzrog.jpeg ...Done.
STDOUT: Image...h5d52sz4.jpeg ...Done.
STDOUT: Image...x0zzhjeu.jpeg ...Done.
STDOUT: done (DCM -> JPG).
STDOUT: Creating train lmdb...
STDERR: /opt/caffe/.build_release/tools/convert_imageset: error while loading shared libraries: libcudart.so.6.5: cannot open shared object file: No such file or directory
STDOUT: Creating val lmdb...
STDERR: /opt/caffe/.build_release/tools/convert_imageset: error while loading shared libraries: libcudart.so.6.5: cannot open shared object file: No such file or directory
STDOUT: LMDB creation Done.
```
It seems that the LD_LIBRARY_PATH is not properly set. Below is what the caffe example has (and what I'm using for this run).
```
$ more bashrc
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
ln /dev/null /dev/raw1394
``` Hi Thomas. No, I was not yet able to build nvidia-docker on Mac OS. It is not clear to me that you must have a GPU in the system in order for this to be successful. NVIDIA drivers, on the other hand, seem necessary.
Note that I am able to run the code I use in the containers within my non-GPU system (I just have to change solver.prototxt to setup caffe for CPU mode). Hence, as an extra verification before pushing the containers to synapse, I would like to run the containers locally. Which I cannot, until I get the wrap-around 'nvidia-docker' to work.
I'll keep this thread updated with further developments. @josecp Hi Jose, were you able to install nvidia-docker on your NVIDIA-equipped Mac OS? The 'other distributions' doesn't work in Mac OS. I will retry compilation from sources when I have a little bit of time, but as I mentioned earlier I'm getting some kind of error.
I'll update this thread accordingly. Hi Jose,
> I suppose one uses this as well for 'docker build' images that involve any GPU commands?
`nvidia-docker` provides the same commands (build, images, etc.) as `docker`, however you can still use `docker build` to build a container that needs to be run using `nvidia-docker run`.
> it seems the git repo (https://github.com/NVIDIA/nvidia-docker) doesn't have a MacOS binary.
The GitHub page of nvidia-docker has a section `Other distributions`. Can you give it a try and let me know?
Thanks! hello again,
it seems the git repo (https://github.com/NVIDIA/nvidia-docker) doesn't have a MacOS binary. I've been trying to make and make install from sources (after installing NVIDIA drivers as required).
It seems that the make simply creates a docker image
```
asterix:docker josecp$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia-docker build 307e044573db 3 hours ago 751.2 MB
docker.synapse.org/syn7233857/caffe-training latest c0096551ddb1 42 hours ago 7.026 GB
...
```
I can't seem to be able to complete the 'make install', nor have I seen the binaries generated by 'make'. Suggestions are welcome! Hi Thomas, thanks again for prompt help.
I didn't know about the existence of this 'nvidia-docker' tool... I suppose one uses this as well for 'docker build' images that involve any GPU commands?
Don't think this should however be needed when performing 'docker push' to synapse? Please confirm. Hi Jose,
You need to run `nvidia-docker run ...` instead of `docker run ...`. See https://github.com/NVIDIA/nvidia-docker
@syohan Can you update the Caffe example?
Thanks!