We have used following version pytorch==1.2.0 and torchvision==0.4.0 using our dockerfile.
Our run.sh file is this:
```
#!/bin/bash
export CUDA_HOME=/cm/local/apps/cuda/libs/current
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64
python /usr/local/bin/predict_model.py
```
We are getting the following error on submission:
```
[34mINFO: [0m Could not find any nv binaries on this host!
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
```
Basically its not able to detect CUDA. Can someone help ?
Created by drago8055 I have two guesses:
1 - building on windows is causing some unexpected issue related to permissions when we try to run the container on our end....
2 - a docker update could help, but your version is not very old
I am unsure what could be causing this.
The only other suggestion I have is to start a AWS EC2 instance (or another cloud provider) with Docker installed and try to build there, to figure out whether if it is the system you are using to build. Hi,
docker version
```
Client:
Version: 19.03.1
API version: 1.40
Go version: go1.12.7
Git commit: 74b1e89e8a
Built: Wed Jul 31 15:18:18 2019
OS/Arch: windows/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:28:45 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683
```
docker build --no-cache -t docker.synapse.org/syn21696204/pred_model-1:test .
```
Sending build context to Docker daemon 3.072kB
Step 1/4 : FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime
---> a10c611c2731
Step 2/4 : COPY run.sh /run.sh
---> 866b576f3853
Step 3/4 : RUN chmod 775 /run.sh
---> Running in efc07f089a87
Removing intermediate container efc07f089a87
---> 273b9c5fc940
Step 4/4 : ENTRYPOINT ["/bin/bash", "/run.sh"]
---> Running in 7e968d5aa426
Removing intermediate container 7e968d5aa426
---> f1b301223960
Successfully built f1b301223960
Successfully tagged docker.synapse.org/syn21696204/pred_model-1:test
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.
```
Thanks! This is really weird. I can build the same exact container and it works fine. However, if I take the container that you submitted, retag it, upload it to synapse, and then submit it, I can reproduce your error. This suggests to me that it is something about how the container is being built on your end.
Can you please, on the machine that you are using to build the container, run the following commands:
`docker version`
`docker build --no-cache -t docker.synapse.org//:test .` (as you did before)
and paste the entire output of both of these commands so that I can see if they are informative?
Submitted as per instructions.
Kindl check the logs. **For ref:** /syn21696204/pred_model-1:test
Thanks,
Pranay. Hi,
As noted earlier in this thread:
>As noted before, the "cannot find nv binaries on this host" issue is a misleading warning thrown by singularity and does not reflect the actual configuration of the challenge scoring harness.
Please ignore this message, it does not mean that the GPU is not working.
My suggestion would be to build a container using these two scripts exactly as written here so that we can try and get some diagnostic information in the logs:
Dockerfile
```
# Get a good base docker image, pin it to a specific SHA digest to ensure reproducibility
FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime
# Required for GPU: run.sh defines PATHs to find GPU drivers, see run.sh for specific commands
COPY run.sh /run.sh
RUN chmod 775 /run.sh
# Required: define an entrypoint. run.sh will run the model for us, but in a different configuration
# you could simply call the model file directly as an entrypoint
ENTRYPOINT ["/bin/bash", "/run.sh"]
```
run.sh
```
#!/bin/bash
export CUDA_HOME=/cm/local/apps/cuda/libs/current
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64
nvidia-smi
python -c 'import torch; print(torch.rand(2,3).cuda())'
```
In addition, please build using the --no-cache flag:
`docker build --no-cache -t docker.synapse.org//:test .`
Then `docker push docker.synapse.org//:test` and submit the file to the fast lane, being sure to submit the one tagged "test"
I just ran a container using this exact config and it worked fine (please note, *it will still be an invalid container because it does not generate a prediction file*), so I'm at bit of a loss without more information. Hopefully by including the nvidia-smi and python -c 'import torch; print(torch.rand(2,3).cuda())' commands we can get some informative errors in your logs.
Best,
Robert
We have tried on windows as well as Linux. We have also updated the docker but still, we are getting "Error: Could not find any nv binaries on this host!" and we are not able to make a submission. Could you guide us anything else we can try on? As we have searched but couldn't find anything much to resolve it.
Help is much appreciated. Thanks.
Hi there, I built a container using the scripts you pasted above and it works fine:
my std_out.txt looks like this:
```
tensor([[0.7165, 0.5863, 0.1046],
[0.8270, 0.4540, 0.3492]], device='cuda:0')
```
Can you make sure you correctly built and pushed the container using the scripts above? Can you also make sure you are running the latest stable version of docker?
What OS are you building on?
Hi,
We used the exact files to build and push the container to Challenge queue.
-> Dockerfile
```
FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime
# Required for GPU: run.sh defines PATHs to find GPU drivers, see run.sh for specific commands
COPY run.sh /run.sh
RUN chmod 775 /run.sh
# Required: define an entrypoint. run.sh will run the model for us, but in a different configuration
# you could simply call the model file directly as an entrypoint
ENTRYPOINT ["/bin/bash", "/run.sh"]
```
-> run.sh
```
#!/bin/bash
export CUDA_HOME=/cm/local/apps/cuda/libs/current
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64
python -c 'import torch; print(torch.rand(2,3).cuda())'
```
[Kindly take a look at the logs.](https://drive.google.com/drive/folders/15Ww0BFeTFO-_Uyio2lwozfC16T8VipQK?usp=sharing)
Help is much appreciated. Thanks.
Hi there,
If your container did not work, the submission is not counted against your quota, so you should still have 3 submissions remaining for the current cycle.
I'm not sure that your install of of torch is configured correctly to use GPUs in the challenge pipeline. As noted before, the "cannot find nv binaries on this host" issue is a misleading warning thrown by singularity and does not reflect the actual configuration of the challenge scoring harness. My suggestion is that instead of manually installing pytorch and cuda, that you use a base docker container from pytorch. I was able to get this one to work on the challenge queue: `FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime`
To test whether this configuration works, I created a dummy container that does not produce valid prediction files, but does have pytorch:
```
FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime
# Required for GPU: run.sh defines PATHs to find GPU drivers, see run.sh for specific commands
COPY run.sh /run.sh
RUN chmod 775 /run.sh
# Required: define an entrypoint. run.sh will run the model for us, but in a different configuration
# you could simply call the model file directly as an entrypoint
ENTRYPOINT ["/bin/bash", "/run.sh"]
```
The run file (run.sh) just runs the follwing commands:
```
#!/bin/bash
# Required to use GPU: Copy and paste these path exports to your run.sh file so that the NVIDIA drivers
# Are available. Alternatively, you can define these system-wide in the Dockerfile
export CUDA_HOME=/cm/local/apps/cuda/libs/current
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
PATH=${CUDA_HOME}/bin:${PATH}
export PATH
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64
# Command to test gpu
nvidia-smi
python -c 'import torch; print(torch.rand(2,3).cuda())'
```
Those final two commands are the only things that return information. When we submit this container on the Challenge queue, we can see that the container is able to access the GPU and use it to create a random tensor:
```
Wed May 6 13:52:48 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 27C P0 25W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tensor([[0.3334, 0.6368, 0.5115],
[0.2717, 0.0904, 0.9310]], device='cuda:0')
```
So, this seems to work. Therefore, I'd recommend building your docker container with `FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime` as the base container, instead of installing pytorch manually to avoid the need to troubleshoot the configuration. I'd also suggest including the commands :
```
nvidia-smi
python -c 'import torch; print(torch.rand(2,3).cuda())'
```
somewhere in your run file to help you diagnose any further issues.
Unfortunately, my experience with pytorch is limited so I hope this suggestion helps you!
Continuing the above @drago8055 reply, We also didn't install any Cuda toolkit in our container.
Apart from this, we have one request. If possible could we get our submissions back as everything was working in our system but due to Cuda problem, we lost our this week submissions? As we tried submitting many times but every time we get "cannot find nv binaries on this host" error consistently along with "Cuda not available" in our this week last submission. We actually installed the pytorch and torchvision version compatible with CUDA 10.0 only.
```
# CUDA 10.0
pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html
```
Ref: https://pytorch.org/get-started/previous-versions/
Thanks. Which cudatoolkit do you have installed in your container, if any?
I am not familiar with using pytorch, but the docs indicate different installs based on the CUDA version. We have drivers for CUDA 10.0 available.
`conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch`
the "cannot find nv binaries on this host" is a little confusing but is an irrelevant error message. Singularity throws this warning because it cannot find the binaries on the cluster, but we are mounding the GPU drivers in manually at runtime. We've tested this configuration with tensorflow and can successfully utilize the GPU even when this warning is thrown.
Drop files to upload
Error: Could not find any nv binaries on this host! page is loading…