We have used following version pytorch==1.2.0 and torchvision==0.4.0 using our dockerfile. Our run.sh file is this: ``` #!/bin/bash export CUDA_HOME=/cm/local/apps/cuda/libs/current export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64 export PATH=${CUDA_HOME}/bin:${PATH} export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64 python /usr/local/bin/predict_model.py ``` We are getting the following error on submission: ``` INFO:  Could not find any nv binaries on this host! RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. ``` Basically its not able to detect CUDA. Can someone help ?

Created by drago8055
I have two guesses: 1 - building on windows is causing some unexpected issue related to permissions when we try to run the container on our end.... 2 - a docker update could help, but your version is not very old I am unsure what could be causing this. The only other suggestion I have is to start a AWS EC2 instance (or another cloud provider) with Docker installed and try to build there, to figure out whether if it is the system you are using to build.
Hi, docker version ``` Client: Version: 19.03.1 API version: 1.40 Go version: go1.12.7 Git commit: 74b1e89e8a Built: Wed Jul 31 15:18:18 2019 OS/Arch: windows/amd64 Experimental: false Server: Docker Engine - Community Engine: Version: 19.03.5 API version: 1.40 (minimum version 1.12) Go version: go1.12.12 Git commit: 633a0ea838 Built: Wed Nov 13 07:28:45 2019 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.2.10 GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339 runc: Version: 1.0.0-rc8+dev GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657 docker-init: Version: 0.18.0 GitCommit: fec3683 ``` docker build --no-cache -t docker.synapse.org/syn21696204/pred_model-1:test . ``` Sending build context to Docker daemon 3.072kB Step 1/4 : FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime ---> a10c611c2731 Step 2/4 : COPY run.sh /run.sh ---> 866b576f3853 Step 3/4 : RUN chmod 775 /run.sh ---> Running in efc07f089a87 Removing intermediate container efc07f089a87 ---> 273b9c5fc940 Step 4/4 : ENTRYPOINT ["/bin/bash", "/run.sh"] ---> Running in 7e968d5aa426 Removing intermediate container 7e968d5aa426 ---> f1b301223960 Successfully built f1b301223960 Successfully tagged docker.synapse.org/syn21696204/pred_model-1:test SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context will have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories. ```
Thanks! This is really weird. I can build the same exact container and it works fine. However, if I take the container that you submitted, retag it, upload it to synapse, and then submit it, I can reproduce your error. This suggests to me that it is something about how the container is being built on your end. Can you please, on the machine that you are using to build the container, run the following commands: `docker version` `docker build --no-cache -t docker.synapse.org//:test .` (as you did before) and paste the entire output of both of these commands so that I can see if they are informative?
Submitted as per instructions. Kindl check the logs. **For ref:** /syn21696204/pred_model-1:test Thanks, Pranay.
Hi, As noted earlier in this thread: >As noted before, the "cannot find nv binaries on this host" issue is a misleading warning thrown by singularity and does not reflect the actual configuration of the challenge scoring harness. Please ignore this message, it does not mean that the GPU is not working. My suggestion would be to build a container using these two scripts exactly as written here so that we can try and get some diagnostic information in the logs: Dockerfile ``` # Get a good base docker image, pin it to a specific SHA digest to ensure reproducibility FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime # Required for GPU: run.sh defines PATHs to find GPU drivers, see run.sh for specific commands COPY run.sh /run.sh RUN chmod 775 /run.sh # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ``` run.sh ``` #!/bin/bash export CUDA_HOME=/cm/local/apps/cuda/libs/current export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64 export PATH=${CUDA_HOME}/bin:${PATH} export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64 nvidia-smi python -c 'import torch; print(torch.rand(2,3).cuda())' ``` In addition, please build using the --no-cache flag: `docker build --no-cache -t docker.synapse.org//:test .` Then `docker push docker.synapse.org//:test` and submit the file to the fast lane, being sure to submit the one tagged "test" I just ran a container using this exact config and it worked fine (please note, *it will still be an invalid container because it does not generate a prediction file*), so I'm at bit of a loss without more information. Hopefully by including the nvidia-smi and python -c 'import torch; print(torch.rand(2,3).cuda())' commands we can get some informative errors in your logs. Best, Robert
We have tried on windows as well as Linux. We have also updated the docker but still, we are getting "Error: Could not find any nv binaries on this host!" and we are not able to make a submission. Could you guide us anything else we can try on? As we have searched but couldn't find anything much to resolve it. Help is much appreciated. Thanks.
Hi there, I built a container using the scripts you pasted above and it works fine: my std_out.txt looks like this: ``` tensor([[0.7165, 0.5863, 0.1046], [0.8270, 0.4540, 0.3492]], device='cuda:0') ``` Can you make sure you correctly built and pushed the container using the scripts above? Can you also make sure you are running the latest stable version of docker? What OS are you building on?
Hi, We used the exact files to build and push the container to Challenge queue. -> Dockerfile ``` FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime # Required for GPU: run.sh defines PATHs to find GPU drivers, see run.sh for specific commands COPY run.sh /run.sh RUN chmod 775 /run.sh # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ``` -> run.sh ``` #!/bin/bash export CUDA_HOME=/cm/local/apps/cuda/libs/current export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64 export PATH=${CUDA_HOME}/bin:${PATH} export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64 python -c 'import torch; print(torch.rand(2,3).cuda())' ``` [Kindly take a look at the logs.](https://drive.google.com/drive/folders/15Ww0BFeTFO-_Uyio2lwozfC16T8VipQK?usp=sharing) Help is much appreciated. Thanks.
Hi there, If your container did not work, the submission is not counted against your quota, so you should still have 3 submissions remaining for the current cycle. I'm not sure that your install of of torch is configured correctly to use GPUs in the challenge pipeline. As noted before, the "cannot find nv binaries on this host" issue is a misleading warning thrown by singularity and does not reflect the actual configuration of the challenge scoring harness. My suggestion is that instead of manually installing pytorch and cuda, that you use a base docker container from pytorch. I was able to get this one to work on the challenge queue: `FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime` To test whether this configuration works, I created a dummy container that does not produce valid prediction files, but does have pytorch: ``` FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime # Required for GPU: run.sh defines PATHs to find GPU drivers, see run.sh for specific commands COPY run.sh /run.sh RUN chmod 775 /run.sh # Required: define an entrypoint. run.sh will run the model for us, but in a different configuration # you could simply call the model file directly as an entrypoint ENTRYPOINT ["/bin/bash", "/run.sh"] ``` The run file (run.sh) just runs the follwing commands: ``` #!/bin/bash # Required to use GPU: Copy and paste these path exports to your run.sh file so that the NVIDIA drivers # Are available. Alternatively, you can define these system-wide in the Dockerfile export CUDA_HOME=/cm/local/apps/cuda/libs/current export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64 PATH=${CUDA_HOME}/bin:${PATH} export PATH export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/cm/shared/apps/cuda10.0/toolkit/10.0.130/lib64 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/share/apps/rc/software/cuDNN/7.6.2.24-CUDA-10.1.243/lib64 # Command to test gpu nvidia-smi python -c 'import torch; print(torch.rand(2,3).cuda())' ``` Those final two commands are the only things that return information. When we submit this container on the Challenge queue, we can see that the container is able to access the GPU and use it to create a random tensor: ``` Wed May 6 13:52:48 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 | | N/A 27C P0 25W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ tensor([[0.3334, 0.6368, 0.5115], [0.2717, 0.0904, 0.9310]], device='cuda:0') ``` So, this seems to work. Therefore, I'd recommend building your docker container with `FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime` as the base container, instead of installing pytorch manually to avoid the need to troubleshoot the configuration. I'd also suggest including the commands : ``` nvidia-smi python -c 'import torch; print(torch.rand(2,3).cuda())' ``` somewhere in your run file to help you diagnose any further issues. Unfortunately, my experience with pytorch is limited so I hope this suggestion helps you!
Continuing the above @drago8055 reply, We also didn't install any Cuda toolkit in our container. Apart from this, we have one request. If possible could we get our submissions back as everything was working in our system but due to Cuda problem, we lost our this week submissions? As we tried submitting many times but every time we get "cannot find nv binaries on this host" error consistently along with "Cuda not available" in our this week last submission.
We actually installed the pytorch and torchvision version compatible with CUDA 10.0 only. ``` # CUDA 10.0 pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html ``` Ref: https://pytorch.org/get-started/previous-versions/ Thanks.
Which cudatoolkit do you have installed in your container, if any? I am not familiar with using pytorch, but the docs indicate different installs based on the CUDA version. We have drivers for CUDA 10.0 available. `conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch` the "cannot find nv binaries on this host" is a little confusing but is an irrelevant error message. Singularity throws this warning because it cannot find the binaries on the cluster, but we are mounding the GPU drivers in manually at runtime. We've tested this configuration with tensorflow and can successfully utilize the GPU even when this warning is thrown.

Error: Could not find any nv binaries on this host! page is loading…