Hello All, I have successfully created a docker container that functions correctly when using the "GPU" command outlined in step "5. Locally test your Docker container" of your Submission Tutorial. However, I did not write this in a way that supports CPU only. - Is CPU-only functionality a requirement? - Is there a way to specify during the container submission to use the `--gpu device=0` flag? Thank you for your help, -Richard

Created by Richard Barcus MrRichard
Hey Ben @Interion I am just now getting around to submitting my new container using the **nvidia/cuda:11.4.1-base-ubuntu20.04** base container, so I do not know if it will run in the Synapse environment yet (It did run in my test environment). _# Edit: The container using the base listed above did successfully find the GPU and start. We used Pytorch for our submission. I do not know if Tensorflow has different CUDA driver requirements. _ One thing that I did notice (and I hope I am not leading you in the wrong direction) is that in the container there are no directories: **/usr/local/nvidia/lib:/usr/local/nvidia/lib64** as specified by the `LD_LIBRARY_PATH` environmental variable. There is a symbolic link to /usr/local/**cuda** which points (via alternatives) to /usr/local/cuda-11.4/ (where there are similar lib/ and lib64/ directories) which I would _expect_ to contain some version of libcublas.so.11, but it does not. Sadly, I am no expert at Tensorflow. The only suggestion I can think of would be try another container, like the developer container **11.4.1-devel-ubuntu20.04**? [From Dockerhub](https://registry.hub.docker.com/r/nvidia/cuda) or **tensorflow/tensorflow:latest-gpu** per [Tensorflow in Docker Containers](https://www.tensorflow.org/install/docker). If you find a the correct dynamic library but in a folder that does not match the LD_LIBRARY_PATH, you should be able to change that variable in the Dockerfile using `ENV LD_LIBRARY_PATH=/your/path/here/`. Good Luck! -Richard
Hello, I tried to use the following base for the Docker image, as well as the package installations specified in the Toward Data Science blog post. ``` FROM nvidia/cuda:11.4.1-base-ubuntu20.04 #installing python and pip RUN apt-get update && apt-get install --no-install-recommends --no-install-suggests -y curl RUN apt-get install unzip RUN apt-get -y install python3 RUN apt-get -y install python3-pip #python package installation RUN pip3 install numpy RUN pip3 install argparse RUN pip3 install nibabel RUN pip3 install tensorflow==2.6.0 tensorflow-gpu==2.6.0 RUN pip3 install cuda #copy the model and set the entry point COPY run_model.py /usr/local/bin/ COPY segmentation_model.h5 /usr/local/bin/ #enable executable permissions for all users RUN chmod a+x /usr/local/bin/run_model.py ENTRYPOINT ["python3", "/usr/local/bin/run_model.py"] ``` However, when running the actual Docker image with the GPU, it isn't able to load the libcublas dynamic libraries, and as a result, it skips registering the GPU debices. **2021-09-15 17:28:38.587681: W tensorflow/stream_executor/platform/default/dso_lo ader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas .so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PA TH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64** I'm not sure if I'm missing an installation, command, or have something else invalid in the Docker file. Any help would be appreciated, thanks! Ben
Hello @MrRichard, We were able to find a way to run the docker with the GPU and submit successfully. We did not use the Ubuntu image in the Dockerfile as it was recommended in the tutorial (`FROM ubuntu`), but rather `FROM nvidia/cuda:11.4.1-base-ubuntu20.04`. This is the only thing we changed. I think it's a way to "force" the container to use GPU with the correct drivers. In another thread, it was said that the docker was run with the following command (not with `--gpus device=0`): ``` docker run \ --rm \ --network=none \ --runtime="nvidia" \ -v /path/to/input:/input:ro \ -v /path/to/output:/output:rw \ ``` For more information, we read the following blog post. https://towardsdatascience.com/how-to-properly-use-the-gpu-within-a-docker-container-4c699c78c6d1 Hope it helps ! Good luck, Nguyen Quoc Duong
@vchung Hi Venera, and Sage Team, Thanks for looking at our issue. (and Thank you Quoc @shf99 for following up). I am still running into the same problem: the GPU 0 is not being found in the container by Pytorch. Here is some additional context: - Early in the process we found that CPU classification was far too time consuming and opted for GPU only classification. - I am using Pytorch to determine if a GPU is available, using `num_gpus = torch.cuda.device_count()`. If no devices are found, the process throws an error and stops. - Based on the container testing instructions (using `--gpu device=0` at docker run), I set `CUDA_VISIBLE_DEVICES=0`. - I have tested on two different machines with different Nvidia GPUs using the "GPU" docker local test command you provided and they produce nii.gz images. Questions: - Is GPU device number going to be consistent in the docker container in the Synapse env? I suspect that by limiting the number of available `CUDA_VISIBLE_DEVICES=0`, if that GPU device is not available in your CWL/toil config (if GPU is set as a value > 0), Pytorch will not see any available devices. - Do you have any suggestions for ensuring the GPU is located? Thanks, -Richard
@vchung Hello, We cannot run the docker on CPU, because it takes too long (timeout). The docker run using `--gpu device=0` flag functions correctly on our side, but when we remove that flag, we run into the same issues in the logs, it cannot find the GPU. Same question as MrRichard: Is there a way to specify during the container submission to use the `--gpu device=0` flag? Thank you ! Quoc Duong
Hi @MrRichard , When we run your Docker submission, we will be running it with the GPU specified. Hope this helps!

GPU-only operation with Docker Submission page is loading…