GPU-only operation with Docker Submission

Hello All, I have successfully created a docker container that functions correctly when using the "GPU" command outlined in step "5. Locally test your Docker container" of your Submission Tutorial. However, I did not write this in a way that supports CPU only. - Is CPU-only functionality a requirement? - Is there a way to specify during the container submission to use the `--gpu device=0` flag? Thank you for your help, -Richard

Created by Richard Barcus MrRichard
Hey Ben @Interion I am just now getting around to submitting my new container using the **nvidia/cuda:11.4.1-base-ubuntu20.04** base container, so I do not know if it will run in the Synapse environment yet (It did run in my test environment). _# Edit: The container using the base listed above did successfully find the GPU and start. We used Pytorch for our submission. I do not know if Tensorflow has different CUDA driver requirements. _ One thing that I did notice (and I hope I am not leading you in the wrong direction) is that in the container there are no directories: **/usr/local/nvidia/lib:/usr/local/nvidia/lib64** as specified by the `LD_LIBRARY_PATH` environmental variable. There is a symbolic link to /usr/local/**cuda** which points (via alternatives) to /usr/local/cuda-11.4/ (where there are similar lib/ and lib64/ directories) which I would _expect_ to contain some version of libcublas.so.11, but it does not. Sadly, I am no expert at Tensorflow. The only suggestion I can think of would be try another container, like the developer container **11.4.1-devel-ubuntu20.04**? [From Dockerhub](https://registry.hub.docker.com/r/nvidia/cuda) or **tensorflow/tensorflow:latest-gpu** per [Tensorflow in Docker Containers](https://www.tensorflow.org/install/docker). If you find a the correct dynamic library but in a folder that does not match the LD_LIBRARY_PATH, you should be able to change that variable in the Dockerfile using `ENV LD_LIBRARY_PATH=/your/path/here/`. Good Luck! -Richard
Hello, I tried to use the following base for the Docker image, as well as the package installations specified in the Toward Data Science blog post. ``` FROM nvidia/cuda:11.4.1-base-ubuntu20.04 #installing python and pip RUN apt-get update && apt-get install --no-install-recommends --no-install-suggests -y curl RUN apt-get install unzip RUN apt-get -y install python3 RUN apt-get -y install python3-pip #python package installation RUN pip3 install numpy RUN pip3 install argparse RUN pip3 install nibabel RUN pip3 install tensorflow==2.6.0 tensorflow-gpu==2.6.0 RUN pip3 install cuda #copy the model and set the entry point COPY run_model.py /usr/local/bin/ COPY segmentation_model.h5 /usr/local/bin/ #enable executable permissions for all users RUN chmod a+x /usr/local/bin/run_model.py ENTRYPOINT ["python3", "/usr/local/bin/run_model.py"] ``` However, when running the actual Docker image with the GPU, it isn't able to load the libcublas dynamic libraries, and as a result, it skips registering the GPU debices. **2021-09-15 17:28:38.587681: W tensorflow/stream_executor/platform/default/dso_lo ader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas .so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PA TH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64** I'm not sure if I'm missing an installation, command, or have something else invalid in the Docker file. Any help would be appreciated, thanks! Ben
Hello @MrRichard, We were able to find a way to run the docker with the GPU and submit successfully. We did not use the Ubuntu image in the Dockerfile as it was recommended in the tutorial (`FROM ubuntu`), but rather `FROM nvidia/cuda:11.4.1-base-ubuntu20.04`. This is the only thing we changed. I think it's a way to "force" the container to use GPU with the correct drivers. In another thread, it was said that the docker was run with the following command (not with `--gpus device=0`): ``` docker run \ --rm \ --network=none \ --runtime="nvidia" \ -v /path/to/input:/input:ro \ -v /path/to/output:/output:rw \ ``` For more information, we read the following blog post. https://towardsdatascience.com/how-to-properly-use-the-gpu-within-a-docker-container-4c699c78c6d1 Hope it helps ! Good luck, Nguyen Quoc Duong
@vchung Hi Venera, and Sage Team, Thanks for looking at our issue. (and Thank you Quoc @shf99 for following up). I am still running into the same problem: the GPU 0 is not being found in the container by Pytorch. Here is some additional context: - Early in the process we found that CPU classification was far too time consuming and opted for GPU only classification. - I am using Pytorch to determine if a GPU is available, using `num_gpus = torch.cuda.device_count()`. If no devices are found, the process throws an error and stops. - Based on the container testing instructions (using `--gpu device=0` at docker run), I set `CUDA_VISIBLE_DEVICES=0`. - I have tested on two different machines with different Nvidia GPUs using the "GPU" docker local test command you provided and they produce nii.gz images. Questions: - Is GPU device number going to be consistent in the docker container in the Synapse env? I suspect that by limiting the number of available `CUDA_VISIBLE_DEVICES=0`, if that GPU device is not available in your CWL/toil config (if GPU is set as a value > 0), Pytorch will not see any available devices. - Do you have any suggestions for ensuring the GPU is located? Thanks, -Richard
@vchung Hello, We cannot run the docker on CPU, because it takes too long (timeout). The docker run using `--gpu device=0` flag functions correctly on our side, but when we remove that flag, we run into the same issues in the logs, it cannot find the GPU. Same question as MrRichard: Is there a way to specify during the container submission to use the `--gpu device=0` flag? Thank you ! Quoc Duong
Hi @MrRichard , When we run your Docker submission, we will be running it with the GPU specified. Hope this helps!

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

GPU-only operation with Docker Submission page is loading…