Hello,
I am getting the following error when running the MLCube test on my system. It passes all the sanity checks but fails when computing the metrics. Please let me know if you have any advices.
Anita
```
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
torch._C._cuda_setDevice(device)
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
```
Created by Anita Kriz anitakriz @sebq ,
It would be great if you could use these specific versions, but I will defer your question over to @ujjwalbaid to be sure! Thank you, so do we need to have a dockerfile matching these specific versions or is it okay to have for example "FROM nvidia/cuda:11.7.1-base-ubuntu22.04" in the dockerfile, which is a more recent cuda version ? @sebq ,
We will be using cloud computes to run your MLCubes, which has the following specs:
* NVIDIA V100
* Driver Version: 510.47.03
* CUDA Version: 11.0
EDIT: fix typo Ok, thank you! So I take it that every team can have a dockerfile with their specific OS, nvidia driver version etc.
Hi @sebq
Right! Your docker image was not properly configured with pytorch+cuda.
I will let @vchung answer your question about CUDA/driver versions.
Also a note: after you properly configure your dockerfile with cuda, mlcube.yaml's docker section should include `--gpus=all` in `gpu_args` (see my comment above)
It seems the forum did not like my at app command sign, but bascially the infer is the same as the one given in the medperf mlcube template but we call the print_func() function at the end
Hi,
By following your guidelines, I cannot access the GPU with torch. This is my dockerfile :
"
FROM python:3.9.16-slim
COPY ./requirements.txt /mlcube_project/requirements.txt
RUN pip3 install --no-cache-dir -r /mlcube_project/requirements.txt
ENV LANG C.UTF-8
COPY . /mlcube_project
ENTRYPOINT ["python3", "/mlcube_project/mlcube.py"]
"
mlcube.yaml file is the same as the one you put as an example except for image name.
Code is just
mlcube.py :
"
import typer
from dummy import print_func
app = typer.Typer()
@app.command("infer")
def infer(
data_path: str = typer.Option(..., "--data_path"),
parameters_file: str = typer.Option(..., "--parameters_file"),
output_path: str = typer.Option(..., "--output_path"),
# Provide additional parameters as described in the mlcube.yaml file
# e.g. model weights:
# weights: str = typer.Option(..., "--weights"),
):
# Modify the infer command as needed
print_func()
@app.command("hotfix")
def hotfix():
# NOOP command for typer to behave correctly. DO NOT REMOVE OR MODIFY
pass
if __name__ == "__main__":
app()
"
dummy.py :
"
import torch
def print_func():
print("Hello World")
print(torch.cuda.is_available())
"
requirements.txt :
"
typer
torch
"
The container image is able to build with "mlcube configure -Pdocker.build_strategy=always" . However, when running the "mlcube run --task infer --data_path=....." command, I get this output :
"
Hello World
False
"
So following your method, we don't have access to GPU. I want to empahsize that my machine has a cuda enable=ed gpu which I see when running nvidia-smi command and can use torch with the gpu. Only when running the docker image created via mlcube I cannot access the gpu even if I run the docker iamge on the same (my) machine. Is there something I am missing? A lot of teams have codes that only run on GPU so we need the access to GPU from the docker image.
It seems like we need to build from a nvidia cuda docker image to have all the nvidia drivers installed properly. https://towardsdatascience.com/a-complete-guide-to-building-a-docker-image-serving-a-machine-learning-system-in-production-d8b5b0533bde
However, this brings a question : If let's say I find a cuda version which works for my laptop which is the host machine on which I am testing my container, how will the organizers of the BraTS challenge have an environment which will fit every team requirements of OS, nvdia drivers etc ?
Could you clarify ? Thank you very much.
Hi @anitakriz
Please make sure you include `--gpus=all` in `gpu_args` inside the `mlcube.yaml` file, and that `accelerator_count` is `1`, as shown in the example below:
```yaml
platform:
accelerator_count: 1
docker:
# Image name
image: mlcommons/mock-model-brats:0.0.0
# Docker build context relative to $MLCUBE_ROOT. Default is `build`.
build_context: "../project"
# Docker file name within docker build context, default is `Dockerfile`.
build_file: "Dockerfile"
gpu_args: --gpus=all
```
If the problem persists, it's most likely that your dockerfile is not configured correctly or there is some problem in your system's driver. Each data takes about 1 minute. And when I run docker push docker.synapse.org//:,The system says denied: requested access to the resource is denied, if you know how to solve this problem, please tell me.
If i use a CPU, my inference time will be significantly higher. Just wondering, how much time does it take for your model to infer? I solved this problem using a CPU, but I would also like to know how to run code using the GPU
Drop files to upload
MLCube unable to access NVIDIA Drivers page is loading…