Hi,
I have been trying to run the challenge script on GPU. Am using 1 V100 GPU with 16GB of memory, have CUDA_VISIBLE_DEVICES=0, and have set `devices = 'cuda'` in the [script](https://github.com/FETS-AI/Challenge/blob/main/Task_1/FeTS_Challenge.py#L536) . However I keep encountering the following, any idea what could be going wrong? Thanks
Note: the script runs fine for `small_split.csv` but **NOT** for `partitioning_1.csv`
```
Traceback (most recent call last):
File "FeTS_Challenge.py", line 568, in
restore_from_checkpoint_folder = restore_from_checkpoint_folder)
File "/root/github/Challenge/Task_1/fets_challenge/experiment.py", line 286, in run_challenge_experiment
task_runner = copy(plan).get_task_runner(collaborator_data_loaders[col])
File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/plan/plan.py", line 389, in get_task_runner
self.runner_ = Plan.build(**defaults)
File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/plan/plan.py", line 182, in build
instance = getattr(module, class_name)(**settings)
File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/task/runner_fets_challenge.py", line 43, in __init__
model, optimizer, train_loader, val_loader, scheduler, params = create_pytorch_objects(fets_config_dict, train_csv=train_csv, val_csv=val_csv, device=device)
File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/compute/generic.py", line 55, in create_pytorch_objects
) = get_class_imbalance_weights(parameters["training_data"], parameters)
File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/utils/tensor.py", line 357, in get_class_imbalance_weights
loader_type="penalty",
File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/data/ImagesFromDataFrame.py", line 200, in ImagesFromDataFrame
subject.load()
File "/root/setup/envs/venv/lib/python3.7/site-packages/torchio/data/subject.py", line 368, in load
image.load()
File "/root/setup/envs/venv/lib/python3.7/site-packages/torchio/data/image.py", line 498, in load
tensor = torch.cat(tensors)
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 17856000 bytes. Error code 12 (Cannot allocate memory)
```
I do see the following in the log
```
Device requested via CUDA_VISIBLE_DEVICES: 0
Total number of CUDA devices: 1
Device finally used: 0
Sending model to aforementioned device
Memory Total : 15.8 GB, Allocated: 0.3 GB, Cached: 0.3 GB
Device - Current: 0 Count: 1 Name: Tesla V100-PCIE-16GB Availability: True
```
Created by ambrish Thanks! I was using 120G CPU RAM which was insufficient it seems. 150G seems to be adequate. This is CPU Memory Error (RAM). I get the same issues with 32GB RAM, which cannot even be solved by increasing the available space in the hard disk for virtual paging. I can run it on 128GB RAM, not sure if 64GB is enough.
Drop files to upload
Steps for using GPU - processes running out of memory page is loading…