Hi!
Since I had access to only 1 GPU, I used the simulation mode in Flower setup. However, it is mentioned that training will take place on a cluster. I can image a cluster having multiple GPU's with probably each GPU running 1 client. I would like to know what is the expectation in terms of server and client scripts. Should we write the client script such that an external function can call each of the clients on a separate GPU? Can you please explain?
Thank you,
Santhi
Created by Santhi Raj Kolamuri SRajKolamuri Thank you Max, I will learn more about these and make a decision which is right given the constraints.
Santhi Hi Santhi,
Thanks for your question!
In a cluster setup with multiple GPUs, it's common for each GPU to run one client. When writing your client script, it should be flexible enough to allow to assign clients to specific GPUs. This can be handled either by setting device IDs in the client code or through external orchestration (e.g., Docker containers or job scripts).
In the cluster setup we have, I have to adapt the docker run command in such a way that I specify which GPU I want to use.
There are two main ways to handle this in Flower:
1.) Individual Docker Containers (like the [FedSurg24](https://gitlab.com/nct_tso_public/challenges/miccai2024/FedSurg24/-/tree/main?ref_type=heads) example repository): Each client and server can have its own container, with training distributed to separate GPUs. This allows isolated execution for each client.
(Note, that the repository was created before I had access to the data and the data format is not correct. Please use the version I send you per mail some days ago)
2.) Simulation Mode with Multiple GPUs: You can also use Flower?s simulation mode, which automatically distributes clients across multiple available GPUs. The settings between 1 and 4 GPUs are the same.
Both ways are fine for this challenge!
Let me know how you?d like to proceed, and I can assist further if needed!
Best regards,
Max
Drop files to upload
Federated learning: 1 GPU vs. Cluster page is loading…