Hi,
I have trouble validating the fact that modifying the "epochs_per_round" value outputted by the "constant_hyper_parameters" function given to the "run_challenge_experiment" function is actually doing something.
The steps I followed:
- Using the github code on https://github.com/FETS-AI/Challenge
- Followed the installation as specified in the Readme.md file of Task 1,
- Using FeTS_Challenge.py directly, on the "small_split.csv" partitioning.
Thus the functions given as parameters of the "run_experiment_challenge" function are, as initially given:
```
aggregation_function = weighted_average_aggregation
choose_training_collaborators = all_collaborators_train
training_hyper_parameters_for_round = constant_hyper_parameters
```
This gives, for any round (not only round 0):
```
Collaborators chosen to train for round 0: \experiment.py\:\396\
['1', '2', '3']
INFO Hyper-parameters for round 0: \experiment.py\:\424\
learning rate: 5e-05
epochs_per_round: 1.0
INFO Waiting for tasks... \collaborator.py\:\178\
INFO Sending tasks to collaborator 3 for round 0 \aggregator.py\:\312\
INFO Received the following tasks: ['aggregated_model_validation', 'train', 'locally_tuned_model_validation'] \collaborator.py\:\168\
[14:27:18] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
********************
Starting validation :
********************
Looping over validation data: 100%|??????????| 1/1 [00:06<00:00, 6.83s/it] Epoch Final validation loss : 1.0
Epoch Final validation dice : 0.2386646866798401
Epoch Final validation dice_per_label : [0.9437699913978577, 0.007874629460275173, 0.0030141547322273254, 2.570958582744781e-13]
[14:27:25] INFO 1.0 \fets_challenge_model.py\:\48\
INFO {'dice': 0.2386646866798401, 'dice_per_label': [0.9437699913978577, 0.007874629460275173, 0.0030141547322273254, \fets_challenge_model.py\:\49\
2.570958582744781e-13]}
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_loss 1.000000 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice 0.238665 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_0 0.943770 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_1 0.007875 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_2 0.003014 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_4 0.000000 \collaborator.py\:\416\
INFO Collaborator 3 is sending task results for aggregated_model_validation, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_loss: 1.000000 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice: 0.238665 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_0: 0.943770 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_1: 0.007875 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_2: 0.003014 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_4: 0.000000 \aggregator.py\:\531\
INFO Using TaskRunner subclassing API \collaborator.py\:\253\
INFO Run 0 epoch of 0 round \fets_challenge_model.py\:\143\
********************
Starting Training :
********************
Looping over training data: 0%| | 0/80 [00:00, ?it/s]/home/manthe/anaconda3/envs/fets2022_env/lib/python3.7/site-packages/torchio/data/queue.py:215: RuntimeWarning: Queue length (100) not divisible by the number of patches per volume (40)
warnings.warn(message, RuntimeWarning)
Looping over training data: 100%|??????????| 80/80 [00:47<00:00, 1.67it/s] Epoch Final Train loss : 1.0
Epoch Final Train dice : 0.22838935144245626
Epoch Final Train dice_per_label : [0.8640330836176873, 0.004636458929081044, 0.04488786500078277, 6.515775969446157e-12]
[14:28:13] METRIC Round 0, collaborator 3 is sending metric for task train: loss 1.000000 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice 0.228389 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_0 0.864033 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_1 0.004636 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_2 0.044888 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_4 0.000000 \collaborator.py\:\416\
INFO Collaborator 3 is sending task results for train, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator metric train result loss: 1.000000 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice: 0.228389 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_0: 0.864033 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_1: 0.004636 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_2: 0.044888 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_4: 0.000000 \aggregator.py\:\531\
[14:28:14] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
********************
Starting validation :
********************
Looping over validation data: 100%|??????????| 1/1 [00:06<00:00, 6.97s/it] Epoch Final validation loss : 1.0
Epoch Final validation dice : 0.244467630982399
Epoch Final validation dice_per_label : [0.9684193134307861, 0.00489959167316556, 0.004551596473902464, 4.3230536880302373e-13]
[14:28:21] INFO 1.0 \fets_challenge_model.py\:\48\
INFO {'dice': 0.244467630982399, 'dice_per_label': [0.9684193134307861, 0.00489959167316556, 0.004551596473902464, \fets_challenge_model.py\:\49\
4.3230536880302373e-13]}
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_loss 1.000000 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice 0.244468 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_0 0.968419 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_1 0.004900 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_2 0.004552 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_4 0.000000 \collaborator.py\:\416\
INFO Collaborator 3 is sending task results for locally_tuned_model_validation, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_loss: 1.000000 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice: 0.244468 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_0: 0.968419 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_1: 0.004900 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_2: 0.004552 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_4: 0.000000 \aggregator.py\:\531\
INFO All tasks completed on 3 for round 0... \collaborator.py\:\171\
INFO Collaborator 3 took simulated time: 4.6 minutes \experiment.py\:\476\
INFO Waiting for tasks... \collaborator.py\:\178\
INFO Sending tasks to collaborator 2 for round 0 \aggregator.py\:\312\
INFO Received the following tasks: ['aggregated_model_validation', 'train', 'locally_tuned_model_validation'] \collaborator.py\:\168\
[14:28:22] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
********************
Starting validation :
********************
Looping over validation data: 100%|??????????| 1/1 [00:07<00:00, 7.06s/it] Epoch Final validation loss : 0.9427357912063599
Epoch Final validation dice : 0.26110994815826416
Epoch Final validation dice_per_label : [0.9405062198638916, 0.0011481185210868716, 0.045521270483732224, 0.05726420879364014]
[14:28:30] INFO 0.9427357912063599 \fets_challenge_model.py\:\48\
INFO {'dice': 0.26110994815826416, 'dice_per_label': [0.9405062198638916, 0.0011481185210868716, 0.045521270483732224, \fets_challenge_model.py\:\49\
0.05726420879364014]}
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_loss 0.942736 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice 0.261110 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_0 0.940506 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_1 0.001148 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_2 0.045521 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_4 0.057264 \collaborator.py\:\416\
INFO Collaborator 2 is sending task results for aggregated_model_validation, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_loss: 0.942736 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice: 0.261110 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_0: 0.940506 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_1: 0.001148 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_2: 0.045521 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_4: 0.057264 \aggregator.py\:\531\
INFO Using TaskRunner subclassing API \collaborator.py\:\253\
INFO Run 0 epoch of 0 round
...
```
Which is the desired behaviour: collaborator 3 trains for 1.0 epochs on the 2 data points with 40 patches each (am I right on this ?).
However, if a change the function like this (literally just modifying the given values for epochs_per_round from 1.0 to 2.0):
```
def constant_hyper_parameters(collaborators,
db_iterator,
fl_round,
collaborators_chosen_each_round,
collaborator_times_per_round):
"""Set the training hyper-parameters for the round.
Args:
collaborators: list of strings of collaborator names
db_iterator: iterator over history of all tensors.
Columns: ['tensor_name', 'round', 'tags', 'nparray']
fl_round: round number
collaborators_chosen_each_round: a dictionary of {round: list of collaborators}. Each list indicates which collaborators trained in that given round.
collaborator_times_per_round: a dictionary of {round: {collaborator: total_time_taken_in_round}}.
Returns:
tuple of (learning_rate, epochs_per_round, batches_per_round). One of epochs_per_round and batches_per_round must be None.
"""
# these are the hyperparameters used in the May 2021 recent training of the actual FeTS Initiative
# they were tuned using a set of data that UPenn had access to, not on the federation itself
# they worked pretty well for us, but we think you can do better :)
epochs_per_round = 2.0
batches_per_round = None
learning_rate = 5e-5
return (learning_rate, epochs_per_round, batches_per_round)
```
I still have the same results on the "small_split.csv" partitioning.
```
Collaborators chosen to train for round 0: \experiment.py\:\396\
['1', '2', '3']
INFO Hyper-parameters for round 0: \experiment.py\:\424\
learning rate: 5e-05
epochs_per_round: 2.0
INFO Waiting for tasks... \collaborator.py\:\178\
INFO Sending tasks to collaborator 3 for round 0 \aggregator.py\:\312\
INFO Received the following tasks: ['aggregated_model_validation', 'train', 'locally_tuned_model_validation'] \collaborator.py\:\168\
[14:12:30] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
********************
Starting validation :
********************
Looping over validation data: 100%|??????????| 1/1 [00:07<00:00, 7.04s/it] Epoch Final validation loss : 1.0
Epoch Final validation dice : 0.2386646866798401
Epoch Final validation dice_per_label : [0.9437699913978577, 0.007874629460275173, 0.0030141547322273254, 2.570958582744781e-13]
[14:12:37] INFO 1.0 \fets_challenge_model.py\:\48\
INFO {'dice': 0.2386646866798401, 'dice_per_label': [0.9437699913978577, 0.007874629460275173, 0.0030141547322273254, \fets_challenge_model.py\:\49\
2.570958582744781e-13]}
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_loss 1.000000 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice 0.238665 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_0 0.943770 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_1 0.007875 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_2 0.003014 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_4 0.000000 \collaborator.py\:\416\
INFO Collaborator 3 is sending task results for aggregated_model_validation, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_loss: 1.000000 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice: 0.238665 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_0: 0.943770 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_1: 0.007875 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_2: 0.003014 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_4: 0.000000 \aggregator.py\:\531\
[14:12:38] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
INFO Run 0 epoch of 0 round \fets_challenge_model.py\:\143\
********************
Starting Training :
********************
Looping over training data: 0%| | 0/80 [00:00, ?it/s]/home/manthe/anaconda3/envs/fets2022_env/lib/python3.7/site-packages/torchio/data/queue.py:215: RuntimeWarning: Queue length (100) not divisible by the number of patches per volume (40)
warnings.warn(message, RuntimeWarning)
Looping over training data: 100%|??????????| 80/80 [00:49<00:00, 1.63it/s] Epoch Final Train loss : 1.0
Epoch Final Train dice : 0.2305977862328291
Epoch Final Train dice_per_label : [0.8688250705599785, 0.004667150774294626, 0.048898923238084535, 6.210030775201381e-12]
[14:13:27] METRIC Round 0, collaborator 3 is sending metric for task train: loss 1.000000 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice 0.230598 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_0 0.868825 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_1 0.004667 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_2 0.048899 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task train: train_dice_per_label_4 0.000000 \collaborator.py\:\416\
INFO Collaborator 3 is sending task results for train, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator metric train result loss: 1.000000 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice: 0.230598 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_0: 0.868825 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_1: 0.004667 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_2: 0.048899 \aggregator.py\:\531\
METRIC Round 0, collaborator metric train result train_dice_per_label_4: 0.000000 \aggregator.py\:\531\
[14:13:28] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
********************
Starting validation :
********************
Looping over validation data: 100%|??????????| 1/1 [00:06<00:00, 6.98s/it] Epoch Final validation loss : 1.0
Epoch Final validation dice : 0.24105004966259003
Epoch Final validation dice_per_label : [0.9601203799247742, 0.0006217751652002335, 0.0034580088686197996, 3.5011449577189435e-13]
[14:13:35] INFO 1.0 \fets_challenge_model.py\:\48\
INFO {'dice': 0.24105004966259003, 'dice_per_label': [0.9601203799247742, 0.0006217751652002335, \fets_challenge_model.py\:\49\
0.0034580088686197996, 3.5011449577189435e-13]}
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_loss 1.000000 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice 0.241050 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_0 0.960120 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_1 0.000622 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_2 0.003458 \collaborator.py\:\416\
METRIC Round 0, collaborator 3 is sending metric for task locally_tuned_model_validation: valid_dice_per_label_4 0.000000 \collaborator.py\:\416\
INFO Collaborator 3 is sending task results for locally_tuned_model_validation, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_loss: 1.000000 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice: 0.241050 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_0: 0.960120 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_1: 0.000622 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_2: 0.003458 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_local locally_tuned_model_validation result valid_dice_per_label_4: 0.000000 \aggregator.py\:\531\
INFO All tasks completed on 3 for round 0... \collaborator.py\:\171\
INFO Collaborator 3 took simulated time: 7.08 minutes \experiment.py\:\476\
INFO Waiting for tasks... \collaborator.py\:\178\
INFO Sending tasks to collaborator 2 for round 0 \aggregator.py\:\312\
INFO Received the following tasks: ['aggregated_model_validation', 'train', 'locally_tuned_model_validation'] \collaborator.py\:\168\
[14:13:36] INFO Using TaskRunner subclassing API \collaborator.py\:\253\
********************
Starting validation :
********************
Looping over validation data: 100%|??????????| 1/1 [00:07<00:00, 7.02s/it] Epoch Final validation loss : 0.8570630550384521
Epoch Final validation dice : 0.2770087718963623
Epoch Final validation dice_per_label : [0.9444660544395447, 0.00011090079351561144, 0.020521221682429314, 0.14293694496154785]
[14:13:43] INFO 0.8570630550384521 \fets_challenge_model.py\:\48\
INFO {'dice': 0.2770087718963623, 'dice_per_label': [0.9444660544395447, 0.00011090079351561144, 0.020521221682429314, \fets_challenge_model.py\:\49\
0.14293694496154785]}
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_loss 0.857063 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice 0.277009 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_0 0.944466 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_1 0.000111 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_2 0.020521 \collaborator.py\:\416\
METRIC Round 0, collaborator 2 is sending metric for task aggregated_model_validation: valid_dice_per_label_4 0.142937 \collaborator.py\:\416\
INFO Collaborator 2 is sending task results for aggregated_model_validation, round 0 \aggregator.py\:\486\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_loss: 0.857063 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice: 0.277009 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_0: 0.944466 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_1: 0.000111 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_2: 0.020521 \aggregator.py\:\531\
METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_4: 0.142937 \aggregator.py\:\531\
INFO Using TaskRunner subclassing API \collaborator.py\:\253\
INFO Run 0 epoch of 0 round
```
The number of epochs per round seems to be used, as stated in the beginning of the trace:
```
INFO Hyper-parameters for round 0: \experiment.py\:\424\
learning rate: 5e-05
epochs_per_round: 2.0
```
However, it doesn't seem to happen during training (same training time, number of patches, etc.)... (Same if epochs_per_round = int(2))
Is it occurring but hidden by the interface? Is it just a behaviour with the small_split? Am I doing something wrong? Can't we modify the number of epochs_per_round? Or is it a bug?
Thank you very much for your help,
All the best,
Matthis.
Created by Matthis Manthe Matthis @Matthis You are correct. Non-integer values of epochs_per_round are not supported this year. Hi again,
Thank you very much for this fix, I have been able to use it.
However, it seems like it's not possible anymore (maybe it is not a possibility this year) to use epochs_per_round as a float value (with 0.5 epochs per round for example).
I am getting a crash I did not get before, due to the following part of the fets_challenge_model.py file:
```
def train(self, col_name, round_num, input_tensor_dict, use_tqdm=False, epochs=1, **kwargs):
"""Train batches.
Train the model on the requested number of batches.
Args:
col_name : Name of the collaborator
round_num : What round is it
input_tensor_dict : Required input tensors (for model)
use_tqdm (bool) : Use tqdm to print a progress bar (Default=True)
epochs : The number of epochs to train
crossfold_test : Whether or not to use cross fold trainval/test
to evaluate the quality of the model under fine tuning
(this uses a separate prameter to pass in the data and
config used)
crossfold_test_data_csv : Data csv used to define data used in crossfold test.
This csv does not itself define the folds, just
defines the total data to be used.
crossfold_val_n : number of folds to use for the train,val level of the nested crossfold.
corssfold_test_n : number of folds to use for the trainval,test level of the nested crossfold.
kwargs : Key word arguments passed to GaNDLF main_run
Returns:
global_output_dict : Tensors to send back to the aggregator
local_output_dict : Tensors to maintain in the local TensorDB
"""
# handle the hparams
epochs_per_round = int(input_tensor_dict.pop('epochs_per_round'))
```
epochs_per_round is thus converted to int, it does not seem to be possible anymore to use lower values than 1.
Is it an intended behaviour? I thought it would have been possible this year to have float number of epochs, especially since the size of the dataset this year is so big.
Thank you again,
All the best,
Matthis.
Hi @Matthis,
Thanks for bumping this thread. This is now fixed. If you reinstall the python package, you'll see the expected number of epochs when you change that parameter.
Just as a note, the hyperparameter interface has been slightly modified; instead of returning `(learning_rate, batches_per_round, epochs_per_round)`, now only `(learning_rate, epochs_per_round)` should be returned.
New interface example (taken from FeTS_Challenge.py)
```
def constant_hyper_parameters(collaborators,
db_iterator,
fl_round,
collaborators_chosen_each_round,
collaborator_times_per_round):
"""Set the training hyper-parameters for the round.
Args:
collaborators: list of strings of collaborator names
db_iterator: iterator over history of all tensors.
Columns: ['tensor_name', 'round', 'tags', 'nparray']
fl_round: round number
collaborators_chosen_each_round: a dictionary of {round: list of collaborators}. Each list indicates which collaborators trained in that given round.
collaborator_times_per_round: a dictionary of {round: {collaborator: total_time_taken_in_round}}.
Returns:
tuple of (learning_rate, epochs_per_round).
"""
# these are the hyperparameters used in the May 2021 recent training of the actual FeTS Initiative
# they were tuned using a set of data that UPenn had access to, not on the federation itself
# they worked pretty well for us, but we think you can do better :)
epochs_per_round = 1
learning_rate = 5e-5
return (learning_rate, epochs_per_round)
```
Hi,
First, thank you for your previous reply.
Do you have any new information about this problem? Has it been fixed yet?
Thank you again,
All the best,
Matthis. Hi @Matthis, this is a bug. We will have a fix in place in the next few days and let you know once its resolved. Thanks a lot for reporting!