Epochs_per_round not working?

Hi, I have trouble validating the fact that modifying the "epochs_per_round" value outputted by the "constant_hyper_parameters" function given to the "run_challenge_experiment" function is actually doing something. The steps I followed: - Using the github code on https://github.com/FETS-AI/Challenge - Followed the installation as specified in the Readme.md file of Task 1, - Using FeTS_Challenge.py directly, on the "small_split.csv" partitioning. Thus the functions given as parameters of the "run_experiment_challenge" function are, as initially given: ``` aggregation_function = weighted_average_aggregation choose_training_collaborators = all_collaborators_train training_hyper_parameters_for_round = constant_hyper_parameters ``` This gives, for any round (not only round 0): ``` Collaborators chosen to train for round 0: \experiment.py\:\396\ ['1', '2', '3'] INFO Hyper-parameters for round 0: \experiment.py\:\424\ learning rate: 5e-05 epochs_per_round: 1.0 INFO Waiting for tasks... \collaborator.py\:\178\ INFO Sending tasks to collaborator 3 for round 0 \aggregator.py\:\312\ INFO Received the following tasks: ['aggregated_model_validation', 'train', 'locally_tuned_model_validation'] \collaborator.py\:\168\ [14:27:18] INFO Using TaskRunner subclassing API \collaborator.py\:\253\ ** Starting validation : Looping over validation data: 100%|??????????| 1/1 [00:06<00:00, 6.83s/it] Epoch Final validation loss : 1.0 Epoch Final validation dice : 0.2386646866798401 Epoch Final validation dice_per_label : [0.9437699913978577, 0.007874629460275173, 0.0030141547322273254, 2.570958582744781e-13] [14:27:25] INFO 1.0 \fets_challenge_model.py\:\48\ INFO {'dice': 0.2386646866798401, 'dice_per_label': [0.9437699913978577, 0.007874629460275173, 0.0030141547322273254, \fets_challenge_model.py\:\49\ 2.570958582744781e-13]} METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_loss 1.000000 \collaborator.py\:\416\ METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice 0.238665 \collaborator.py\:\416\ METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_0 0.943770 \collaborator.py\:\416\ METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_1 0.007875 \collaborator.py\:\416\ METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_2 0.003014 \collaborator.py\:\416\ METRIC Round 0, collaborator 3 is sending metric for task aggregated_model_validation: valid_dice_per_label_4 0.000000 \collaborator.py\:\416\ INFO Collaborator 3 is sending task results for aggregated_model_validation, round 0 \aggregator.py\:\486\ METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_loss: 1.000000 \aggregator.py\:\531\ METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice: 0.238665 \aggregator.py\:\531\ METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_0: 0.943770 \aggregator.py\:\531\ METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_1: 0.007875 \aggregator.py\:\531\ METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_2: 0.003014 \aggregator.py\:\531\ METRIC Round 0, collaborator validate_agg aggregated_model_validation result valid_dice_per_label_4: 0.000000 \aggregator.py\:\531\ INFO Using TaskRunner subclassing API \collaborator.py\:\253\ INFO Run 0 epoch of 0 round \fets_challenge_model.py\:\143\ Starting Training : Looping over training data: 0%| | 0/80 [00:00Created by Matthis Manthe Matthis
@Matthis You are correct. Non-integer values of epochs_per_round are not supported this year.
Hi again, Thank you very much for this fix, I have been able to use it. However, it seems like it's not possible anymore (maybe it is not a possibility this year) to use epochs_per_round as a float value (with 0.5 epochs per round for example). I am getting a crash I did not get before, due to the following part of the fets_challenge_model.py file: ``` def train(self, col_name, round_num, input_tensor_dict, use_tqdm=False, epochs=1, kwargs): """Train batches. Train the model on the requested number of batches. Args: col_name : Name of the collaborator round_num : What round is it input_tensor_dict : Required input tensors (for model) use_tqdm (bool) : Use tqdm to print a progress bar (Default=True) epochs : The number of epochs to train crossfold_test : Whether or not to use cross fold trainval/test to evaluate the quality of the model under fine tuning (this uses a separate prameter to pass in the data and config used) crossfold_test_data_csv : Data csv used to define data used in crossfold test. This csv does not itself define the folds, just defines the total data to be used. crossfold_val_n : number of folds to use for the train,val level of the nested crossfold. corssfold_test_n : number of folds to use for the trainval,test level of the nested crossfold. kwargs : Key word arguments passed to GaNDLF main_run Returns: global_output_dict : Tensors to send back to the aggregator local_output_dict : Tensors to maintain in the local TensorDB """ # handle the hparams epochs_per_round = int(input_tensor_dict.pop('epochs_per_round')) ``` epochs_per_round is thus converted to int, it does not seem to be possible anymore to use lower values than 1. Is it an intended behaviour? I thought it would have been possible this year to have float number of epochs, especially since the size of the dataset this year is so big. Thank you again, All the best, Matthis.
Hi @Matthis, Thanks for bumping this thread. This is now fixed. If you reinstall the python package, you'll see the expected number of epochs when you change that parameter. Just as a note, the hyperparameter interface has been slightly modified; instead of returning `(learning_rate, batches_per_round, epochs_per_round)`, now only `(learning_rate, epochs_per_round)` should be returned. New interface example (taken from FeTS_Challenge.py) ``` def constant_hyper_parameters(collaborators, db_iterator, fl_round, collaborators_chosen_each_round, collaborator_times_per_round): """Set the training hyper-parameters for the round. Args: collaborators: list of strings of collaborator names db_iterator: iterator over history of all tensors. Columns: ['tensor_name', 'round', 'tags', 'nparray'] fl_round: round number collaborators_chosen_each_round: a dictionary of {round: list of collaborators}. Each list indicates which collaborators trained in that given round. collaborator_times_per_round: a dictionary of {round: {collaborator: total_time_taken_in_round}}. Returns: tuple of (learning_rate, epochs_per_round). """ # these are the hyperparameters used in the May 2021 recent training of the actual FeTS Initiative # they were tuned using a set of data that UPenn had access to, not on the federation itself # they worked pretty well for us, but we think you can do better :) epochs_per_round = 1 learning_rate = 5e-5 return (learning_rate, epochs_per_round) ```
Hi, First, thank you for your previous reply. Do you have any new information about this problem? Has it been fixed yet? Thank you again, All the best, Matthis.
Hi @Matthis, this is a bug. We will have a fix in place in the next few days and let you know once its resolved. Thanks a lot for reporting!

Drop files to upload

Epochs_per_round not working? page is loading…