Hi,
I would like to ask for some clarification regarding the use of the preprocessing folder.
Assume I would like to train 4 models (A,B,C,D), where two models (say A,B) use different preprocessing step than the other two (C,D).
Are we allowed to hold two different preprocessing partitions, or once we we submit a new processing request (in a a new project with a different syn # number) the preprocessing folder of our previous (old) project is erased? In other word, is the preprocessing folder "per team" or "per project"?
How can we simulatensly run two models which are based on the same preprocessing folder?
If we submit a model to the training express lane, which include a preprocessing step, does it erases the current preprocessing folder?
And a future suggestion: I think it would be beneficial to load the preprocessing folder as a read & write folder in the training set. This way, if our preprocessing algorithm is long and composed of multiple stages, and we found a bug in the latter stages, we can fix and modify the latter stages as part of the training script without redoing the whole script .
Created by Eli Meirom bloodymeli No, the [Express Lane](https://www.synapse.org/#!Synapse:syn4224222/wiki/409764) and Challenge Lane are completely separated. Thank you for the prompt response.
I didn't fully understand your comment:
```
The pre-processing directory for the Express Lane works the same as for the Challenge (but are different of course).
```
Assume I submited a preprocessing script A to the **training lane**, which generated a preprocessed partition A. If I now submit a different preprocessing script B to the **express lane**, will it erase partition A?
Hi Eli,
> Are we allowed to hold two different preprocessing partitions, or once we we submit a new processing request (in a a new project with a different syn # number) the preprocessing folder of our previous (old) project is erased? In other word, is the preprocessing folder "per team" or "per project"?
Considering the large size of the pre-processed data (max 10 TB/team), the system does not allow to archive multiple versions of the pre-processed data. We also keep the pre-processed data on disks connected to a specific machine in order to 1) have optimal read/write speed during pre-processing and training and 2) avoid moving around TBs of data, which is prohibitive. In our application, the cached pre-processed data are cleared when needed, i.e. when a different pre-processing docker container is submitted (defined by the docker repository name and digest, and not the Synapse ID).
Assuming that your pre-processed data usage for A,B + C,D stays within the allotted space, you could pre-processed the data for A,B and for C,D (2 versions of the pre-processed data) from the same docker container. Training A, B, C and D could then refer to their common pre-processing docker container.
> In other word, is the preprocessing folder "per team" or "per project"?
Per team
> How can we simulatensly run two models which are based on the same preprocessing folder?
Simply submit two training jobs that refers to the same pre-processing docker container. If the data are cached, the two jobs will run in parallel assuming that there are enough slots available, otherwise the first of the two jobs will start. The second job will start once the system detects that the data are cached.
> If we submit a model to the training express lane, which include a preprocessing step, does it erases the current preprocessing folder?
The pre-processing directory for the Express Lane works the same as for the Challenge (but are different of course).
> And a future suggestion: I think it would be beneficial to load the preprocessing folder as a read & write folder in the training set. This way, if our preprocessing algorithm is long and composed of multiple stages, and we found a bug in the latter stages, we can fix and modify the latter stages as part of the training script without redoing the whole script .
The reason the system doesn't allow that is that the persistence of the pre-processed data is not guaranteed. Therefore, the system need to be able to regenerate the pre-processed data from scratch when required. As mentioned, it is prohibitive to conserve each "revision" of your pre-processed data considering the size of the dataset. However, the infrastructure developed for this Challenge will continue to evolve based on the feedback from the participants and the suggested feature or a similar one may become available in the future.
Thanks!
Drop files to upload
Preprocessing folder persistent and future suggestion page is loading…