I want to raise some serious concerns regarding the handling of intermediate data, which are unique to this contest. I think other participants have raised some similar questions so this shall be beneficial to others as well.
.
Given the nature of the problems, it is very likely to use part of the input images, i.e. segmented image patches, as inputs for a machine learning model. Typically, one will first generate these image patches and output them to an external storage and then read them from external storage for training. We can do the same for this challenge: use the preprocessing produce to generate those image patches and put them under the preprocessedData folder; and then read them from the preprocessedData folder for training.
.
Now comes the problem: **the generation of image patches is not a static process**. For example, if I develop a new method to detect masses from mammograms, I'd like to use that method to re-generate the image patches. So I have two options:
.
**Option 1: re-generate the images patches in the preprocessing step**
The problem is that I'll have to repeat the whole preprocessing procedure on the raw images that are huge. I just record the time to convert the 313,847 .dcm files to .png files and it took 31.5 hrs on the server. It would be a waste to redo the whole thing once again.
.
**Option 2: re-generate the images patches in the training step**
The problem is that during training, the only external storage is the modelState folder, which is supposed to be used for trained models and is only 5GB. That means, we'll have to hold all the images patches in memory and deal with the RAM constraint of 200GB. This can become a serious limit given the large size of the training set.
.
I think the whole setup of the challenge has mainly focused on the type of data that are either totally static or very dynamic. It is rather awkward to deal with a dataset that is sitting somewhere in between, such as the image patches.
.
I really appreciate the time and effort that the organizers have put into this challenge. So I just want to offer some feedback to make it a success for all! :)
Created by Li Shen thefaculty @thefaculty: Just to let you know, we are in the process of reengineering the submission processing pipeline to provide additional 'scratch space' during the training phase. More details will follow. Hope this helps.
@brucehoff Yes. "scratch space" is the right word to put it. > the lack of writing to external storage during the training also prevents ...
Just to make sure I understand: The system *does* allow your training model to write to external storage, i.e. to the /modelState folder (with a limit of 5GB). It sounds like there is an additional need to have 'scratch space' (say, 100GB) to write intermediate files which are not part of the "model state" but which you need to have around while the training model is running. Is that correct? @brucehoff,
.
Just to add another input: the lack of writing to external storage during the training also prevents the use of command line tools, such as **parallel**, from speeding things up. We'll have to use the APIs of the language of choice to do parallel processing, which adds additional programming burdens. @brucehoff,
--
I'll just throw some ballpark estimate to make an argument. Let's say I have a detection method that can identify ROIs from mammograms. I'll use it to generate one image patch per mammogram. Assume a patch size of 256*256, which are stored as 32-bit floats. Then I'll need:
```
256^2*313847*4/1024^3 = 76.6GB
```
So nearly 80GB are taken away from the 200GB limit. In a more complicated design, I may want to generate more than one patches per mammogram. In this situation, I'll have to be more cautious in using different machine learning models to avoid exceeding memory limit. @thefaculty: Thank you for this feedback. If I understand correctly you are saying that there is a workflow that looks like this:
.dcm files -> .png files -> extraction of image patches from .png files -> machine learning on image patches
You are further saying that the Challenge constrains you to combine this workflow into two steps, either:
Option 1:
[ .dcm files -> .png files -> extraction of image patches from .png files ] -> [ machine learning on image patches ]
or
Option 2:
[ .dcm files -> .png files ] -> [ extraction of image patches from .png files -> machine learning on image patches ]
where Option 1 has the disadvantage of taking a long time for each submission (the long running .dcm->.png part is unchanged but has to be repeated every time the image patch extraction is changed) and Option 2 has the problem that the Challenge does not provide a place to store the image patches.
Is my understanding correct? Can you say how much storage is required in Option 2 to allow you to store the image patches?
Drop files to upload
What is the best practice to deal with intermediate data? - some serious concerns page is loading…