Hello
In order to be preparing our architecture...
Would it be possible a training architecture that preprocess all the training images and save them to disk?
Then the preprocessed images would be fed to a training algorithm.
Or have the training images be preprocessed on the fly?
Created by Alberto Albiol alalbiol i think you need to allow loading in the test set all at once instead of sample by sample. because loading one model would take a couple of seconds, it should be pre-loaded in memory and then read in test image one by one. otherwise one needs 10 days to just do prediction on one model. The final format of the input data is still in discussion, but my intuition is that we will provide the uncompressed DICOM files (.dcm) as its takes hours to uncompressed the 640k images using 40 cores in parallel. We are also investigating the possibility to allow the participants to pre-process the data the way they want and save the result. The inference method would then have access to those data without the need to reprocess them. There are technical challenges to overcome considering the size of the data and the number of participants that are already registered. We will make sure to provide a comprehensive documentation once the IT architecture is in its final version. Dear Thomas, thanks, but I have no explained very well what I mean.
I mean the full DICOM image itself. Now is provided in .GZ format.
Metadata is about 1% of the information, a text file is nice but needs to parse the full file to get any information. Dicoms files are fast to find keys as is a keyyed file, so metadata is not the issue.
If you plan to read pixels, you need the following flowchart:
1.- Uncompress the file,
2.- then use you tools for the time you need to access the image.
3.- (remove ?) the uncompressed files, it there is Disk Space limitation (this thread) Yes, the DICOM metadata will also be provided in a text file as mentioned [here](https://www.synapse.org/#!Synapse:syn4224222/discussion/threadId=589). Great.
In my opinion would be great order to provide the rest of the DICOM file in an uncompressed way in order to not need to uncompress before reading.
Not all libraries support direct dicom compression as far as dicom itself has its own libraries and codecs to provide this support. A new version of the [image crosswalk file](https://www.synapse.org/#!Synapse:syn7113506) that contains the fields *laterality* and *view* is now available. Hi Kiko,
The final images crosswalk file will include those two variables: *laterality* ("L" or "R") and *view* ("CC", "MLO", etc.). I'll update the [images_crosswalk_pilot.tsv](https://www.synapse.org/#!Synapse:syn6174179) accordingly. In fact if the DICOM files are compressed as they are in the sample if you what to iterate about some metadata requires uncompression/commpression of the files.
imageView (R L) is in the metadata, but CC or LC is not, so there is aneed to be accessed to select same kind of mammo samples We foresee an scenario where the preprocessed data may be up to 10 times the volume of the original data.
We can take a big computational advantage if:
1- we do not have to preprocess data for each epoch (if we finally use a neural network approach).
2- Of course it would be great if preprocessed data could be reused/shared
among different models/architectures (although in this case the computational savings would be less compared
against option 1)
We have not decided yet the learning framework that we are going to use but it will be one of the typical choices: theano, torch...
Best regards
Thank you for this question. It is a topic that we as organizers have been discussing lately. Can you tell me: What machine learning framework (if any) do you intend to use? How big is the preprocessed data (typically) as a function of the original, raw data? Do you typically have just one preprocessed version of the raw data or multiple?
Thanks.