I can foresee that there are a lot of needs to perform operations on the files under the /preprocessedData folder as we do preprocessing and model training. For example, I just wrote a docker image last night to copy all the .dcm files from the trainingData folder to the preprocessedData folder. This was merely a test on the system and is not part of my plan of data preprocessing. Later I decided to delete all the files copied. How can I do this? Write another docker image? This can be a pain in the ass if we need to do it constantly.
There are 10TB storage for the /preprocessedData folder, which is a lot but not infinite. Maybe the organizers can offer some advice regarding what is the best practice for file operations on the writable volume? Thx!
Created by Li Shen thefaculty Here is some information on how to speed up the pre-processing of the DICOM images using multiple CPU cores. There are two tools that I recommend you to install: [GNU Parallel](https://www.gnu.org/software/parallel/) to run concurrent commands and [ImageMagick](http://www.imagemagick.org/script/index.php) for processing images.
To install GNU parallel on RHEL 7 or CentOS 7:
```
# wget http://repo.openfusion.net/centos7-x86_64/parallel-20160622-1.of.el7.x86_64.rpm && rpm -Uvh parallel-20160622-1.of.el7.x86_64.rpm && rm -fr arallel-20160622-1.of.el7.x86_64.rpm
````
To install ImageMagick on the same Unix distributions:
```
# yum install -y ImageMagick
```
Here is a quick example that combine both tools to use all the CPU cores available to 1) resize (50% of the original size) all the .dcm images in the directory /data/images and 2) export the result in PNG format to the directory /data/output.
```
$ find /data/images -name "*.dcm" | parallel convert -resize 50% {} /data/output/{/.}.png
```
NOTE: Before batch processing the images, make sure to visually check the output of your command on a sample of images from the [Pilot Set](https://www.synapse.org/#!Synapse:syn4224222/files/). In particular, resizing and converting to JPEG images blurs the images. First, don't apply lossy format conversion (e.g. to JPEG) if your goal is to use the output images as input to generate a lossless LMDB file, for example. Preprocessed (e.g. resized) images must be exported to a lossless format such as PNG. If for some reason you still want to use a lossy file format, you should probably apply a kernel to sharpen the images.
By default, GNU Parallel uses all the CPU cores available.
Here are the details of the commands:
- find /data/images -name "*.dcm": lists all the files with the extension ".dcm" in the directory /data/images
- "convert" is the ImageMagick tool to convert an image from one format to another
- "{}" and "{/.}" are two placeholders defined by GNU Parallel that refer to the input that is fed to "parallel". Here the input is the absolute path to the .dcm images returned by "find /data/images -name "*.dcm"". "{}" refers to one of the items, for example "/data/images/ax689k8a.dcm". "{/}" refers to the same item but the slash ('/') informs parallel to discard the path to the directory. Thus, "{/}" represents "ax689k8a.dcm" if "{}" corresponds to "/data/images/ax689k8a.dcm". In the same fashion, the dot ('.') tells parallel to discard the extension. In our example, "{/.}" then represents "ax689k8a".
The input given to parallel can also be a text file where each line contains a different command:
```
$ cat commands.txt | parallel {}
```
GNU Parallel is a powerful tool to run batch of experiments, each with a different set of parameters. Additional information about GNU Parallel can be found [here](https://www.gnu.org/software/parallel/parallel_tutorial.html).
> Later I decided to delete all the files copied.
We are deleting ourselves the content of /preprocessedData when a new pre-processing container is submitted. If you simply want to delete the images (not very useful by itself), you can simply submit an empty but valid pre-processing Docker container.
Drop files to upload
Any easy ways to manipulate the files on the /preprocessedData folder? page is loading…