Hi,
Is anybody else having problems with slow disk I/O from the /trainingData or /preprocessedData directories?
It seems as though the disk I/O speeds on the servers is not so high. We have apparently just preprocessed 85k .dcm files but it took 18hrs even using multiple processes (48) and it seems as though they were spending a lot of time on I/O wait. Here is a snippet of "vmstat 1" output. There seems to have been plenty of spare memory and CPU.
```
STDOUT: procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
STDOUT: r b swpd free buff cache si so bi bo in cs us sy id wa st
STDOUT: 9 40 36996 20216676 445712 472973952 0 0 148972 19336 10045 1564 15 1 71 13 0
```
We had similar results when only using 24 processes, i.e. one per core.
Presumably this is due to AWS EBS limitations but are there any plans to speed up disk access? If not then I'm concerned that it will slow down training. There's only so much you can achieve with prefetching data into memory.
Cheers
Bob
Created by Bob Kemp BobK Yes, I did originally use 24 but that seemed to have the same problem and I wondered whether the data was sharded across disks.
Our processing is not very complex. It sounds like there's something else wrong with it.
I will investigate further. Your figure of 5hrs tells me a lot.
Thanks!
Bob
> Here is a snippet of "vmstat 1" output.
I'm not familiar with _vmstat_ but it seems to generate virtual memory statistics and not disk I/O statistics. You can use _iostat_ to monitor disk performance and _dd_ for read and write tests.
Note that the filesystem, /trainingData and /preprocessedData partitions are logical volumes that span on different physical volumes.
```
# Test write speed
dd if=/dev/zero of=/preprocessedData/output bs=1M count=1024
# Test read speed (using /preprocessedData instead of /trainingData)
dd if=/preprocessedData/output of=/dev/null bs=1M count=1024
```
> it took 18hrs even using multiple processes (48)
When running concurrent tasks, I usually run N tasks where N is the number of CPU cores (here 24). Note that at some point, the IO speed usually becomes the limiting factor and performance doesn't scale linearly anymore with the number of concurrent tasks. Moreover, 18 hours can be a long time or a very short time depending on what you are doing for the pre-processing. To give a reference, it takes me about five hours to resize and save the images of the training set in PNG format using [GPU Parallel](https://www.gnu.org/software/parallel/parallel_tutorial.html), ImageMagick and 24 concurrent threads.
```
echo "Resizing and converting $(find $IMAGES_DIRECTORY -name "*.dcm" | wc -l) DICOM images to PNG format"
find $IMAGES_DIRECTORY/ -name "*.dcm" | parallel --will-cite "convert {} -resize $WIDTHx$HEIGHT! $PREPROCESS_IMAGES_DIRECTORY/{/.}.png" # faster than mogrify
```
Thanks!
Thomas