I find that reading images takes really long time now. Average about 12 seconds per image, worst case 20 seconds+. I suspect that the data volume that contains all the images cannot keep up. If everybody (or many instances) use the same image volume, the volume's read throughput basically is the ceiling of all running instances combined training speed, exclude those who read from its own pre-process data volume. Another possibility is that there are some hot spots in the network.
We need to address this problem, otherwise it will take 50 days to just read all images.
Created by vacuum @vacuum, I have the same problem know. Using the same prepossessed images I have the performance drop from about 3 sec/batch to 30-40 sec/batch, up to 90 sec/batch. I agree that with such a bottleneck no one else will be able to train before the deadline.
@thomas.yu Could you please take a look at submission ID 7894930? According to the logs, in the beginning it was allright, but then it became very slow. Don't think it is a script problem:
1. I didn't do any preprocessing, so the read is from the raw image data volume.
2. same training script was running at about 2s/image just a few days ago. Did not come across that problem, may be an issue with your preprocessing script. Was able to crop/etc, save, reload, and generate features for all images in a few days (think it was around 1 s per image).