Hi, some time ago i postet something in the I/O thread because we thought our preprocessing script slows down caused by slow IO operations. Now we think we found out that this problem may caused by wrong docker cpu allocation. We are doing image transformations parallelized over multiple CPUs to speed up the preprocessing time. We compared the execution time of the process on your machine with the time on our local machine (also on multiple CPU´s) and saw, that the average time of one image being transformed on the submission hardware is likely equal to a image transformation on our machine with one core. In the wiki it says "Each running model will have access to 24 CPU cores". So first question is, if this mentioned 24 CPU cores are dedicated to each submission model or shared between several submissions. The other question is, if there might be wrong docker settings with the argument **--cpuset-cpus=** for the docker run command. In our opinion, it has to be something like cpuset-cpus=0-23 to allocate the 24 mentioned cores, instead of somesthing like cpuset-cpus=0 (pysical cpu 0 with multiple cores on it, which is wrong). Thanks

Created by Michael Mielimonka mimie001
Hi Michael and Sven, Are your submissions running as expected on both the Express and Challenge clouds? Thanks!
Hi, I've run the test that you have sent me on all the machines, and the result is always successful. ``` # docker run -it --rm --cpuset-cpus=1-22 tschaffter/cpu-affinity-test /bin/bash test.sh 7ffffe {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22} numpy import {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22} matplotlib import {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22} ``` All the machines received a kernel update before the opening of the Leaderboard Phase. You mentioned that the problem you encountered does not appear on the Express Lane machines, which have been installed with a more recent kernel than the Open Phase machines. Can you check whether you encounter your problem on the Challenge machines?
Hi Thomas, > The indexes of the CPU cores may be different. What is the underlying relevant result you wish to obtain and how do you interpret it? as my team member said, we want so see if there is any change in cpu allocation after the imports. The indexes may be different, but the number of cores listed in the output may be different before and after the imports, in case of an failure. Are there any news about our Problem / the container I gave you?
What we would like to observe is, that after "import numpy" the cpu affinity mask only allows one core to be used. Sure the cpu core indices may be different, but at the end there should be a reduction from 22 to 1 core only.
Hi Michael, The indexes of the CPU cores may be different. What is the underlying relevant result you wish to obtain and how do you interpret it?
Hi Thomas, I uploaded a file named "minimal.rar" and shared it with you. It includes a Dockerfile based on "FROM nvidia/cuda:7.5-cudnn5-devel", test.sh and test.py. This is only a minimal test to see how the affinity is set before and after the imports. I think a real validation would make more sense running our preprocessing with the multiprocessing included, but maybe with the test.py we get some insights. To test it you only need to execute the test.sh inside the runnign container. As a result you should get something like this: ``` STDOUT: {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46} STDOUT: numpy import STDOUT: {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46} STDOUT: matplotlib import STDOUT: {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46} ``` Thanks
Hi Michael, Thanks for your detailed explanations. > We have in mind that the environment on the fast lane could be different to the normal submission lane, so that there is actually no way for us to validate the fix. What we can do is run your test on the machines that have been prepared for the Challenge. For that, could you send me a Dockerfile to build ubuntu + python + dependencies (numpy, matplotlib) and in a separate file the minimal commands to run and the values to expect?
Hi Thomas, with respect to this post on stackoverflow: [Python multiprocessing ](http://stackoverflow.com/questions/15639779/why-does-multiprocessing-use-only-a-single-core-after-i-import-numpy/15640029#15640029) we think the mentioned problem is the same as ours. We are using python multiprocessing and additional libraries to do our preprocessing. It seems, that some libraries, e.g. numpy, scipy, tables, pandas, skimage..., are changing the CPU affinity caused by the use of BLAS library. This means, that the sheduler restricts the processes to a few / one core. Also it seems to be a problem of ubuntu and appears only on some hardware plattforms. Therefore they showed how to fix the CPU affinity (linux command taskset after library imports). But our problem now is to validate the changes, because it seems that the affinity changing problem does not appear on the fast lane. We tested our container with and without the changes and got the exact same (fast) runtimes (e.g. transform=1304+-301ms, which is 10 times faster than on the normal submission lane). I also ran a little test script to see how the CPU affinity is set before and after the imports of the mentioned libraries and printed out the CPU affinity mask in the preprocess.sh script: ``` STDOUT: {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46} STDOUT: numpy import STDOUT: {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46} STDOUT: matplotlib import STDOUT: {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46} ``` ``` pid 31's current affinity mask: 7ffffe000000 ``` So it shows that the CPU affinity is set to the right value since the beginning. The first bits of 7ffffe000000 are set to 1, which means that the cores are allocated correctly. We have in mind that the environment on the fast lane could be different to the normal submission lane, so that there is actually no way for us to validate the fix. Thanks
Hi Yuanfang, I was referring here to the problem you reported here regarding the 20 minutes limit that was not applied. I received a confirmation from Bruce that Synapse solved the problem I mentioned, so the problem should be fixed. Regarding the missing image, we have found what the problem is and Bruce is addressing it. He or I will update your original ticket shortly. Thanks!
this problem still exist:?IOError: [Errno 2] No such file or directory: '/trainingData/cmkg906a.dcm' (express) i posted in another thread days ago (so sorry for couble posting). i wonder if it is because you are re-map some data now (as i had the same problem on the long queue, early on)
@yuanfang.guan: I know there was an issue on Synapse yesterday that caused some troubles with the Challenge infrastructure. Please let me know if the problem is still present.
Hi Michael, > Although the cores are allocated correctly, it seems like a problem with the hardware platform. We will definitively solve this mystery. The next step would be to compare the time your pre-processing algorithm takes to process all the images using 1) 1 thread and 2) 22 threads using GNU Parallel on the Express Lane . If the speed up is around 20, then I guess the CPU cores are working correctly. Also, a difference to take into account here when comparing results that you obtain locally vs. on the Challenge Cloud is that the read/write speed of the disks are certainly different. > The challenge started today, so will the use of the fast lane decrease our server time or is it possible to do tests there with the hope to fix our problem? We mentioned in the previous newsletter that the Challenge was starting on the week of Nov. 14, not Nov. 14. The Challenge is actually starting this Friday Nov. 18. The Express Lane will continue to be available after the opening of the Leaderboard Phase. Using it will not consume your time quota (remember that a dummy dataset is installed on the Express Lane).
Hi again, another question: The challenge started today, so will the use of the fast lane decrease our server time or is it possible to do tests there with the hope to fix our problem? Thanks
Hi Thomas, > saw that in the email received from Synapse but I don't find this statement any more in your message. nproc shows the number of CPU cores available to a Docker container. Here is a simple example: sry this was my fault, I saw that we had nproc all in our script, which was wrong. Thats why I edited my post. Now nproc shows 22, which is correct. But still our results are very confusing and are likely equal to results with on core on our machine, like my team member showed. For example, if your are comparing the following results, it looks like only one core is used: > your maschine: transform=30188+-10128ms > our machine with 1 core: transform=27493+-6876ms > our maschine with 24 cores: transform=3366+-2044ms Although the cores are allocated correctly, it seems like a problem with the hardware platform. Maybe it would be helpful, if you provide some more informations about the unterlying host system. Also, as my team member mentioned, we observed the machine getting slower: > In prior submissions we had almost identical fast runtimes for the first few hundreds of images and it started getting slower after a few thousands of images. But now it is very slow from the very beginning. Do you have any ideas regarding to our problem? Thanks
i don't want to block the queue, but one of my submission has been running in express lane for 80 minutes, and also it should have taken less than 2 minutes
Hi Michael and Sven, Please use the training Express Lane for future tests. Thanks!
Hi Peter, > Should we be taking steps to parallelise our code as well, or should we rely on the optimisation? Yes, you should definitively parallelize your code. See the following example that I've uploaded recently (see Docker tab): - docker.synapse.org/syn4224222/dm-preprocess-png - docker.synapse.org/syn4224222/dm-preprocess-lmdb - docker.synapse.org/syn4224222/dm-preprocess-caffe
Hi Michael, > nproc shows all available cores from the host system, in our last run for example 48 I saw that in the email received from Synapse but I don't find this statement any more in your message. `nproc` shows the number of CPU cores available to a Docker container. Here is a simple example: ``` [root@vm ~]# docker run --cpuset-cpus="0" ubuntu nproc 1 [root@vm ~]# docker run --cpuset-cpus="0-21" ubuntu nproc 22 ``` Before analyzing the time your custom script takes, can you: 1) show the output of `nproc` run from your submission If you see 22, I think that we have allocated the CPU cores correctly, otherwise I'll check with Bruce to ensure that we set the parameter correctly.
Hi, I am a team member of Michael and I will show you some log output from our preprocessing container: Latest submission log from Saturday (we verified these runtimes on 20k/30k/45k/70k images): ``` STDOUT: 100/313847 images processed (elapsed=0:03:14.015248 read=748+-1058ms transform=30188+-10128ms write_normal=490+-235ms write_flipped=479+-224ms write_256=3+-0ms write_300=6+-1ms) STDOUT: 200/313847 images processed (elapsed=0:06:06.461356 read=686+-1000ms transform=31198+-8872ms write_normal=507+-265ms write_flipped=502+-259ms write_256=3+-0ms write_300=6+-1ms) STDOUT: 300/313847 images processed (elapsed=0:09:05.408644 read=816+-956ms transform=32255+-9266ms write_normal=580+-297ms write_flipped=571+-291ms write_256=3+-0ms write_300=6+-1ms) STDOUT: 400/313847 images processed (elapsed=0:11:52.306820 read=644+-948ms transform=31309+-10078ms write_normal=550+-253ms write_flipped=538+-243ms write_256=3+-0ms write_300=7+-8ms) STDOUT: 500/313847 images processed (elapsed=0:14:24.905768 read=700+-892ms transform=27517+-8397ms write_normal=492+-241ms write_flipped=483+-224ms write_256=3+-0ms write_300=6+-1ms) ``` Same container on our server with --cpuset-cpus="0-23" on the pilot images: ``` 100/500 images processed (elapsed=0:00:26.063078 read=56+-83ms transform=3366+-2044ms write_normal=549+-253ms write_flipped=540+-257ms write_256=3+-0ms write_300=6+-1ms) 200/500 images processed (elapsed=0:00:42.383789 read=50+-49ms transform=1944+-504ms write_normal=582+-251ms write_flipped=564+-235ms write_256=3+-0ms write_300=6+-1ms) 300/500 images processed (elapsed=0:01:02.044930 read=50+-75ms transform=2742+-1613ms write_normal=530+-253ms write_flipped=533+-255ms write_256=3+-0ms write_300=6+-1ms) 400/500 images processed (elapsed=0:01:16.938005 read=36+-17ms transform=1660+-491ms write_normal=544+-259ms write_flipped=537+-252ms write_256=3+-0ms write_300=6+-1ms) 500/500 images processed (elapsed=0:01:31.575279 read=39+-20ms transform=1725+-504ms write_normal=564+-265ms write_flipped=554+-263ms write_256=3+-0ms write_300=6+-1ms) ``` Now with --cpuset-cpus="0" on the pilot images: ``` 100/500 images processed (elapsed=0:04:25.309678 read=496+-249ms transform=27493+-6876ms write_normal=9317+-4264ms write_flipped=9317+-4222ms write_256=52+-47ms write_300=112+-47ms) 200/500 images processed (elapsed=0:08:21.399517 read=432+-114ms transform=26809+-4905ms write_normal=9391+-3752ms write_flipped=9256+-3709ms write_256=58+-49ms write_300=115+-44ms) 300/500 images processed (elapsed=0:12:11.493144 read=432+-165ms transform=27513+-6080ms write_normal=8993+-4067ms write_flipped=8889+-4070ms write_256=56+-48ms write_300=113+-48ms) 400/500 images processed (elapsed=0:16:23.822088 read=403+-128ms transform=28044+-7352ms write_normal=9656+-4506ms write_flipped=9587+-4439ms write_256=53+-46ms write_300=119+-47ms) 500/500 images processed (elapsed=0:21:01.815664 read=409+-125ms transform=30501+-7972ms write_normal=10931+-6049ms write_flipped=11029+-6394ms write_256=68+-65ms write_300=135+-77ms) ``` I admit that --cpuset-cpus="0" is not identical to what we can observe from the submission log output, but on the other hand the container running on your hardware is definitely not using 22 (or 24) cores with 100%. We are working very hard to optimize runtimes, but since a few weeks we are observing a terrible performance regardless what we are trying. In prior submissions we had almost identical fast runtimes for the first few hundreds of images and it started getting slower after a few thousands of images. But now it is very slow from the very beginning. A software design failure should be out of question, since we are validating our scripts on our local machine (20 physical, 40 logical cores) and as you can see the runtimes are much more reasonable.
Should we be taking steps to parallelise our code as well, or should we rely on the optimisation? This is an important question for me. I'm currently examining all the images with one single python script. It'd be easy to split this into, say, ten separate scripts, each tackling a tenth portion of the images. Would it be worth doing this? If it's likely to speed everything up 5-10 times, then it'd make sense for me to do this and re-submit the job. If it'll only make a 5-10% difference, then it'd not be worth doing.
Hi, > Are you using the example I gave you in your previous thread to illustrate the use of multiple CPU cores? yes, we have also tested the conversion with parallel + imagemagic as mentioned in your example. Currently we are using python multiprocessing creating 20 processes for the preprocessing (with having the global interpreter lock in mind and using the right package for real parallelization..). My team member will post some plots showing the above mentioned observations.
By the way, the submissions have currently access to 22 CPU cores. Soon we will ask you to use an environment variable instead of hardcoding this number. If you have doubt that the number of CPU allocated is incorrect, please provide the output of the command `nproc`.
Hi, > is likely equal to a image transformation on our machine with one core. Are you using the example I gave you in your previous thread to illustrate the use of multiple CPU cores? > if this mentioned 24 CPU cores are dedicated to each submission model or shared between several submissions. They are exclusively dedicated to a single submission. > In our opinion, it has to be something like cpuset-cpus=0-23 to allocate the 24 mentioned cores, instead of somesthing like cpuset-cpus=0 (pysical cpu 0 with multiple cores on it, which is wrong). We are using `cpuset-cpus=0-23`.

Docker CPU allocation page is loading…