slow training and crash in inference

I am experiencing a very slow training in my last training submission. Compared to an earlier submision in the beginning of this round the speed has slown down by a factor of 6 (what it took four hours now takes 24 hours), this is very annoying because the algorithm is exactly the same but now we are wasting the scarce quota and remaining time Is there an explanation for this? Also, one of our inference submissions has recently crashed after two days of processing. This is also strange because it processed many images so far.... is it possible to that to many users in the system reduce the available memory and since we are paralelizing all the available cpu cores there is not enough memory and the program crashes? Are the inference images in the same order among submisions so we can try to seek to the same position and try to reproduce the error to see what happens?

Created by Alberto Albiol alalbiol
Dear @brucehoff Thanks for your response! Please take a look at submission 8469272(sc2-inference, just finished) and 8468933(sc2-inference-express-lane). These two use exactly same container images (same submit file) and exactly same code path (except that inference-express one only processed 1178 images while the other one is about 120k images). The express-lane one average speed is about 1.25 second/image (lasted 1500 seconds) while the other one is avg 4.3 second /image (lasted 6.4 days). This is about 3.4x performance difference (1.25 vs 4.3). The performance on express lane is pretty consistent. Other than contention, I cannot think of other reasons. I am pretty confident that if you rerun 8469272 full-sc2-inference on a dedicated machine without other jobs, it can finish within 2 days. BTW, can you share the docker run options? Thanks!
@vacuum As explained in the challenge instructions: > Each submission will have access to 22 CPU cores We could have mentioned that this access is *exclusive*. The host machines have 48 cores and run at most two jobs at once. We use the Docker run parameters to ensure that each job gets exclusive access to those 22 CPUs, to one Tesla K80 GPU board and 200GB of memory.
I have seen similar thing and I am very frustrated. e.g. 8437355 vs 8444322. If it happens during training, we will lose precious quota. If it happens during inference, we will lose my score completely. To ensure we will get a score, we have to get very big performance margin on the top of the already very tight 5.4 second budget. On the inference express lane, we can do less than 2 seconds an image in average (I have many logs to show that). But on a problematic training machine, same code can only do 8 seconds an image in average. We are not brave enough to try real inference yet. I suspect it might because of CPU contention. What is the docker CPU option setting? If not specified, by default, each container?s access to the host machine?s CPU cycles is unlimited. How many containers (jobs) are scheduled on each physical machine?
This is the one that crahsed after two days in the sc1 queue: 8433791 8438315 This one is still running but according to the logs, the process is slow compared to 8339790 In any case what worries us most is why 8433791 crahsed
Dear Alberto, Please give me the submission id's of your submissions so I can try to see what happened. Best, Tom

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

slow training and crash in inference page is loading…