Hi, @brucehoff , @thomas.yu
I kept tracking my inference submission status, and found there exists significant speed differences between different jobs.
Currently, my two SC2 submissions (8507997 and 8508120) are running at the speed similar as predicted on express line. But, my other SC2 submission (8508092) and two SC1 submissions (8502236 and 8509095) are running **4~5 times slower ** than predicted speed on express line.
I believe this is** NOT** caused by using the different codes or models. I made this claim because my SC2 submission 8507997 and 8508092 are using the same training states , and have almost exactly the same code except one line (a python math operation that will not significantly affect the speed). But, 8507997 is **5 times faster **than 8508092.
Although you mentioned each submission has independent access to 2 GPUs and CPUs and RAMs, I am still concerned that **submissions on the same physical machine will affect each other's speed.** Otherwise, I couldn't think of other explanations for this significant speed difference between two almost identical submissions.
BTW, I would also like to know the rough number of mammo images for 18000 subjects?
Thanks for your help.
And I would appreciate it if any other teams having the similar issues could share some opinions.
Created by Bibo Shi darrylbobo
> run on the fly as required by the challenge.
To be clear, what the challenge requires is that you not use some *inference* images to inform the prediction for other inference images. If the information from images is isolated (which we will test by rerunning your model on modified versions of the inference image set) then the order in which your code performs the various steps of preprocessing and inference should not matter.
> And, if we can also get a similar preprocess and scratch folder during the inference as in training stage?
You are provided 200GB of scratch space, as documented here:
https://www.synapse.org/#!Synapse:syn4224222/wiki/409763
I realize this is only 1/50 of the 10TB 'preprocessing' space provided during training, but it may allow you to create batches of preprocessed images during inference.
Does this help? @brucehoff
Thanks for your analysis. I agreed that it is difficult to track how this concurrent issues affect some of my submissions.
I think I will try my best to optimize my own code, and speed it up.
However, I want to share with you the reason why my current submission is slow, this is because I made the inference code run on the fly as required by the challenge.
This means each time, my code will take only one mammogram, preprocess it and apply the trained model on it to get the post-probability. Among these steps, the preprocessing step takes the majority of the time.
But, if "running on the fly" is not required, I can do the inference same as what I did in my training stage: preprocess all the mammogram first in parallel, and then apply the trained model. I believe this will save tons of time.
So, I am wondering if "running on the fly" is still required in the validation round? And, if we can also get a similar preprocess and scratch folder during the inference as in training stage?
If so, I believe I can speed up my inference code significantly.
Since I posted yesterday, 8509095 has advanced to 3.132 (delta=1.111) and 8508092 has advanced to 4.944 (delta=1.292). The former is running alone.
The latter is sharing a server with another submission, yet it's running faster. From this information alone there doesn't seem to be a deleterious effect of sharing a server. (Of course difference in speed could be due to a difference in the code for the two submissions and also it could be that some paired submissions have more cross-effect than others.)
From your description of your progress metric, a sub-challenge 1 submission should reach progress=7.5863 and a sub-challenge 2 submission should reach 12.6773. At this rate 8509095 would finish 4 days from now (7 days total at its current rate). 8508092 would finish 6 days from now, 9.8 days total.
The mean completion time for a sub-challenge 1 submission in Round 3 was 2.25 days. For sub-challenge 2 is was 2.55 days. The 12 day limit is meant to allow submissions to complete even if there are issues beyond our control (e.g., increased network traffic within our cloud provider) that cause things to go a bit more slowly. I'm not sure how we can guarantee to complete a job that requires 10 days of computation under optimal conditions.
It's getting even faster now, about 600+ images per hour.
@brucehoff
Thanks for helping solve the issue.
I didn't keep track the speed for 8509095 and 8508092 all the time, since I was kind of hopeless about those two submissions. I just remembered that at yesterday afternoon both of them were still processing less than 100 mammo images per hour.
But now, based on my observations for the last two hours, both of them are speeding up , and are able to process more than 400 images per hour.
I hope this information is helpful for your analysis. And please do let me know if you have any suggestion regarding this cocurrent issue for the final validation round. And I will also try to find what I can do to further speed up my code.
Thanks @darrylbobo It looks like four of your six submissions have finished. Remaining are 8509095 (Sub-challenge 1, progress=3.132) and 8508092 (Sub-challenge 2, progress=4.944). One of the two is sharing a server and the other is running on a server by itself. Can you comment on the rate of progress of these two submissions? @brucehoff @Admin-Hoff
Sorry to bother you. But I am eager to know if you have suggestions for my current running round 3 submissions, and future validation submissions? I am so worried that none of them will be finished in time if this concurrent issue stays. @brucehoff
Thanks very much your time, and help. That does help understand the situation. However, right now it seems all of my current running jobs are slowing down again duo to this concurrent running issue. I am so worried that none of them will be finished in time if this concurrent issue stays. So, do you have any ideas or suggestions for my current running round 3 submissions, and future validation submissions?
Thanks. > Did you know when is the last time that 8502236 shared a server with other submissions?
Yes! Each server has two "lanes", if you will, and runs just two (hopefully isolated) submissions concurrently. We keep track of when each submission starts and ends, which server it runs on and which lane of the server it runs on. (If we restart a submission we lose this information as it gets overwritten with the updated start, end, server and lane info). So we can try to answer your question by querying for what submissions ran on the parallel lane on the server where 8502236 is running, since it started running.
Our records show that 8502236 started on 03/21/2017 01:59:37AM, PDT. Here are the submissions that ran concurrently with yours (in the parallel lane):
xxxx047 ran from 03/21/2017 01:28:13AM to 03/23/2017 10:18:34PM
xxxx796, ran from 03/23/2017 10:19:35PM to 03/29/2017 03:30:43AM
xxxx096 started on 3/30/2017 12:04:49PM PDT and is still running.
So indeed the acceleration of 8502236 might be concurrent with xxxx796 stopping on 3/29.
> Submission 8502236 did have a very good speed in the first several days (right after the round 3 deadline), but suddenly slowed down significantly since last Friday
That's somewhat concurrent with the start of xxxx796 (last Friday was 3/24). So it appears that 8502236 ran along side xxxx047 without any slowdown, but when xxxx796 started, 8502236 slowed down only to speed up again when xxxx796 stopped.
@brucehoff
BTW, I tracked the progress during the last two hours, and noticed that the submission 8502236 's speed is catching up now (600+ images per hour, compared to previously less than 100 images per hour). Did you know when is the last time that 8502236 shared a server with other submissions? Hi, @brucehoff,
Thanks very much for your report.
First of all, I want point out that the progress for my submissions indicates how many mammo images are processed. For example, **10.92 means 109.2K images** have been processed, so I believe it will be finished in one or two days. Also, based on my test on express line, my submission should be finished within 6 or 7 days, and my two submissions 8508120 and 8507997 are following the predictions.
Second, as I mentioned, my two SC2 submission 8507997 and 8508092 are using the same training states , and have almost exactly the same code except one line (a python math operation that will not significantly affect the speed). But, 8507997 is 5 times faster than 8508092. Although you said this is not caused by sharing the server, I do want to share with you **the fact **that my Submission 8502236 did have a very good speed in the first several days (right after the round 3 deadline), but suddenly slowed down significantly since last Friday. This observation may indicate that although my submission is not sharing server with other submissions now, somehow the state of the server might change during the running.
Last, I would like to know if there is any chance that you can move my two slow running SC1 submissions (8502236 and 8509095) to another machine?? Or at least one of them (8509095) to another machine? I am so worried that none of them will be able to finish in time.
Thanks very much for your help. Of 113 Round 3 submissions, most have finished but 22 are still running, and they are from just 9 teams. From this perspective it appears that it's the nature of the submitted code more than the server which causes a submission to be slower than average. However we will investigate the reported slow servers. @darrylbobo: Six of the 22 still running are from your team.
Submission ID | Progress | Sharing a server?
8509095 | 1.693 | YES
8458995 | NA | YES
8502236 | 5.349 | NO
8508092 | 2.843 | YES
8508120 | 10.16 | YES
8507997 | 10.92 | YES
If sharing a server caused submissions to be slow then I would expect 8502236 to have greater progress than 8508120 or 8507997. @vacuum @alalbiol
Thanks for your kind answers.
@brucehoff @thomas Is there any chance that you can move my two slow running SC1 submissions to another machine?? Or at least one of them to another machine? I am so worried that none of them will be able to finish in time.
And, I am now pretty sure that this is not due to my implementations or coding based on the others' replies and my own observations on my SC2 submissions.
Thanks.
I have also observed the same thing now in the inference and previously in training. We have logs that show differences in performance of about 5 between different epochs of the same training. I suspect that the problem could be the access to the physical disk (but this is only an intuition)
I have seen it all the times. e.g. 8509793 and 8510126, started at the same time, one is 80% progress, the other is 40% progress now. Same code.
Drop files to upload
Speed difference between inference jobs page is loading…