We have detected that progress in the inference queue has decreeased drastically from yesterday.
If in this round we have limited computation time, and machines are less responsive as number of submitions increases, which is the metric of CPU time that is being applied.
**From our counter we assume that we can access to 32 CPU cores, two GPUS, and 200 of GB RAM memory.**
But also the acces to the filesystem containing the image dataset should be 'predictible'.
Created by Kiko Albiol kikoalbiol Hi Kiko,
> From our counter we assume that we can access to 32 CPU cores, two GPUS, and 200 of GB RAM memory.
The number of GPUs and amount of RAM memory are correct but the number of CPUs are not.
The value of these three parameters are provided by the following environmental variables described [here](https://www.synapse.org/#!Synapse:syn4224222/wiki/409763):
- NUM_CPU_CORES: number of CPU cores
- NUM_GPU_DEVICES: number of GPU devices
- GPUS: list of GPU devices (e.g. "/dev/nvidia2,/dev/nvidia3")
- MEMORY_GB: total amount of memory in GB
Please use these variables in your code. Using an incorrect number of CPUs, for example, may affect the performance of your method.
> So, our issue now is: how are these 15 days are calculated, based on what resources?
The wall time for inference submissions in the Leaderboard Phase is 12 days, not 15 days.
The Express Lane and Inference machines have the same configuration. First, you should run your SC1/SC2 inference submission on the Express Lane to determine the progress rate (number of subjects processed / runtime of the submission, possibly discarding the initialization time of your method if required). By combining this information with the approximate number of subjects that we have given in the last newsletter, you should be able to evaluate how long your method would take on the Leaderboard scoring dataset.
Thanks!
Thomas Hi Thomas.
The point is that we have gotten a job in a day to play until 17% but now is halted at this value (or increasing too slow).
This could due to a problem with the occupancy of the service, or a local problem.
The point is that if the issue is not due to our software but the occupancy is not predictable, the steps to provide a good result in our system could be optimistically estimated (we only have 15 days). So, our issue now is: how are these 15 days are calculated, based on what resources? Hi Kiko,
> From our counter we assume that we can access to 32 CPU cores, two GPUS, and 200 of GB RAM memory.
You are not receiving log from inference submissions (exception the last n bytes in case of an error to help you figuring out what the issue is). Can you please provide more information regarding the above "assumption"? In addition, what is exactly the issue? If it's related to CPU, what is your point of reference?
Thanks!
Drop files to upload
Problems with CPU usage in inference page is loading…