Hi, I figure the open period is a chance to get feedback on how things are gong, so I wanted to summarize some of the issues I've been running into in running docker images on the synapse server
1.) Don't receive log file email until way after submitting (sometimes > 24 hours), or maybe not at all? Because of the delay it's not always clear when an email does come if it refers to an earlier run or a second attempt, which raises another difficulty:
2.) No way to tell which running job a log or completion email refers to.
3.) Manually stopped jobs never leave the queue.
4.) Jobs remain in queue as "Running" long after receiving a Model Training Complete email, seemingly indefinitely.
For the last two points, if I posted this image right, it shows my current queue, with all old jobs hanging around, and long completed jobs still "Running"
${imageLink?synapseId=syn7314264&align=None&responsive=true}
Created by ken.sullivan Great, thank you Bruce! A while ago I think I ran into the concurrent user issue you mentioned, but right now everything looks great, clean queue, quick job start etc. Ken: Thanks for your feedback. There are indeed a couple issues for us to iron out before we start the scoring round. A few comments:
> Don't receive log file email until way after submitting (sometimes > 24 hours), or maybe not at all?
The first email is sent when we start running your submission. Indeed a submission may not run for a long time due to the size of the queue.
> it's not always clear when an email does come if it refers to an earlier run ... No way to tell which running job a log or completion email refers to.
Yes. Each submission has a unique ID. We will add the ID to the email notifications so you can see which submission it refers to. As a work around, if you follow the link in a received email to look at your log files, the folder containing your zipped log file starts with a 7 digit number. This is the submission ID.
> Manually stopped jobs never leave the queue.
If you could please provide a submission ID, we can investigate.
> Jobs remain in queue as "Running" long after receiving a Model Training Complete email, seemingly indefinitely.
Yes. Jobs which completed were shown as "Running" in this view. The problem should be fixed now. If not, please post again. Also, there is another problem in this view: New jobs that have not yet started running are shown as "Running" when they should say "Pending." We are fixing this. Finally, the infrastructure we are using for this queue/cancellation tool severely restricts concurrent users. This is a bigger problem which we are addressing but may take a couple weeks.
i see the same. but i see an additional thing: i only asked my docker to read in one image, and it is out of memory. this was also seen by other participants running the example docker.
i have been thinking what was the reason. and here is my diagnosis, and then my proposed solution.
POTENTIAL REASONs:
1. There is only a couple of nodes;
2. A cluster is there, but somehow it just didn't kick in, so everyone is running on the same node (most likely, and easiest to fix).
3. A cluster is there and kicked in, but somehow there is one participant managed to submit an image that occupying all resources. I have been thinking how that could be done for like a month now. so maybe this smart person, if existing, please raise his/her hands, so we can all move forward.
4. Previous dockers that allocate gpus not properly terminated.
If there is anyone that can successfully run anything, please raise his/her hands, that is also useful information.
PROPOSED SOLUTIONs:
all problems seem to be not enough resource or resource not correctly allocated, or shared resources very difficult to be managed.
1. We all pay for our individual AWS, and access to the data. Then we all have resources. But of course, that means we will scp all the data to our local disk.
2. The organizers pays AWS, as much as we request. So that we each have our own infinite resource we need; And then we reimburse the organizers.