Question about "implementing threading in the submission architecture"

Hi, I have two questions regarding the email you (@Michael.Mason ) sent titled "Multiple Myeloma Challenge Leader Board Round 1 Update". You mentioned that: > Additionally we are implementing threading in the submission architecture that will allow a handful of submissions to run in parallel. 1. Do you mean that we can now run our code in parallel (e.g using `doMC` & `foreach`) or you are trying to run our dockers side by side to shorten the queue? 2. With this new change/improvement, how much memory and CPU will be assigned to our docker containers? ([related post](https://www.synapse.org/#!Synapse:syn6187098/discussion/threadId=2473))
_I also would like to thank you for investigating the submitted codes to find out the bottleneck(s). I know that my main struggle is the first point you mentioned (iterating through VCF files) and there is nothing I can do about it since I need to sift out all bits of information I need. As the result I need enough memory, diskIO and CPU to handle this and what I can do is to minimize the memory footprint and optimization (read vectorization). _

Created by Mehrad Mahmoudian michelangelo
mike, i was seriously proposing that on the express lane you probably select 5 patient from each cohort, so the iteration will be much faster, it will be 20 times faster than now
Dear All, Thank you for your feed back. We clearly need to improve the ease of use for challenges with regards to docker. This challenge should be about improving model accuracy and not about navigating the challenge infrastructure and optimizing data processing, which many people are spending the bulk of their time on. We are working to improve pain points but much remains to be done. Please note, however, that this challenge is only possible because of the docker set up. Key data providers are particularly vigilant about patient data and insist that we not make data public (even anonymized data). I am currently working to provide vcfs filtered by allele frequency, missense annotation, and PASS filter. The original vcfs will still be available for participant interested in rare alleles. Thank you again for your feed back, Mike
>I think the dockerization was a good move by Synapse, i found it a terrible move. because most time goes into debugging their system. i found i perform terribly in challenges that require a docker. and in general, the overall performance in partipants significatnly drops
Dear @yuanfang.guan > life cannot be harder when there is docker. all my features are still NA on line, do you have any plan to open a truncated data lane , e.g. a real express lane, with like 10 sample, to help us fix the problem???? I'm not sure which subchallenge you are referring to here, but I personally found the simulated tables quite useful. Perhaps having other tables simulated would be a good idea since the validation set is very different than the training set from the point we are standing (kinda feels like navigating mars rover with huge latency in order to study dark matter which we cannot see or send a sample back to lab). I think the dockerization was a good move by Synapse, but it definitely needs to get more user-friendly.
>We don't have the donated compute resources that the DM Challenge had but we really should not need them. that's usually where you email the people in the 'funders and sponsors' section.
>We are considering some options to make lives easier on all sides. life cannot be harder when there is docker. all my features are still NA on line, do you have any plan to open a truncated data lane , e.g. a real express lane, with like 10 sample, to help us fix the problem????
Dear Mike, Suffice to say I can't confirm nor deny for a simple reason : Whatever I say through this channel can potentially reveal some part of the strategy that my team and I have came up with and spent hour upon hours on :) But I can change your question to "Would this type of filtering solve the current load situation?" to which my answer is "no, not to the level that it worth the effort". We all knew that this is computationally heavy and as far as I can remember, during the webcall we had, you (organizers) mentioned that based on some agreement the dockers will be run on AWS. I myself even raised the question about memory and CPU allocation during the call (although I didn't get a conclusive answer ever since). Perhaps we need to have a brainstorming session and ask all teams for suggestion. Actually this would be relatively easy to investigate. Add some timestamping lines to people's code and run them. I heavily doubt that only reading the files in is the issue. I probably (if all my teammates are on board) can discuss some parts of my methodology with you through proper private channel if you think this can potentially help solving this issue before the final submission.
Dear Mehrad, We are considering some options to make lives easier on all sides. Allowing saving is problematic from some reasons. We are considering providing pre-filtered vcfs base on MAF and/or PASS conditions. We want the to be low-level filters so that participants that want to do additional filtering still can. Do or others think this would help your efforts?
To support Mehrad - for this challenge, we are not using the genetic processing algorithm that we typically use, and this decision is **entirely** driven by computational constraints and processing time.
Dear Mike, I'm not writing this as complaint but rather a feedback since I cannot agree with the the following two points you mentioned: > We don't have the donated compute resources that the DM Challenge had **but we really should not need them.** Here we are dealing with gigabytes of data (esp. subchallenge1) and they need to be processed. imho this challenge should not be about technicality and efficiency but more about scientific merit behind methods, but the current situation is the exact opposite. > In general, **teams should not need doMC and parallelization once their models are trained outside of docker.** I beg to differ, considering number of samples in the validation set, one should transform the data into suitable format (similar to the training set) so that the prediction function can run smoothly.
**Suggestion:** I would like to suggest that you let users save the variables on your end so that their code don't need to start reading files and processing things from scratch and instead load their variables in (if they exist) and continue processing. + **Pros** - This will drastically reduce the load on your end - This will drastically reduce the waiting time on our end + **Cons:** - The only downside for you would be the disk quota management for the next few weeks - The only downside for participants would be to add few lines to read and write their variables in correct format based on your standards if they wish to adopt such solution, otherwise they can run their code as they were doing previously

Nevertheless, I appreciate that you (organizers) are acknowledging the issues and pushing forward in the direction of solving and addressing these issues.
Dear Mehrad, Actually, the parallel code is just on our end so participants *cannot* start using parallelization within their docker agents unfortunately. We have implemented some code that runs up to 7 submissions at at time per challenge **but** it is batched in such a way that it won't kick of the next batch until all 7 have finished. This is not ideal and we are working broader architecture changes that will simple spin out jobs on new machines. Those changes will not be part of this challenge unfortunately. We don't have the donated compute resources that the DM Challenge had but we really should not need them. We are identifying individual teams that are taking a while and giving them ideas to run faster. As an example there one team just ran over night and is still running in the express lanes so we may have to kill their submission and give them some ideas. In general, teams should not need doMC and parallelization once their models are trained outside of docker. I guess it would be nice if teams could implement vcf filtering with some threading but we are not set up to do that. Lastly we are now time stamping submissions on our side to determine what the distribution of running time is. We will then set up a time threshold so teams are not stuck waiting for others to finish. We will inform participants once that time threshold is determined. We apologize for the bottle necks.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Question about "implementing threading in the submission architecture" page is loading…