Dear @PEGSDREAMChallengeParticipants, We are excited to announce that the training and validation data for the challenge are now available. We look forward to seeing your innovative solutions and wish you the best of luck. Happy coding! PEGS Organizers

Created by Gaia Andreoletti gaia.sage
Hi @ARTD ! Using the latest version of the synapse python client (v4.3.0) and this [Bulk downloads tutorial](https://python-docs.synapse.org/tutorials/python/download_data_in_bulk/?h=download#2-download-all-filesfolders-for-a-specific-folder-within-the-project), I was able to download the training data in 11 minutes. Here is a code snippet for your use case. Keep in mind the syntax looks a little different because the python client has undergone some major refactoring as of v4.0.0: ``` import synapseclient from synapseclient.models import Folder syn = synapseclient.Synapse() syn.login() folder = Folder(name="train_data_synthetic", parent_id="syn52817032") folder.sync_from_synapse() ``` Hope this helps, Jenny
Hi, I am trying to download the training data using the python code but it is taking forever. Typically with good internet connection how long does it take? import synapseclient import synapseutils syn = synapseclient.Synapse() syn.login(authToken="") files = synapseutils.syncFromSynapse(syn, 'syn59063544') Thanks, Ankita
@JoFa , > We understand the information in the description to mean that both training (_train.) and validation data (_val.) are mounted in the Docker container under "/input/". However, we (our code) only find validation data there. -- Apologies for the confusion! To clarify, training data (_\*\_train.\*_) should be used to train your model/algorithms while the validation data (_\*\_val.\*_) will be used to validate them. To be more specific, what you will be submitting to Synapse will be a trained model, since only the validation data will be mounted during the Leaderboard Round, and test data for the Final Round. -- EDIT: My deepest apologies, I misspoke. You are correct in that both the training and validation data should be mounted for the Docker submissions. The fix has been implemented and both the training + validation data is now mounted. We apologize for the inconvenience.
Hi @JoFa, Thank you for identifying these issues. 1. We are looking into this. 2. We have identified some discrepancies with the epr_numbers in the files and are working on fixing these issues. We will update this thread once the rectified files have been uploaded. Thank you, Farida
Dear @gaia.andreoletti @farida and @vchung , Two things are not clear to us about the Challenge submit workflow: - We understand the information in the description to mean that both training (*_train.*) and validation data (*_val.*) are mounted in the Docker container under "/input/". However, we (our code) only find validation data there. - If we have not misunderstood, the file "healthexposure_16jun22_v3.1_nonpii_val.RData" should contain all the epr_numbers (N=3062), which are expected in the returned .csv for scoring. However, using these numbers, we always receive the error: "Found 50 unknown ID(s)". So, 3012 IDs seem to be correct. We would be very pleased to receive a brief explanation of what we have misunderstood or what we are doing wrong. Thank you in advance :-)
Hi @n.ramil, Thank you for identifying the issue with the epr_numbers. We have identified that erroneously the synthetic RData files have duplicate epr_numbers. We are fixing this issue and will update the data files by the end of this week. - Farida Akhtari
Dear Gaia @gaia.andreoletti and Farida @farida I have downloaded these GWAS data for training and validation sets. I run simple allele frequency command that finished well. There are no shortcomings with this GWAS part. ----------- However, I looked for information for corresponding "epr_number" in other datasets. For the training data I have found only 940 epr_numbers in "healthexposure_16jun22_v3.1_nonpii_train_synthetic.RData" I have found exactly the same 940 epr_numbers in "bcbb_map_22jun22_v3.1_nonpii_train_synthetic.RData" The remaining 575 epr_numbers are absent. ----------- Similarly, only 934 epr_numbers from "PEGS_GWAS_genotypes_v1.1_val_synthetic.fam" are found in "bcbb_map_22jun22_v3.1_nonpii_val_synthetic.RData" The remaining 581 are absent. Can you please investigate this missed information in RData files? Ramil .
Hi @n.ramil, We have fixed the Plink files to contain all SNPs. You can now re-download the full files. Thank you for raising this with us!
Hi @n.ramil, The gender (here you mean sex) information is in the Survey files. It is the "sex_derived" variable in the PEGS_freeze_v3.1_nonpii/MAP/bcbb_map_22jun22_v3.1_nonpii_train_synthetic.RData. The "epr_number" is the participant ID / primary key to join various files. We have specifically not included in the genomic data to avoid any discordance between files. You are free to edit the .fam files if you so wish, or specify the sex variable separately from the above file. We'll get back to you on the SNPs issue.
Dear @gaia.andreoletti I have compared the content of "PEGS_GWAS_genotypes_v1.1_train_synthetic.bim" and "PEGS_GWAS_genotypes_v1.1_val_synthetic.bim" Firstly, I found exactly 15,000,000 variants within each file. This is weird. Secondly (1), there are 9,395,029 SNPs that are present in "PEGS_GWAS_genotypes_v1.1_val_synthetic.bim" and absent in "PEGS_GWAS_genotypes_v1.1_train_synthetic.bim" Secondly (2), there are 9,395,029 SNPs that are present in "PEGS_GWAS_genotypes_v1.1_train_synthetic.bim" and absent in "PEGS_GWAS_genotypes_v1.1_val_synthetic.bim" These nine millions are different SNPs of course. --- As a confirmation, these are md5sum f04a35e930144c6e085fb1e376dbe6e1 PEGS_GWAS_genotypes_v1.1_train_synthetic.bim be4647f56a8e8f18a33abd7e75ee03a5 PEGS_GWAS_genotypes_v1.1_val_synthetic.bim These two files, as well as the "test dataset" file, should be identical from my point of view. Otherwise our predictions, generated in SNPs from "train" set would not have counterpart SNPs in test ones. ------- ------- Finally there is no gender information in corresponding *.fam files. I suppose that this information is quite important for prediction. I can impute gender information from SNPs with "plink --impute-sex". For 41 females in the training set and 52 females in validation one I need to relax thresholds to call them female. Nevertheless, from my point of view, if you include gender directly into *.fam files for all three datasets you remove a piece of absolutely unnecessary computation and facilitate the training for not very experienced researchers. Ramil .
Dear @vchung , Thanks for your useful reminder. The problem was solved after I put my Synapse PAT into authToken. Sincerely yours, Tsai-Min
@chentsaimin , Thank you for sharing your code, I can help you with the technical issues. I just want to check - was your Synapse PAT omitted in the above code? Or did you log in with `authToken=""`, which would mean you are logging in as an anonymous user, and therefore, would not have download access to the challenge data? Let me know.
Dear @gaia.andreoletti, Thanks for your information. I tried to download the train_data_synthetic data, syn59063544, with the following python code: ``` import synapseclient import synapseutils syn = synapseclient.Synapse() syn.login(authToken="") files = synapseutils.syncFromSynapse(syn, 'syn59063544') ``` However, it shows me: ``` [WARNING] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: You have READ permission on this file entity but not DOWNLOAD permission. The file has NOT been downloaded. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ``` This phenomenon also happens when downloading val_data_synthetic data. Could you please help me check it? Sincerely yours, Tsai-Min
hi @chentsaimin the link is the following https://www.synapse.org/Synapse:syn52817032/files/ as a PEGS DREAM Challenge Participant you should be able to download the files within each folder.
Dear @gaia.andreoletti, I could not find the download link of the distributable synthetic data created from the original tabular PEGS data using the sythpop v1.8 library in R. Could you please help me check it? Sincerely yours, Tsai-Min

Training and Validation Data Now Available for Challenge Participants page is loading…