Hello, community: can anyone tell me how to download raw HTA20 data for dreamer challenge? raw HTA20 data is pretty big, how can I download and load them in R session? any idea? thanks

Created by Jurat Shayiding jshahid
@bcbuprb: thanks again for your community help and maintaining this project. Honestly, I am very new to microarray analysis and learning limma package. your comment was helpful, appreciated. Plus, could you show a few more examples by using gene-level expression data (HTA20_RMA.RData) in your workflow example? Hope your contribution could benefit others in this community. Thank you
Both gene level (HTA20_RMA.RData) and probeset level data (HTA20_RMA_probeset.RData) was obtained using RMA, which involves background correction, normalization and summarization. As explained further in the wiki, data was further batch effect corrected. Pls read documentation carefully.
@jshahid, I'm unsure if the summarized data is already normalized. I leave that question up to @bcbuprb . I'm glad you can download the data, and am here for any Synapse support you may need. Thanks.
@thomas.yu: you mean RMA summarized data is already normalized? If so I'll take care of rest, don't want to download HTA20 row data, it is too big and time-consuming to download that.
Hi, "background correction, normalization" are needed only if you start with the .CELL files. If you start with our version of RMA summarized data (either gene level or probeset level) then you do not need to preprocess it anymore. For PCA you can look at the prcomp function in R. We can't really give crash courses in data analysis as a part of this challenge. You can look to the R project and bioconductor project for more community support.
@thomas.yu: since downloading HTA20 data is very time-consuming, is there any quick workflow to start with gene-level expression data (HTA20_RMA.RData) for the simple task like **background correction, normalization and basic PCA** ? could you show few steps to approach this correctly? I believe the community could get benefit from it if there is some instruction on it. Thanks for your community help.
Dear @jshahid Ah, my apologies. I didn't realize that syn18505793 was a folder. `synGet` command is for file entities, so for example: the webinar file: https://www.synapse.org/#!Synapse:syn18778668. ``` library(synapser) synLogin(username, password, rememberMe=TRUE) #you can use rememberMe=True if you don't want to put in your username and password everytime. ent = synGet('syn18778668') print(ent$path) [1] "......../.synapseCache/621/39344621/GMT20190522-170002_Preterm-Bi_2560x1440.mp4" ``` Now to download all the files in a directory, you will want to use: ``` library(synapser) install.packages("synapserutils", repos=c("http://ran.synapse.org", "http://cran.fhcrc.org")) library(synapserutils) # login to Synapse synLogin() # download all the files in folder syn18505793 to a local folder called "myFolder" all_files = syncFromSynapse(entity='syn18505793', path='/path/to/myFolder') ``` Please let me know if this doesn't work. All of this information is located here: https://docs.synapse.org/articles/downloading_data.html#downloading-in-bulk Best, Tom
@thomas.yu: there is no attribute like ``` ent$path, print(ent$path) ``` return a NULL value. maybe you better try it on your site, don't know it's the problem of R API. ``` > ent Folder: HTA20 (syn18505793) properties: concreteType=org.sagebionetworks.repo.model.Folder createdBy=1420476 createdOn=2019-04-17T19:28:16.540Z etag=01b25ab3-7ab3-11e9-98fa-026b0a0ad230 id=syn18505793 modifiedBy=3324230 modifiedOn=2019-05-02T05:16:50.123Z name=HTA20 parentId=syn18636841 annotations: > ent$path NULL > ``` any better idea to access HTA20 row cell file by using R API that you provided? can you provide a workable solution on that? thanks
Dear @jshahid, Please review the "Accessing Data" section of our documentation for the R client: https://r-docs.synapse.org/articles/synapser.html. Per your convenience: ``` ent = synGet("syn18505793") print(ent) print(ent$path) ``` Please run the commands above and let me know what you get. Best, Tom
@bcbuprb: here is what I did for using row cell files and I want to do preprocessing, background correction, normalization, and basic PCA, but I couldn't access row HTA20 cell files in my R session, why? ``` library(synapser) synLogin("myusername", "mypassword") ent = synGet("syn18505793") # this command return an environment, not actual cell files df <- data.frame(setNames(lapply(ls(ent), get, envir=ent), ls(ent))) ## I tried this way to access rwo cell files, but it didn't work ``` if I am able to access all row cell files in HTA20 data folder, then here is my workflow: ``` ano=read.csv("anoSC1_v11_nokey.csv",stringsAsFactors = FALSE, header=TRUE) rownames(ano)<-pheno$sampleID filepaths<-unlist(paste0("C:/HTA20/",ano$sampleID,".CEL.gz")) taffy<-ReadAffy(filenames=filepaths,sampleNames=ano$sampleID,verbose=TRUE,phenoData=ano) boxplot(taffy) hist(taffy) taffynormal<-bg.correct.rma(taffy) taffynormal<-normalize.AffyBatch.quantiles(taffynormal) boxplot(taffynormal) ``` seems your R API to access HTA20 data is not allowed me to get data. Any idea? Could you provide your workflow on how to make above work? thank you
Preprocessing of the .CEL files to get the gene level or probeset level expression matrices is illustrated in the R script preprocess_data_SC1.R . If you have issues downloading the .CEL files using one of the programatic options (Python, R or command line listed) please provide specific error messages and our synapse partners may assist with this.
@bcbuprb: How can I make background correction, normalization and basic PCA for that? Is there any useful thread that I could look up? Could you provide possible instruction on that? Thanks
See workflow example that starts with the prepossessed gene level data https://www.synapse.org/#!Synapse:syn18380862/discussion/threadId=5555
@bcbuprb: I am trying to start with row cell files for preprocessing, background correction, normalization, summarization of gene expression for chosen samples, but HTA20 data is too big. Could you instruct me few basic steps of using matrix data? I followed some tutorials about this and it used row cell files. Any further thoughts? thanks
See the limma package in bioconductor on how to use an expression matrix (such as eset_HTA20 ) to find genes that change in expression with a given covariate
@bcbuprb: Could you provide a few steps workflow of how to HTA20_RMA.Rdata for differential expression analysis? Do I need to HTA20 data (row data of 735 cell files) ? any possible guidance? thanks
The .cel files are for those that want to use different preprocessing methods. If you are not sure how to use those files, you can start with the gene level or probeset level data instead (see the .RData files). I will post a basic workflow in another thread. Adi
I tried ``` synGet("syn18505793") ``` which return an environment, how can I access row cell files in R session? any idea?
Dear jshahid, Do you get a login successful? After logging in you have to actually download the file. So going to the entity, for instance. This file: https://www.synapse.org/#!Synapse:syn18507612 has synapse Id syn18507612. Therefore in R you can do. ``` ent = synGet(syn18507612) ent$path or ent@path #Apologies currently not at computer ``` Best, Tom
when I tried to R API, I got NULL when I tried synLogin("jshahid", "password"). why is that? Is there any way to load all row cell files into R session? Any idea?
Dear jshahid, Can you further explain what doesn't work about the R client? What are the specific functions you are using? Best, Tom
Thanks for your reply. I tried R API but it doesn't work for me. any better idea?
Dear @jshahid Unfortunately, I can't give you any guidance on how to use the data, but synapse does have R and Python API clients which you can use to download the data. (It is more stable than downloading via the web). See more here: https://docs.synapse.org/articles/getting_started_clients.html. Best, Tom

can anyone tell me how to download raw HTA20 data for dreamer challenge? page is loading…