Hello?
I noticed the official description about the "HTA20_RMA.RData:" file is that:"_R Data file containing the gene level expression matrix eset_HTA20 (32,830 rows x 735 columns). Row names are ENTREZ gene IDs (except for ?_at? suffix) and columns are SampleID._ ?.However,according to my previous experience,the row name of matrix eset_HTA20 should be the id of probes.Is the organizer wrong?This confused me.
And another question is that how to convert the probe(or entrez) id to gene symbol based on the package "pd.hta20.hs.entrezg".
Thank you very much for your answer.
Created by szjshuffle hi,
First of all, we can be both right at the same time. You say that usually microarray data is summarized at probeset level, with probesets defined by manufacturer (in this case around transcript clusters) which is true. This does not mean we are wrong defining probesets at the level of unique ENTREZ gene IDs (see http://brainarray.mbni.med.umich.edu).
Since we have provided the .CEL files, you may chose not to use a custom chip definition file, as we did below :
data = read.celfiles(list,pkgname="pd.hta20.hs.entrezg")
but rely on the default package (pd.hta.2.0) that will group probes around transcript clusters , and then you can map those identifiers to symbols or entrez gene ids (see more on how this works at https://support.bioconductor.org/p/94554/ )
We preferred entrez id level summaries for easy merger of data with samples profiled on other platforms in the next sub-challenge.
Regarding the second question, the gene level expression matrix has ENTREZ gene identifiers as row names (except for the _at suffix). You can use the bioconductor package org.Hs.eg.db to map entrez ids to gene symbols
> library(org.Hs.eg.db)
> load("HTA20_RMA.RData")
> symbol=as.vector(unlist(mget(gsub("_at","",rownames(eset_HTA20)), envir=org.Hs.egSYMBOL, ifnotfound=NA)))
> head(symbol)
[1] "A1BG" "NAT2" "ADA" "CDH2" "AKT3" "LINC02584"
Drop files to upload
how to convert probeid to gene symbol? page is loading…