Hi,
There is a column called "file" in the training data. I access the samples via GEO datasets but I cannot find the file with the exact same name.
Could please tell me where the "file" column direct to ?
For example, where can I find the "GSE68234-GPL10999_series_matrix.txt.gz.anno.tsv" file in "geo-rnaseq-immune-cells.xlsx" ?
Thank you in advance.
Kim
Created by Kimzheng Hi @Kimzheng and all,
I have finished placed the GEO annotation files in the folder:
https://www.synapse.org/#!Synapse:syn20614465
Best,
Brian Hi @djo and @Kimzheng ,
Dominik, thank you for helping out Kim.
Yes, as Dominik guesses those files are the raw GEO annotations.
Here is a little more detail on Dominick's suggestion to use GEOquery in R to get them.
The documentation to the getGEO function call is here:
https://www.rdocumentation.org/packages/GEOquery/versions/2.38.4/topics/getGEO
This is how you could use it to extract the annotations:
gse.obj <- getGEO("GSE68234")
pData(gse.obj[["GSE68234-GPL10999_series_matrix.txt.gz"]])
If there _were_ expression data associated with this in GEO, you could access that via:
exprs(gse.obj[["GSE68234-GPL10999_series_matrix.txt.gz"]])
As Dominik says, this is RNA-seq data and sometimes that isn't held in GEO.
I am also posting those files here:
https://www.synapse.org/#!Synapse:syn20614465
but it will take quite a while for me to upload all of them. Once I have, I will update the group.
Thanks,
Brian Hi @djo ,
Thank you very much for your answer!
Actually, I was wondering maybe there is another place that saves _.anno.tsv_ file quietly.
Best regards,
Kim Hi @Kimzheng,
The `series_matrix` files are hosted by NCBI. E.g. `GSE68234-GPL10999_series_matrix.txt.gz` can be found here [GSE68234 matrix](ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68234/matrix/). But the expression data must not be part of the file. Specifically, RNAseq data may be attached to GSE number in any format (e.g. see on the button of [GSE68234](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68234) and click `(custom)` under Download).
You can download all supplementary files and SOFT-Anotation to a given GSE number by
```
gse=GSE68234
wget "ftp://ftp.ncbi.nlm.nih.gov/geo/series/${gse:0:-3}nnn/$gse/suppl/*"
wget "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=$gse&targ=gsm&form=text&view=quick" -O "${gse}.soft"
```
and then use the annotation in `${gse}.soft` and the related paper to interpret the data. More information on the SOFT format and other GEO data can be found here: https://www.ncbi.nlm.nih.gov/geo/info/
You can also use the R library GEOquery to access the data: https://bioconductor.org/packages/release/bioc/html/GEOquery.html Obtaining microarray data becomes very convenient with the tool.
The `.anno.tsv` postfix in the file column of `geo-rnaseq-immune-cells.xlsx` never became clear to me. It may be a generated annotation file by the creator to annotate the series matrix file.
I hope this information was of any use to you.
Kind regards,
Dominik
Drop files to upload
Where is the "file" in training data? page is loading…