Hi, There is a column called "file" in the training data. I access the samples via GEO datasets but I cannot find the file with the exact same name. Could please tell me where the "file" column direct to ? For example, where can I find the "GSE68234-GPL10999_series_matrix.txt.gz.anno.tsv" file in "geo-rnaseq-immune-cells.xlsx" ? Thank you in advance. Kim

Created by Kimzheng
Hi @Kimzheng and all, I have finished placed the GEO annotation files in the folder: https://www.synapse.org/#!Synapse:syn20614465 Best, Brian
Hi @djo and @Kimzheng , Dominik, thank you for helping out Kim. Yes, as Dominik guesses those files are the raw GEO annotations. Here is a little more detail on Dominick's suggestion to use GEOquery in R to get them. The documentation to the getGEO function call is here: https://www.rdocumentation.org/packages/GEOquery/versions/2.38.4/topics/getGEO This is how you could use it to extract the annotations: gse.obj <- getGEO("GSE68234") pData(gse.obj[["GSE68234-GPL10999_series_matrix.txt.gz"]]) If there _were_ expression data associated with this in GEO, you could access that via: exprs(gse.obj[["GSE68234-GPL10999_series_matrix.txt.gz"]]) As Dominik says, this is RNA-seq data and sometimes that isn't held in GEO. I am also posting those files here: https://www.synapse.org/#!Synapse:syn20614465 but it will take quite a while for me to upload all of them. Once I have, I will update the group. Thanks, Brian
Hi @djo , Thank you very much for your answer! Actually, I was wondering maybe there is another place that saves _.anno.tsv_ file quietly. Best regards, Kim
Hi @Kimzheng, The `series_matrix` files are hosted by NCBI. E.g. `GSE68234-GPL10999_series_matrix.txt.gz` can be found here [GSE68234 matrix](ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68234/matrix/). But the expression data must not be part of the file. Specifically, RNAseq data may be attached to GSE number in any format (e.g. see on the button of [GSE68234](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68234) and click `(custom)` under Download). You can download all supplementary files and SOFT-Anotation to a given GSE number by ``` gse=GSE68234 wget "ftp://ftp.ncbi.nlm.nih.gov/geo/series/${gse:0:-3}nnn/$gse/suppl/*" wget "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=$gse&targ=gsm&form=text&view=quick" -O "${gse}.soft" ``` and then use the annotation in `${gse}.soft` and the related paper to interpret the data. More information on the SOFT format and other GEO data can be found here: https://www.ncbi.nlm.nih.gov/geo/info/ You can also use the R library GEOquery to access the data: https://bioconductor.org/packages/release/bioc/html/GEOquery.html Obtaining microarray data becomes very convenient with the tool. The `.anno.tsv` postfix in the file column of `geo-rnaseq-immune-cells.xlsx` never became clear to me. It may be a generated annotation file by the creator to annotate the series matrix file. I hope this information was of any use to you. Kind regards, Dominik

Where is the "file" in training data? page is loading…