How to find which fastq file corresponds to which specimen ID

Hello! I am trying to download the fastq files from "ROSMAP RNAseq fastq files" (syn21589959), and was searching for the corresponding metadata, which I assume is in "ROSMAP_assay_RNAseq_metadata.csv" (syn21088596). However, I was unable to match which fastq file corresponds to its respective specimen ID. Am I looking at the wrong projects to match with each other? Much help would be greatly appreciated! Thank you!

Created by Young-Jun Jeon jeonrpm2020
So sorry, I think I figured this out. The project ID corresponds to each donor, so I was able to use both the clinical metadata file and the study metadata csv file and the legend from the data dictionary here: https://www.synapse.org/#!Synapse:syn3191087 Hope this helps others who may be trying to do the same thing.
Hi again @abby.vanderlinden , I found a file called ROSMAP_clinical.csv and it has project IDs, however the individual IDs (from ROSMAP metadata) do not match the sample IDs on the clinical data csv. The project ID however does match, so can we assume that the project ID is unique to each sample? (And for the bulk microglia ROSMAP gene expression data, we should see 4 project IDs per sample (sequenced on 2 lanes for paired end seq?) Sorry for the confusion. Please email me if you'd like to see the files I am referring to (medwards@umass.edu).
Hi @jaclynbeck @abby.vanderlinden, I am also struggling to find the appropriate metadata for my study (ROSMAP, gene expression, SYN11468526). I see the metadata csv file with the sample ID and accompanying barcodes, but there isn't any sample data for sex and/or gender, age, etc. Where can I find this data, especially if my study is focused on sex differences? Thank you, M�lise Edwards
Hello, I just took a look at ROSMAP_biospecimen_metadata.csv and I do see an entry for RISK_100, on line 5168. It should be in the "specimenID" column. I found it by loading the csv into R as `tmp` and running `grep("RISK_100", tmp$specimenID)`. Can you try this (or something similar if you're not working in R) and verify whether it does/doesn't find that specimen ID on your end? Jaclyn
Hello Jaclyn, I still have issues in mapping the ROSMAP fastq files to their pathology states (AD or control). Taking the RISK_100_S63_R1_001.fastq.gz file as an example, I looked into both "ROSMAP_assay_RNAseq_metadata.csv" and "ROSMAP_biospecimen_metadata.csv", but I failed in find the "RISK_100" id in these two files. So I cannot know which fastq files correspond to AD samples and which control samples. I would be highly appreciated if you can help. Thanks in advance! Best, Hongdong
The only information I can find is from the wiki page for [syn3388564](https://www.synapse.org/#!Synapse:syn3388564): "Then RNA-Seq data were processed by our parallelized and automatic pipeline. These pipeline include trimming the beginning and ending bases from each read, identifying and trimming adapter sequences from reads, detecting and removing rRNA reads, aligning reads to reference genome. We used the non-gapped aligner Bowtie to align reads to transcriptome reference and then applied RSEM to estimate expression levels for all transcripts." and the supplementary methods from the [published paper](https://doi.org/10.1038%2Fs41593-018-0154-9), which says the same thing. So it sounds like Bowtie was the main program for alignment. The paper says they used GRCh37 to align histone modification data, so I assume they used the same reference for the RNA Seq data although they do not explicitly state that. If you need more in-depth information on their pipeline I think you will need to contact the authors directly. I hope that helps! Jaclyn
Thank you for the reply. May I ask what how the BAM files were mapped? What programs and reference files were used during the mapping process? Thanks once again!
Hello! The FASTQ files should all be in the format `_S<#>_R<1/2>_001.fastq.gz`. So, for example, files "RISK_100_S63_R1_001.fastq.gz" and "RISK_100_S63_R2_001.fastq.gz" are for specimenID "RISK_100" in the RNAseq_metadata file. A lot of the data was provided as BAM files instead of FASTQ files, so the FASTQ folder will not have files for all the specimens listed in the metadata file. You can find those BAM files with the other specimen data here: [syn22333035](https://www.synapse.org/#!Synapse:syn22333035). I hope that helps! Let me know if you have any other questions, Jaclyn

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

How to find which fastq file corresponds to which specimen ID page is loading…