Fastq to sample confusion

Hi! Firstly, thank you for depositing the raw data for this project! I am having trouble matching the fastq files to individual IDs: I have access to ROSMAP_biospecimen_metadata.csv and ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv, but they don't seem to have information I need. Let's take this fastq for example: 200708-B31-A_NYGC_S1_L001_R2_001.fastq.gz. I know the library batch is 200708-B31-A. But which sample does S1 correspond to? Because there are S1, S2, S3, and S4. I can't find a file that matches the S1-4 IDs to an individual ID. Moreover, some files have the same library id and sample number but different insitututions (ie, NYGC vs Broad). Are these the same samples just sequenced at different places? Or are they their own individual sample? A file with the fastq names and the corresponding sample IDs would be very helpful. Or am I missing some fundamental piece of information? Thank you for your time and help! Best, Mark

Created by sanbomics
Hi @sanbomics, I do not think the "S" labels are informative in and of themselves, they seem to be part of the fastq naming convention. Rather, it is the libraryBatch and sequencingBatch numbers that are important, and you can find those in the annotations file for each fastq, which will link you to the specimenIDs in the [assay metadata](syn21073536). As for the replicates, the authors say in the methods in their [preprint](https://www.biorxiv.org/content/10.1101/2022.11.07.515446v1) that "The same libraries of batches B10?B63 were resequenced at The New York Genome Center using Illumina NovaSeq 6000. Sequencing data of both Broad Institute and New York Genome Center were used for analysis." But I don't know exactly how they used the data (whether merged or not-merged and compared). This is not my area of expertise so I'm tagging our senior bioinformatics engineer, @wpoehlm, for more detailed technical questions. Best, Laura
Hi Laura, Thanks for this information. Things make much more sense now. My initial assumption was that a "library batch" was multiple 10x libraries that were prepared at the same time. So in conclusion: Each library batch is **one** 10x sample that is a mixture of cells from 8 different individuals. De-multiplexing of individuals is done by the authors based on SNPs and requires ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv. I am still a little unsure about two points: 1) Why do many of the libraries go from S1-S4? Should we just treat all 4 as one sample and run them all at the same time? 2) The biological replicates are LibraryBatch-A/B and the sequencing replicates are Broad/NYGC. So the Broad/NYGC should be the same cells/barcodes? Did the authors merge Broad/NYGC when running through cellranger Thanks again! Best, Mark
Hi @sanbomics, For most of the fastq files, 8 individuals were pooled per sample (except those labeled "alone" and those with an obvious individualID), and you do need to use the demultiplexing file to identify the individuals, as noted in the [documentation](syn31512863) for this project. The "S" labels aren't particularly informative here, though the libraryBatch and sequencingBatch information in the annotations for each file will link back to the specimenID in the metadata (see this [discussion thread](https://www.synapse.org/#!Synapse:syn2580853/discussion/threadId=9673) for more details. This is likely to be a helpful discussion thread, since others have had similar questions). In addition, as noted in the same discussion thread, there are both technical and biological replicates: replicate libraries contain the same 8 brain donors (from same brain region) but were prepared independently, and then each library was sequenced twice in different sequencing centers (Broad and NYGC). So there will be 4 sets of sequencing data for each pool. Also, the study investigators have very recently uploaded processed files in [syn51123521](syn51123521), which might be a more straightforward place to start. Let us know if you are still stuck and we will try to help you troubleshoot! Best, Laura
Are the S1, S2, S3, and S4 IDs arbitrary? Should they be merged for individual samples? Then the demultiplexing file starts to make a little more sense. Thank you for any input!!

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Fastq to sample confusion - syn51123517 page is loading…