Thanks a lot for making the raw snRNA-seq data (FASTQ files) for the ROSMAP "Single Nucleus RNAseq - DLPFC, Experiment 2" available in folder syn23650894. This is an incredibly valuable dataset, especially because it contains so many samples. Are you planning to make the processed data (e.g. count matrices or hdf5 files) available as well?

Created by Thomas Sandmann sandmannt
Dear Team, I am struggling to make sense of the number of cells between the demultiplex file syn34572333., cell annotation file syn51218314. For example, for the library batch 201207-B63-B, the number of cells and individuals in the demultiplex file corresponding to this library batch is (#individual 1 =14542 #individual 2 = 24224 ) . The number cells in the annotation file is much smaller at ( #individual 1=1628, #individual 2 =1950). Is this order of reduction in number of cells expected? Why are there only 2 individuals corresponding to this library batch? Also, the number of cells matching my count matrix after running kallisto transcript quantification is only 53! This number was 165 before I update the cell-annotation and demultiplex files for the new version. Can you help please? Thanks. Best regards, Rajesh
Thanks a ton for explaining the relationships between the barcodes and batches, @masashi. Super helpful!
Dear @masashi , this makes sense. I see now that in the count matrix for each batch there are actually many individuals not in that batch who are still included in the data but they only have a very small number of cells each so I will filter out those cells for future analyses. Thank you!
Dear @08nanaka and @sandmannt , in the metadata file, the combination of "libraryBatch" and "cellBarcode" should be used to retrieve information. You find barcode "GCATCTCGTCAACCTA-1" of batch "190403-B4-A" in the metadata file but do not find the same barcode in the count matrix of the batch. It means that the barcode was filtered out at some point of downstream analysis of the batch. The barcode "GCATCTCGTCAACCTA-1" in "190403-B4-A" is nothing to do with "GCATCTCGTCAACCTA-1" in "200316-B24-A" or "201007-B58-B". Sorry for confusing you. I will update the data files so as to avoid confusion.
Dear @08nanaka, no unfortunately I haven't been able to make any progress toward understanding the demultiplexing results that I highlighted in my post. I am hoping that the final, e.g. not preliminary, results the authors will share eventually won't raise the same questions.
Dear @sandmannt , I was wondering if you figured out the potential issue with the ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv file? I had problems with it and just asked Dr. Fujita about it (I asked: I have been looking at the processed data and am having some difficulty. For example, I look at the data from batch "190403-B4-A". It has many cells with Barcodes from different batches. For example, barcode "GCATCTCGTCAACCTA-1" is one of the cells in the data. However, when I look at the metadata (ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv), that barcode is not present as part of batch 190403-B4-A. Instead, it is present as in libraries (200316-B24-A, 201007-B58-B, 191122-B6-R1710143-alone, 191122-B6-R7641350-alone) associated with multiple 4 different individualIDs. I was wondering if the metadata is wrong or if some of the labeling of the cells in the processed data has an error?)
Thanks, @masashi !
Hi @svattathil, I used syn3191087.
Is there a file that links the IDs in the 'individualID' column of syn34572333 to ROSMAP projids, so they can be aligned with other datasets?
Ah okay that makes a lot of sense. Thank you!
Hi @08nanaka. Each library is multiplexed and conatins cells from 8 individuals. Mapping from cell barcodes to individuals is available at syn34572333.
Hi @abby.vanderlinden and @masashi , I was wondering why the processed data only has 127 individuals? Will the data from the rest of the individuals (to get to all 424 or 465) become available soon?
@abby.vanderlinden I am very grateful to the data contributors for sharing their raw and processed data. One piece of information I haven't been able to locate, yet, is a table with the cell type they assigned to each cell in their preprint. Perhaps it is available already and I missed it?
That's great news, thanks a lot to the contributors for sharing the processed data!
Hi all, the counts files for the recent ROSMAP snRNAseq data are now publicly available here: https://www.synapse.org/#!Synapse:syn31512863. Thank you for your patience and thank you to our data contributors for working to share this data.
We are still waiting to hear back from the contributing lab on the processed data. I'll follow up with them again!
Hi @abby.vanderlinden, do you have an update as to when the processed data will become available? Thank you! Eléonore
Hi Thomas, you may be right -- I know the data contributors were very clear with us that the preliminary mapping could undergo changes as the analysis progressed. The publication pre-print is [here](https://www.biorxiv.org/content/10.1101/2022.11.07.515446v1.full). Dr. Masashi Fujita ( @masashi ) at Columbia uploaded the data and is the first author on the manuscript and would be a good person to talk with further!
Thanks again for providing additional information about this very valuable dataset - and thanks to the contributors for uploading the raw data (FASTQ files)! I have processed the raw data into count matrices now, mapped to the latest human reference index provided by 10X Genomics (GRCh38, 2020-A) with cellranger v7. Then I used the provided (preliminary) mapping file (syn34572333) to assign the nuclei in each libraryBatch to individualIDs. When I spot checked the total counts of the droplets that were assigned to individuals, I noticed that many had very low total counts (sometimes zero) in the cellranger output. As a consequence, cellranger classified these droplets as "empty" and they are not included in cellranger's filtered output. To make sure this is not an artifact of my processing pipeline, I repeated the analysis and quantified gene expression with [alevin-fy](https://www.nature.com/articles/s41592-022-01408-3) as well. The results were very similar: again a large fraction of droplets annotated with individualIDs seemed to correspond to empty droplets. According to section [Demultiplexing of snRNAseq reads](https://www.synapse.org/#!Synapse:syn31512863) in the description of the dataset, the contributors called variants (SNPs) using the snRNAseq reads themselves. That would - only be expected to work for non-empty droplets and - require at least a minimum coverage for variant calling. That's why I am wondering if the preliminary mapping file (syn34572333) might not be entirely correct? Here is an exceprt of the mapping file for libraryBatch `201207-B63-A`, which contains nuclei from two individualIDs. (individualIDs were redacted to `R1` and `R2` because this is a public post). I think these cell barcodes correspond to empty drops, e.g. they have zero counts in the raw cellranger output: ``` 201207-B63-A TTGGGATAGCCTCTCC-1 R1 201207-B63-A GTCAGCGTCGAACCAT-1 R1 201207-B63-A TTACCATCAGGGATAC-1 R2 201207-B63-A TTCCTCTCAGGGATAC-1 R1 201207-B63-A TGCATGACATCTTCGC-1 R1 201207-B63-A GTGCACGGTCCATTCC-1 R1 201207-B63-A CTAGACACACCTCAGG-1 R2 201207-B63-A CATCGGGAGGCTTAGG-1 R1 201207-B63-A TTACGTTAGCCTATCA-1 R1 201207-B63-A ACTTCCGTCCGTTACT-1 R1 ``` I wonder if this might indicate that the mapping file (syn34572333) needs to be checked - or whether there is another explanation that I haven't thought of? @abby.vanderlinden I am happy to share my R analysis code with the contributors, if they are interested. Feel free to put us in touch directly, if you like. Best regards, Thomas
Very, very helpful - thanks a lot @abby.vanderlinden !
Hi Thomas, for this study the contributor informed me that libraries were generated as two replicates. Replicate libraries contain the same brain regions of the same 8 donors but were prepared independently. Moreover, each library was sequenced twice in different sequencing centers (Broad and NYGC), resulting in 4 sets of sequencing data for each pool of samples -- so there are technical replicates and biological replicates. This information is captured in the libraryBatch and sequencingBatch annotations on the fastq files. Those annotations can be linked back to specimenIDs in the snRNAseq assay metadata (syn21073536). There's more info on Synapse annotations here: https://help.synapse.org/docs/Annotating-Data-With-Metadata.2667708522.html. Hope this helps!
@abby.vanderlinden Perhaps you can help me with another question regarding the `ROSMAP Single Nucleus RNAseq (DLPFC, Experiment 2)` dataset ([syn31512863](https://www.synapse.org/#!Synapse:syn31512863)) ? I noticed that there seem to be FASTQ files from both the Broad and the NYGC, e.g. there are libraries `201021-B60-A_NYGC` and `201021-B60-A_Broad`. Do you know whether the same libraries were sequenced twice, e.g. by differents sequencing centers? I am wondering whether I should consider them technical replicates or if there are other differences (beyond sequencing center) worth knowing about? Many thanks for any information you can share! Thomas
Thanks a lot for letting me, know Abby! Greatly appreciate that the contributors made the (many!) FASTQ files available so quickly.
Hi there, we do not have the processed data available yet -- the contributors are still finalizing it. Their priority was to make the raw data available to the community as soon as possible. We will host the counts in the portal as soon as they are available. Make sure you're subscribed to our [newsletter](https://news.adknowledgeportal.org/newsletter/) to be notified when they are released!

ROSMAP Single Nucleus RNAseq (DLPFC, Experiment 2) - Processed data? page is loading…