Hi! I have a question on matching the cell type for single cell data from syn31512863. It seems that the cell annotation file does not cover all barcodes in this dataset. Specifically, the cell annotation file in syn51218314 seems to have a smaller number of barcodes than the barcodes file for the processed data in syn51123521. For example, for libraryBatch='190403-B4-A', there are 22180 barcodes in processed/190403-B4-A.barcodes.tsv.gz, but in the annotation file cell-annotation.csv, ``` sum(annot$libraryBatch='190403-B4-A') ``` gave only 13768 matches. Generally, for all batches, the cell_annotation file only covers ~50% of the cells provided in the processed data. I have also checked original publication https://www.biorxiv.org/content/10.1101/2022.11.07.515446v1 and do not have a clue. Could anyone provide some insights on this? Thank you!

Created by Chang Su changSU
Dear @masashi , thank you so much for all of your help. One more question I had was in your definitions of AD and Control in your paper. Did you use ceradsc with 1 and 2 as "AD" and 3 and 4 as "Control"? That was the only diagnosis type I could get that had similar numbers of AD and controls as in your paper.
Hi @08nanaka , in the metadata file, the combination of "libraryBatch" and "cellBarcode" should be used to retrieve information. You find barcode "GCATCTCGTCAACCTA-1" of batch "190403-B4-A" in the metadata file but do not find the same barcode in the count matrix of the batch. It means that the barcode was filtered out at some point of downstream analysis of the batch. The barcode "GCATCTCGTCAACCTA-1" in "190403-B4-A" is nothing to do with "GCATCTCGTCAACCTA-1" in "200316-B24-A" or "201007-B58-B". Sorry for confusing you. I will update the data files so as to avoid confusion.
Dear @masashi , @m-fujita , I have been looking at the processed data and am having some difficulty. For example, I look at the data from batch "190403-B4-A". It has many cells with Barcodes from different batches. For example, barcode "GCATCTCGTCAACCTA-1" is one of the cells in the data. However, when I look at the metadata (ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv), that barcode is not present as part of batch 190403-B4-A. Instead, it is present as in libraries (200316-B24-A, 201007-B58-B, 191122-B6-R1710143-alone, 191122-B6-R7641350-alone) associated with multiple 4 different individualIDs. I was wondering if the metadata is wrong or if some of the labeling of the cells in the processed data has an error?
Got it! That makes sense. I found a total of 436 unique individuals in the processed data syn51123521. Considering only 424 samples are used in the publication, maybe some of the unmatched barcodes correspond to those 12 individuals, as you pointed out. Hopefully the authors can provide more details on how processed the data are in their future release. Thank you Jaclyn for your insights!
Hello! Currently, the cell annotations file is preliminary data that is not yet published, and might change with further data analysis. My understanding of the file is that the cells listed in it have: 1) Passed quality control 2) Been confidently identified as belonging to a single individual (since they had to run demuxlet/freemuxlet), and 3) Been identified as a specific cell type This means that a good number of cells listed in the barcodes.tsv.gz files will be left out for now. 50% _does_ seem high to me, but from the paper on Biorxiv it looks like after obtaining the output from CellRanger (the barcodes.tsv.gz files), they excluded _all_ cells from ~55 individuals based on QC and the results of demuxlet/freemuxlet, which would account for a lot of cells gone even before doublet removal and further QC. Hopefully that helps! Let me know if you have more questions. Jaclyn

Cell annotation file does not cover all cells for study syn31512863 page is loading…