Linking scRNA cells from hda5 file to the metadata file.

Hi, I am trying to link the scRNA cells from `analysis_scRNAseq_tumor_counts.h5ad` to the metadata file `analysis_scRNAseq_tumor_metadata.tsv` to obtain the linked subjects, conform your guidelines: files: analysis_scRNAseq_tumor_counts.h5ad This file contains the post-filtered count matrices (pre-normalization) for the tumor samples (microenvironmental non-tumor + tumor cells) across the 11 subjects in this study. The sample and cell state annotation can be linked from this file to the cell_barcode variable in the table "analysis_scRNAseq_tumor_metadata.tsv". However, the cell names in the hda5 file do not match with the metadata and processed counts table: ``` > SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) > d = SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") > colnames(d)[c(1:5,55248)] "Cell1" "Cell2" "Cell3" "Cell4" "Cell5" "Cell55248" ``` ``` > m = read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv",sep="\t") > m$cell_barcode[c(1:5,55248)] [1] "AAACCTGAGGACCACA-1-0" "AAACCTGCAAATACAG-1-0" "AAACCTGTCACTGGGC-1-0" "AAACGGGGTGAGTATA-1-0" "AAACGGGGTGCAACGA-1-0" "TTTACTGCACGACAGA-1-10" ``` @elang @kcjohnson Do you have suggestions how to link the subjects to the cells from `analysis_scRNAseq_tumor_counts.h5ad` ? Youri

Created by Youri Hoogstrate yhoogstrate
Hi @kcjohnson thank you so much for your prompt response and verification of the samples! Yes i'm trying to get only the IDH-m gliomas of astrocytic origin of any grade and am combining your data with some other public datasets so I really appreciate your clarification so that I can be confident that I have only those samples. Thanks again
Hi @dmiyagi , happy to provide clarification! It sounds like you are interested in analyzing IDHmut astrocytomas as defined by the 2021 WHO classification. If so, I would look at the "idh_mut_subtype" column in either the `clinical_metadata` table (Synapse) or STable1 for the 4 IDHmut-noncodels (SM002, SM008, SM015, and SM019). I am confident in these classifications given that we have WGS (driver mutations, CNVs) and methylation classification support. As for Extended Data Fig 6C, we've analyzed a number of sc/snRNA profiles of adult glioma tumors and have found that microenvironmental composition is indeed distinct by glioma subtype but it is also strongly influenced by both time point (initial vs recurrence), extent of surgical resection + spatial location of tissue sample, and a few other features. Thus, I would take caution with interpreting the distribution of normal cell types within this smaller dataset. -Kevin
@kcjohnson I'm sorry to be a major pain but I was finishing up the analysis and cross referencing with your paper. If I'm not mistaken, your STable1 has the corresponding ID to histological_classifications for the IDH astrocytomas: case_barcode histological_classification SM002 Anaplastic Astrocytoma SM008 Anaplastic astrocytoma SM015 Astrocytoma SM019 Diffuse Astrocytoma But I believe that the ID mapping file you gave had: SM019 : RV18001 SM008 : RV19005 In your Ex. Data 6C I see that SM019 and SM008 seem to have the highest % oligodentrocytes. I'm wondering if maybe the ID mapping might be incorrect or I might be misunderstanding one of the pieces of data? If i'm trying to only get IDH-m astrocytomas, should I include SM019 and SM008? or only SM004, SM001, SM015 and SM002 which is what it looks like in Ex. Data 6C ? Thank you so much!
Thank you @kcjohnson !! PS. welcome to Yale, I'm in the Gunel Lab
Hi @dmiyagi, Sorry for the delayed response - we're in the process of transitioning our lab to a new institution (JAX > Yale). Thank you for pointing out helpfulness of a direct link between the `case_barcode` present in the metadata and `sampleid` in the h5ad file. I have uploaded that file as "analysis_scRNA_id_mapping". Note that another way to link these features is to join based on the `cell_barcode` (analysis_scRNAseq_tumor_metadata.csv) and `index` (scanpy: analysis_scRNAseq_tumor_counts.h5ad) fields. Also, thanks to Youri ( @yhoogstrate ) for posting that helpful solution to converting for use in Seurat/R! -Kevin
Hi @yhoogstrate thank you so much! Although if i'm not mistaken, I believe your code is for R/Seurat? I am using Scanpy/Python. Of course I can go through and manually convert, but prefer to not make a mistake if @kcjohnson has the solution for Python because I tried his recommendation of adata.obs_names but I am seeing that the barcodes are as such: ``` Index(['AAACCTGAGGACCACA-1-0', 'AAACCTGCAAATACAG-1-0', 'AAACCTGTCACTGGGC-1-0', 'AAACGGGGTGAGTATA-1-0', 'AAACGGGGTGCAACGA-1-0', 'AAACGGGGTTGCTCCT-1-0', 'AAACGGGTCACAATGC-1-0', 'AAACGGGTCAGTTGAC-1-0', 'AAACGGGTCCCGGATG-1-0', 'AAAGATGGTTTGTTGG-1-0', ... 'TTTCGATTCCGTAATG-1-10', 'TTTGACTGTTCATCGA-1-10', 'TTTGATCCATACAGGG-1-10', 'TTTGATCTCCATTGGA-1-10', 'TTTGATCTCCTGCTAC-1-10', 'TTTGGAGCATCCTAAG-1-10', 'TTTGGAGGTACCTGTA-1-10', 'TTTGGTTCATACAGCT-1-10', 'TTTGGTTTCCGATCTC-1-10', 'TTTGGTTTCGAGTCTA-1-10'], dtype='object', name='index', length=55284) ``` can I get the conversion to "RV18001" to "RV19009" in the clinical data?
@dmiyagi - My code snipped does exactly that. Using the code below, all individual samples will be exported to separate folders, including the IDH-mut non-codels: ``` if(!file.exists("cache/analysis_scRNAseq_tumor_counts.h5")) { SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) } seurat_obj <- SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") rename <- read.table('cache/syn25956426_Johnson_cell_identifiers.txt', header=T) |> dplyr::rename(new.name = umi) |> dplyr::mutate(old.name = paste0("Cell",1:dplyr::n())) seurat_obj <- Seurat::RenameCells(seurat_obj, new.names = rename$new.name) metadata <- rename |> dplyr::left_join(read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv", sep="\t"), by=c('new.name'='cell_barcode')) |> dplyr::left_join( read.csv('data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_syn25880693_clinical_metadata.csv'), by=c('case_barcode'='case_barcode') ) stopifnot(colnames(seurat_obj) == metadata$new.name) for(slot in c("cell_state","case_barcode","case_sex","tumor_location","laterality","driver_mutations","time_point","idh_codel_subtype","who_grade","histological_classification")) { seurat_obj[[slot]] <- metadata[[slot]] } # export individual samples ---- # split per individual and export for(slot in read.csv('data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_syn25880693_clinical_metadata.csv')$case_barcode) { #slot = "SM001" export <- subset(seurat_obj, case_barcode == slot) export.slim <- Seurat::DietSeurat(export, counts = TRUE, data = TRUE, scale.data = FALSE, features = NULL, assays=c('RNA'), dimreducs = NULL, graphs = NULL, misc = F ) export.slim@misc <- export.slim@meta.data[c('case_barcode', 'case_sex', 'tumor_location', 'laterality', 'driver_mutations', 'time_point', 'idh_codel_subtype', 'who_grade')] |> tibble::remove_rownames() |> dplyr::distinct() export.slim@meta.data <- export.slim@meta.data[c('cell_state')] DropletUtils::write10xCounts(path=paste0("cache/syn25956426_Johnson__split__",slot), x=export.slim@assays$RNA@counts) write.csv(data.frame( cell_barcode = colnames(export.slim), cell_state = export.slim@meta.data$cell_state ), file=paste0("cache/syn25956426_Johnson__split__",slot,"/cell_states.csv")) rm(export, export.slim) } ``` Resulting in: ``` $ tree syn* syn25956426_Johnson__split__SM001 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM002 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM004 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM006 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM008 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM011 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM012 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM015 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM017 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM018 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx syn25956426_Johnson__split__SM019 ??? barcodes.tsv ??? cell_states.csv ??? genes.tsv ??? matrix.mtx ```
Thanks @yhoogstrate ! I realize now I meant to tag @kcjohnson . @kcjohnson when I use scanpy to look at barcodes when using adata.obs_names, I see barcodes that don't seem to match. For example I see barcodes: TTTGGAGCATCCTAAG-1-10 , or AAACCTGAGGACCACA-1-0, but clinical metadata the cases are named like "SM001" or "SM019" whereas the adata object has "RV18001" to "RV19009" can you tell me how to convert these? I'm trying to only extract the IDH mutant so it's important that I get the right samples. Thanks so much!
@dmiyagi ``` if(!file.exists("cache/analysis_scRNAseq_tumor_counts.h5")) { SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) } seurat_obj <- SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") rename <- read.table('cache/syn25956426_Johnson_cell_identifiers.txt', header=T) |> dplyr::rename(new.name = umi) |> dplyr::mutate(old.name = paste0("Cell",1:dplyr::n())) seurat_obj <- Seurat::RenameCells(seurat_obj, new.names = rename$new.name) # load per cell metadata and per patient metadata (latter from Table 'clinical metadata', not file) metadata <- rename |> dplyr::left_join(read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv", sep="\t"), by=c('new.name'='cell_barcode')) |> dplyr::left_join( read.csv('data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_syn25880693_clinical_metadata.csv'), by=c('case_barcode'='case_barcode') ) stopifnot(colnames(seurat_obj) == metadata$new.name) for(slot in c("cell_state","case_barcode","case_sex","tumor_location","laterality","driver_mutations","time_point","idh_codel_subtype","who_grade","histological_classification")) { seurat_obj[[slot]] <- metadata[[slot]] } ```
Hi Youri (@yhoogstrate), I wanted to piggy-back off this question. I am using scanpy and AnnData but I noticed that the "case_barcodes" column and the adata.obs["sampleid"] don't match. i.e. in the clinical metadata the cases are named like "SM001" or "SM019" whereas the adata object has "RV18001" to "RV19009" can you tell me how to convert these? or whether I should be using a different metadata column?
Dear Kevin, Thank you!! I wasn't aware these files were generated using python. I will try python / scanpy immediately. Thanks! Youri edit 1 --- I've exported the identifiers [on the assumption that the order matches with the R loader, will check this] : ``` from tqdm import tqdm import scanpy as sc path = "data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad" h5 = sc.read_h5ad(path) with open("cache/syn25956426_Johnson_cell_identifiers.txt","w") as fh: fh.write("umi\n") for _ in tqdm(h5.obs_names): fh.write(str(_)+"\n") ``` edit 2 --- Identifiers match with the order according to UMAP. If someone else wants to load this file into R: ``` if(!file.exists("cache/analysis_scRNAseq_tumor_counts.h5")) { SeuratDisk::Convert("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_counts.h5ad", dest = "cache/analysis_scRNAseq_tumor_counts.h5", overwrite = TRUE) } seurat_obj <- SeuratDisk::LoadH5Seurat("cache/analysis_scRNAseq_tumor_counts.h5") rename <- read.table('cache/syn25956426_Johnson_cell_identifiers.txt', header=T) |> # this is the file exported with the python script from *edit 1* dplyr::rename(new.name = umi) |> dplyr::mutate(old.name = paste0("Cell",1:dplyr::n())) seurat_obj <- Seurat::RenameCells(seurat_obj, new.names = rename$new.name) metadata <- rename |> dplyr::left_join(read.csv("data/syn25956426_Johnson/processed_data/analysis_scRNAseq_tumor_metadata.tsv", sep="\t"), by=c('new.name'='cell_barcode')) stopifnot(colnames(seurat_obj) == metadata$new.name) seurat_obj@meta.data$cell_state = metadata$cell_state seurat_obj@meta.data$case_barcode = metadata$case_barcode ```
Hi Youri, If I correctly understand your objective it's that you are looking to load the h5ad file generated via scanpy into R. I ran your commands and it looks like there was a warning message printed that indicates the issue: `Warning: No cell-level metadata present, creating fake cell names`. I assume that this occurred for you too given the generic cell names "Cell1" etc. I am not sure why that's happening as the other metadata is present. I just loaded the h5ad file into scanpy and can see the expected barcodes when using adata.obs_names. I tried looking for alternative solutions to easily load the h5ad file into R, but did not have any luck. If you are able to work with the data in scanpy, I think that's the easiest solution. Best, Kevin

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Linking scRNA cells from hda5 file to the metadata file. page is loading…